0% found this document useful (0 votes)
132 views

MCU Architecture

The document discusses program optimization for multi-core processors from a hardware perspective. It provides an overview of virtual memory and caching concepts like translation lookaside buffers (TLBs) and cache hierarchies. Key points covered include how virtual addresses are translated to physical addresses using page tables and TLBs, the role of caches in reducing memory access latency, and cache coherence across multiple cache levels in a hierarchy.

Uploaded by

sachinshetty001
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views

MCU Architecture

The document discusses program optimization for multi-core processors from a hardware perspective. It provides an overview of virtual memory and caching concepts like translation lookaside buffers (TLBs) and cache hierarchies. Key points covered include how virtual addresses are translated to physical addresses using page tables and TLBs, the role of caches in reducing memory access latency, and cache coherence across multiple cache levels in a hierarchy.

Uploaded by

sachinshetty001
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 232

Program Optimization for Multi-core: Hardware side of it

Contents
Virtual Memory and Caches (Recap) Fundamentals of Parallel Computers: ILP vs. TLP Parallel Programming: Shared Memory and Message Passing Performance Issues in Shared Memory Shared Memory Multiprocessors: Consistency and Coherence Synchronization Memory consistency models Case Studies of CMP
June-July 2009 2

RECAP: VIRTUAL MEMORY AND CACHE


Mainak Chaudhuri [email protected]

Why virtual memory?


With a 32-bit address you can access 4 GB of physical memory (you will never get the full memory though)
Seems enough for most day-to-day applications But there are important applications that have much bigger memory footprint: databases, scientific apps operating on large matrices etc. Even if your application fits entirely in physical memory it seems unfair to load the full image at startup Just takes away memory from other processes, but probably doesnt need the full image at any point of time during execution: hurts multiprogramming

Need to provide an illusion of bigger memory: Virtual Memory (VM)


June-July 2009

Virtualaccess virtual memory memory Need an address to


Virtual Address (VA) Every process sees a 4 GB of virtual memory This is much better than a 4 GB physical memory shared between multiprogrammed processes The size of VA is really fixed by the processor data path width 64-bit processors (Alpha 21264, 21364; Sun UltraSPARC; AMD Athlon64, Opteron; IBM POWER4, POWER5; MIPS R10000 onwards; Intel Itanium etc., and recently Intel Pentium4) provide bigger virtual memory to each process Large virtual and physical memory is very important in commercial server market: need to run large databases
June-July 2009 5

Assume a 32-bit VA

Addressing VM
There are primarily three ways to address VM
Paging, Segmentation, Segmented paging We will focus on flat paging only

Paged VM
The entire VM is divided into small units called pages Virtual pages are loaded into physical page frames as and when needed (demand paging) Thus the physical memory is also divided into equal sized page frames The processor generates virtual addresses But memory is physically addressed: need a VA to PA translation
June-July 2009 6

VA to PA translation
The VA generated by the processor is divided into two parts:
Page offset and Virtual page number (VPN) Assume a 4 KB page: within a 32-bit VA, lower 12 bits will be page offset (offset within a page) and the remaining 20 bits are VPN (hence 1 M virtual pages total) The page offset remains unchanged in the translation Need to translate VPN to a physical page frame number (PPFN) This translation is held in a page table resident in memory: so first we need to access this page table How to get the address of the page table?
June-July 2009 7

VA to PA translation
Accessing the page table
The Page table base register (PTBR) contains the starting physical address of the page table PTBR is normally accessible in the kernel mode only Assume each entry in page table is 32 bits (4 bytes) Thus the required page table address is PTBR + (VPN << 2) Access memory at this address to get 32 bits of data from the page table entry (PTE) These 32 bits contain many things: a valid bit, the much needed PPFN (may be 20 bits for a 4 GB physical memory), access permissions (read, write, execute), a dirty/modified bit etc.
June-July 2009 8

Page fault
The valid bit within the 32 bits tells you if the translation is valid If this bit is reset that means the page is not resident in memory: results in a page fault In case of a page fault the kernel needs to bring in the page to memory from disk The disk address is normally provided by the page table entry (different interpretation of 31 bits) Also kernel needs to allocate a new physical page frame for this virtual page If all frames are occupied it invokes a page replacement policy
June-July 2009 9

VA to PA translation
Page faults take a long time: order of ms Once the page fault finishes, the page table entry is updated with the new VPN to PPFN mapping Of course, if the valid bit was set, you get the PPFN right away without taking a page fault Finally, PPFN is concatenated with the page offset to get the final PA PPFN Offset Processor now can issue a memory request with this PA to get the necessary data Really two memory accesses are needed Can we improve on this?
June-July 2009 10

Need a good page replacement policy

TLB
Why cant we cache the most recently used translations?
Translation Look-aside Buffers (TLB) Small set of registers (normally fully associative) Each entry has two parts: the tag which is simply VPN and the corresponding PTE The tag may also contain a process id On a TLB hit you just get the translation in one cycle (may take slightly longer depending on the design) On a TLB miss you may need to access memory to load the PTE in TLB (more later) Normally there are two TLBs: instruction and data
June-July 2009 11

Caches
Once you have completed the VA to PA translation you have the physical address. Whats next? You need to access memory with that PA Instruction and data caches hold most recently used (temporally close) and nearby (spatially close) data Use the PA to access the cache first Caches are organized as arrays of cache lines Each cache line holds several contiguous bytes (32, 64 or 128 bytes)
June-July 2009 12

Addressing a cache
The PA is divided into several parts TAG INDEX BLK. OFFSET The block offset determines the starting byte address within a cache line The index tells you which cache line to access In that cache line you compare the tag to determine hit/miss

June-July 2009

13

Addressing a cache
PA TAG INDEX BLK. OFFSET
HIT/ MISS

TAG

DATA

ACCESS SIZE (HOW MANY BYTES?)

STATE
June-July 2009

DATA

14

Addressing a cache
An example
PA is 32 bits Cache line is 64 bytes: block offset is 6 bits Number of cache lines is 512: index is 9 bits So tag is the remaining bits: 17 bits Total size of the cache is 512*64 bytes i.e. 32 KB Each cache line contains the 64 byte data, 17-bit tag, one valid/invalid bit, and several state bits (such as shared, dirty etc.) Since both the tag and the index are derived from the PA this is called a physically indexed physically tagged cache
June-July 2009 15

Set associative cache


The example assumes one cache line per index
Called a direct-mapped cache A different access to a line evicts the resident cache line This is either a capacity or a conflict miss

Conflict misses can be reduced by providing multiple lines per index Access to an index returns a set of cache lines
For an n-way set associative cache there are n lines per set

Carry out multiple tag comparisons in parallel to see if any one in the set hits
June-July 2009 16

2-way set associative


PA TAG
TAG0

INDEX

BLK. OFFSET
TAG1

TAG

DATA

TAG

DATA

STATE
June-July 2009

STATE
17

Set associative cache


When you need to evict a line in a particular set you run a replacement policy
LRU is a good choice: keeps the most recently used lines (favors temporal locality) Thus you reduce the number of conflict misses

Two extremes of set size: direct-mapped (1-way) and fully associative (all lines are in a single set)

Example: 32 KB cache, 2-way set associative, line size of 64 bytes: number of indices or number of sets=32*1024/(2*64)=256 and hence index is 8 bits wide Example: Same size and line size, but fully associative: number of sets is 1, within the set there are 32*1024/64 or 512 lines; you need 512 tag comparisons for each access
June-July 2009 18

Cache hierarchy
Ideally want to hold everything in a fast cache
Never want to go to the memory

But, with increasing size the access time increases A large cache will slow down every access So, put increasingly bigger and slower caches between the processor and the memory Keep the most recently used data in the nearest cache: register file (RF) Next level of cache: level 1 or L1 (same speed or slightly slower than RF, but much bigger) Then L2: way bigger than L1 and much slower
June-July 2009 19

Cache hierarchy
Example: Intel Pentium 4 (Netburst)
128 registers accessible in 2 cycles L1 date cache: 8 KB, 4-way set associative, 64 bytes line size, accessible in 2 cycles for integer loads L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 7 cycles 128 registers accessible in 1 cycle L1 instruction and data caches: each 16 KB, 4-way set associative, 64 bytes line size, accessible in 1 cycle Unified L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 5 cycles Unified L3 cache: 6 MB, 24-way set associative, 128 bytes line size, accessible200914 cycles in 20 June-July

Example: Intel Itanium 2 (code name Madison)

States of a cache line


The life of a cache line starts off in invalid state (I) An access to that line takes a cache miss and fetches the line from main memory If it was a read miss the line is filled in shared state (S) [we will discuss it later; for now just assume that this is equivalent to a valid state] In case of a store miss the line is filled in modified state (M); instruction cache lines do not normally enter the M state (no store to Icache) The eviction of a line in M state must write the line back to the memory (this is called a writeback cache); otherwise the effect of the store would be lost 21 June-July 2009

Inclusion policy
A cache hierarchy implements inclusion if the contents of level n cache (exclude the register file) is a subset of the contents of level n+1 cache
Eviction of a line from L2 must ask L1 caches (both instruction and data) to invalidate that line if present A store miss fills the L2 cache line in M state, but the store really happens in L1 data cache; so L2 cache does not have the most up-to-date copy of the line Eviction of an L1 line in M state writes back the line to L2 Eviction of an L2 line in M state first asks the L1 data cache to send the most up-to-date copy (if any), then it writes the line back to the next higher level (L3 or main memory) Inclusion simplifies the on-chip coherence protocol (more 22 June-July 2009 later)

The first instruction


Accessing the first instruction
Take the starting PC Access iTLB with the VPN extracted from PC: iTLB miss Invoke iTLB miss handler Calculate PTE address If PTEs are cached in L1 data and L2 caches, look them up with PTE address: you will miss there also Access page table in main memory: PTE is invalid: page fault Invoke page fault handler Allocate page frame, read page from disk, update PTE, load PTE in iTLB, restart fetch
June-July 2009 23

The first instruction


Now you have the physical address
Access Icache: miss Send refill request to higher levels: you miss everywhere Send request to memory controller (north bridge) Access main memory Read cache line Refill all levels of cache as the cache line returns to the processor Extract the appropriate instruction from the cache line with the block offset

This is the longest possible latency in an instruction/data access


June-July 2009

24

TLB access
For every cache access (instruction or data) you need to access the TLB first Puts the TLB in the critical path Want to start indexing into cache and read the tags while TLB lookup takes place
Virtually indexed physically tagged cache Extract index from the VA, start reading tag while looking up TLB Once the PA is available do tag comparison Overlaps TLB reading and tag reading
June-July 2009 25

L1 hit: ~1 ns L2 hit: ~5 ns L3 hit: ~10-15 ns Main memory: ~70 ns DRAM access time + bus transfer etc. = ~110-120 ns If a load misses in all caches it will eventually come to the head of the ROB and block instruction retirement (in-order retirement is a must) Gradually, the pipeline backs up, processor runs out of resources such as ROB entries and physical registers Ultimately, the fetcher stalls: severely limits ILP
June-July 2009 26

Memory op latency

Need memory-level parallelism (MLP) Step 1: Non-blocking cache

MLP

Simply speaking, need to mutually overlap several memory operations Allow multiple outstanding cache misses Mutually overlap multiple cache misses Supported by all microprocessors today (Alpha 21364 supported 16 outstanding cache misses) Issue loads out of program order (address is not known at the time of issue) How do you know the load didnt issue before a store to the same address? Issuing stores must check for this memory-order violation
June-July 2009 27

Step 2: Out-of-order load issue

Out-of-order loads
/* other instructions */ lw r2, 80(r20) Assume that the load issues before the store because r20 gets ready before r6 or r7 The load accesses the store buffer (used for holding already executed store values before they are committed to the cache at retirement) If it misses in the store buffer it looks up the caches and, say, gets the value somewhere After several cycles the store issues and it turns out that 0(r7)==80(r20) or they overlap; now what?
June-July 2009 28

sw 0(r7), r6

Load/store ordering
Out-of-order load issue relies on speculative memory disambiguation
Assumes that there will be no conflicting store If the speculation is correct, you have issued the load much earlier and you have allowed the dependents to also execute much earlier If there is a conflicting store, you have to squash the load and all the dependents that have consumed the load value and re-execute them systematically Turns out that the speculation is correct most of the time To further minimize the load squash, microprocessors use simple memory dependence predictors (predicts if a load is going to conflict with a pending store based on that loads or load/store pairs past behavior)
June-July 2009 29

MLP and memory wall

Today microprocessors try to hide cache misses by initiating early prefetches:


Hardware prefetchers try to predict next several load addresses and initiate cache line prefetch if they are not already in the cache All processors today also support prefetch instructions; so you can specify in your program when to prefetch what: this gives much better control compared to a hardware prefetcher

Researchers are working on load value prediction Even after doing all these, memory latency remains the biggest bottleneck Today microprocessors are trying to overcome one single wall: the memory wall
June-July 2009 30

Fundamentals of Parallel Computers

Agenda
Convergence of parallel architectures Fundamental design issues ILP vs. TLP

June-July 2009

32

Historically, parallel architectures are tied to programming models

Communication architecture

Today parallel architecture is seen as an extension of microprocessor architecture with a communication architecture
Defines the basic communication and synchronization operations and provides hw/sw implementation of those
June-July 2009 33

Diverse designs made it impossible to write portable parallel software But the driving force was the same: need for fast processing

A parallel architecture can be divided into several layers


Parallel applications Programming models: shared address, message passing, multiprogramming, data parallel, dataflow etc Compiler + libraries Operating systems support Communication hardware Physical communication medium

Layered architecture

Communication architecture = user/system interface + hw implementation (roughly defined by the last four layers)
Compiler and OS provide the user interface to communicate between and synchronize threads
June-July 2009 34

Communication takes place through a logically shared portion of memory


User interface is normal load/store instructions Load/store instructions generate virtual addresses The VAs are translated to PAs by TLB or page table The memory controller then decides where to find this PA Actual communication is hidden from the programmer

Shared address

The general communication hw consists of multiple processors connected over some medium so that they can talk to memory banks and I/O devices
The architecture of the interconnect may vary depending on projected cost and target performance
June-July 2009 35

Communication medium MEM MEM

Shared address
MEM I/O I/O DANCE HALL INTERCONNECT

Interconnect could be a crossbar switch so that any processor can talk to any memory bank in one hop (provides latency and bandwidth advantages) Scaling a crossbar becomes a problem: cost is proportional to square of the size Instead, could use a scalable switch-based network; latency increases and bandwidth decreases because now multiple processors contend for switch ports
June-July 2009

36

Shared address Communication medium

From mid 80s shared bus became popular leading to the design of SMPs Pentium Pro Quad was the first commodity SMP Sun Enterprise server provided a highly pipelined wide shared bus for scalability reasons; it also distributed the memory to each processor, but there was no local bus on the boards i.e. the memory was still symmetric (must use the shared bus) NUMA or DSM architectures provide a better solution to the scalability problem; the symmetric view is replaced by local and remote memory and each node (containing processor(s) with caches, memory controller and router) gets connected via a scalable network (mesh, ring etc.); Examples include Cray/SGI T3E, SGI Origin 2000, Alpha GS320, Alpha/HP GS1280 etc.
June-July 2009 37

Very popular for large-scale computing The system architecture looks exactly same as DSM, but there is no shared memory The user interface is via send/receive calls to the message layer The message layer is integrated to the I/O system instead of the memory system Send specifies a local data buffer that needs to be transmitted; send also specifies a tag A matching receive at dest. node with the same tag reads in the data from kernel space buffer to user memory Effectively, provides a memory-to-memory copy38 June-July 2009

Message passing

Actual implementation of message layer

Message passing

Initially it was very topology dependent A node could talk only to its neighbors through FIFO buffers These buffers were small in size and therefore while sending a message send would occasionally block waiting for the receive to start reading the buffer (synchronous message passing) Soon the FIFO buffers got replaced by DMA (direct memory access) transfers so that a send can initiate a transfer from memory to I/O buffers and finish immediately (DMA happens in background); same applies to the receiving end also The parallel algorithms were designed specifically for certain topologies: a big problem
June-July 2009 39

To improve usability of machines, the message layer started providing support for arbitrary source and destination (not just nearest neighbors)

Message passing

Essentially involved storing a message in intermediate hops and forwarding it to the next node on the route Later this store-and-forward routing got moved to hardware where a switch could handle all the routing activities Further improved to do pipelined wormhole routing so that the time taken to traverse the intermediate hops became small compared to the time it takes to push the message from processor to network (limited by node-tonetwork bandwidth) Examples include IBM SP2, Intel Paragon Each node of Paragon had two i860 processors, one of which was dedicated to servicing the network (send/recv. 40 June-July 2009 etc.)

Shared address and message passing are two distinct programming models, but the architectures look very similar

Convergence

Both have a communication assist or network interface to initiate messages or transactions In shared memory this assist is integrated with the memory controller In message passing this assist normally used to be integrated with the I/O, but the trend is changing There are message passing machines where the assist sits on the memory bus or machines where DMA over network is supported (direct transfer from source memory to destination memory) Finally, it is possible to emulate send/recv. on shared memory through shared buffers, flags and locks a shared Possible to emulateJune-July 2009 virtual mem. on message 41 passing machines through modified page fault handlers

A generic architecture
CA = network interface (NI) + communication controller

In all the architectures we have discussed thus far a node essentially contains processor(s) + caches, memory and a communication assist (CA) The nodes are connected over a scalable network The main difference remains in the architecture of the CA
And even under a particular programming model (e.g., shared memory) there is a lot of choices in the design of the CA Most innovations in parallel architecture takes place in the communication assist (also called communication controller or node controller)
June-July 2009 42

A generic architecture
SCALABLE NETWORK NODE NODE NODE MEM XBAR
CACHE

NODE

CA

P
June-July 2009 43

Design issues
Need to understand architectural components that affect software
Compiler, library, program User/system interface and hw/sw interface How programming models efficiently talk to the communication architecture? How to implement efficient primitives in the communication layer? In a nutshell, what issues of a parallel machine will affect the performance of the parallel applications?

Naming, Operations, Ordering, Replication, Communication cost


June-July 2009 44

Naming
How are the data in a program referenced?
In sequential programs a thread can access any variable in its virtual address space In shared memory programs a thread can access any private or shared variable (same load/store model of sequential programs) In message passing programs a thread can access local data directly

Clearly, naming requires some support from hw and OS


Need to make sure that the accessed virtual address gets translated to the correct physical address
June-July 2009 45

Operations
What operations are supported to access data?
For sequential and shared memory models load/store are sufficient For message passing models send/receive are needed to access remote data For shared memory, hw (essentially the CA) needs to make sure that a load/store operation gets correctly translated to a message if the address is remote For message passing, CA or the message layer needs to copy data from local memory and initiate send, or copy data from receive buffer to local memory

June-July 2009

46

How are the accesses to the same data ordered?

Ordering

For sequential model, it is the program order: true dependence order For shared memory, within a thread it is the program order, across threads some valid interleaving of accesses as expected by the programmer and enforced by synchronization operations (locks, point-to-point synchronization through flags, global synchronization through barriers) Ordering issues are very subtle and important in shared memory model (some microprocessor re-ordering tricks may easily violate correctness when used in shared memory context) For message passing, ordering across threads is implied through point-to-point send/receive pairs (producerconsumer relationship) and mutual exclusion is inherent 47 June-July 2009 (no shared variable)

Replication
How is the shared data locally replicated?
This is very important for reducing communication traffic In microprocessors data is replicated in the cache to reduce memory accesses In message passing, replication is explicit in the program and happens through receive (a private copy is created) In shared memory a load brings in the data to the cache hierarchy so that subsequent accesses can be fast; this is totally hidden from the program and therefore the hardware must provide a layer that keeps track of the most recent copies of the data (this layer is central to the performance of shared memory multiprocessors and is called the cache coherence protocol)
June-July 2009 48

Communication cost
Three major components of the communication architecture that affect performance
Latency: time to do an operation (e.g., load/store or send/recv.) Bandwidth: rate of performing an operation Overhead or occupancy: how long is the communication layer occupied doing an operation Already a big problem for microprocessors Even bigger problem for multiprocessors due to remote operations Must optimize application or hardware to hide or lower latency (algorithmic optimizations or prefetching or overlapping computation with communication)
June-July 2009 49

Latency

Bandwidth

Communication cost
How many ops in unit time e.g. how many bytes transferred per second Local BW is provided by heavily banked memory or faster and wider system bus Communication BW has two components: 1. node-tonetwork BW (also called network link BW) measures how fast bytes can be pushed into the router from the CA, 2. within-network bandwidth: affected by scalability of the network and architecture of the switch or router

Linear cost model: Transfer time = T0 + n/B where T0 is start-up overhead, n is number of bytes transferred and B is BW

Not sufficient since overlap of comp. and comm. is not considered; also does not count how the transfer is done 50 (pipelined or not) June-July 2009

Better model:

Communication cost
Communication time for n bytes = Overhead + CA occupancy + Network latency + Size/BW + Contention T(n) = Ov + Oc + L + n/B + Tc Overhead and occupancy may be functions of n Contention depends on the queuing delay at various components along the communication path e.g. waiting time at the communication assist or controller, waiting time at the router etc. Overall communication cost = frequency of communication x (communication time overlap with useful computation) Frequency of communication depends on various factors such as how the program is written or the granularity of communication supported by the underlying hardware
June-July 2009 51

Microprocessors enhance performance of a sequential program by extracting parallelism from an instruction stream (called instruction-level parallelism) Multiprocessors enhance performance of an explicitly parallel program by running multiple threads in parallel (called thread-level parallelism) TLP provides parallelism at a much larger granularity compared to ILP In multiprocessors ILP and TLP work together

ILP vs. TLP

Within a thread ILP provides performance boost Across threads TLP provides speedup over a sequential version of the parallel program
June-July 2009 52

Parallel Programming

Prolog: Why bother?


As an architect why should you be concerned with parallel programming?
Understanding program behavior is very important in developing high-performance computers An architect designs machines that will be used by the software programmers: so need to understand the needs of a program Helps in making design trade-offs and cost/performance analysis i.e. what hardware feature is worth supporting and what is not Normally an architect needs to have a fairly good knowledge in compilers and operating systems
June-July 2009 54

Agenda
Steps in writing a parallel program Example

June-July 2009

55

Start from a sequential description Identify work that can be done in parallel Partition work and/or data among threads or processes
Decomposition and assignment Orchestration

Writing a parallel program

Add necessary communication and synchronization Map threads to processors (Mapping) How good is the parallel program?
June-July 2009

Measure speedup = sequential execution time/parallel execution time = number of processors ideally
56

Task

Some definitions

Arbitrary piece of sequential work Concurrency is only across tasks Fine-grained task vs. coarse-grained task: controls granularity of parallelism (spectrum of grain: one instruction to the whole sequential program)

Process/thread
Logical entity that performs a task Communication and synchronization happen between threads

Processors
Physical entity on which one or more processes execute
June-July 2009 57

Decomposition
Find concurrent tasks and divide the program into tasks
Level or grain of concurrency needs to be decided here Too many tasks: may lead to too much of overhead communicating and synchronizing between tasks Too few tasks: may lead to idle processors Goal: Just enough tasks to keep the processors busy

Number of tasks may vary dynamically


New tasks may get created as the computation proceeds: new rays in ray tracing Number of available tasks at any point in time is an upper bound on the achievable speedup
June-July 2009 58

Static assignment
Given a decomposition it is possible to assign tasks statically
For example, some computation on an array of size N can be decomposed statically by assigning a range of indices to each process: for k processes P0 operates on indices 0 to (N/k)-1, P1 operates on N/k to (2N/k)-1,, Pk-1 operates on (k-1)N/k to N-1 For regular computations this works great: simple and low-overhead

What if the nature computation depends on the index?

For certain index ranges you do some heavy-weight computation while for others you do something simple Is there a problem?
June-July 2009 59

Static assignment may lead to load imbalance depending on how irregular the application is Dynamic decomposition/assignment solves this issue by allowing a process to dynamically choose any available task whenever it is done with its previous task
Normally in this case you decompose the program in such a way that the number of available tasks is larger than the number of processes Same example: divide the array into portions each with 10 indices; so you have N/10 tasks An idle process grabs the next available task Provides better load balance since longer tasks can execute concurrently with the smaller ones
June-July 2009 60

Dynamic assignment

Dynamic assignment
Dynamic assignment comes with its own overhead
Now you need to maintain a shared count of the number of available tasks The update of this variable must be protected by a lock Need to be careful so that this lock contention does not outweigh the benefits of dynamic decomposition

More complicated applications where a task may not just operate on an index range, but could manipulate a subtree or a complex data structure
Normally a dynamic task queue is maintained where each task is probably a pointer to the data The task queue gets populated as new tasks are discovered 61 June-July 2009

Decomposition types
Decomposition by data
The most commonly found decomposition technique The data set is partitioned into several subsets and each subset is assigned to a process The type of computation may or may not be identical on each subset Very easy to program and manage

Computational decomposition
Not so popular: tricky to program and manage All processes operate on the same data, but probably carry out different kinds of computation More common in systolic arrays, pipelined graphics processor units (GPUs) etc.
June-July 2009

62

Involves structuring communication and synchronization among processes, organizing data structures to improve locality, and scheduling tasks
This step normally depends on the programming model and the underlying architecture

Orchestration

Goal is to
Reduce communication and synchronization costs Maximize locality of data reference Schedule tasks to maximize concurrency: do not schedule dependent tasks in parallel Reduce overhead of parallelization and concurrency management (e.g., management of the task queue, overhead of initiating a task etc.)
June-July 2009

63

At this point you have a parallel program Could be specified by the program

Mapping

Just need to decide which and how many processes go to each processor of the parallel machine Pin particular processes to a particular processor for the whole life of the program; the processes cannot migrate to other processors

Could be controlled entirely by the OS

Schedule processes on idle processors Various scheduling algorithms are possible e.g., round robin: process#k goes to processor#k NUMA-aware OS normally takes into account multiprocessor-specific metrics in scheduling How many processes per processor? Most common 64 June-July 2009 is one-to-one

An example
Iterative equation solver
Main kernel in Ocean simulation Update each 2-D grid point via Gauss-Seidel iterations A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+A[i+1,j]+A[i-1,j] Pad the n by n grid to (n+2) by (n+2) to avoid corner problems Update only interior n by n grid One iteration consists of updating all n2 points in-place and accumulating the difference from the previous value at each point If the difference is less than a threshold, the solver is said to have converged to a stable grid equilibrium
June-July 2009 65

Sequential program
int n; float **A, diff; begin main() read (n); /* size of grid */ Allocate (A); Initialize (A); Solve (A); end main begin Solve (A) int i, j, done = 0; float temp; while (!done) diff = 0.0; for i = 0 to n-1 for j = 0 to n-1 temp = A[i,j]; A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+ A[i-1,j]+A[i+1,j]; diff += fabs (A[i,j] - temp); endfor endfor if (diff/(n*n) < TOL) then done = 1; endwhile end Solve
June-July 2009 66

Decomposition
Look for concurrency in loop iterations
In this case iterations are really dependent Iteration (i, j) depends on iterations (i, j-1) and (i-1, j)

Each anti-diagonal can be computed in parallel Must synchronize after each anti-diagonal (or pt-to-pt) 67 June-July 2009 Alternative: red-black ordering (different update pattern)

Can update all red points first, synchronize globally with a barrier and then update all black points
May converge faster or slower compared to sequential program Converged equilibrium may also be different if there are multiple solutions Ocean simulation uses this decomposition

Decomposition

We will ignore the loop-carried dependence and go ahead with a straight-forward loop decomposition

Allow updates to all points in parallel This is yet another different update order and may affect convergence Update to a point may or may not see the new updates to the nearest neighbors (this parallel algorithm is nondeterministic) 68 June-July 2009

while (!done) diff = 0.0; for_all i = 0 to n-1 for_all j = 0 to n-1 temp = A[i, j]; A[i, j] = 0.2(A[i, j]+A[i, j+1]+A[i, j-1]+A[i-1, j]+A[i+1, j]; diff += fabs (A[i, j] temp); end for_all end for_all if (diff/(n*n) < TOL) then done = 1; end while

Decomposition

Offers concurrency across elements: degree of concurrency is n2 Make the j loop sequential to have row-wise decomposition: degree n concurrency
June-July 2009

69

Possible static assignment: block row decomposition Another static assignment: cyclic row decomposition Dynamic assignment

Assignment

Process 0 gets rows 0 to (n/p)-1, process 1 gets rows n/p to (2n/p)-1 etc.

Process 0 gets rows 0, p, 2p,; process 1 gets rows 1, p+1, 2p+1,. Grab next available row, work on that, grab a new row,

Static block row assignment minimizes nearest neighbor communication by assigning contiguous rows to the same process
June-July 2009 70

/* include files */ MAIN_ENV; int P, n; void Solve (); struct gm_t { LOCKDEC (diff_lock); BARDEC (barrier); float **A, diff; } *gm;

Shared memory version

int main (char **argv, int argc) { int i; MAIN_INITENV; gm = (struct gm_t*) G_MALLOC (sizeof (struct gm_t)); } LOCKINIT (gm->diff_lock); June-July 2009

BARINIT (gm->barrier); n = atoi (argv[1]); P = atoi (argv[2]); gm->A = (float**) G_MALLOC ((n+2)*sizeof (float*)); for (i = 0; i < n+2; i++) { gm->A[i] = (float*) G_MALLOC ((n+2)*sizeof (float)); } Initialize (gm->A); for (i = 1; i < P; i++) { /* starts at 1 */ CREATE (Solve); } Solve (); WAIT_FOR_END (P-1); MAIN_END;
71

local_diff += fabs (gm->A[i] [j] void Solve (void) temp); { } /* end for */ int i, j, pid, done = 0; } /* end for */ float temp, local_diff; LOCK (gm->diff_lock); GET_PID (pid); gm->diff += local_diff; while (!done) { UNLOCK (gm->diff_lock); local_diff = 0.0; BARRIER (gm->barrier, P); if (!pid) gm->diff = 0.0; if (gm->diff/(n*n) < TOL) done = 1; BARRIER (gm->barrier, P);/*why?*/ BARRIER (gm->barrier, P); /* why? */ for (i = pid*(n/P); i < (pid+1)*(n/P); } /* end while */ i++) { } for (j = 0; j < n; j++) { temp = gm->A[i] [j]; gm->A[i] [j] = 0.2*(gm->A[i] [j] + gm->A[i] [j-1] + gm->A[i] [j+1] + gm>A[i+1] [j] + gm->A[i-1] [j]; 72 June-July 2009

Shared memory version

Use LOCK/UNLOCK around critical sections

Mutual exclusion

Updates to shared variable diff must be sequential Heavily contended locks may degrade performance Try to minimize the use of critical sections: they are sequential anyway and will limit speedup This is the reason for using a local_diff instead of accessing gm->diff every time Also, minimize the size of critical section because the longer you hold the lock, longer will be the waiting time for other processors at lock acquire

June-July 2009

73

LOCK optimization
Suppose each processor updates a shared variable holding a global cost value, only if its local cost is less than the global cost: found frequently in minimization problems
LOCK (gm->cost_lock); if (my_cost < gm->cost) { gm->cost = my_cost; if (my_cost < gm->cost) { LOCK (gm->cost_lock); if (my_cost < gm->cost) { /* make sure*/ gm->cost = my_cost; } UNLOCK (gm->cost_lock); } /* this works because gm->cost is monotonically decreasing */

} UNLOCK (gm->cost_lock); /* May lead to heavy lock contention if everyone tries to update at the same time */

June-July 2009

74

More synchronization
Global synchronization
Through barriers Often used to separate computation phases

Point-to-point synchronization
A process directly notifies another about a certain event on which the latter was waiting Producer-consumer communication pattern Semaphores are used for concurrent programming on uniprocessor through P and V functions Normally implemented through flags on shared memory multiprocessors (busy wait or spin) P0: A = 1; flag = 1; P1: while (!flag); use (A);
June-July 2009 75

What is different from shared memory?

Message passing

Grid solver example

No shared variable: expose communication through send/receive No lock or barrier primitive Must implement synchronization through send/receive

P0 allocates and initializes matrix A in its local memory Then it sends the block rows, n, P to each processor i.e. P1 waits to receive rows n/P to 2n/P-1 etc. (this is onetime) Within the while loop the first thing that every processor does is to send its first and last rows to the upper and the lower processors (corner cases need to be handled) Then each processor waits to receive the neighboring two rows from the upper and the lower processors
June-July 2009 76

Message passing
At the end of the loop each processor sends its local_diff to P0 and P0 sends back the accumulated diff so that each processor can locally compute the done flag

June-July 2009

77

Major changes
/* include files */ MAIN_ENV; int P, n; void Solve (); struct gm_t { LOCKDEC (diff_lock); BARDEC (barrier); float **A, diff; } *gm; BARINIT (gm->barrier); n = atoi (argv[1]); P = atoi (argv[2]); gm->A = (float**) G_MALLOC ((n+2)*sizeof (float*)); for (i = 0; i < n+2; i++) { gm->A[i] = (float*) G_MALLOC ((n+2)*sizeof (float)); } Initialize (gm->A); for (i = 1; i < P; i++) { /* starts at 1 */ CREATE (Solve); } Solve (); WAIT_FOR_END (P-1); MAIN_END;
78

Local Alloc.

int main (char **argv, int argc) { int i; int P, n; float **A; MAIN_INITENV; gm = (struct gm_t*) G_MALLOC (sizeof (struct gm_t)); } LOCKINIT (gm->diff_lock); June-July 2009

Major changes
local_diff += fabs (gm->A[i] [j] void Solve (void) temp); { } /* end for */ int i, j, pid, done = 0; } /* end for */ float temp, local_diff; LOCK (gm->diff_lock); GET_PID (pid); if (pid) Recv rows, n, P gm->diff += local_diff; Send local_diff while (!done) { to P0 Send up/down UNLOCK (gm->diff_lock); local_diff = 0.0; BARRIER (gm->barrier, P); Recv diff Recv up/down if (!pid) gm->diff = 0.0; if (gm->diff/(n*n) < TOL) done = 1; BARRIER (gm->barrier, P);/*why?*/ BARRIER (gm->barrier, P); /* why? */ for (i = pid*(n/P); i < (pid+1)*(n/P); } /* end while */ i++) { } for (j = 0; j < n; j++) { temp = gm->A[i] [j]; gm->A[i] [j] = 0.2*(gm->A[i] [j] + gm->A[i] [j-1] + gm->A[i] [j+1] + gm>A[i+1] [j] + gm->A[i-1] [j]; 79 June-July 2009

Message passing
This algorithm is deterministic May converge to a different solution compared to the shared memory version if there are multiple solutions: why?
There is a fixed specific point in the program (at the beginning of each iteration) when the neighboring rows are communicated This is not true for shared memory

June-July 2009

80

Message Passing Grid Solver

MPI-like environment
MPI stands for Message Passing Interface PVM (Parallel Virtual Machine) is another wellknown platform for message passing programming Background in MPI is not necessary for understanding this lecture Only need to know
When you start an MPI program every thread runs the same main function We will assume that we pin one thread to one processor just as we did in shared memory A C library that provides a set of message passing primitives (e.g., send, receive, broadcast etc.) to the user

Instead of using the exact MPI syntax we will use 82 June-July 2009 some macros that call the MPI functions

MAIN_ENV; /* define message tags */ #define ROW 99 #define DIFF 98 #define DONE 97

while (!done) { local_diff = 0.0; /* MPI_CHAR means raw byte format */ if (pid) { /* send my first row up */ SEND(&A[1][1], N*sizeof(float), MPI_CHAR, pid-1, ROW); } int main(int argc, char **argv) if (pid != P-1) { /* recv last row */ { RECV(&A[N/P+1][1], N*sizeof(float), int pid, P, done, i, j, N; MPI_CHAR, pid+1, ROW); float tempdiff, local_diff, temp, **A; } if (pid != P-1) { /* send last row down */ MAIN_INITENV; SEND(&A[N/P][1], N*sizeof(float), MPI_CHAR, pid+1, ROW); GET_PID(pid); } GET_NUMPROCS(P); if (pid) { /* recv first row from above */ N = atoi(argv[1]); RECV(&A[0][1], N*sizeof(float), tempdiff = 0.0; MPI_CHAR, pid-1, ROW); done = 0; } A = (double **) malloc ((N/P+2) * for (i=1; i <= N/P; i++) for (j=1; j <= N; sizeof(float *)); j++) { for (i=0; i < N/P+2; i++) { temp = A[i][j]; A[i] = (float *) malloc (sizeof(float) A[i][j] = 0.2 * (A[i][j] + A[i][j-1] + * (N+2)); A[i-1][j] + A[i][j+1] + A[i+1][j]); } 83 June-July 2009 local_diff += fabs(A[i][j] - temp); initialize(A); }

MAIN_ENV; /* define message tags */ #define ROW 99 #define DIFF 98 #define DONE 97

while (!done) { local_diff = 0.0; /* MPI_CHAR means raw byte format */ if (pid) { /* send my first row up */ SEND(&A[1][1], N*sizeof(float), MPI_CHAR, pid-1, ROW); } int main(int argc, char **argv) if (pid != P-1) { /* recv last row */ { RECV(&A[N/P+1][1], N*sizeof(float), int pid, P, done, i, j, N; MPI_CHAR, pid+1, ROW); float tempdiff, local_diff, temp, **A; } if (pid != P-1) { /* send last row down */ MAIN_INITENV; SEND(&A[N/P][1], N*sizeof(float), MPI_CHAR, pid+1, ROW); GET_PID(pid); } GET_NUMPROCS(P); if (pid) { /* recv first row from above */ N = atoi(argv[1]); RECV(&A[0][1], N*sizeof(float), tempdiff = 0.0; MPI_CHAR, pid-1, ROW); done = 0; } A = (double **) malloc ((N/P+2) * for (i=1; i <= N/P; i++) for (j=1; j <= N; sizeof(float *)); j++) { for (i=0; i < N/P+2; i++) { temp = A[i][j]; A[i] = (float *) malloc (sizeof(float) A[i][j] = 0.2 * (A[i][j] + A[i][j-1] + * (N+2)); A[i-1][j] + A[i][j+1] + A[i+1][j]); } 84 June-July 2009 local_diff += fabs(A[i][j] - temp); initialize(A); }

Performance Issues
85

June-July 2009

Agenda
Partitioning for performance Data access and communication Summary Goal is to understand simple trade-offs involved in writing a parallel program keeping an eye on parallel performance
Getting good performance out of a multiprocessor is difficult Programmers need to be careful A little carelessness may lead to extremely poor performance 86 June-July 2009

Partitioning for perf.


Partitioning plays an important role in the parallel performance
This is where you essentially determine the tasks

A good partitioning should practise


Load balance Minimal communication Low overhead to determine and manage task assignment (sometimes called extra work)

A well-balanced parallel program automatically has low barrier or point-to-point synchronization time
Ideally I want all the threads to arrive at a barrier at the same time
June-July 2009 87

Achievable speedup is bounded above by

Load balancing

What leads to a high variance?

Sequential exec. time / Max. time for any processor Thus speedup is maximized when the maximum time and minimum time across all processors are close (want to minimize the variance of parallel execution time) This directly gets translated to load balancing

Ultimately all processors finish at the same time But some do useful work all over this period while others may spend a significant time at synchronization points This may arise from a bad partitioning There may be other architectural reasons for load imbalance beyond the scope of a programmer e.g., network congestion, unforeseen cache conflicts etc. (slows down a few threads)
June-July 2009 88

Dynamic task queues


Introduced in the last lecture Normally implemented as part of the parallel program Two possible designs
Centralized task queue: a single queue of tasks; may lead to heavy contention because insertion and deletion to/from the queue must be critical sections Distributed task queues: one queue per processor

Issue with distributed task queues


When a queue of a particular processor is empty what does it do? Task stealing
June-July 2009 89

A processor may choose to steal tasks from another processors queue if the formers queue is empty

Task stealing

How many tasks to steal? Whom to steal from? The biggest question: how to detect termination? Really a distributed consensus! Task stealing, in general, may increase overhead and communication, but a smart design may lead to excellent load balance (normally hard to design efficiently) This is a form of a more general technique called Receiver Initiated Diffusion (RID) where the receiver of the task initiates the task transfer In Sender Initiated Diffusion (SID) a processor may choose to insert into another processors queue if the formers task queue is full above a threshold
June-July 2009 90

Normally load balancing is a responsibility of the programmer

Architects job

However, an architecture may provide efficient primitives to implement task queues and task stealing For example, the task queue may be allocated in a special shared memory segment, accesses to which may be optimized by special hardware in the memory controller But this may expose some of the architectural features to the programmer There are multiprocessors that provide efficient implementations for certain synchronization primitives; this may improve load balance Sophisticated hardware tricks are possible: dynamic load monitoring and favoring slow threads dynamically
June-July 2009 91

Partitioning and communication


Need to reduce inherent communication
This is the part of communication determined by assignment of tasks There may be other communication traffic also (more later)

Goal is to assign tasks such that accessed data are mostly local to a process
Ideally I do not want any communication But in life sometimes you need to talk to people to get some work done!
June-July 2009 92

Domain decomposition
Normally applications show a local bias on data usage
Communication is short-range e.g. nearest neighbor Even if it is long-range it falls off with distance View the dataset of an application as the domain of the problem e.g., the 2-D grid in equation solver If you consider a point in this domain, in most of the applications it turns out that this point depends on points that are close by Partitioning can exploit this property by assigning contiguous pieces of data to each process Exact shape of decomposed domain depends on the application and load balancing requirements
June-July 2009 93

Comm-to-comp ratio
Surely, there could be many different domain decompositions for a particular problem
For grid solver we may have a square block decomposition, block row decomposition or cyclic row decomposition How to determine which one is good? Communication-tocomputation ratio Assume P processors and NxN grid for grid solver P0 P1 P2 P3 P4 P5 P6 P7
P15

Size of each block: N/P by N/P Communication (perimeter): 4N/P Computation (area): N2/P Comm-to-comp ratio = 4P/N
June-July 2009 94

Sq. block decomp. for P=16

Comm-to-comp ratio
For block row decomposition
Each strip has N/P rows Communication (boundary rows): 2N Computation (area): N2/P (same as square block) Comm-to-comp ratio: 2P/N Each processor gets N/P isolated rows Communication: 2N2/P Computation: N2/P Comm-to-comp ratio: 2

For cyclic row decomposition

Normally N is much much larger than P


Asymptotically, square block yields lowest comm-to95 June-July 2009 comp ratio

Idea is to measure the volume of inherent communication per computation

Comm-to-comp ratio
In most cases it is beneficial to pick the decomposition with the lowest comm-to-comp ratio But depends on the application structure i.e. picking the lowest comm-to-comp may have other problems Normally this ratio gives you a rough estimate about average communication bandwidth requirement of the application i.e. how frequent is communication But it does not tell you the nature of communication i.e. bursty or uniform For grid solver comm. happens only at the start of each iteration; it is not uniformly distributed over computation Thus the worst case BW requirement may exceed the average comm-to-comp ratio
June-July 2009 96

Extra work in a parallel version of a sequential program may result from


Decomposition Assignment techniques Management of the task pool etc.

Extra work

Speedup is bounded above by Sequential work / Max (Useful work + Synchronization + Comm. cost + Extra work) where the Max is taken over all processors But this is still incomplete
We have only considered communication cost from the viewpoint of the algorithm and ignored the architecture 97 June-July 2009 completely

Data access and communication


The memory hierarchy (caches and main memory) plays a significant role in determining communication cost For uniprocessor, the execution time of a program is given by useful work time + data access time
May easily dominate the inherent communication of the algorithm

Useful work time is normally called the busy time or busy cycles Data access time can be reduced either by architectural techniques (e.g., large caches) or by cache-aware algorithm design that exploits spatial and temporal locality 98 June-July 2009

In multiprocessors

Data access

Every processor wants to see the memory interface as its own local cache and the main memory In reality it is much more complicated If the system has a centralized memory (e.g., SMPs), there are still caches of other processors; if the memory is distributed then some part of it is local and some is remote For shared memory, data movement from local or remote memory to cache is transparent while for message passing it is explicit View a multiprocessor as an extended memory hierarchy where the extension includes caches of other processors, remote memory modules and the network 99 June-July 2009 topology

Communication caused by artifacts of extended memory hierarchy


Data accesses not satisfied in the cache or local memory cause communication Inherent communication is caused by data transfers determined by the program Artifactual communication is caused by poor allocation of data across distributed memories, unnecessary data in a transfer, unnecessary transfers due to system-dependent transfer granularity, redundant communication of data, finite replication capacity (in cache or memory)

Artifactual comm.

Inherent communication assumes infinite capacity and perfect knowledge of what should be 100 June-July 2009 transferred

Capacity problem
Most probable reason for artifactual communication
Due to finite capacity of cache, local memory or remote memory May view a multiprocessor as a three-level memory hierarchy for this purpose: local cache, local memory, remote memory Communication due to cold or compulsory misses and inherent communication are independent of capacity Capacity and conflict misses generate communication resulting from finite capacity Generated traffic may be local or remote depending on the allocation of pages General technique: exploit spatial and temporal locality to use the cache properly
June-July 2009 101

Temporal locality
Maximize reuse of data
Schedule tasks that access same data in close succession Many linear algebra kernels use blocking of matrices to improve temporal (and spatial) locality Example: Transpose phase in Fast Fourier Transform (FFT); to improve locality, the algorithm carries out blocked transpose i.e. transposes a block of data at a time

Block transpose
June-July 2009 102

Spatial locality
Consider a square block decomposition of grid solver and a C-like row major layout i.e. A[i][j] and A[i][j+1] have contiguous memory locations
Memory allocation Page Cache line Cache line across partition
June-July 2009

Page straddles partition boundary

The same page is local to a processor while remote to others; same applies to straddling cache lines. Ideally, I want to have all pages within a partition local to a single processor. Standard trick is to covert the 2D array to 103 4D.

Essentially you need to change the way memory is allocated

2D to 4D conversion

The matrix A needs to be allocated in such a way that the elements falling within a partition are contiguous The first two dimensions of the new 4D matrix are block row and column indices i.e. for the partition assigned to processor P6 these are 1 and 2 respectively (assuming 16 processors) The next two dimensions hold the data elements within that partition Thus the 4D array may be declared as float B[P][P][N/P][N/P] The element B[3][2][5][10] corresponds to the element in 10th column, 5th row of the partition of P14 Now all elements within a partition have contiguous 104 June-July 2009 addresses

Transfer granularity
How much data do you transfer in one communication?
For message passing it is explicit in the program For shared memory this is really under the control of the cache coherence protocol: there is a fixed size for which transactions are defined (normally the block size of the outermost level of cache hierarchy)

In shared memory you have to be careful


Since the minimum transfer size is a cache line you may end up transferring extra data e.g., in grid solver the elements of the left and right neighbors for a square block decomposition (you need only one element, but must transfer the whole cache line): no good solution
June-July 2009 105

Worse: false sharing


If the algorithm is designed so poorly that
Two processors write to two different words within a cache line at the same time The cache line keeps on moving between two processors The processors are not really accessing or updating the same element, but whatever they are updating happen to fall within a cache line: not a true sharing, but false sharing For shared memory programs false sharing can easily degrade performance by a lot Easy to avoid: just pad up to the end of the cache line before starting the allocation of the data for the next processor (wastes memory, but improves performance)
June-July 2009 106

It is very easy to ignore contention effects when designing algorithms


Can severely degrade performance by creating hot-spots

Contention

Location hot-spot:
Consider accumulating a global variable; the accumulation takes place on a single node i.e. all nodes access the variable allocated on that particular node whenever it tries to increment it
CA on this node becomes a bottleneck

June-July 2009

Scalable tree accumulation

107

Hot-spots
Avoid location hot-spot by either staggering accesses to the same location or by designing the algorithm to exploit a tree structured communication Module hot-spot
Normally happens when a particular node saturates handling too many messages (need not be to same memory location) within a short amount of time Normal solution again is to design the algorithm in such a way that these messages are staggered over time

Rule of thumb: design communication pattern such that it is not bursty; want to distribute it uniformly 108 June-July 2009 over time

Overlap
Increase overlap between communication and computation
Not much to do at algorithm level unless the programming model and/or OS provide some primitives to carry out prefetching, block data transfer, non-blocking receive etc. Normally, these techniques increase bandwidth demand because you end up communicating the same amount of data, but in a shorter amount of time (execution time hopefully goes down if you can exploit overlap)

June-July 2009

109

Summary
Parallel programs introduce three overhead terms: busy overhead (extra work), remote data access time, and synchronization time
Goal of a good parallel program is to minimize these three terms Goal of a good parallel computer architecture is to provide sufficient support to let programmers optimize these three terms (and this is the focus of the rest of the course)

June-July 2009

110

Shared Memory Multiprocessors

Shared cache

Four organizations
P0 SWITCH INTERLEAVED SHARED CACHE INTERLEAVED MEMORY Pn

The switch is a simple controller for granting June-July 2009 access to cache banks

Interconnect is between the processors and the shared cache Which level of cache hierarchy is shared depends on the design: Chip multiprocessors today normally share the outermost level (L2 or L3 cache) The cache and memory are interleaved to improve bandwidth by allowing multiple concurrent accesses Normally small scale due to heavy bandwidth demand on switch and shared cache
112

Four organizations
Interconnect is a shared bus located between the private cache hierarchies and P0 Pn memory controller CACHE CACHE The most popular organization for small to BUS medium-scale servers Possible to connect 30 or so MEM processors with smart bus design Bus bandwidth requirement is Scalability is limited by lower compared to shared the shared bus bandwidth cache approach
June-July 2009

Bus-based SMP

Why?

113

Four organizations
Dancehall

P0 CACHE

Pn CACHE

INTERCONNECT MEM MEM

June-July 2009

Better scalability compared to previous two designs The difference from busbased SMP is that the interconnect is a scalable point-to-point network (e.g. crossbar or other topology) Memory is still symmetric from all processors Drawback: a cache miss may take a long time since all memory banks too far off from the processors (may be several network hops)
114

Four organizations
Distributed shared memory

P0 CACHE MEM

Pn CACHE MEM

The most popular scalable organization Each node now has local memory banks Shared memory on other nodes must be accessed over the network Non-uniform memory access (NUMA)
Latency to access local memory is much smaller compared to remote memory Remote memory access

INTERCONNECT

Caching is very important to reduce remote memory access 115 June-July 2009

In all four organizations caches play an important role in reducing latency and bandwidth requirement
If an access is satisfied in cache, the transaction will not appear on the interconnect and hence the bandwidth requirement of the interconnect will be less (shared L1 cache does not have this advantage)

Four organizations

In distributed shared memory (DSM) cache and local memory should be used cleverly Bus-based SMP and DSM are the two designs supported today by industry vendors
In bus-based SMP every cache miss is launched on the shared bus so that all processors can see all transactions In DSM this is not the case
June-July 2009 116

Hierarchical design
Possible to combine bus-based SMP and DSM to build hierarchical shared memory
Sun Wildfire connects four large SMPs (28 processors) over a scalable interconnect to form a 112p multiprocessor IBM POWER4 has two processors on-chip with private L1 caches, but shared L2 and L3 caches (this is called a chip multiprocessor); connect these chips over a network to form scalable multiprocessors

Next few lectures will focus on bus-based SMPs only


June-July 2009

117

Cache Coherence
Intuitive memory model
For sequential programs we expect a memory location to return the latest value written to that location For concurrent programs running on multiple threads or processes on a single processor we expect the same model to hold because all threads see the same cache hierarchy (same as shared L1 cache) For multiprocessors there remains a danger of using a stale value: in SMP or DSM the caches are not shared and processors are allowed to replicate data independently in each cache; hardware must ensure that cached values are coherent across the system and they satisfy programmers intuitive memory model
June-July 2009 118

Example
Assume a write-through cache i.e. every store updates the value in cache as well as in memory
P0: reads x from memory, puts it in its cache, and gets the value 5 P1: reads x from memory, puts it in its cache, and gets the value 5 P1: writes x=7, updates its cached value and memory value P0: reads x from its cache and gets the value 5 P2: reads x from memory, puts it in its cache, and gets the value 7 (now the system is completely incoherent) P2: writes x=10, updates its cached value and memory value
June-July 2009 119

Example
Consider the same example with a writeback cache i.e. values are written back to memory only when the cache line is evicted from the cache
P0 has a cached value 5, P1 has 7, P2 has 10, memory has 5 (since caches are not write through) The state of the line in P1 and P2 is M while the line in P0 is clean Eviction of the line from P1 and P2 will issue writebacks while eviction of the line from P0 will not issue a writeback (clean lines do not need writeback) Suppose P2 evicts the line first, and then P1 Final memory value is 7: we lost the store x=10 from P2
June-July 2009 120

What went wrong?


For write through cache
The memory value may be correct if the writes are correctly ordered But the system allowed a store to proceed when there is already a cached copy Lesson learned: must invalidate all cached copies before allowing a store to proceed

Writeback cache
Problem is even more complicated: stores are no longer visible to memory immediately Writeback order is important Lesson learned: do not allow more than one copy of a cache line in M state
June-July 2009 121

What went wrong?


Need to formalize the intuitive memory model
In sequential programs the order of read/write is defined by the program order; the notion of last write is welldefined For multiprocessors how do you define last write to a memory location in presence of independent caches? Within a processor it is still fine, but how do you order read/write across processors?

June-July 2009

122

Definitions
Memory operation: a read (load), a write (store), or a read-modify-write A memory operation is said to issue when it leaves the issue queue and looks up the cache A memory operation is said to perform with respect to a processor when a processor can tell that from other issued memory operations
Assumed to take place atomically

A read is said to perform with respect to a processor when subsequent writes issued by that processor cannot affect the returned read value A write is said to perform with respect to a processor when a subsequent read from that processor to the same address returns the new value
June-July 2009 123

Ordering memory op
A memory operation is said to complete when it has performed with respect to all processors in the system Assume that there is a single shared memory and no caches
Memory operations complete in shared memory when they access the corresponding memory locations Operations from the same processor complete in program order: this imposes a partial order among the memory operations Operations from different processors are interleaved in such a way that the program order is maintained for each processor: memory imposes some total order (many are 124 June-July 2009 possible)

Example
P0: x=8; u=y; v=9; P1: r=5; y=4; t=v; Legal total order: x=8; u=y; r=5; y=4; t=v; v=9; Another legal total order: x=8; r=5; y=4; u=y; v=9; t=v; Last means the most recent in some legal total order A system is coherent if
Reads get the last written value in the total order All processors see writes to a location in the same order
June-July 2009 125

Cache coherence Formal definition

1. 2. A. B.

A memory system is coherent if the values returned by reads to a memory location during an execution of a program are such that all operations to that location can form a hypothetical total order that is consistent with the serial order and has the following two properties: Operations issued by any particular processor perform according to the issue order The value returned by a read is the value written to that location by the last write in the total order Two necessary features that follow from above: Write propagation: writes must eventually become visible to all processors Write serialization: Every processor should see the writes to a location in the same order (if I see w1 before w2, you should not see w2 before w1) 126 June-July 2009

Extend the philosophy of uniprocessor bus transactions

Bus-based SMP

Three phases: arbitrate for bus, launch command (often called request) and address, transfer data Every device connected to the bus can observe the transaction Appropriate device responds to the request In SMP, processors also observe the transactions and may take appropriate actions to guarantee coherence The other device on the bus that will be of interest to us is the memory controller (north bridge in standard mother boards) Depending on the bus transaction a cache block executes a finite state machine implementing the coherence protocol
June-July 2009 127

Cache coherence protocols implemented in busbased machines are called snoopy protocols

Snoopy protocols

The processors snoop or monitor the bus and take appropriate protocol actions based on snoop results Cache controller now receives requests both from processor and bus Since cache state is maintained on a per line basis that also dictates the coherence granularity Cannot normally take a coherence action on parts of a cache line The coherence protocol is implemented as a finite state machine on a per cache line basis The snoop logic in each processor grabs the address from the bus and decides if any action should be taken on the cache line containing that address (only if the line 128 June-July 2009 is in cache)

Write through caches


There are only two cache line states
Invalid (I): not in cache Valid (V): present in cache, may be present in other caches also

Read access to a cache line in I state generates a BusRd request on the bus
Memory controller responds to the request and after reading from memory launches the line on the bus Requester matches the address and picks up the line from the bus and fills the cache in V state A store to a line always generates a BusWr transaction on the bus (since write through); other sharers either invalidate the line in their caches or update the line with 129 June-July 2009 new value

State transition
The finite state machine for each cache line:
PrWr/BusWr PrRd/BusRd PrRd/-

I
BusWr (snoop)

V
PrWr/BusWr

On a write miss no line is allocated

A/B means: A is generated by processor, B is the resulting bus transaction (if any) Changes for write through write allocate?
June-July 2009 130

The state remains at I: called write through write noallocated

Assume that the bus is atomic

Ordering memory op
It takes up the next transaction only after finishing the previous one

Read misses and writes appear on the bus and hence are visible to all processors What about read hits?

In general, in between writes reads can happen in any order without violating coherence
Writes establish a partial order
June-July 2009 131

They take place transparently in the cache But they are correct as long as they are correctly ordered with respect to writes And all writes appear on the bus and hence are visible immediately in the presence of an atomic bus

Write through is bad


High bandwidth requirement
Every write appears on the bus Assume a 3 GHz processor running application with 10% store instructions, assume CPI of 1 If the application runs for 100 cycles it generates 10 stores; assume each store is 4 bytes; 40 bytes are generated per 100/3 ns i.e. BW of 1.2 GB/s A 1 GB/s bus cannot even support one processor There are multiple processors and also there are read misses Writes that hit in cache do not go on bus (not visible to others) Complicated coherence protocol with many choices
June-July 2009 132

Writeback caches absorb most of the write traffic

Need a more formal description of memory ordering

Memory consistency
How to establish the order between reads and writes from different processors?

The most clear way is to use synchronization P0: A=1; flag=1 P1: while (!flag); print A; Another example (assume A=0, B=0 initially) P0: A=1; print B; P1: B=1; print A;
What do you expect?

Memory consistency model is a contract between programmer and hardware regarding memory 133 June-July 2009 ordering

Consistency model
A multiprocessor normally advertises the supported memory consistency model
This essentially tells the programmer what the possible correct outcome of a program could be when run on that machine Cache coherence deals with memory operations to the same location, but not different locations Without a formally defined order across all memory operations it often becomes impossible to argue about what is correct and what is wrong in shared memory

Various memory consistency models


Sequential consistency (SC) is the most intuitive one and we will focus on it now (more consistency models later)
June-July 2009 134

Sequential consistency
Total order achieved by interleaving accesses from different processors The accesses from the same processor are presented to the memory system in program order Essentially, behaves like a randomly moving switch connecting the processors to memory Lamports definition of SC

Picks the next access from a randomly chosen processor A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program
June-July 2009 135

The program order is the order of instructions from a sequential piece of code where programmers intuition is preserved Can out-of-order execution violate program order?

What is program order? Any legal re-ordering is allowed

The order must produce the result a programmer expects No. All microprocessors commit instructions in-order and that is where the state becomes visible For modern microprocessors the program order is really the commit order Yes. Need extra logic to support SC on top of OOO
June-July 2009

Can out-of-order (OOO) execution violate SC?

136

OOO and are zero initially) SC Consider a simple example (all

What went wrong? We will discuss it in detail later


June-July 2009 137

P0: x=w+1; r=y+1; P1: y=2; w=y+1; Suppose the load that reads w takes a miss and so w is not ready for a long time; therefore, x=w+1 cannot complete immediately; eventually w returns with value 3 Inside the microprocessor r=y+1 completes (but does not commit) before x=w+1 and gets the old value of y (possibly from cache); eventually instructions commit in order with x=4, r=1, y=2, w=3 So we have the following partial orders P0: x=w+1 < r=y+1 and P1: y=2 < w=y+1 Cross-thread: w=y+1 < x=w+1 and r=y+1 < y=2 Combine these to get a contradictory total order

SC example
Consider the following example Possible outcomes for an SC machine
P0: A=1; print B; P1: B=1; print A;

(A, B) = (0,1); interleaving: B=1; print A; A=1; print B (A, B) = (1,0); interleaving: A=1; print B; B=1; print A (A, B) = (1,1); interleaving: A=1; B=1; print A; print B A=1; B=1; print B; print A (A, B) = (0,0) is impossible: read of A must occur before write of A and read of B must occur before write of B i.e. print A < A=1 and print B < B=1, but A=1 < print B and B=1 < print A; thus print B < B=1 < print A < A=1 < print B which implies print B < print B, a contradiction
June-July 2009 138

Implementing SC
Two basic requirements
Memory operations issued by a processor must become visible to others in program order Need to make sure that all processors see the same total order of memory operations: in the previous example for the (0,1) case both P0 and P1 should see the same interleaving: B=1; print A; A=1; print B

The tricky part is to make sure that writes become visible in the same order to all processors
Write atomicity: as if each write is an atomic operation Otherwise, two processors may end up using different values (which may still be correct from the viewpoint of cache coherence, but will violate SC)
June-July 2009 139

Write atomicity
Example (A=0, B=0 initially)
P0: A=1; P1: while (!A); B=1; P2: while (!B); print A;

A correct execution on an SC machine should print A=1

A=0 will be printed only if write to A is not visible to P2, but clearly it is visible to P1 since it came out of the loop Thus A=0 is possible if P1 sees the order A=1 < B=1 and P2 sees the order B=1 < A=1 i.e. from the viewpoint of the whole system the write A=1 was not atomic Without write atomicity P2 may proceed to print 0 with a stale value from its cache
June-July 2009 140

Program order from each processor creates a partial order among memory operations Interleaving of these partial orders defines a total order Sequential consistency: one of many total orders A multiprocessor is said to be SC if any execution on this machine is SC compliant Sufficient but not necessary conditions for SC

Summary of SC

Issue memory operation in program order Every processor waits for write to complete before issuing the next operation Every processor waits for read to complete and the write that affects the returned value to complete before issuing the next operation (important for write atomicity)
June-July 2009 141

Centralized shared bus makes it easy to support SC

Back to shared bus

What about a multiprocessor with writeback cache?


No SMP uses write through protocol due to high BW
June-July 2009 142

Writes and reads are all serialized in a total order through the bus transaction ordering If a read gets a value of a previous write, that write is guaranteed to be complete because that bus transaction is complete The write order seen by all processors is the same in a write through system because every write causes a transaction and hence is visible to all in the same order In a nutshell, every processor sees the same total bus order for all memory operations and therefore any busbased SMP with write through caches is SC

No change to processor or cache

Snoopy protocols

We will focus on writeback caches only

Just extend the cache controller with snoop logic and exploit the bus Possible states of a cache line: Invalid (I), Shared (S), Modified or dirty (M), Clean exclusive (E), Owned (O); every processor does not support all five states E state is equivalent to M in the sense that the line has permission to write, but in E state the line is not yet modified and the copy in memory is the same as in cache; if someone else requests the line the memory will provide the line O state is exactly same as E state but in this case memory is not responsible for servicing requests to the line; the owner must supply the line (just as in M state) June-July 2009 Stores really read the memory (as opposed to write)143

Stores
Look at stores a little more closely
There are three situations at the time a store issues: the line is not in the cache, the line is in the cache in S state, the line is in the cache in one of M, E and O states If the line is in I state, the store generates a read-exclusive request on the bus and gets the line in M state If the line is in S or O state, that means the processor only has read permission for that line; the store generates an upgrade request on the bus and the upgrade acknowledgment gives it the write permission (this is a data-less transaction) If the line is in M or E state, no bus transaction is generated; the cache already has write permission for the line (this is the case of a write hit; previous two are 144 June-July 2009 write misses)

Invalidation vs. update


Two main classes of protocols:
Invalidation-based and update-based Dictates what action should be taken on a write Invalidation-based protocols invalidate sharers when a write miss (upgrade or readX) appears on the bus Update-based protocols update the sharer caches with new value on a write: requires write transactions (carrying just the modified bytes) on the bus even on write hits (not very attractive with writeback caches) Advantage of update-based protocols: sharers continue to hit in the cache while in invalidation-based protocols sharers will miss next time they try to access the line Advantage of invalidation-based protocols: only write misses go on bus (suited for writeback caches) and subsequent stores to the same line are cache hits
June-July 2009 145

Difficult to answer

Which one is better?


Depends on program behavior and hardware cost What sharing pattern? (large-scale producer/consumer) Otherwise it would just waste bus bandwidth doing useless updates Sequence of multiple writes to a cache line Saves intermediate write transactions

When is update-based protocol good?

When is invalidation-protocol good?

Also think about the overhead of initiating small updates for every write in update protocols
Invalidation-based protocols are much more popular Some systems support both or maybe some hybrid based on dynamic sharing pattern of a cache line
June-July 2009

146

MSI protocol
Forms the foundation of invalidation-based writeback protocols
Assumes only three supported cache line states: I, S, and M There may be multiple processors caching a line in S state There must be exactly one processor caching a line in M state and it is the owner of the line If none of the caches have the line, memory must have the most up-to-date copy of the line

Processor requests to cache: PrRd, PrWr Bus transactions: BusRd, BusRdX, BusUpgr, BusWB June-July 2009

147

State transition
PrWr/BusRdX PrWr/BusUpgr

PrRd/BusRd

PrRd/BusRd/-

S
BusRd/Flush

M
PrRd/PrWr/-

{BusRdX, BusUpgr}/CacheEvict/-

BusRdX/Flush CacheEvict/BusWB
148

June-July 2009

MSI protocol
Few things to note
Flush operation essentially launches the line on the bus Processor with the cache line in M state is responsible for flushing the line on bus whenever there is a BusRd or BusRdX transaction generated by some other processor On BusRd the line transitions from M to S, but not M to I. Why? Also at this point both the requester and memory pick up the line from the bus; the requester puts the line in its cache in S state while memory writes the line back. Why does memory need to write back? On BusRdX the line transitions from M to I and this time memory does not need to pick up the line from bus. Only the requester picks up the line and puts it in M state in its cache. Why? 149 June-July 2009

BusRd takes a cache line in M state to S state

M to S, or M to I?

The assumption here is that the processor will read it soon, so save a cache miss by going to S May not be good if the sharing pattern is migratory: P0 reads and writes cache line A, then P1 reads and writes cache line A, then P2 For migratory patterns it makes sense to go to I state so that a future invalidation is saved But for bus-based SMPs it does not matter much because an upgrade transaction will be launched anyway by the next writer, unless there is special hardware support to avoid that: how? The big problem is that the sharing pattern for a cache line may change dynamically: adaptive protocols are good and are supported by Sequent Symmetry and MIT Alewife 150 June-July 2009

MSI example Take the following example

P0 reads x, P1 reads x, P1 writes x, P0 reads x, P2 reads x, P3 writes x Assume the state of the cache line containing the address of x is I in all processors P0 generates BusRd, memory provides line, P0 puts line in S state P1 generates BusRd, memory provides line, P1 puts line in S state P1 generates BusUpgr, P0 snoops and invalidates line, memory does not respond, P1 sets state of line to M P0 generates BusRd, P1 flushes line and goes to S state, P0 puts line in S state, memory writes back P2 generates BusRd, memory provides line, P2 puts line in S state 151 June-July P1, P3 generates BusRdX, P0, 2009 P2 snoop and invalidate, memory provides line, P3 puts line in cache in M state

MESI protocol
The most popular invalidation-based protocol e.g., appears in Intel Xeon MP Why need E state?
The MSI protocol requires two transactions to go from I to M even if there is no intervening requests for the line: BusRd followed by BusUpgr We can save one transaction by having memory controller respond to the first BusRd with E state if there is no other sharer in the system How to know if there is no other sharer? Needs a dedicated control wire that gets asserted by a sharer (wired OR) Processor can write to a line in E state silently and take it 152 June-July 2009 to M state

State transition
PrRd/BusRd(S) {BusRdX, BusUpgr}/Flush CacheEvict/BusRdX/Flush CacheEvict/BusWB BusRd/Flush PrRd/BusRd/Flush

I
PrRd/BusRd(!S)

BusRd/Flush PrWr/BusUpgr PrWr/BusRdX PrWr/-

BusRdX/Flush CacheEvict/-

E
PrRd/-

PrRd/PrWr/153

June-July 2009

MESI protocol
If a cache line is in M state definitely the processor with the line is responsible for flushing it on the next BusRd or BusRdX transaction If a line is not in M state who is responsible?
Memory or other caches in S or E state? Original Illinois MESI protocol assumed cache-to-cache transfer i.e. any processor in E or S state is responsible for flushing the line However, it requires some expensive hardware, namely, if multiple processors are caching the line in S state who flushes it? Also, memory needs to wait to know if it should source the line Without cache-to-cache sharing memory always sources the line unless it is in M state
June-July 2009 154

Take the following example

MESI example

P0 reads x, P0 writes x, P1 reads x, P1 writes x, P0 generates BusRd, memory provides line, P0 puts line in cache in E state P0 does write silently, goes to M state P1 generates BusRd, P0 provides line, P1 puts line in cache in S state, P0 transitions to S state Rest is identical to MSI Consider this example: P0 reads x, P1 reads x, P0 generates BusRd, memory provides line, P0 puts line in cache in E state P1 generates BusRd, memory provides line, P1 puts line in cache in S state, P0 transitions to S state (no cache-tocache sharing) Rest is same as MSI
June-July 2009 155

Some SMPs implement MOESI today e.g., AMD Athlon MP and the IBM servers Why is the O state needed?

MOESI protocol

O state is very similar to E state with four differences: 1. If a cache line is in O state in some cache, that cache is responsible for sourcing the line to the next requester; 2. The memory may not have the most up-to-date copy of the line (this implies 1); 3. Eviction of a line in O state generates a BusWB; 4. Write to a line in O state must generate a bus transaction When a line transitions from M to S it is necessary to write the line back to memory For a migratory sharing pattern (frequent in database workloads) this leads to a series of writebacks to memory 156 These writebacks just keep the memory banks busy and June-July 2009 consumes memory bandwidth

MOESI protocol
Take the following example
P0 reads x, P0 writes x, P1 reads x, P1 writes x, P2 reads x, P2 writes x, Thus at the time of a BusRd response the memory will write the line back: one writeback per processor handover O state aims at eliminating all these writebacks by transitioning from M to O instead of M to S on a BusRd/Flush Subsequent BusRd requests are replied by the owner holding the line in O state The line is written back only when the owner evicts it: one single writeback
June-July 2009 157

MOESI protocol
State transitions pertaining to O state
I to O: not possible (or maybe; next slide) E to O or S to O: not possible M to O: on a BusRd/Flush (but no memory writeback) O to I: on CacheEvict/BusWB or {BusRdX,BusUpgr}/Flush O to S: not possible (or maybe; next slide) O to E: not possible (or maybe if silent eviction not allowed) O to M: on PrWr/BusUpgr

At most one cache can have a line in O state at any point in time
June-July 2009 158

MOESI protocol
Two main design choices for MOESI
Consider the example P0 reads x, P0 writes x, P1 reads x, P2 reads x, P3 reads x, When P1 launches BusRd, P0 sources the line and now the protocol has two options: 1. The line in P0 goes to O and the line in P1 is filled in state S; 2. The line in P0 goes to S and the line in P1 is filled in state O i.e. P1 inherits ownership from P0 For bus-based SMPs the two choices will yield roughly the same performance For DSM multiprocessors we will revisit this issue if time permits According to the second choice, when P2 generates a BusRd request, P1 sources the line and transitions from O to S; P2 becomes the new owner
June-July 2009 159

MOSI protocol
Some SMPs do not support the E state
In many cases it is not helpful, only complicates the protocol MOSI allows a compact state encoding in 2 bits Sun WildFire uses MOSI protocol

June-July 2009

160

Synchronization

June-July 2009

161

Types
Mutual exclusion
Synchronize entry into critical sections Normally done with locks

Point-to-point synchronization
Tell a set of processors (normally set cardinality is one) that they can proceed Normally done with flags

Global synchronization
Bring every processor to sync Wait at a point until everyone is there Normally done with barriers
June-July 2009 162

Synchronization
Normally a two-part process: acquire and release; acquire can be broken into two parts: intent and wait
Intent: express intent to synchronize (i.e. contend for the lock, arrive at a barrier) Wait: wait for your turn to synchronization (i.e. wait until you get the lock) Release: proceed past synchronization and enable other contenders to synchronize

Waiting algorithms do not depend on the type of synchronization


June-July 2009 163

Busy wait (common in multiprocessors)

Waiting algorithms
Waiting processes repeatedly poll a location (implemented as a load in a loop) Releasing process sets the location appropriately May cause network or bus transactions

Block

Busy waiting is better if

Waiting processes are de-scheduled Frees up processor cycles for doing something else De-scheduling and re-scheduling take longer than busy waiting No other active process Does not work for single processor

Hybrid policies: busy wait for some time and then 164 June-July 2009 block

Implementation
Popular trend
Architects offer some simple atomic primitives Library writers use these primitives to implement synchronization algorithms Normally hardware primitives for acquire and possibly release are provided Hard to offer hardware solutions for waiting Also hardwired waiting may not offer that much of flexibility

June-July 2009

165

Hardwired locks
Not popular today
Less flexible Cannot support large number of locks

Possible designs
Dedicated lock line in bus so that the lock holder keeps it asserted and waiters snoop the lock line in hardware Set of lock registers shared among processors and lock holder gets a lock register (Cray Xmp)

June-July 2009

166

Shared: choosing[P] = FALSE, ticket[P] = 0; Acquire: choosing[i] = TRUE; ticket[i] = max(ticket[0],,ticket[P-1]) + 1; choosing[i] = FALSE; for j = 0 to P-1 while (choosing[j]); while (ticket[j] && ((ticket[j], j) < (ticket[i], i))); endfor Release: ticket[i] = 0;

Bakery algorithm

Software locks

Does it work for multiprocessors?

Too much overhead: need faster and simpler lock algorithms


Need some hardware support
June-July 2009 167

Assume sequential consistency Performance issues related to coherence?

Start with a simple software lock


Shared: lock = 0; Acquire: while (lock); lock = 1; Release or Unlock: lock = 0;

Hardware support

Lock: lw register, lock_addr /* register is any processor register */ bnez register, Lock addi register, register, 0x1 sw register, lock_addr Unlock: xor register, register, register sw register, lock_addr

Assembly translation

Does it work?
What went wrong? We wanted the read-modify-write sequence to be atomic
June-July 2009 168

Atomic exchange
addi register, r0, 0x1 Lock: xchg register, lock_addr bnez register, Lock Unlock remains unchanged

We can fix this if we have an atomic exchange instruction

/* r0 is hardwired to 0 */ /* An atomic load and store */

Various processors support this type of instruction


Intel x86 has xchg, Sun UltraSPARC has ldstub (loadstore-unsigned byte), UltraSPARC also has swap Normally easy to implement for bus-based systems: whoever wins the bus for xchg can lock the bus Difficult to support in distributed memory systems
June-July 2009

169

Less general compared to exchange


Lock: ts register, lock_addr bnez register, Lock Unlock remains unchanged

Test & set

Loads current lock value in a register and sets location always with 1
Exchange allows to swap any value

A similar type of instruction is fetch & op


Fetch memory location in a register and apply op on the memory location Op can be a set of supported operations e.g. add, increment, decrement, store etc. In Test & set op=set
June-July 2009 170

Possible to implement a lock with fetch & clear then add (used to be supported in BBN Butterfly 1)
Lock: addi reg1, r0, 0x1 fetch & clr then add reg1, reg2, lock_addr /* fetch in reg2, clear, add reg1 */ bnez reg2, Lock

Fetch & op

Butterfly 1 also supports fetch & clear then xor Sequent Symmetry supports fetch & store More sophisticated: compare & swap
Takes three operands: reg1, reg2, memory address Compares the value in reg1 with address and if they are equal swaps the contents of reg2 and address Not in line with RISC philosophy (same goes for fetch & 171 June-July 2009 add)

Compare & swap


Lock: addi reg1, r0, 0x0 /* reg1 has 0x0 */ addi reg2, r0, 0x1 /* reg2 has 0x1 */ compare & swap reg1, reg2, lock_addr bnez reg2, Lock

June-July 2009

172

In some machines (e.g., SGI Origin 2000) uncached fetch & op is supported

Traffic of test & set


every such instruction will generate a transaction (may be good or bad depending on the support in memory controller; will discuss later)

Let us assume that the lock location is cacheable and is kept coherent

Can we improve this?

Every invocation of test & set must generate a bus transaction; Why? What is the transaction? What are the possible states of the cache line holding lock_addr? Therefore all lock contenders repeatedly generate bus transactions even if someone is still in the critical section and is holding the lock Test & set with backoff 2009 June-July
173

Instead of retrying immediately wait for a while

Backoff test & set

How long to wait? Waiting for too long may lead to long latency and lost opportunity Constant and variable backoff Special kind of variable backoff: exponential backoff (after the i th attempt the delay is k*ci where k and c are constants) Test & set with exponential backoff works pretty well
delay = k ts register, lock_addr bez register, Enter_CS pause (delay) /* Can be simulated as a timed loop */ delay = delay*c j Lock
June-July 2009

Lock:

174

Test & test & set


Reduce traffic further
Before trying test & set make sure that the lock is free
Lock: Test: ts register, lock_addr bez register, Enter_CS lw register, lock_addr bnez register, Test j Lock

How good is it?


In a cacheable lock environment the Test loop will execute from cache until it receives an invalidation (due to store in unlock); at this point the load may return a zero value after fetching the cache line If the location is zero then only everyone will try test & set
June-July 2009 175

Recall that unlock is always a simple store In the worst case everyone will try to enter the CS at the same time

TTS traffic analysis

First time P transactions for ts and one succeeds; every other processor suffers a miss on the load in Test loop; then loops from cache The lock-holder when unlocking generates an upgrade (why?) and invalidates all others All other processors suffer read miss and get value zero now; so they break Test loop and try ts and the process continues until everyone has visited the CS (P+(P-1)+1+(P-1))+((P-1)+(P-2)+1+(P-2))+ = (3P-1) + (3P-4) + (3P-7) + ~ 1.5P2 asymptotically For distributed shared memory the situation is worse because each invalidation2009 becomes a separate message 176 June-July (more later)

Goals of a lock algorithm


Low latency: if no contender the lock should be acquired fast Low traffic: worst case lock acquire traffic should be low; otherwise it may affect unrelated transactions Scalability: Traffic and latency should scale slowly with the number of processors Low storage cost: Maintaining lock states should not impose unrealistic memory overhead Fairness: Ideally processors should enter CS according to the order of lock request (TS or TTS does not guarantee this)
June-July 2009 177

Similar to Bakery algorithm but simpler A nice application of fetch & inc Basic idea is to come and hold a unique ticket and wait until your turn comes
Shared: ticket = 0, release_count = 0; Lock: fetch & inc reg1, ticket_addr Wait: lw reg2, release_count_addr sub reg3, reg2, reg1 bnez reg3, Wait Unlock: addi reg2, reg2, 0x1 sw reg2, release_count_addr

Ticket lock

Bakery algorithm failed to offer this uniqueness thereby increasing complexity

/* while (release_count != ticket); */

/* release_count++ */
178

June-July 2009

Ticket lock
Initial fetch & inc generates O(P) traffic on busbased machines (may be worse in DSM depending on implementation of fetch & inc) But the waiting algorithm still suffers from 0.5P2 messages asymptotically
Researchers have proposed proportional backoff i.e. in the wait loop put a delay proportional to the difference between ticket value and last read release_count

Latency and storage-wise better than Bakery Traffic-wise better than TTS and Bakery (I leave it to you to analyze the traffic of Bakery) Guaranteed fairness: the ticket value induces a FIFO queue
June-July 2009 179

Array-based lock
Solves the O(P2) traffic problem The idea is to have a bit vector (essentially a character array if boolean type is not supported) Each processor comes and takes the next free index into the array via fetch & inc Then each processor loops on its index location until it becomes set On unlock a processor is responsible to set the next index location if someone is waiting Initial fetch & inc still needs O(P) traffic, but the wait loop now needs O(1) traffic Disadvantage: storage overhead is O(P)
June-July 2009 180

Performance concerns

Array-based lock

Correctness concerns

Avoid false sharing: allocate each array location on a different cache line Assume a cache line size of 128 bytes and a character array: allocate an array of size 128P bytes and use every 128th position in the array For distributed shared memory the location a processor loops on may not be in its local memory: on acquire it must take a remote miss; allocate P pages and let each processor loop on one bit in a page? Too much wastage; better solution: MCS lock (Mellor-Crummey & Scott) Make sure to handle corner cases such as determining if someone is waiting on the next location (this must be an atomic operation) while unlocking 181 Remember to resetJune-July 2009 location to zero while your index unlocking

RISC processors
All these atomic instructions deviate from the RISC line
Instruction needs a load as well as a store

Also, it would be great if we can offer a few simple instructions with which we can build most of the atomic primitives
Note that it is impossible to build atomic fetch & inc with xchg instruction

MIPS, Alpha and IBM processors support a pair of instructions: LL and SC


Load linked and store conditional
June-July 2009 182

Load linked behaves just like a normal load with some extra tricks

LL/SC

Store conditional is a special store

Puts the loaded value in destination register as usual Sets a load_linked bit residing in cache controller to 1 Puts the address in a special lock_address register residing in the cache controller sc reg, addr stores value in reg to addr only if load_linked bit is set; also it copies the value in load_linked bit to reg and resets load_linked bit

Any intervening operation (e.g., bus transaction or cache replacement) to the cache line containing the address in lock_address register clears the load_linked bit so that subsequent sc fails
June-July 2009 183

Locks with LL/SC


Test & set
Lock: LL r1, lock_addr addi r2, r0, 0x1 SC r2, lock_addr beqz r2, Lock bnez r1, Lock LL r1, lock_addr bnez r1, Lock addi r1, r0, 0x1 SC r1, lock_addr beqz r1, Lock /* Normal read miss/BusRead */ /* Possibly upgrade miss */ /* Check if SC succeeded */ /* Check if someone is in CS */

LL/SC is best-suited for test & test & set locks


Lock:

June-July 2009

184

Fetch & op with LL/SC


Fetch & inc
Try: LL r1, addr addi r1, r1, 0x1 SC r1, addr beqz r1, Try

Compare & swap: Compare with r1, swap r2 and memory location (here we keep on trying until comparison passes)
Try: LL r3, addr sub r4, r3, r1 bnez r4, Try add r4, r2, r0 SC r4, addr beqz r4, Try add r2, r3, r0

June-July 2009

185

Point-to-point synch.
Normally done in software with flags
P0: A = 1; flag = 1; P1: while (!flag); print A;

Some old machines supported full/empty bits in memory


Possible optimization for shared memory

Each memory location is augmented with a full/empty bit Producer writes the location only if bit is reset Consumer reads location if bit is set and resets it Lot less flexible: one producer-one consumer sharing only (one producer-many consumers is very popular); all accesses to a memory location become synchronized (unless compiler flags some accesses as special)

Allocate flag and data structures (if small) guarded by 186 June-July 2009 flag in same cache line e.g., flag and A in above example

Barrier High-level classification of barriers


Hardware and software barriers Centralized barrier: every processor polls a single count Distributed tree barrier: shows much better scalability Low latency: after all processors have arrived at the barrier, they should be able to leave quickly Low traffic: minimize bus transaction and contention Scalability: latency and traffic should scale slowly with the number of processors Low storage: barrier state should not be big Fairness: Preserve some strict order of barrier exit (could be FIFO according to arrival order); a particular processor should not always be the last one to exit 187 June-July 2009

Will focus on two types of software barriers

Performance goals of a barrier implementation

Centralized barrier
struct bar_type { int counter; struct lock_type lock; int flag = 0; } bar_name; BARINIT (bar_name) { LOCKINIT(bar_name.lock); bar_name.counter = 0; }

BARRIER (bar_name, P) { int my_count; LOCK (bar_name.lock); if (!bar_name.counter) { bar_name.flag = 0; /* first one */ } my_count = ++bar_name.counter; UNLOCK (bar_name.lock); if (my_count == P) { bar_name.counter = 0; bar_name.flag = 1; /* last one */ } else { while (!bar_name.flag); } }
188

June-July 2009

Sense reversal
The last implementation fails to work for two consecutive barrier invocations
BARRIER (bar_name, P) { local sense = !local_sense; /* this is private per processor */ LOCK (bar_name.lock); bar_name.counter++; Need to prevent a process if (bar_name.counter == P) { from entering a barrier UNLOCK (bar_name.lock); instance until all have left the previous instance bar_name.counter = 0; Reverse the sense of a bar_name.flag = local_sense; barrier i.e. every other } barrier will have the same else { sense: basically attach UNLOCK (bar_name.lock); parity or sense to a barrier while (bar_name.flag != local_sense); } 189 June-July 2009 }

How fast is it?

Centralized barrier
Assume that the program is perfectly balanced and hence all processors arrive at the barrier at the same time Latency is proportional to P due to the critical section (assume that the lock algorithm exhibits at most O(P) latency) The amount of traffic of acquire section (the CS) depends on the lock algorithm; after everyone has settled in the waiting loop the last processor will generate a BusRdX during release (flag write) and others will subsequently generate BusRd before releasing: O(P) Scalability turns out to be low partly due to the critical section and partly due to O(P) traffic of release No fairness in terms of who exits first
June-July 2009 190

Does not need a lock, only uses flags

Tree barrier

Arrange the processors logically in a binary tree (higher degree also possible) Two siblings tell each other of arrival via simple flags (i.e. one waits on a flag while the other sets it on arrival) One of them moves up the tree to participate in the next level of the barrier Introduces concurrency in the barrier algorithm since independent subtrees can proceed in parallel Takes log(P) steps to complete the acquire A fixed processor starts a downward pass of release waking up other processors that in turn set other flags Shows much better scalability compared to centralized barriers in DSM multiprocessors; the advantage in small bus-based systems is not much, since all transactions 191 June-July the are any way serialized on 2009 bus; in fact the additional log (P) delay may hurt performance in bus-based SMPs

TreeBarrier (pid, P) { unsigned int i, mask;

Convince yourself that this works Take 8 processors and for (i = 0, mask = 1; (mask & arrange them on leaves of a pid) != 0; ++i, mask <<= 1) { tree of depth 3 while (!flag[pid][i]); You will find that only odd flag[pid][i] = 0; nodes move up at every level } during acquire (implemented if (pid < (P - 1)) { in the first for loop) flag[pid + mask][i] = 1; The even nodes just set the while (!flag[pid][MAX- 1]); flags (the first statement in flag[pid][MAX - 1] = 0; the if condition): they bail out } of the first loop with mask=1 for (mask >>= 1; mask > 0; The release is initiated by the mask >>= 1) { last processor in the last for flag[pid - mask][MAX-1] = 1; loop; only odd nodes execute } this loop (7 wakes up 3, 5, 6; } 5 192 June-July 2009 wakes up 4; 3 wakes up 1, 2; 1 wakes up 0)

Tree barrier

Tree barrier
Each processor will need at most log (P) + 1 flags Avoid false sharing: allocate each processors flags on a separate chunk of cache lines With some memory wastage (possibly worth it) allocate each processors flags on a separate page and map that page locally in that processors physical memory
Avoid remote misses in DSM multiprocessor Does not matter in bus-based SMPs

June-July 2009

193

Memory Consistency Models

Coherence protocol is not enough to completely specify the output(s) of a parallel program

Memory consistency
Coherence protocol only provides the foundation to reason about legal outcome of accesses to the same memory location Consistency model tells us the possible outcomes arising from legal ordering of accesses to all memory locations A shared memory machine advertises the supported consistency model; it is a contract with the writers of parallel software and the writers of parallelizing compilers Implementing memory consistency model is really a hardware-software tradeoff: a strict sequential model (SC) offers execution that is intuitive, but may suffer in terms of performance; relaxed models (RC) make program reasoning difficult, but may offer better performance 195 June-July 2009

SC
Recall that an execution is SC if the memory operations form a valid total order i.e. it is an interleaving of the partial program orders
Sufficient conditions require that a new memory operation cannot issue until the previous one is completed This is too restrictive and essentially disallows compiler as well as hardware re-ordering of instructions No microprocessor that supports SC implements sufficient conditions Instead, all out-of-order execution is allowed, and a proper recovery mechanism is implemented in case of a memory order violation Lets discuss the MIPS R10000 implementation
June-July 2009 196

Issues instructions out of program order, but commits in order

SC in MIPS R10000
The problem is with speculatively executed loads: a load may execute and use a value long before it finally commits In the meantime, some other processor may modify that value through a store and the store may commit (i.e. become globally visible) before the load commits: may violate SC (why?) How do you detect such a violation? How do you recover and guarantee an SC execution? Any special consideration for prefetches? Binding and non-binding prefetches
June-July 2009 197

In MIPS R10000 a store remains at the head of the active list until it is completed in cache
Can we just remove it as soon as it issues and let the other instructions commit (the store can complete from store buffer at a later point)? How far can we go and still guarantee SC?

SC in MIPS R10000

The Stanford DASH multiprocessor, on receiving a read reply that is already invalidated, forces the processor to retry that load Does the cache controller need to take any special action when a line is replaced from the cache?
June-July 2009 198

Why cant it use the value in the cache line and then discard the line?

Implementing SC requires complex hardware

Relaxed models

Is there an example that clearly shows the disaster of not implementing all these? Observe that cache coherence protocol is orthogonal But such violations are rare Does it make sense to invest so much time (for verification) and hardware (associative lookup logic in load queue)? Many processors today relax the consistency model to get rid of complex hardware and achieve some extra performance at the cost of making program reasoning complex P0: A=1; B=1; flag=1; P1: while (!flag); print A; print B; SC is too restrictive; relaxing it does not always violate programmers intuition
June-July 2009 199

Case Studies

CMP microprocessor CMP is the mantra of todays


industry
Intels dual-core Pentium 4: each core is still hyperthreaded (just uses existing cores) Intels quad-core Whitefield is coming up in a year or so For the server market Intel has announced a dual-core Itanium 2 (code named Montecito); again each core is 2way threaded AMD has released dual-core Opteron in 2005 IBM released their first dual-core processor POWER4 circa 2001; next-generation POWER5 also uses two cores but each core is also 2-way threaded Suns UltraSPARC IV (released in early 2004) is a dualcore processor and integrates two UltraSPARC III cores
June-July 2009 201

Today microprocessor designers can afford to have a lot of transistors on the die
Ever-shrinking feature size leads to dense packing What would you do with so many transistors? Can invest some to cache, but beyond a certain point it doesnt help Natural choice was to think about greater level of integration Few chip designers decided to bring the memory and coherence controllers along with the router on the die The next obvious choice was to replicate the entire core; it is fairly simple: just use the existing cores and connect them through a coherent interconnect
June-July 2009 202

Why CMP?

The number of transistors on a die doubles every 18-24 months

Moores law

Exponential growth in available transistor count If transistor utilization is constant, this would lead to exponential performance growth; but life is slightly more complicated Wires dont scale with transistor technology: wire delay becomes the bottleneck Short wires are good: dictates localized logic design But superscalar processors exercise a centralized control requiring long wires (or pipelined long wires) However, to utilize the transistors well, we need to overcome the memory wall problem To hide memory latency we need to extract more independent instructions i.e. more ILP
June-July 2009 203

Extracting more ILP directly requires more available in-flight instructions


But for that we need bigger ROB which in turn requires a bigger register file Also we need to have bigger issue queues to be able to find more parallelism None of these structures scale well: main problem is wiring So the best solution to utilize these transistors effectively with a low cost must not require long wires and must be able to leverage existing technology: CMP satisfies these goals exactly (use existing processors and invest transistors to have more of these on-chip instead of trying to scale the existing processor for more ILP)
June-July 2009 204

Moores law

Moores law

June-July 2009

205

Power consumption?
Hey, didnt I just make my power consumption roughly N-fold by putting N cores on the die?
Yes, if you do not scale down voltage or frequency Usually CMPs are clocked at a lower frequency Oops! My games run slower! Voltage scaling happens due to smaller process technology Overall, roughly cubic dependence of power on voltage or frequency Need to talk about different metrics Performance/Watt (same as reciprocal of energy) More general, Performancek+1/Watt (k > 0) Need smarter techniques to further improve these metrics 206 June-July 2009 Online voltage/frequency scaling

ABCs of CMP Where to put the interconnect?

Do not want to access the interconnect too frequently because these wires are slow It probably does not make much sense to have the L1 cache shared among the cores: requires very high bandwidth and may necessitate a redesign of the L1 cache and surrounding load/store unit which we do not want to do; so settle for private L1 caches, one per core Makes more sense to share the L2 or L3 caches Need a coherence protocol at L2 interface to keep private L1 caches coherent: may use a high-speed custom designed snoopy bus connecting the L1 controllers or may use a simple directory protocol An entirely different design choice is not to share the cache hierarchy at all (dual-core AMD and Intel): rids you of the on-chip coherence protocol, but no gain in 207 June-July communication latency 2009

IBM POWER4

June-July 2009

208

IBM POWER4
Dual-core chip multiprocessor

June-July 2009

209

4-chip 8-way NUMA

June-July 2009

210

32-way: ring bus

June-July 2009

211

Private L1 instruction and data caches (on chip)

POWER4 caches

On-chip shared L2 (on-chip coherence point)

L1 icache: 64 KB/direct mapped/128 bytes line L1 dcache: 32 KB/2-way associative/128 bytes line/LRU No M state in L1 data cache (write through) 1.5 MB/8-way associative/128 bytes line/pseudo LRU For on-chip coherence, L2 tag is augmented with a twobit sharer vector; used to invalidate L1 on other cores write Three L2 controllers and each L2 controller has four local coherence units; each L2 controller handles roughly 512 KB of data divided into four SRAM partitions For off-chip coherence, each L2 controller has four snoop engines; executes enhanced MESI with seven states
June-July 2009 212

POWER4 L2 cache

June-July 2009

213

On-chip tag (IBM calls it directory), off-chip data


32 MB/8-way associative/512 bytes line Contains eight coherence/snoop controllers Does not maintain inclusion with L2: requires L3 to snoop fabric interconnect also Maintains five coherence states Putting the L3 cache on the other side of the fabric requires every L2 cache miss (even local miss) to cross the fabric: increases latency quite a bit

POWER4 L3 cache

June-July 2009

214

POWER4 L3 cache

June-July 2009

215

POWER4 die photo

June-July 2009

216

IBM POWER5

June-July 2009

217

Carries on POWER4 to the next generation

IBM POWER5

Each core of the dual-core chip is 2-way SMT: 24% area growth per core More than two threads not only add complexity, may not provide extra performance benefit; in fact, performance may degrade because of resource contention and cache thrashing unless all shared resources are scaled up accordingly (hits a complexity wall) L3 cache is moved to the processor side so that L2 cache can directly talk to it: reduces bandwidth demand on the interconnect (L3 hits at least do not go on bus) This change enabled POWER5 designers to scale to 64processor systems (i.e. 32 chips with a total of 128 threads) Bigger L2 and L3 caches: 1.875 MB L2, 36 MB L3 On-chip memory controller 218 June-July 2009

IBM POWER5

Reproduced from IEEE Micro


June-July 2009 219

Same pipeline structure as POWER4

IBM POWER5

Added SMT facility Like Pentium 4, fetches from each thread in alternate cycles (8-instruction fetch per cycle just like POWER4) Threads share ITLB and ICache Increased size of register file compared to POWER4 to support two threads: 120 integer and floating-point registers (POWER4 has 80 integer and 72 floating-point registers): improves single-thread performance compared to POWER4; smaller technology (0.13 m) made it possible to access a bigger register file in same or shorter time leading to same pipeline as POWER4 Doubled associativity of L1 caches to reduce conflict misses: icache is 2-way and dcache is 4-way
June-July 2009 220

IBMmanagement POWER5 Dynamic power

With SMT and CMP average number of switching per cycle increases leading to more power consumption Need to reduce power consumption without losing performance: simple solution is to clock it at a slower frequency, but that hurts performance POWER5 employs fine-grain clock-gating: in every cycle the power management logic decides if a certain latch will be used in the next cycle; if not, it disables or gates the clock for that latch so that it will not unnecessarily switch in the next cycle Clock-gating and power management logic themselves should be very simple If both threads are running at priority level 1, the processor switches to a low power mode where it dispatches instructions at a much slower pace
June-July 2009 221

POWER5 die photo

June-July 2009

222

Intel Montecito

June-July 2009

223

Dual core Itanium 2, each core dual threaded 1.7 billion transistors, 21.5 mm x 27.7 mm die 27 MB of on-chip three levels of cache
Not shared among cores

Features

1.8+ GHz, 100 W Single-thread enhancements


Extra shifter improves performance of crypto codes by 100% Improved branch prediction Improved data and control speculation recovery Separate L2 instruction and data caches buys 7% improvement over Itanium2; four times bigger L2I (1 MB) 224 June-July 2009 Asynchronous 12 MB L3 cache

Overview

Reproduced from IEEE Micro


June-July 2009 225

Foxton technology

Power efficiency

Blind replication of Itanium 2 cores at 90 nm would lead to roughly 300 W peak power consumption (Itanium 2 consumes 130 W peak at 130 nm) In case of lower than the ceiling power consumption, the voltage is increased leading to higher frequency and performance 10% boost for enterprise applications Software or OS can also dictate a frequency change if power saving is required 100 ms response time for the feedback loop Frequency control is achieved by 24 voltage sensors distributed across the chip: the entire chip runs at a single frequency (other than asynchronous L3) Clock gating found limited application in Montecito
June-July 2009 226

Foxton technology

Reproduced from IEEE Micro Embedded microcontroller runs a real-time scheduler 227 to execute various tasksJune-July 2009

Die photo

June-July 2009

228

Sun Niagara OR Ultrasparc T1


June-July 2009 229

Eight pipelines or cores, each shared by 4 threads


32-way multithreading on a single chip Starting frequency of 1.2 GHz, consumes 60 W Shared 3 MB L2 cache, 4-way banked, 12-way set associative, 200 GB/s bandwidth Single-issue six stage pipe Target market is web service where ILP is limited, but TLP is huge (independent transactions) Throughput matters

Features

June-July 2009

230

Pipeline details

Reproduced from IEEE Micro


June-July 2009

231

Cache hierarchy
L1 instruction cache
16 KB / 4-way / 32 bytes / random replacement Fetches two instructions every cycle If both instructions are useful, next cycle is free for icache refill

L1 data cache
8 KB / 4-way / 16 bytes/ write-through, no-allocate On avearge 10% miss rate for target benchmarks L2 cache extends the tag to maintain a directory for keeping the core L1s coherent

L2 cache is writeback with silent clean eviction


June-July 2009 232

You might also like