0% found this document useful (0 votes)
16 views253 pages

Pdf24 Merged

Uploaded by

Tara Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views253 pages

Pdf24 Merged

Uploaded by

Tara Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 253

NPTEL Online - IIT Bombay

Course Name Parallel Computer


Architecture

Department Computer Science and


Engineering
IIT Kanpur

Instructor Dr. Mainak Chaudhuri

file:///E|/parallel_com_arch/lecture1/main.html[6/13/2012 11:08:53 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 1: "Evolution of Processor Architecture"

The Lecture Contains:

Multi-core: The Ultimate Dose of Moore’s Law


A gentle introduction to the multi-core landscape as a tale of four decades of glory and success

Mind-boggling Trends in Chip Industry

Agenda

Unpipelined Microprocessors

Pipelining

Pipelining Hazards

Control Dependence

Data Dependence

Structural Hazard

Out-of-order Execution

Multiple Issue

Out-of-order Multiple Issue

file:///E|/parallel_com_arch/lecture1/1_1.htm[6/13/2012 11:08:53 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 1: "Evolution of Processor Architecture"

Mind-boggling Trends in Chip Industry:

Long history since 1971


Introduction of Intel 4004
http://www.intel4004.com/
Today we talk about more than one billion transistors on a chip
Intel Montecito (in market since July’06) has 1.7B transistors
Die size has increased steadily (what is a die?)
Intel Prescott: 112mm2, Intel Pentium 4EE: 237 mm2, Intel Montecito: 596 mm2
Minimum feature size has shrunk from 10 micron in 1971 to 0.045 micron today

Agenda:

Unpipelined microprocessors
Pipelining: simplest form of ILP
Out-of-order execution: more ILP
Multiple issue: drink more ILP
Scaling issues and Moore’s Law
Why multi-core
TLP and de-centralized design
Tiled CMP and shared cache
Implications on software
Research directions

Unpipelined Microprocessors:

Typically an instruction enjoys five phases in its life


Instruction fetch from memory
Instruction decode and operand register read
Execute
Data memory access
Register write
Unpipelined execution would take a long single cycle or multiple short cycles
Only one instruction inside processor at any point in time

file:///E|/parallel_com_arch/lecture1/1_2.htm[6/13/2012 11:08:53 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 1: "Evolution of Processor Architecture"

Pipelining:

One simple observation


Exactly one piece of hardware is active at any point in time
Why not fetch a new instruction every cycle?
Five instructions in five different phases
Throughput increases five times (ideally)
Bottom-line is
If consecutive instructions are independent, they can be processed in parallel
The first form of instruction-level parallelism (ILP)

Pipelining Hazards:

Instruction dependence limits achievable parallelism


Control and data dependence (aka hazards)
Finite amount of hardware limits achievable parallelism
Structural hazards
Control dependence
On average, every fifth instruction is a branch (coming from if-else, for, do-while,…)
Branches execute in the third phase
Introduces bubbles unless you are smart

Control Dependence:

What do you fetch in X and y slots?

Options: nothing, fall-through, learn past history and predict (today best predictors achieve on
average 97% accuracy for SPEC2000)

file:///E|/parallel_com_arch/lecture1/1_3.htm[6/13/2012 11:08:54 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 1: "Evolution of Processor Architecture"

Data Dependence:

Take three bubbles?

Back-to-back dependence is too frequent

Solution: hardware bypass paths

Allow the ALU to bypass the produced value in time: not always possible

Need a live bypass! (requires some negative time travel: not yet feasible in real world)

No option but to take one bubble

Bigger problems: load latency is often high; you may not find the data in cache

Structural Hazard:

Usual solution is to put more resources

file:///E|/parallel_com_arch/lecture1/1_4.htm[6/13/2012 11:08:54 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 1: "Evolution of Processor Architecture"

Out-of-order Execution:

file:///E|/parallel_com_arch/lecture1/1_5.htm[6/13/2012 11:08:54 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 1: "Evolution of Processor Architecture"

Multiple Issue:

Out-of-order Multiple Issue:

Some hardware nightmares


Complex issue logic to discover independent instructions
Increased pressure on cache
Impact of a cache miss is much bigger now in terms of lost opportunity
Various speculative techniques are in place to “ignore” the slow and stupid memory
Increased impact of control dependence
Must feed the processor with multiple correct instructions every cycle
One cycle of bubble means lost opportunity of multiple instructions
Complex logic to verify

file:///E|/parallel_com_arch/lecture1/1_6.htm[6/13/2012 11:08:55 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 2: "Moore's Law and Multi-cores"

The Lecture Contains:

Moore’s Law

Scaling Issues

Multi-core

Thread-level Parallelism

Communication in Multi-core

Tiled CMP (Hypothetical Floor-plan)

Shared Cache CMP

Niagara Floor-plan

Implications on Software

Research Directions

References

file:///E|/parallel_com_arch/lecture2/2_1.htm[6/13/2012 11:10:15 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 2: "Moore's Law and Multi-cores"

Moore’s Law:

Number of transistors on-chip doubles every 18 months


So much of innovation was possible only because we had transistors
Phenomenal 58% performance growth every year
Moore’s Law is facing a danger today
Power consumption is too high when clocked at multi-GHz frequency and it is
proportional to the number of switching transistors
Wire delay doesn’t decrease with transistor size

Scaling Issues:

Hardware for extracting ILP has reached the point of diminishing return
Need a large number of in-flight instructions
Supporting such a large population inside the chip requires power-hungry delay-
sensitive logic and storage
Verification complexity is getting out of control
How to exploit so many transistors?
Must be a de-centralized design which avoids long wires

Multi-core:

Put a few reasonably complex processors or many simple processors on the chip
Each processor has its own primary cache and pipeline
Often a processor is called a core
Often called a chip-multiprocessor (CMP)
Did we use the transistors properly?
Depends on if you can keep the cores busy
Introduces the concept of thread-level parallelism (TLP)

file:///E|/parallel_com_arch/lecture2/2_2.htm[6/13/2012 11:10:15 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 2: "Moore's Law and Multi-cores"

Thread-level Parallelism:

Look for concurrency at a granularity coarser than instructions


Put a chunk of consecutive instructions together and call it a thread (largely wrong!)
Each thread can be seen as a “dynamic” subgraph of the sequential control-flow
graph: take a loop and unroll its graph
The edges spanning the subgraphs represent data dependence across threads ( the
spanning control edges are usually converted to data edges through suitable
transformations)
The goal of parallelization is to minimize such edges
Threads should mostly compute independently on different cores; but need to
talk once in a while to get things done!
Parallelizing sequential programs is fun, but often tedious for non-experts
So look for parallelism at even coarser grain
Run multiple independent programs simultaneously
Known as multi-programming
The biggest reason why quotidian Windows fans would buy small-scale
multiprocessors and multi-core today
Can play games while running heavy-weight simulations and downloading
movies
Have you seen the state of the poor machine when running anti-virus?

Communication in Multi-core:

Ideal for shared address space


Fast on-chip hardwired communication through cache (no OS intervention)
Two types of architectures
Tiled CMP: each core has its private cache hierarchy (no cache sharing); Intel
Pentium D, Dual Core Opteron, Intel Montecito, Sun UltraSPARC IV, IBM Cell
(more specialized)
Shared cache CMP: Outermost level of cache hierarchy is shared among cores;
Intel Woodcrest (server-grade Core duo), Intel Conroe (Core2 duo for desktop),
Sun Niagara, IBM Power4, IBM Power5

file:///E|/parallel_com_arch/lecture2/2_3.htm[6/13/2012 11:10:15 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 2: "Moore's Law and Multi-cores"

Tiled CMP (Hypothetical Floor-plan):

file:///E|/parallel_com_arch/lecture2/2_4.htm[6/13/2012 11:10:16 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 2: "Moore's Law and Multi-cores"

Shared Cache CMP:

file:///E|/parallel_com_arch/lecture2/2_5.htm[6/13/2012 11:10:16 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 2: "Moore's Law and Multi-cores"

Niagara Floor-plan:

Implications on Software:

A tall memory hierarchy


Each core could run multiple threads
Each core in Niagara runs four threads
Within core, threads communicate through private cache (fastest)
Across cores communication happens through shared L2 or coherence controller (if
tiled)
Multiple such chips can be connected over a scalable network
Adds one more level of memory hierarchy
A very non-uniform access stack

file:///E|/parallel_com_arch/lecture2/2_6.htm[6/13/2012 11:10:16 AM]


Objectives_template

Module 1: "Multi-core: The Ultimate Dose of Moore's Law"


Lecture 2: "Moore's Law and Multi-cores"

Research Directions:

Hexagon of puzzles
Running single-threaded programs efficiently on this sea of cores
Managing energy envelope efficiently
Allocating shared cache efficiently
Allocating shared off-chip bandwidth and memory banks efficiently
Making parallel programming easy
Transactional
Speculative parallelization
Verification of hardware and parallel software and tolerate faults

References:

A good reading is Parallel Computer Architecture by Culler, Singh with Gupta


Caveat: does not talk about multi-core, but introduces the general area of shared
memory multiprocessors
Papers
Check out the most recent issue of Intel Technology Journal
http://www.intel.com/technology/itj/
http://www.intel.com/technology/itj/archive.htm
Conferences: ASPLOS, ISCA, HPCA, MICRO, PACT
Journals: IEEE Micro, IEEE TPDS, ACM TACO

file:///E|/parallel_com_arch/lecture2/2_7.htm[6/13/2012 11:10:16 AM]


Objectives_template

Module 2: "Parallel Computer Architecture: Today and Tomorrow"


Lecture 3: "Evaluating Performance"

The Lecture Contains:

Parallel Computer Architecture: Today and Tomorrow

What is computer architecture ?

Architect’s job

58% growth rate

The computer market

The applications

Parallel architecture

Why parallel arch.?

Why study it?

Performance metrics

Throughput metrics

Application trends

Commercial sector

Desktop market

[From Chapter 1 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture3/3_1.htm[6/13/2012 11:11:50 AM]


Objectives_template

Module 2: "Parallel Computer Architecture: Today and Tomorrow"


Lecture 3: "Evaluating Performance"

What is computer architecture ?

Amdahl, Blaauw and Brookes, 1964 (IBM 360 team):


The structure of a computer that a machine language programmer must understand to
write a correct (timing independent) program for that machine
Loosely speaking, it is the science of designing computers “leading to glorious failures and
some notable successes”

Architect’s job

Design and engineer various parts of a computer system to maximize performance and
programmability within the technology limits and cost budget
Technology limit could mean process/circuit technology in case of microprocessor architecture
For bigger systems technology limit could mean interconnect technology (how one component
talks to another at macro level)

Slightly outdated data

58% growth rate

Two major architectural reasons


Advent of RISC (Reduced Instruction Set Computer) made it easy to implement many
aggressive architectural techniques for extracting parallelism
Introduction of caches
Made easy by Moore’s law
Two major impacts
Highest performance microprocessors today outperform supercomputers designed less
than 10 years ago
Microprocessor-based products have dominated all sectors of computing: desktops,
workstations, minicomputers are replaced by servers, mainframes are replaced by

file:///E|/parallel_com_arch/lecture3/3_2.htm[6/13/2012 11:11:51 AM]


Objectives_template

multiprocessors, supercomputers are built out of commodity microprocessors (also a


cost factor dictated this trend)

file:///E|/parallel_com_arch/lecture3/3_2.htm[6/13/2012 11:11:51 AM]


Objectives_template

Module 2: "Parallel Computer Architecture: Today and Tomorrow"


Lecture 3: "Evaluating Performance"

The computer market

Three major sectors


Desktop: ranges from low-end PCs to high-end workstations; market trend is very
sensitive to price-performance ratio
Server: used in large-scale computing or service-oriented market such as heavy-
weight scientific computing, databases, web services, etc; reliability, availability and
scalability are very important; servers are normally designed for high throughput
Embedded: fast growing sector; very price-sensitive; present in most day-to-day
appliances such as microwave ovens, washing machines, printers, network switches,
palmtops, cell phones, smart cards, game engines; software is usually specialized/tuned
for one particular system

The applications

Very different in three sectors


This difference is the main reason for different design styles in these three areas
Desktop market demands leading-edge microprocessors, high-performance graphics
engines; must offer balanced performance for a wide range of applications; customers
are happy to spend a reasonable amount of money for high performance i.e. the metric
is price-performance
Server market integrates high-end microprocessors into scalable multiprocessors;
throughput is very important; could be floating-point or graphics or transaction
throughput
Embedded market adopts high-end microprocessor techniques paying immense
attention to low price and low power; processors are either general purpose (to some
extent) or application-specific

Parallel architecture

Collection of processing elements that co-operate to solve large problems fast


Design questions that need to be answered
How many processing elements (scalability)?
How capable is each processor (computing power)?
How to address memory (shared or distributed)?
How much addressable memory (address bit allocation)?
How do the processors communicate (through memory or by messages)?
How do the processors avoid data races (synchronization)?
How do you answer all these to achieve highest performance within your cost envelope?

file:///E|/parallel_com_arch/lecture3/3_3.htm[6/13/2012 11:11:51 AM]


Objectives_template

Module 2: "Parallel Computer Architecture: Today and Tomorrow"


Lecture 3: "Evaluating Performance"

Why parallel arch.?

Parallelism helps
There are applications that can be parallelized easily
There are important applications that require enormous amount of computation (10 GFLOPS
to 1 TFLOPS)
NASA taps SGI, Intel for Supercomputers: 20 512p SGI Altix using Itanium 2
( http://zdnet.com.com/2100-1103_2-5286156.html) [27th July, 2004]
There are important applications that need to deliver high throughput

Why study it?

Parallelism is ubiquitous
Need to understand the design trade-offs
Microprocessors are now multiprocessors (more later)
Today a computer architect’s primary job is to find out how to efficiently extract
parallelism
Get involved in interesting research projects
Make an impact
Shape the future development
Have fun

Performance metrics

Need benchmark applications


SPLASH (Stanford ParalleL Applications for SHared memory)
SPEC (Standard Performance Evaluation Corp.) OMP
ScaLAPACK (Scalable Linear Algebra PACKage) for message-passing machines
TPC (Transaction Processing Performance Council) for database/transaction
processing performance
NAS (Numerical Aerodynamic Simulation) for aerophysics applications
NPB2 port to MPI for message-passing only
PARKBENCH (PARallel Kernels and BENCHmarks) for message-passing only
Comparing two different parallel computers
Execution time is the most reliable metric
Sometimes MFLOPS, GFLOPS, TFLOPS are used, but could be misleading
Evaluating a particular machine
Use speedup to gauge scalability of the machine (provided the application itself scales)
Speedup(P) = Uniprocessor time/Time on P processors
Normally the input data set is kept constant when measuring speedup

file:///E|/parallel_com_arch/lecture3/3_4.htm[6/13/2012 11:11:51 AM]


Objectives_template

Module 2: "Parallel Computer Architecture: Today and Tomorrow"


Lecture 3: "Evaluating Performance"

Throughput metrics

Sometimes metrics like jobs/hour may be more important than just the turn-around time of a
job
This is the case for transaction processing (the biggest commercial application for
servers)
Needs to serve as many transactions as possible in a given time provided time per
transaction is reasonable
Transactions are largely independent; so throw in as many hardware threads as
possible
Known as throughput computing

Application trends

Equal to or below 1 GFLOPS requirements


2D airfoil, oil reservoir modeling, 3D plasma modeling, 48-hour weather
Below 100 GFLOPS requirements
Chemical dynamics, structural biology, 72-hour weather
Tomorrow’s applications (beyond 100 GFLOPS)
Human genome, protein folding, superconductor modeling, quantum chromodynamics,
molecular geometry, real-time vision and speech recognition, graphics, CAD, space
exploration, global-warming etc.
Demand for insatiable CPU cycles (need large-scale supercomputers)

Commercial sector

Slightly different story


Transactions per minute (tpm)
Scale of computers is much smaller
4P machines to maybe 32P servers
But use of parallelism is tremendous
Need to serve as many transaction threads as possible (maximize the number of
database users)
Need to handle large data footprint and offer massive parallelism (also economics kicks
in: should be low-cost)

Desktop market

Demand to improve throughput for sequential multi-programmed workload


I want to run as many simulations as I can and want them to finish before I come back
next morning
Possibly the biggest application for small-scale multiprocessors (e.g. 2 or 4-way
SMPs)
Even on a uniprocessor machine I would be happy if I could play AOE without affecting the
performance of my simulation running in background (simultaneous multi-threading and chip
multi-processing; more later)

file:///E|/parallel_com_arch/lecture3/3_5.htm[6/13/2012 11:11:51 AM]


Objectives_template

Module 2: "Parallel Computer Architecture: Today and Tomorrow"


Lecture 4: "Shared Memory Multiprocessors"

The Lecture Contains:

Technology trends

Architectural trends

Exploiting TLP: NOW

Supercomputers

Exploiting TLP: Shared memory

Shared memory MPs

Bus-based MPs

Scaling: DSMs

On-chip TLP

Economics

Summary

[From Chapter 1 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture4/4_1.htm[6/13/2012 11:12:33 AM]


Objectives_template

Module 2: "Parallel Computer Architecture: Today and Tomorrow"


Lecture 4: "Shared Memory Multiprocessors"

Technology trends

The natural building block for multiprocessors is microprocessor


Microprocessor performance increases 50% every year
Transistor count doubles every 18 months
Intel Pentium 4 EE 3.4 GHz has 178 M transistors on a 237 mm2 die
130 nm Itanium 2 has 410 M transistors on a 374 mm2 die
90 nm Intel Montecito has 1.7 B transistors on a 596 mm2 die
Die area is also growing
Intel Prescott had 125 M transistors on a 112 mm2 die
Ever-shrinking process technology
Shorter gate length of transistors
Can afford to sweep electrons through channel faster
Transistors can be clocked at faster rate
Transistors also get smaller
Can afford to pack more on the die
And die size is also increasing
What to do with so many transistors?
Could increase L2 or L3 cache size
Does not help much beyond a certain point
Burns more power
Could improve microarchitecture
Better branch predictor or novel designs to improve instruction-level parallelism (ILP)
If cannot improve single-thread performance have to look for thread-level
parallelism (TLP)
Multiple cores on the die (chip multiprocessors): IBM POWER4, POWER5, Intel
Montecito, Intel Pentium 4, AMD Opteron, Sun UltraSPARC IV
TLP on chip
Instead of putting multiple cores could put extra resources and logic to run multiple
threads simultaneously (simultaneous multi-threading): Alpha 21464 (cancelled), Intel
Pentium 4, IBM POWER5, Intel Montecito
Today’s microprocessors are small-scale multiprocessors (dual-core, 2-way SMT)
Tomorrow’s microprocessors will be larger-scale multiprocessors or highly multi-threaded
Sun Niagara is an 8-core (each 4-way threaded) chip: 32 threads on a single chip

Architectural trends

Circuits: bit-level parallelism


Started with 4 bits (Intel 4004) [http://www.intel4004.com/]
Now 32-bit processor is the norm
64-bit processors are taking over (AMD Opteron, Intel Itanium, Pentium 4 family);
started with Alpha, MIPS, Sun families
Architecture: instruction-level parallelism (ILP)
Extract independent instruction stream
Key to advanced microprocessor design
Gradually hitting a limit: memory wall
Memory operations are bottleneck
Need memory-level parallelism (MLP)

file:///E|/parallel_com_arch/lecture4/4_2.htm[6/13/2012 11:12:34 AM]


Objectives_template

Also technology limits such as wire delay are pushing for a more distributed control
rather than the centralized control in today’s processors
If cannot boost ILP what can be done?
Thread-level parallelism (TLP)
Explicit parallel programs already have TLP (inherent)
Sequential programs that are hard to parallelize or ILP-limited can be speculatively
parallelized in hardware
Thread-level speculation (TLS)
Today’s trend: if cannot do anything to boost single-thread performance invest transistors and
resources to exploit TLP

file:///E|/parallel_com_arch/lecture4/4_2.htm[6/13/2012 11:12:34 AM]


Objectives_template

Module 2: "Parallel Computer Architecture: Today and Tomorrow"


Lecture 4: "Shared Memory Multiprocessors"

Exploiting TLP: NOW

Simplest solution: take the commodity boxes, connect them over gigabit ethernet and let them
talk via messages
The simplest possible message-passing machine
Also known as Network of Workstations (NOW)
Normally PVM (Parallel Virtual Machine) or MPI (Message Passing Interface) is used
for programming
Each processor sees only local memory
Any remote data access must happen through explicit messages (send/recv calls
trapping into kernel)
Optimizations in the messaging layer are possible (user level messages, active messages)

Supercomputers

Historically used for scientific computing


Initially used vector processors
But uniprocessor performance gap of vector processors and microprocessors is narrowing
down
Microprocessors now have heavily pipelined floating-point units, large on-chip caches,
modern techniques to extract ILP
Microprocessor based supercomputers come in large-scale: 100 to 1000 (called massively
parallel processors or MPPs)
However, vector processor based supercomputers are much smaller scale due to cost
disadvantage
Cray finally decided to use Alpha µP in T3D

Exploiting TLP: Shared memory

Hard to build, but offers better programmability compared to message-passing clusters


The “conventional” load/store architecture continues to work
Communication takes place through load/store instructions
Central to design: a cache coherence protocol
Handling data coherency among different caches
Special care needed for synchronization

file:///E|/parallel_com_arch/lecture4/4_3.htm[6/13/2012 11:12:34 AM]


Objectives_template

Module 2: "Parallel Computer Architecture: Today and Tomorrow"


Lecture 4: "Shared Memory Multiprocessors"

Shared memory MPs

What is the communication protocol?


Could be bus-based
Processors share a bus and snoop every transaction on the bus

The most common design in server and enterprise market

Bus-based MPs

The memory is “equidistant” from all processors


Normally called symmetric multiprocessors (SMPs)
Fast processors can easily saturate the bus
Bus bandwidth becomes a scalability bottleneck
In `90s when processors were slow 32P SMPs could be seen
Now mostly Sun pushes for large-scale SMPs with advanced bus
architecture/technology
The bus speed and width have also increased dramatically: Intel Pentium 4 boxes
normally come with 400 MHz front-side bus, Xeons have 533 MHz or 800 MHz FSB,
PowerPC G5 can clock the bus up to 1.25 GHz

Scaling: DSMs

Large-scale shared memory MPs are normally built over a scalable switch-based network
Now each node has its local memory
Access to remote memory happens through load/store, but may take longer
Non-Uniform Memory Access (NUMA)
Distributed Shared Memory (DSM)
The underlying coherence protocol is quite different compared to a bus-based SMP
Need specialized memory controller to handle coherence requests and a router to connect to
the network

file:///E|/parallel_com_arch/lecture4/4_4.htm[6/13/2012 11:12:34 AM]


Objectives_template

Module 2: "Parallel Computer Architecture: Today and Tomorrow"


Lecture 4: "Shared Memory Multiprocessors"

On-chip TLP

Current trend:
Tight integration
Minimize communication latency (data communication is the bottleneck)
Since we have transistors
Put multiple cores on chip (Chip multiprocessing)
They can communicate via either a shared bus or switch-based fabric on-chip (can be
custom designed and clocked faster)
Or put support for multiple threads without replicating cores (Simultaneous multi-
threading)
Both choices provide a good cost/performance trade-off

Economics

Ultimately who controls what gets built?


It is cost vs. performance trade-off
Given a time budget (to market) and a revenue projection, how much performance can be
afforded
Normal trend is to use commodity microprocessors as building blocks unless there is a very
good reason
Reuse existing technology as much as possible
Large-scale scientific computing mostly exploits message-passing machines (easy to build,
less costly); even google uses same kind of architecture [use commodity parts]
Small to medium-scale shared memory multiprocessors are needed in the commercial market
(databases)
Although large-scale DSMs (256 or 512 nodes) are built by SGI, demand is less

Summary

Parallel architectures will be ubiquitous soon


Even on desktop (already we have SMT/HT, multi-core)
Economically attractive: can build with COTS (commodity-off-the-shelf) parts
Enormous application demand (scientific as well as commercial)
More attractive today with positive technology and architectural trends
Wide range of parallel architectures: SMP servers, DSMs, large clusters, CMP, SMT,
CMT, …
Today’s microprocessors are, in fact, complex parallel machines trying to extract ILP as
well as TLP

file:///E|/parallel_com_arch/lecture4/4_5.htm[6/13/2012 11:12:35 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 5: "Pipelining and Hazards"

The Lecture Contains:

RECAP: SINGLE-THREADED EXECUTION

Long history

Single-threaded execution

CPI equation: analysis

Life of an instruction

Multi-cycle execution

Pipelining

More on pipelining

Control hazard

Branch delay slot

What else can we do?

Branch prediction

Data hazards

More on RAW

Multi-cycle EX stage

WAW hazard

Overall CPI

Multiple issue

file:///E|/parallel_com_arch/lecture5/5_1.htm[6/13/2012 11:13:29 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 5: "Pipelining and Hazards"

Long history

Starting from long cycle/multi-cycle execution


Big leap: pipelining
Started with single issue
Matured into multiple issue
Next leap: speculative execution
Out-of-order issue, in-order completion
Today’s microprocessors feature
Speculation at various levels during execution
Deep pipelining
Sophisticated branch prediction
And many more performance boosting hardware

Single-threaded execution

Goal of a microprocessor
Given a sequential set of instructions it should execute them correctly as fast as
possible
Correctness is guaranteed as long as the external world sees the execution in-order
(i.e. sequential)
Within the processor it is okay to re-order the instructions as long as the changes to
states are applied in-order
Performance equation
Execution time = average CPI × number of instructions × cycle time

CPI equation: analysis

To reduce the execution time we can try to lower one or more the three terms
Reducing average CPI (cycles per instruction):
The starting point could be CPI=1
But complex arithmetic operations e.g. multiplication/division take more than a cycle
Memory operations take even longer
So normally average CPI is larger than 1
How to reduce CPI is the core of this lecture
Reducing number of instructions
Better compiler, smart instruction set architecture (ISA)
Reducing cycle time: faster clock

Life of an instruction

Fetch from memory


Decode/read (figure out the opcode, source and dest registers, read source registers)
Execute (ALUs, address calculation for memory op)
Memory access (for load/store)
Writeback or commit (write result to destination reg)
During execution the instruction may talk to
Register file (for reading source operands and writing results)
Cache hierarchy (for instruction fetch and for memory op)

file:///E|/parallel_com_arch/lecture5/5_2.htm[6/13/2012 11:13:29 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture5/5_2.htm[6/13/2012 11:13:29 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 5: "Pipelining and Hazards"

Multi-cycle execution

Simplest implementation
Assume each of five stages takes a cycle
Five cycles to execute an instruction
After instruction i finishes you start fetching instruction i+1
Without “long latency” instructions CPI is 5
Alternative implementation
You could have a five times slower clock to accommodate all the logic within one cycle
Then you can say CPI is 1 excluding mult/div, mem op
But overall execution time really doesn’t change
What can you do to lower the CPI?

Pipelining

Simple observation
In the multi-cycle implementation when the ALU is executing, say, an add instruction
the decoder is idle
Exactly one stage is active at any point in time
Wastage of hardware
Solution: pipelining
Process five instructions in parallel
Each instruction is in a different stage of processing
Each stage is called a pipeline stage
Need registers between pipeline stages to hold partially processed instructions (called
pipeline latches): why?

More on pipelining

What do you gain?


Parallelism: called instruction-level parallelism (ILP)
Ideal CPI of 1 at the same clock speed as multi-cycle implementation: ideally 5 times
reduction in execution time
What are the problems?
Slightly more complex
Control and data hazards
These hazards put a limit on available ILP

file:///E|/parallel_com_arch/lecture5/5_3.htm[6/13/2012 11:13:29 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 5: "Pipelining and Hazards"

Control hazard

Branches pose a problem

Two pipeline bubbles: increases average CPI


Can we reduce it to one bubble?

Branch delay slot

MIPS R3000 has one bubble


Called branch delay slot
Exploit clock cycle phases
On the positive half compute branch condition
On the negative half fetch the target

The PC update hardware (selection between target and next PC) works on the lower edge
Can we utilize the branch delay slot?
Ask the compiler guy
The delay slot is always executed (irrespective of the fate of the branch)
Boost instructions common to fall through and target paths to the delay slot
Not always possible to find
You have to be careful also
Must boost something that does not alter the outcome of fall-through or target basic
blocks
If the BD slot is filled with useful instruction then we don’t lose anything in CPI;
otherwise we pay a branch penalty of one cycle

What else can we do?

Branch prediction
We can put a branch target cache in the fetcher
Also called branch target buffer (BTB)
Use the lower bits of the instruction PC to index the BTB
Use the remaining bits to match the tag
In case of a hit the BTB tells you the target of the branch when it executed last time
You can hope that this is correct and start fetching from that predicted target provided
by the BTB

file:///E|/parallel_com_arch/lecture5/5_4.htm[6/13/2012 11:13:30 AM]


Objectives_template

One cycle later you get the real target, compare with the predicted target, and throw
away the fetched instruction in case of misprediction; keep going if predicted correctly

Branch prediction

BTB will work great for


Loop branches
Subroutine calls
Unconditional branches
Conditional branch prediction
Rather dynamic in nature
The last target is not very helpful in general (if-then-else)
Need a direction predictor (predicts taken or not taken)
Once that prediction is available we can compute the target
Return address stack (RAS): push/pop interface

file:///E|/parallel_com_arch/lecture5/5_4.htm[6/13/2012 11:13:30 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 5: "Pipelining and Hazards"

Data hazards

Data dependency in instruction stream limits ILP


True dependency (Read After Write: RAW)

Need a bypass network to avoid losing cycles


Without the bypass the fetching of subtraction would have to be delayed by three cycles
This is an example of RAW hazard

More on RAW

The most problematic dependencies involve memory ops


The memory ops may take a large number of cycles to return the value (if missed in cache)

This type of dependencies is the primary cause of increase in CPI and lower ILP

Multi-cycle EX stage

Thus far we have assumed a single cycle EX


Consider multiplication and division
Assume a four-cycle multiplication unit: mult r5, r4, r3 IF ID EX1 EX2 EX3 EX4 MEM
WB
Normally the multiplier is separate
So the next instruction can start executing when mult moves to EX2 stage and, in fact,
can finish before mult
More data hazards

file:///E|/parallel_com_arch/lecture5/5_5.htm[6/13/2012 11:13:30 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 5: "Pipelining and Hazards"

WAW hazard

Write After Write (WAW)

The problem: out-of-order completion


The final value in r5 will nullify the effect of the add instruction
The bigger issue: precise exception is violated
Next load instruction raises an exception (may be due to page fault)
You handle the exception and start from the load
But value in r5 does not reflect precise state
Solution: disallow out-of-order completion

Overall CPI

CPI = 1.0 + pipeline overhead


Pipeline overhead comes from
Branch penalty (useless delay slots, mispredictions)
True data dependencies
Multi-cycle instructions (load/store, mult/div)
Other data hazards
So to boost CPI further
Need to have better branch prediction
Need to hide latency of memory ops, mult/div

Multiple issue

Thus far we have assumed that at most one instruction gets advanced to EX stage every cycle
If we have four ALUs we can issue four independent instructions every cycle
This is called superscalar execution
Ideally CPI should go down by a factor equal to issue width (more parallelism)
Extra hardware needed:
Wider fetch to keep the ALUs fed
More decode bandwidth, more register file ports; decoded instructions are put in an
issue queue
Selection of independent instructions for issue
In-order completion

file:///E|/parallel_com_arch/lecture5/5_6.htm[6/13/2012 11:13:30 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 6: "Instruction Issue Algorithms"

The Lecture Contains:

Instruction selection

In-order multi-issue

Out-of-order issue

WAR hazard

Modified bypass

WAR and WAW

Register renaming

The pipeline

What limits ILP now?

Cycle time reduction

Alternative: VLIW

Current research in µP

file:///E|/parallel_com_arch/lecture6/6_1.htm[6/13/2012 11:14:27 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 6: "Instruction Issue Algorithms"

Instruction selection

Simplest possible design


Issue the instructions sequentially (in-order)
Scan the issue queue, stop as soon as you come to an instruction dependent on one
already issued

Cannot issue the last two even though they are independent of the first two: in-order
completion is a must for precise exception support

In-order multi-issue

Complexity of selection logic


Need to check for RAW and WAW
Comparisons for RAW: N(N-1) where N is the issue width
Comparisons for WAW: N(N-1)/2
18 comparators for 4-issue
Still need to make sure instructions write back in-order to support precise exception
As instructions issue, they are removed from the issue queue and put in a re-order
buffer (also called active list in MIPS processors) [Isn’t WAW check sufficient?]
Instructions write back or retire in-order from re-order buffer (ROB)

Out-of-order issue

Taking the parallelism to a new dimension


Central to all modern microprocessors
Scan the issue queue completely, select independent instructions and issue as many as
possible limited only by the number of functional units
Need more comparators
Able to extract more ILP: CPI goes down further
Possible to overlap the latency of mult/div, load/store with execution of other independent
instructions

file:///E|/parallel_com_arch/lecture6/6_2.htm[6/13/2012 11:14:27 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture6/6_2.htm[6/13/2012 11:14:27 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 6: "Instruction Issue Algorithms"

WAR hazard

Modified bypass

An executing instruction must broadcast results to the issue queue


Waiting instructions compare their source register numbers with the destination register
number of the bypassed value
Also, now it needs to make sure that it is consuming the right value in program order to
avoid WAR

Need to tag every instruction with its last producer


Can we simplify this?

WAR and WAW

These are really false dependencies


Arises due to register allocation by the compiler
Thus far we have assumed that ROB has space to hold the destination values: needs wide
ROB entries
These values are written back to the register file when the instructions retire or commit in-
order from ROB
Also, bypass becomes complicated
Better way to solve it: rename the destination registers

file:///E|/parallel_com_arch/lecture6/6_3.htm[6/13/2012 11:14:27 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture6/6_3.htm[6/13/2012 11:14:27 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 6: "Instruction Issue Algorithms"

Register renaming

Registers visible to the compiler


Logical or architectural registers
Normally 32 in number for RISC and is fixed by the ISA
Physical registers inside the processor
Much larger in number
The destination logical register of every instruction is renamed to a physical register number
The dependencies are tracked based on physical registers
MIPS R10000 has 32 logical and 64 physical regs
Intel Pentium 4 has 8 logical and 128 physical regs

Now it is safe to issue them in parallel: they are really independent (compiler introduced
WAW)
Register renaming maintains a map table that records logical register to physical register map
After an instruction is decoded, its logical register numbers are available
The renamer looks up the map table to find mapping for the logical source regs of this
instruction, assigns a free physical register to the destination logical reg, and records the new
mapping
If the renamer runs out of physical registers, the pipeline stalls until at least one register is
available
When do you free a physical register?
Suppose a physical register P is mapped to a logical register L which is the destination
of instruction I
It is safe to free P only when the next producer of L retires (Why not earlier?)

file:///E|/parallel_com_arch/lecture6/6_4.htm[6/13/2012 11:14:28 AM]


Objectives_template

More physical registers


more in-flight instructions
possibility of more parallelism
But cannot make the register file very big
Takes time to access
Burns power

file:///E|/parallel_com_arch/lecture6/6_4.htm[6/13/2012 11:14:28 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 6: "Instruction Issue Algorithms"

The pipeline

Fetch, decode, rename, issue, register file read, ALU, cache, retire
Fetch, decode, rename are in-order stages, each handles multiple instructions every cycle
The ROB entry is allocated in rename stage
Issue, register file, ALU, cache are out-of-order
Retire is again in-order, but multiple instructions may retire each cycle: need to free the
resources and drain the pipeline quickly

What limits ILP now?

Instruction cache miss (normally not a big issue)


Branch misprediction
Observe that you predict a branch in decode, and the branch executes in ALU
There are four pipeline stages before you know outcome
Misprediction amounts to loss of at least 4F instructions where F is the fetch width
Data cache miss
Assuming a issue width of 4, frequency of 3 GHz, memory latency of 120 ns, you need
to find 1440 independent instructions to issue so that you can hide the memory latency:
this is impossible (resource shortage)

Cycle time reduction

Execution time = CPI × instruction count × cycle time


Talked about CPI reduction or improvement in IPC (instructions retired per cycle)
Cycle time reduction is another technique to boost performance
Faster clock frequency
Pipelining poses a problem
Each pipeline stage should be one cycle for balanced progress
Smaller cycle time means need to break pipe stages into smaller stages
Superpipelining
Faster clock frequency necessarily means deep pipes
Each pipe stage contains small amount of logic so that it fits in small cycle time
May severely degrade CPI if not careful
Now branch penalty is even bigger (31 cycles for Intel Prescott): branch mispredictions
cause massive loss in performance (93 micro-ops are lost, F=3)
Long pipes also put more pressure on resources such as ROB and registers because
instruction latency increases (in terms of cycles, not in absolute terms)
Instructions occupy ROB entries and registers longer
The design becomes increasingly complicated (long wires)

file:///E|/parallel_com_arch/lecture6/6_5.htm[6/13/2012 11:14:28 AM]


Objectives_template

Module 3: "Recap: Single-threaded Execution"


Lecture 6: "Instruction Issue Algorithms"

Alternative: VLIW

Very Long Instruction Word computers


Compiler carries out all dependence analysis
Bundles as many independent instructions as allowed by the number of functional units
into an instruction packet
Hardware is a lot less complex
The instructions in the packet issue in parallel
Each packet of instructions is pretty long (hence the name)
Problem: compiler may not be able to extract as much ILP as a dynamic out-of-order
core; many packets may go unutilized
Big leap from VLIW: EPIC (Explicitly Parallel Instruction Computing) [Itanium family]

Current research in µP

Micro-architectural techniques to extract more ILP


Directly helps improve IPC and reduce CPI
Various speculative techniques to hide cache miss latency: prefetching, load value
prediction, etc.
Better branch prediction
Helps deep pipelines
Faster clocking
Need to cool the chip
Various techniques to reduce power consumption: clock gating, dynamic
voltage/frequency scaling (DVFS), power-aware resource usage
Fighting the long wires: scaling micro-architectures against the complexity wall

file:///E|/parallel_com_arch/lecture6/6_6.htm[6/13/2012 11:14:28 AM]


Objectives_template

Module 4: "Recap: Virtual Memory and Caches"


Lecture 7: "Virtual Memory, TLB, and Caches"

RECAP: VIRTUAL MEMORY AND CACHE

Why virtual memory?

Virtual memory

Addressing VM

VA to PA translation

Page fault

VA to PA translation

TLB

Caches

Addressing a cache

file:///E|/parallel_com_arch/lecture7/7_1.htm[6/13/2012 11:15:59 AM]


Objectives_template

Module 4: "Recap: Virtual Memory and Caches"


Lecture 7: "Virtual Memory, TLB, and Caches"

Why virtual memory?

With a 32-bit address you can access 4 GB of physical memory (you will never get the full
memory though)
Seems enough for most day-to-day applications
But there are important applications that have much bigger memory footprint:
databases, scientific apps operating on large matrices etc.
Even if your application fits entirely in physical memory it seems unfair to load the full
image at startup
Just takes away memory from other processes, but probably doesn’t need the full
image at any point of time during execution: hurts multiprogramming
Need to provide an illusion of bigger memory: Virtual Memory (VM)

Virtual memory

Need an address to access virtual memory


Virtual Address (VA)
Assume a 32-bit VA
Every process sees a 4 GB of virtual memory
This is much better than a 4 GB physical memory shared between multiprogrammed
processes
The size of VA is really fixed by the processor data path width
64-bit processors (Alpha 21264, 21364; Sun UltraSPARC; AMD Athlon64, Opteron;
IBM POWER4, POWER5; MIPS R10000 onwards; Intel Itanium etc., and recently Intel
Pentium4) provide bigger virtual memory to each process
Large virtual and physical memory is very important in commercial server market: need
to run large databases

Addressing VM

There are primarily three ways to address VM


Paging, Segmentation, Segmented paging
We will focus on flat paging only
Paged
The entire VM is divided into small units called pages
Virtual pages are loaded into physical page frames as and when needed
(demand paging)
Thus the physical memory is also divided into equal sized page frames
The processor generates virtual addresses
But memory is physically addressed: need a VA to PA translation

file:///E|/parallel_com_arch/lecture7/7_2.htm[6/13/2012 11:16:00 AM]


Objectives_template

Module 4: "Recap: Virtual Memory and Caches"


Lecture 7: "Virtual Memory, TLB, and Caches"

VA to PA translation

The VA generated by the processor is divided into two parts:


Page offset and Virtual page number (VPN)
Assume a 4 KB page: within a 32-bit VA, lower 12 bits will be page offset (offset within
a page) and the remaining 20 bits are VPN (hence 1 M virtual pages total)
The page offset remains unchanged in the translation
Need to translate VPN to a physical page frame number (PPFN)
This translation is held in a page table resident in memory: so first we need to
access this page table
How to get the address of the page table?
Accessing the page table
The Page table base register (PTBR) contains the starting physical address of the
page table
PTBR is normally accessible in the kernel mode only
Assume each entry in page table is 32 bits (4 bytes)
Thus the required page table address is

Access memory at this address to get 32 bits of data from the page table entry (PTE)
These 32 bits contain many things: a valid bit, the much needed PPFN (may be 20 bits
for a 4 GB physical memory), access permissions (read, write, execute), a
dirty/modified bit etc.

Page fault

The valid bit within the 32 bits tells you if the translation is valid
If this bit is reset that means the page is not resident in memory: results in a page fault
In case of a page fault the kernel needs to bring in the page to memory from disk
The disk address is normally provided by the page table entry (different interpretation of 31
bits)
Also kernel needs to allocate a new physical page frame for this virtual page
If all frames are occupied it invokes a page replacement policy

VA to PA translation

Page faults take a long time: order of ms


Need a good page replacement policy
Once the page fault finishes, the page table entry is updated with the new VPN to PPFN
mapping
Of course, if the valid bit was set, you get the PPFN right away without taking a page fault
Finally, PPFN is concatenated with the page offset to get the final PA
Processor now can issue a memory request with this PA to get the necessary data
Really two memory accesses are needed
Can we improve on this?

file:///E|/parallel_com_arch/lecture7/7_3.htm[6/13/2012 11:16:00 AM]


Objectives_template

Module 4: "Recap: Virtual Memory and Caches"


Lecture 7: "Virtual Memory, TLB, and Caches"

TLB

Why can’t we cache the most recently used translations?


Translation Look-aside Buffers (TLB)
Small set of registers (normally fully associative)
Each entry has two parts: the tag which is simply VPN and the corresponding PTE
The tag may also contain a process id
On a TLB hit you just get the translation in one cycle (may take slightly longer
depending on the design)
On a TLB miss you may need to access memory to load the PTE in TLB (more later)
Normally there are two TLBs: instruction and data

Caches

Once you have completed the VA to PA translation you have the physical address. What’s
next?
You need to access memory with that PA
Instruction and data caches hold most recently used (temporally close) and nearby (spatially
close) data
Use the PA to access the cache first
Caches are organized as arrays of cache lines
Each cache line holds several contiguous bytes (32, 64 or 128 bytes)

file:///E|/parallel_com_arch/lecture7/7_4.htm[6/13/2012 11:16:00 AM]


Objectives_template

Module 4: "Recap: Virtual Memory and Caches"


Lecture 7: "Virtual Memory, TLB, and Caches"

Addressing a cache

The PA is divided into several parts

The block offset determines the starting byte address within a cache line
The index tells you which cache line to access
In that cache line you compare the tag to determine hit/miss

An example
PA is 32 bits
Cache line is 64 bytes: block offset is 6 bits
Number of cache lines is 512: index is 9 bits
So tag is the remaining bits: 17 bits
Total size of the cache is 512*64 bytes i.e. 32 KB
Each cache line contains the 64 byte data, 17-bit tag, one valid/invalid bit, and several
state bits (such as shared, dirty etc.)
Since both the tag and the index are derived from the PA this is called a physically
indexed physically tagged cache

file:///E|/parallel_com_arch/lecture7/7_5.htm[6/13/2012 11:16:00 AM]


Objectives_template

Module 4: "Recap: Virtual Memory and Caches"


Lecture 8: "Cache Hierarchy and Memory-level Parallelism"

Set associative cache

2-way set associative

Set associative cache

Cache hierarchy

States of a cache line

Inclusion policy

The first instruction

TLB access

Memory op latency

MLP

Out-of-order loads

Load/store ordering

MLP and memory wall

file:///E|/parallel_com_arch/lecture8/8_1.htm[6/13/2012 11:16:30 AM]


Objectives_template

Module 4: "Recap: Virtual Memory and Caches"


Lecture 8: "Cache Hierarchy and Memory-level Parallelism"

Set associative cache

The example assumes one cache line per index


Called a direct-mapped cache
A different access to a line evicts the resident cache line
This is either a capacity or a conflict miss
Conflict misses can be reduced by providing multiple lines per index
Access to an index returns a set of cache lines
For an n-way set associative cache there are n lines per set
Carry out multiple tag comparisons in parallel to see if any one in the set hits

2-way set associative

Set associative cache

When you need to evict a line in a particular set you run a replacement policy
LRU is a good choice: keeps the most recently used lines (favors temporal locality)
Thus you reduce the number of conflict misses
Two extremes of set size: direct-mapped (1-way) and fully associative (all lines are in a single
set)
Example: 32 KB cache, 2-way set associative, line size of 64 bytes: number of indices
or number of sets=32*1024/(2*64)=256 and hence index is 8 bits wide
Example: Same size and line size, but fully associative: number of sets is 1, within the
set there are 32*1024/64 or 512 lines; you need 512 tag comparisons for each access

file:///E|/parallel_com_arch/lecture8/8_2.htm[6/13/2012 11:16:31 AM]


Objectives_template

Module 4: "Recap: Virtual Memory and Caches"


Lecture 8: "Cache Hierarchy and Memory-level Parallelism"

Cache hierarchy

Ideally want to hold everything in a fast cache


Never want to go to the memory
But, with increasing size the access time increases
A large cache will slow down every access
So, put increasingly bigger and slower caches between the processor and the memory
Keep the most recently used data in the nearest cache: register file (RF)
Next level of cache: level 1 or L1 (same speed or slightly slower than RF, but much bigger)
Then L2: way bigger than L1 and much slower
Example: Intel Pentium 4 (Netburst)
128 registers accessible in 2 cycles
L1 date cache: 8 KB, 4-way set associative, 64 bytes line size, accessible in 2 cycles
for integer loads
L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 7 cycles
Example: Intel Itanium 2 (code name Madison)
128 registers accessible in 1 cycle
L1 instruction and data caches: each 16 KB, 4-way set associative, 64 bytes line size,
accessible in 1 cycle
Unified L2 cache: 256 KB, 8-way set associative, 128 bytes line size, accessible in 5
cycles
Unified L3 cache: 6 MB, 24-way set associative, 128 bytes line size, accessible in 14
cycles

States of a cache line

The life of a cache line starts off in invalid state (I)


An access to that line takes a cache miss and fetches the line from main memory
If it was a read miss the line is filled in shared state (S) [we will discuss it later; for now just
assume that this is equivalent to a valid state]
In case of a store miss the line is filled in modified state (M); instruction cache lines do not
normally enter the M state (no store to Icache)
The eviction of a line in M state must write the line back to the memory (this is called a
writeback cache); otherwise the effect of the store would be lost

Inclusion policy

A cache hierarchy implements inclusion if the contents of level n cache (exclude the register
file) is a subset of the contents of level n+1 cache
Eviction of a line from L2 must ask L1 caches (both instruction and data) to invalidate
that line if present
A store miss fills the L2 cache line in M state, but the store really happens in L1 data
cache; so L2 cache does not have the most up-to-date copy of the line
Eviction of an L1 line in M state writes back the line to L2
Eviction of an L2 line in M state first asks the L1 data cache to send the most up-to-
date copy (if any), then it writes the line back to the next higher level (L3 or main
memory)
Inclusion simplifies the on-chip coherence protocol (more later)

file:///E|/parallel_com_arch/lecture8/8_3.htm[6/13/2012 11:16:31 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture8/8_3.htm[6/13/2012 11:16:31 AM]


Objectives_template

Module 4: "Recap: Virtual Memory and Caches"


Lecture 8: "Cache Hierarchy and Memory-level Parallelism"

The first instruction

Accessing the first instruction


Take the starting PC
Access iTLB with the VPN extracted from PC: iTLB miss
Invoke iTLB miss handler
Calculate PTE address
If PTEs are cached in L1 data and L2 caches, look them up with PTE address: you will
miss there also
Access page table in main memory: PTE is invalid: page fault
Invoke page fault handler
Allocate page frame, read page from disk, update PTE, load PTE in iTLB, restart fetch
Now you have the physical address
Access Icache: miss
Send refill request to higher levels: you miss everywhere
Send request to memory controller (north bridge)
Access main memory
Read cache line
Refill all levels of cache as the cache line returns to the processor
Extract the appropriate instruction from the cache line with the block offset
This is the longest possible latency in an instruction/data access

TLB access

For every cache access (instruction or data) you need to access the TLB first
Puts the TLB in the critical path
Want to start indexing into cache and read the tags while TLB lookup takes place
Virtually indexed physically tagged cache
Extract index from the VA, start reading tag while looking up TLB
Once the PA is available do tag comparison
Overlaps TLB reading and tag reading

Memory op latency

L1 hit: ~1 ns
L2 hit: ~5 ns
L3 hit: ~10-15 ns
Main memory: ~70 ns DRAM access time + bus transfer etc. = ~110-120 ns
If a load misses in all caches it will eventually come to the head of the ROB and block
instruction retirement (in-order retirement is a must)
Gradually, the pipeline backs up, processor runs out of resources such as ROB entries and
physical registers
Ultimately, the fetcher stalls: severely limits ILP

file:///E|/parallel_com_arch/lecture8/8_4.htm[6/13/2012 11:16:31 AM]


Objectives_template

Module 4: "Recap: Virtual Memory and Caches"


Lecture 8: "Cache Hierarchy and Memory-level Parallelism"

MLP

Need memory-level parallelism (MLP)


Simply speaking, need to mutually overlap several memory operations
Step 1: Non-blocking cache
Allow multiple outstanding cache misses
Mutually overlap multiple cache misses
Supported by all microprocessors today (Alpha 21364 supported 16 outstanding cache
misses)
Step 2: Out-of-order load issue
Issue loads out of program order (address is not known at the time of issue)
How do you know the load didn’t issue before a store to the same address? Issuing
stores must check for this memory-order violation

Out-of-order loads

sw 0(r7), r6
… /* other instructions */
lw r2, 80(r20)

Assume that the load issues before the store because r20 gets ready before r6 or r7
The load accesses the store buffer (used for holding already executed store values before
they are committed to the cache at retirement)
If it misses in the store buffer it looks up the caches and, say, gets the value somewhere
After several cycles the store issues and it turns out that 0(r7)==80(r20) or they overlap; now
what?

Load/store ordering

Out-of-order load issue relies on speculative memory disambiguation


Assumes that there will be no conflicting store
If the speculation is correct, you have issued the load much earlier and you have
allowed the dependents to also execute much earlier
If there is a conflicting store, you have to squash the load and all the dependents that
have consumed the load value and re-execute them systematically
Turns out that the speculation is correct most of the time
To further minimize the load squash, microprocessors use simple memory dependence
predictors (predicts if a load is going to conflict with a pending store based on that
load’s or load/store pairs’ past behavior)

file:///E|/parallel_com_arch/lecture8/8_5.htm[6/13/2012 11:16:31 AM]


Objectives_template

Module 4: "Recap: Virtual Memory and Caches"


Lecture 8: "Cache Hierarchy and Memory-level Parallelism"

MLP and memory wall

Today microprocessors try to hide cache misses by initiating early prefetches:


Hardware prefetchers try to predict next several load addresses and initiate cache line
prefetch if they are not already in the cache
All processors today also support prefetch instructions; so you can specify in your
program when to prefetch what: this gives much better control compared to a hardware
prefetcher
Researchers are working on load value prediction
Even after doing all these, memory latency remains the biggest bottleneck
Today microprocessors are trying to overcome one single wall: the memory wall

file:///E|/parallel_com_arch/lecture8/8_6.htm[6/13/2012 11:16:32 AM]


Objectives_template

Module 5: "MIPS R10000: A Case Study"


Lecture 9: "MIPS R10000: A Case Study"

MIPS R10000

A case study in modern microarchitecture

Overview

Stage 1: Fetch

Stage 2: Decode/Rename

Branch prediction

Branch predictor

Register renaming

Preparing to issue

Stage 3: Issue

Load-dependents

Functional units

Result writeback

Retirement or commit

[Reference: K. C. Yeager. The MIPS R10000 Superscalar Microprocessor.


IEEE Micro, 16(2): 28-40, April 1996.]

file:///E|/parallel_com_arch/lecture9/9_1.htm[6/13/2012 11:17:16 AM]


Objectives_template

Module 5: "MIPS R10000: A Case Study"


Lecture 9: "MIPS R10000: A Case Study"

Overview

Mid 90s: One of the first dynamic out-of-order superscalar RISC microprocessors
6.8 M transistors on 298 mm2 die (0.35 µm CMOS)
Out of 6.8 M transistors 4.4 M are devoted to L1 instruction and data caches
Fetches, decodes, renames 4 instructions every cycle
64-bit registers: the data path width is 64 bits
On-chip 32 KB L1 instruction and data caches, 2-way set associative
Off-chip L2 cache of variable size (512 KB to 16 MB), 2-way set associative, line size 128
bytes

Stage 1: Fetch

The instructions are slightly pre-decoded when the cache line is brought into Icache
Simplifies the decode stage
Processor fetches four sequential instructions every cycle from the Icache
The iTLB has eight entries, fully associative
No BTB
So the fetcher really cannot do anything about branches other than fetching sequentially

Stage 2: Decode/Rename

Decodes and renames four instructions every cycle


The targets of branches, unconditional jumps, and subroutine calls (named jump and link or
jal) are computed in this stage
Unconditional jumps are not fed into the pipeline and the fetcher PC is modified directly by
the decoder
Conditional branches look up a simple predictor to predict the branch direction (taken or not
taken) and accordingly modify the fetch PC

file:///E|/parallel_com_arch/lecture9/9_2.htm[6/13/2012 11:17:16 AM]


Objectives_template

Module 5: "MIPS R10000: A Case Study"


Lecture 9: "MIPS R10000: A Case Study"

Branch prediction

Branches are predicted and unconditional jumps are computed in stage 2


There is always a one-cycle bubble (four instructions)
In case of branch misprediction (which will be detected later) the processor may need to roll
back and restart fetching from the correct target
Need to checkpoint (i.e. save) the register map right after the branch is renamed (will
be needed to restore in case of misprediction)
The processor supports at most four register map checkpoints; this is stored in a structure
called branch stack (really, it is a FIFO queue, not a stack)
Can support up to four in-flight branches

Branch predictor

The predictor is an array of 512 two-bit saturating counters


Can count up to 3; if already 3, an increment does not have any effect (remains at 3)
Similarly, if the count is 0, a decrement does not have any effect (remains at 0)
The array is indexed by PC[11:3]
Ignore lower 3 bits, take the next 9 bits
The outcome is the count at that index of the predictor
If count >= 2 then predict taken; else not taken
Very simple algorithm; prediction accuracy of 85+% on most benchmarks; works fine for short
pipes
Commonly known as bimodal branch predictor
The branch predictor is updated when a conditional branch retires (in-order update because
retirement is in-order)
At retirement we know the correct outcome of the branch
So we use that to train the predictor
If the branch is taken the count in the index for that branch is incremented (remains at
3 if already 3)
If the branch is not taken the count is decremented (remains at zero if already 0)
This predictor will fail to predict many simple patterns including alternating branches
depending on where the count starts

Register renaming

Takes place in the second pipeline stage


As we have discussed, every destination is assigned a new physical register from the free list
The sources are assigned the existing map
Map table is updated with the newly renamed dest.
For every destination physical register, a busy bit is set high to signify that the value in this
register is not yet ready; this bit is cleared after the instruction completes execution
The integer and floating-point instructions are assigned registers from two separate free lists
The integer and fp register files are separate (each has 64 registers)

file:///E|/parallel_com_arch/lecture9/9_3.htm[6/13/2012 11:17:16 AM]


Objectives_template

Module 5: "MIPS R10000: A Case Study"


Lecture 9: "MIPS R10000: A Case Study"

Preparing to issue

Finally, during the second stage every instruction is assigned an active list entry
The active list is a 32-entry FIFO queue which keeps track of all in-flight instructions
(at most 32) in-order
Each entry contains various info about the allocated instruction such as physical dest
reg number etc.
Also, each instruction is assigned to one of the three issue queues depending on its type
Integer queue: holds integer ALU instructions
Floating-point queue: holds FPU instructions
Address queue: holds the memory operations
Therefore, stage 2 may stall if the processor runs out of: active list entries, physical regs,
issue queue entries

Stage 3: Issue

Three issue queue selection logics work in parallel


Integer and fp queue issue logics are similar
Integer issue logic
Integer queue contains 16 entries (can hold at most 16 instructions)
Search for ready-to-issue instructions among these 16
Issue at most two instructions to two ALUs
Address queue
Slightly more complicated
When a load or a store is issued the address is still not known
To simplify matters, R10000 issues load/stores in-order (we have seen problems
associated with out-of-order load/store issue)

Load-dependents

The loads take two cycles to execute


During the first cycle the address is computed
During the second cycle the dTLB and data cache are accessed
Ideally I want to issue an instruction dependent on the load so that the instruction can
pick up the load value from the bypass just in time
Assume that a load issues in cycle 0, computes address in cycle 1, and looks up cache
in cycle 2
I want to issue the dependent in cycle 2 so that it can pick up the load value just
before executing in cycle 3
Thus the load looks up cache in parallel with the issuing of the dependent; the
dependent is issued even before it is known whether the load will hit in
the cache; this is called load hit speculation (re-execute later if the load misses)

file:///E|/parallel_com_arch/lecture9/9_4.htm[6/13/2012 11:17:17 AM]


Objectives_template

Module 5: "MIPS R10000: A Case Study"


Lecture 9: "MIPS R10000: A Case Study"

Functional units

Right after an instruction is issued it reads the source operands (dictated by physical reg
numbers) from the register file (integer or fp depending on instruction type)
From stage 4 onwards the instructions execute
Two ALUs: branch and shift can execute on ALU1, multiply/divide can execute on
ALU2, all other instructions can execute on any of the two ALUs; ALU1 is responsible
for triggering rollback in case of branch misprediction (marks all instructions after the
branch as squashed, restores the register map from correct branch stack entry, sets
fetch PC to the correct target)
Four FPUs: one dedicated for fp multiply, one for fp divide, one for fp square root, most
of the other instructions execute on the remaining FPU
LSU (Load/store unit): Address calc. ALU, dTLB is fully assoc. with 64 entries and
translates 44-bit VA to 40-bit PA, PA is used to match dcache tags (virtually indexed
physically tagged)

Result writeback

As soon as an instruction completes execution the result is written back to the destination
physical register
No need to wait till retirement since the renamer has guaranteed that this physical
destination is associated with a unique instruction in the pipeline
Also the results are launched on the bypass network (from outputs of ALU/FPU/dcache to
inputs of ALU/FPU/address calculation ALUs)
This guarantees that dependents can be issued back-to-back and still they can receive
the correct value
add r3, r4, r5; add r6, r4, r3; (can be issued in consecutive cycles, although the second
add will read a wrong value of r3 from the register file)

Retirement or commit

Immediately after the instructions finish execution, they may not be able to leave the pipe
In-order retirement is necessary for precise exception
When an instruction comes to the head of the active list it can retire
R10k retires 4 instructions every cycle
Retirement involves
Updating the branch predictor and freeing its branch stack entry if it is a branch
instruction
Moving the store value from the speculative store buffer entry to the L1 data cache if it
is a store instruction
Freeing old destination physical register and updating the register free list
And, finally, freeing the active list entry itself

file:///E|/parallel_com_arch/lecture9/9_5.htm[6/13/2012 11:17:17 AM]


Objectives_template

Self-assessment Exercise

These problems should be tried after module 05 is completed.

1. Consider the following memory organization of a processor. The virtual


address is 40 bits, the physical address is 32 bits, the page size is 8 KB. The
processor has a 4-way set associative 128-entry TLB i.e. each way has 32 sets.
Each page table entry is 32 bits in size. The processor also has a 2-way set
associative 32 KB L1 cache with line size of 64 bytes.

(A) What is the total size of the page table?


(B) Clearly show (with the help of a diagram) the addressing scheme if the
cache is virtually indexed and physically tagged. Your diagram should show the
width of TLB and cache tags.
(C) If the cache was physically indexed and physically tagged, what part of the
addressing scheme would change?

2. A set associative cache has longer hit time than an equally sized direct-
mapped cache. Why?

3. The Alpha 21264 has a virtually indexed virtually tagged instruction cache.
Do you see any security/protection issues with this? If yes, explain and offer a
solution. How would you maintain correctness of such a cache in a multi-
programmed environment?

4. Consider the following segment of C code for adding the elements in each
column of an NxN matrix A and putting it in a vector x of size N.

for(j=0;j<N;j++) {
for(i=0;i<N;i++) {
x[j] += A[i][j];
}
}

Assume that the C compiler carries out a row-major layout of matrix A i.e. A[i][j]
and A[i][j+1] are adjacent to each other in memory for all i and j in the legal
range and A[i][N-1] and A[i+1][0] are adjacent to each other for all i in the legal
range. Assume further that each element of A and x is a floating point double
i.e. 8 bytes in size. This code is executed on a modern speculative out-of-order
processor with the following memory hierarchy: page size 4 KB, fully associative
128-entry data TLB, 32 KB 2-way set associative single level data cache with
32 bytes line size, 256 MB DRAM. You may assume that the cache is virtually
indexed and physically tagged, although this information is not needed to
answer this question. For N=8192, compute the following (please show all the
intermediate steps). Assume that every instruction hits in the instruction cache.
Assume LRU replacement policy for physical page frames, TLB entries, and
cache sets.

(A) Number of page faults.

file:///E|/parallel_com_arch/lecture9/self_assign_ex.htm[6/13/2012 11:17:17 AM]


Objectives_template

(B) Number of data TLB misses.


(C) Number of data cache misses. Assume that x and A do not conflict with
each other in the cache.
(D) At most how many memory operations can the processor overlap before
coming to a halt? Assume that the instruction selection logic (associated with
the issue unit) gives priority to older instructions over younger instructions if both
are ready to issue in a cycle.

5. Suppose you are running a program on two machines, both having a single
level of cache hierarchy (i.e. only L1 caches). In one machine the cache is
virtually indexed and physically tagged while in the other it is physically indexed
and physically tagged. Will there be any difference in cache miss rates when the
program is run on these two machines?

file:///E|/parallel_com_arch/lecture9/self_assign_ex.htm[6/13/2012 11:17:17 AM]


Objectives_template

Solution of Self-assessment Exercise

1. Consider the following memory organization of a processor. The virtual


address is 40 bits, the physical address is 32 bits, the page size is 8 KB. The
processor has a 4-way set associative 128-entry TLB i.e. each way has 32 sets.
Each page table entry is 32 bits in size. The processor also has a 2-way set
associative 32 KB L1 cache with line size of 64 bytes.

(A) What is the total size of the page table?

Solution: The physical memory is 2^32 bytes, since the physical address is 32
bits. Since the page size is 8 KB, the number of pages is (2^32)/(2^13) i.e.,
2^19. Since each page table entry is four bytes in size and each page must
have one page table entry, the size of the page table is (2^19)*4 bytes or 2 MB.

(B) Clearly show (with the help of a diagram) the addressing scheme if the
cache is virtually indexed and physically tagged. Your diagram should show the
width of TLB and cache tags.

Solution: I will describe the addressing scheme here. You can derive the
diagram from that. The processor generates 40-bit virtual addresses for memory
operations. This address must be translated to a physical address that can be
used to look up the memory (through the caches). The first step in this
translation is TLB lookup. Since the TLB has 32 sets, the index width for TLB
lookup is five bits. The lowest 13 bits of the virtual address constitute the page
offset and are not used for TLB lookup. The next lower five bits are used for
indexing into the TLB. This leaves the upper 22 bits of the virtual address to be
used as the TLB tag. On a TLB hit, the TLB entry provides the necessary page
table entry, which is 32 bits in width. On a TLB miss, the page table entry must
be read from the page table resident in memory or cache. Nonetheless, the net
effect of whichever path is taken is that we have the 32-bit page table entry.
From these 32 bits, the necessary 19-bit physical page frame number is
extracted (recall that the number of physical pages is 2^19). When the 13-bit
page offset is concatenated to this 19-bit frame number, we get the target
physical address. We must first look up the cache to check if the data
corresponding to this address is already resident there before querying the
memory. Since the cache is virtually indexed and physically tagged, the cache
lookup can start at the same time as TLB lookup. The cache has (2^15)/(64*2)
or 256 sets. So eight bits are needed to index the cache. The lower six bits of
the virtual address are the block offset and not used for cache indexing. The
next eight bits are used as cache index. The tags resident at both the ways of
this set are read out. The target tag is computed from the physical address and
must be compared against both the read out tags to test for a cache hit. Let's
try to understand how the target tag is computed from the physical address that
we have formed above with the help of the page table entry. Usually, the tag is
derived by removing the block offset and cache index bits from the physical
address. So, in this case, it is tempting to take the upper 18 bits of the physical
address as the cache tag. Unfortunately, this does not work for virtually indexed

file:///E|/parallel_com_arch/lecture9/self_assign_ex_sol.html[6/13/2012 11:17:17 AM]


Objectives_template

physically tagged cache where the page offset is smaller than the block offset
plus cache index. In this particular example, they differ by one bit. Let's see what
the problem is. Consider a two different cache blocks residing at the same
cache index v derived from the virtual address. This means that these blocks
have identical lower 14 bits of the virtual address. This guarantees that these
two blocks will have identical lower 13 bits of physical address because virtual
to physical address translation does not change page offset bits. However,
nothing stops these two blocks from having identical upper 18 bits of the
physical address, but different 14th bit. Now, it is clear why the traditional tag
computation would make mistakes in identifying the correct block. So the cache
tag must also include the 14th bit. In other words, the cache tag needs to be
identical to the physical page frame number. This completes the cache lookup.
On a cache miss, the 32-bit physical address must be sent to memory for
satisfying the cache miss.

(C) If the cache was physically indexed and physically tagged, what part of the
addressing scheme would change?

Solution: Almost everything remains unchanged, except that the cache index
comes from the physical address now. As a result, the cache lookup cannot
start until the TLB lookup completes. The cache tag now can be only upper 18
bits of the physical address.

2. A set associative cache has longer hit time than an equally sized direct-
mapped cache. Why?

Solution: Iso-capacity direct-mapped cache has wider index decoder than a


set associative cache. So index decoding takes longer in the direct-mapped
cache. However, in set associative cache, a multiplexing stage is needed to
choose from the possible candidates within the target set based on tag
comparison outcome. While the decoder width falls logarithmically with set-
associativity, the multiplexer width grows linearly. For example, a k-way set
associative cache would require log(k) less index bits compared to an iso-
capacity direct-mapped cache, but the multiplexing stage of the set associative
cache would require a k-to-1 multiplexer. Overall, the multiplexer delay
outweighs the gain in the decoder delay.

3. The Alpha 21264 has a virtually indexed virtually tagged instruction cache.
Do you see any security/protection issues with this? If yes, explain and offer a
solution. How would you maintain correctness of such a cache in a multi-
programmed environment?

Solution: The main purpose of having a virtually indexed virtually tagged


instruction cache is to get rid of the TLB from the instruction lookup path.
However, this also removes the much-needed protection provided by the TLB
entries. For example, now buggy codes can easily overwrite the instructions in
the instruction cache. The minimal solution to this problem would still retain the
read-write-execute permission bits in a TLB-like structure. This is essentially a
translation-less TLB. In a multi-programmed environment, it becomes difficult to
distinguish codes belonging to different processes in a virtually indexed virtually
tagged cache because every process has the same virtual address map. Two
possible solutions exist. On a context switch, one can flush the entire instruction
cache. This may slightly elongate the context switch time and the process that is

file:///E|/parallel_com_arch/lecture9/self_assign_ex_sol.html[6/13/2012 11:17:17 AM]


Objectives_template

switching in will see the cold start effect. Another solution would incorporate
process id in the cache tag. However, this may increase the cache latency
depending on the width of the process id. In general, it is very difficult to say
which one is going to be better and depends on the class of applications that
will run.

4. Consider the following segment of C code for adding the elements in each
column of an NxN matrix A and putting it in a vector x of size N.

for(j=0;j<N;j++) {
for(i=0;i<N;i++) {
x[j] += A[i][j];
}
}

Assume that the C compiler carries out a row-major layout of matrix A i.e., A[i][j]
and A[i][j+1] are adjacent to each other in memory for all i and j in the legal
range and A[i][N-1] and A[i+1][0] are adjacent to each other for all i in the legal
range. Assume further that each element of A and x is a floating point double
i.e., 8 bytes in size. This code is executed on a modern speculative out-of-order
issue processor with the following memory hierarchy: page size 4 KB, fully
associative 128-entry data TLB, 32 KB 2-way set associative single level data
cache with 32 bytes line size, 256 MB DRAM. You may assume that the cache
is virtually indexed and physically tagged, although this information is not
needed to answer this question. For N=8192, compute the following (please
show all the intermediate steps). Assume that every instruction hits in the
instruction cache. Assume LRU replacement policy for physical page frames,
TLB entries, and cache sets.

(A) Number of page faults.

Solution: The total size of x is 64 KB and the total size of A is 512 MB. So,
these do not fit in the physical memory, which is of size 256 MB. Also, we note
that one row of A is of size 64 KB. As a result, every row of A starts on a new
page. As the computation starts, the first outer loop iteration suffers from one
page fault due to x and 8192 page faults due to A. Since one page can hold 512
elements of x and A, the next 511 outer loop iterations do not take any page
faults. The j=512 iteration again suffers from one page fault in x and 8192 fresh
page faults in A. This pattern continues until the memory gets filled up. At this
point we need to invoke the replacement policy, which is LRU. As a result, the
old pages of x and A will get replaced to make room for the new ones. Instead
of calculating the exact iteration point where the memory gets exhausted, we
only note that the page fault pattern continues to hold even beyond this point.
Therefore, the total number of page faults is 8193*(8192/512) or 8193*16 or
131088.

(B) Number of data TLB misses.

Solution: The TLB can hold 128 pages at a time. The TLB gets filled up at j=0,
i=126 with one translation for x[0] and 127 translations for A[0][0] to A[0][126]. At
this point, the LRU replacement policy is invoked and it replaces the translations
of A. The translation of x[0] does not get replaced because it is touched in every
inner loop iteration. By the time the j=0 iteration is finished, only the last 127

file:///E|/parallel_com_arch/lecture9/self_assign_ex_sol.html[6/13/2012 11:17:17 AM]


Objectives_template

translations of A survive in the TLB. As a result, every access of A suffers from


a TLB miss because A is never able to reuse TLB translations because the
reuse distance exceeds the TLB reach. On the other hand, x enjoys maximum
possible reuse in the TLB. Therefore, every page of x suffers from exactly one
TLB miss, while every access of A suffers from a TLB miss. So, the total
number of TLB misses is 16+8192*8192 or 16+64M.

(C) Number of data cache misses. Assume that x and A do not conflict with
each other in the cache.

Solution: In this case also, x enjoys maximum reuse, while A suffers from a
cache miss on every access. This is because the number of blocks in the cache
is 1024, which is much smaller than the reuse distance in A. One cache block
can hold four elements of x. As a result, x takes a cache miss on every fourth
element. So, the total number of cache misses is 2048+8192*8192 or 2K+64M.

(D) At most how many memory operations can the processor overlap before
coming to a halt? Assume that the instruction selection logic (associated with
the issue unit) gives priority to older instructions over younger instructions if both
are ready to issue in a cycle.

Solution: Since every access of A suffers from a TLB miss and the TLB
misses are usually implemented as restartable exceptions, there cannot be any
overlap among multiple memory operations. A typical iteration would involve load
of x, TLB miss followed by load of A, addition, and store to x. No two memory
operations can overlap because the middle one always suffers from a TLB miss
leading to a pipe flush.

5. Suppose you are running a program on two machines, both having a single
level of cache hierarchy (i.e. only L1 caches). In one machine the cache is
virtually indexed and physically tagged while in the other it is physically indexed
and physically tagged. Will there be any difference in cache miss rates when the
program is run on these two machines?

Solution: Depending on how the application is written, the virtually indexed


cache may exhibit a higher miss rate only if the cache organization is such that
the block offset plus the index bits exceed the page offset. We have seen above
that in such situations the tag of the virtually indexed cache has to be extended
to match the physical page frame number. As a result, the number of different
cache blocks that can map to a cache index is larger in a virtually indexed
cache. This increases the chance of conflict misses.

file:///E|/parallel_com_arch/lecture9/self_assign_ex_sol.html[6/13/2012 11:17:17 AM]


Objectives_template

Module 6: "Fundamentals of Parallel Computers"


Lecture 10: "Communication Architecture"

Fundamentals of Parallel Computers

Agenda

Communication architecture

Layered architecture

Shared address

Message passing

Convergence

Data parallel arch.

[From Chapter 1 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture10/10_1.htm[6/13/2012 11:18:07 AM]


Objectives_template

Module 6: "Fundamentals of Parallel Computers"


Lecture 10: "Communication Architecture"

Agenda

Convergence of parallel architectures


Fundamental design issues
ILP vs. TLP

Communication architecture

Historically, parallel architectures are tied to programming models


Diverse designs made it impossible to write portable parallel software
But the driving force was the same: need for fast processing
Today parallel architecture is seen as an extension of microprocessor architecture with a
communication architecture
Defines the basic communication and synchronization operations and provides hw/sw
implementation of those

Layered architecture

A parallel architecture can be divided into several layers


Parallel applications
Programming models: shared address, message passing, multiprogramming, data
parallel, dataflow etc
Compiler + libraries
Operating systems support
Communication hardware
Physical communication medium
Communication architecture = user/system interface + hw implementation (roughly defined by
the last four layers)
Compiler and OS provide the user interface to communicate between and synchronize
threads

file:///E|/parallel_com_arch/lecture10/10_2.htm[6/13/2012 11:18:07 AM]


Objectives_template

Module 6: "Fundamentals of Parallel Computers"


Lecture 10: "Communication Architecture"

Shared address

Communication takes place through a logically shared portion of memory


User interface is normal load/store instructions
Load/store instructions generate virtual addresses
The VAs are translated to PAs by TLB or page table
The memory controller then decides where to find this PA
Actual communication is hidden from the programmer
The general communication hw consists of multiple processors connected over some medium
so that they can talk to memory banks and I/O devices
The architecture of the interconnect may vary depending on projected cost and target
performance
Communication medium

Interconnect could be a crossbar switch so that any processor can talk to any memory
bank in one “hop” (provides latency and bandwidth advantages)
Scaling a crossbar becomes a problem: cost is proportional to square of the size
Instead, could use a scalable switch-based network; latency increases and bandwidth
decreases because now multiple processors contend for switch ports
Communication medium
From mid 80s shared bus became popular leading to the design of SMPs
Pentium Pro Quad was the first commodity SMP
Sun Enterprise server provided a highly pipelined wide shared bus for scalability
reasons; it also distributed the memory to each processor, but there was no local bus
on the boards i.e. the memory was still “symmetric” (must use the shared bus)
NUMA or DSM architectures provide a better solution to the scalability problem; the
symmetric view is replaced by local and remote memory and each node (containing
processor(s) with caches, memory controller and router) gets connected via a scalable
network (mesh, ring etc.); Examples include Cray/SGI T3E, SGI Origin 2000, Alpha
GS320, Alpha/HP GS1280 etc.

file:///E|/parallel_com_arch/lecture10/10_3.htm[6/13/2012 11:18:08 AM]


Objectives_template

Module 6: "Fundamentals of Parallel Computers"


Lecture 10: "Communication Architecture"

Message passing

Very popular for large-scale computing


The system architecture looks exactly same as DSM, but there is no shared memory
The user interface is via send/receive calls to the message layer
The message layer is integrated to the I/O system instead of the memory system
Send specifies a local data buffer that needs to be transmitted; send also specifies a tag
A matching receive at dest. node with the same tag reads in the data from kernel space
buffer to user memory
Effectively, provides a memory-to-memory copy
Actual implementation of message layer
Initially it was very topology dependent
A node could talk only to its neighbors through FIFO buffers
These buffers were small in size and therefore while sending a message send would
occasionally block waiting for the receive to start reading the buffer (synchronous
message passing)
Soon the FIFO buffers got replaced by DMA (direct memory access) transfers so that a
send can initiate a transfer from memory to I/O buffers and finish immediately (DMA
happens in background); same applies to the receiving end also
The parallel algorithms were designed specifically for certain topologies: a big problem
To improve usability of machines, the message layer started providing support for arbitrary
source and destination (not just nearest neighbors)
Essentially involved storing a message in intermediate “hops” and forwarding it to the
next node on the route
Later this store-and-forward routing got moved to hardware where a switch could
handle all the routing activities
Further improved to do pipelined wormhole routing so that the time taken to traverse
the intermediate hops became small compared to the time it takes to push the
message from processor to network (limited by node-to-network bandwidth)
Examples include IBM SP2, Intel Paragon
Each node of Paragon had two i860 processors, one of which was dedicated to
servicing the network (send/recv. etc.)

file:///E|/parallel_com_arch/lecture10/10_4.htm[6/13/2012 11:18:08 AM]


Objectives_template

Module 6: "Fundamentals of Parallel Computers"


Lecture 10: "Communication Architecture"

Convergence

Shared address and message passing are two distinct programming models, but the
architectures look very similar
Both have a communication assist or network interface to initiate messages or
transactions
In shared memory this assist is integrated with the memory controller
In message passing this assist normally used to be integrated with the I/O, but the
trend is changing
There are message passing machines where the assist sits on the memory bus or
machines where DMA over network is supported (direct transfer from source memory
to destination memory)
Finally, it is possible to emulate send/recv. on shared memory through shared buffers
and flags
Possible to emulate a shared virtual mem. on message passing machines through
modified page fault handlers

Data parallel arch.

Array of processing elements (PEs)


Each PE operates on a data element within a large matrix
The operation is normally specified by a control processor
Essentially, single-instruction-multiple-data (SIMD) architectures
So the parallelism is exposed at the data level
Processor arrays were outplayed by vector processors in mid-70s
Vector processors provide a more general framework to operate on large matrices in a
controlled fashion
No need to design a specialized processor array in a certain topology
Advances in VLSI circuits in mid-80s led to design of large arrays of single-bit PEs
Also, arbitrary communication (rather than just nearest neighbor) was made possible
Gradually, this architecture evolved into SPMD (single-program-multiple-data)
All processors execute the same copy of a program in a more controlled fashion
But parallelism is expressed by partitioning the data
Essentially, the same as the way shared memory or message passing machines are
used for running parallel applications

file:///E|/parallel_com_arch/lecture10/10_5.htm[6/13/2012 11:18:08 AM]


Objectives_template

Module 6: "Fundamentals of Parallel Computers"


Lecture 11: "Design Issues in Parallel Computers"

Fundamentals of Parallel Computers

Dataflow architecture

Systolic arrays

A generic architecture

Design issues

Naming

Operations

Ordering

Replication

Communication cost

ILP vs. TLP

[From Chapter 1 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture11/11_1.htm[6/13/2012 11:18:57 AM]


Objectives_template

Module 6: "Fundamentals of Parallel Computers"


Lecture 11: "Design Issues in Parallel Computers"

Dataflow architecture

Express the program as a dataflow graph


Logical processor at each node is activated when both operands are available
Mapping of logical nodes to PEs is specified by the program
On finishing an operation, a message or token is sent to the destination processor
Arriving tokens are matched against a token store and a match triggers the operation

Systolic arrays

Replace the pipeline within a sequential processor by an array of PEs

Each PE may have small instruction and data memory and may carry out a different operation
Data proceeds through the array at regular “heartbeats” (hence the name)
The dataflow may be multi-directional or optimized for specific algorithms
Optimize the interconnect for specific application (not necessarily a linear topology)
Practical implementation in iWARP
Uses general purpose processors as PEs
Dedicated channels between PEs for direct register to register communication

A generic architecture

In all the architectures we have discussed thus far a node essentially contains processor(s) +
caches, memory and a communication assist (CA)
CA = network interface (NI) + communication controller
The nodes are connected over a scalable network
The main difference remains in the architecture of the CA
And even under a particular programming model (e.g., shared memory) there is a lot of
choices in the design of the CA
Most innovations in parallel architecture take place in the communication assist (also
called communication controller or node controller)

file:///E|/parallel_com_arch/lecture11/11_2.htm[6/13/2012 11:18:57 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture11/11_2.htm[6/13/2012 11:18:57 AM]


Objectives_template

Module 6: "Fundamentals of Parallel Computers"


Lecture 11: "Design Issues in Parallel Computers"

Design issues

Need to understand architectural components that affect software


Compiler, library, program
User/system interface and hw/sw interface
How programming models efficiently talk to the communication architecture?
How to implement efficient primitives in the communication layer?
In a nutshell, what issues of a parallel machine will affect the performance of the
parallel applications?
Naming, Operations, Ordering, Replication, Communication cost

Naming

How are the data in a program referenced?


In sequential programs a thread can access any variable in its virtual address space
In shared memory programs a thread can access any private or shared variable (same
load/store model of sequential programs)
In message passing programs a thread can access local data directly
Clearly, naming requires some support from hw and OS
Need to make sure that the accessed virtual address gets translated to the correct
physical address

Operations

What operations are supported to access data?


For sequential and shared memory models load/store are sufficient
For message passing models send/receive are needed to access remote data
For shared memory, hw (essentially the CA) needs to make sure that a load/store
operation gets correctly translated to a message if the address is remote
For message passing, CA or the message layer needs to copy data from local memory
and initiate send, or copy data from receive buffer to user area in local memory

Ordering

How are the accesses to the same data ordered?


For sequential model, it is the program order: true dependence order
For shared memory, within a thread it is the program order, across threads some “valid
interleaving” of accesses as expected by the programmer and enforced by
synchronization operations (locks, point-to-point synchronization through flags, global
synchronization through barriers)
Ordering issues are very subtle and important in shared memory model (some
microprocessor re-ordering tricks may easily violate correctness when used in shared
memory context)
For message passing, ordering across threads is implied through point-to-point
send/receive pairs (producer-consumer relationship) and mutual exclusion is inherent
(no shared variable)

file:///E|/parallel_com_arch/lecture11/11_3.htm[6/13/2012 11:18:57 AM]


Objectives_template

Module 6: "Fundamentals of Parallel Computers"


Lecture 11: "Design Issues in Parallel Computers"

Replication

How is the shared data locally replicated?


This is very important for reducing communication traffic
In microprocessors data is replicated in the cache to reduce memory accesses
In message passing, replication is explicit in the program and happens through receive
(a private copy is created)
In shared memory a load brings in the data to the cache hierarchy so that subsequent
accesses can be fast; this is totally hidden from the program and therefore the
hardware must provide a layer that keeps track of the most recent copies of the data
(this layer is central to the performance of shared memory multiprocessors and is
called the cache coherence protocol)

Communication cost

Three major components of the communication architecture that affect performance


Latency: time to do an operation (e.g., load/store or send/recv.)
Bandwidth: rate of performing an operation
Overhead or occupancy: how long is the communication layer occupied doing an
operation
Latency
Already a big problem for microprocessors
Even bigger problem for multiprocessors due to remote operations
Must optimize application or hardware to hide or lower latency (algorithmic
optimizations or prefetching or overlapping computation with communication)
Bandwidth
How many ops in unit time e.g. how many bytes transferred per second
Local BW is provided by heavily banked memory or faster and wider system bus
Communication BW has two components: 1. node-to-network BW (also called network
link BW) measures how fast bytes can be pushed into the router from the CA, 2.
within-network bandwidth: affected by scalability of the network and architecture of the
switch or router
Linear cost model: Transfer time = T0 + n/B where T0 is start-up overhead, n is number of
bytes transferred and B is BW
Not sufficient since overlap of comp. and comm. is not considered; also does not count
how the transfer is done (pipelined or not)
Better model:
Communication time for n bytes = Overhead + CA occupancy + Network latency +
Size/BW + Contention
T(n) = O V + O C + L + n/B + TC
Overhead and occupancy may be functions of n
Contention depends on the queuing delay at various components along the
communication path e.g. waiting time at the communication assist or controller, waiting
time at the router etc.
Overall communication cost = frequency of communication x (communication time –
overlap with useful computation)
Frequency of communication depends on various factors such as how the program is

file:///E|/parallel_com_arch/lecture11/11_4.htm[6/13/2012 11:18:58 AM]


Objectives_template

written or the granularity of communication supported by the underlying hardware

ILP vs. TLP

Microprocessors enhance performance of a sequential program by extracting parallelism from


an instruction stream (called instruction-level parallelism)
Multiprocessors enhance performance of an explicitly parallel program by running multiple
threads in parallel (called thread-level parallelism)
TLP provides parallelism at a much larger granularity compared to ILP
In multiprocessors ILP and TLP work together
Within a thread ILP provides performance boost
Across threads TLP provides speedup over a sequential version of the parallel program

file:///E|/parallel_com_arch/lecture11/11_4.htm[6/13/2012 11:18:58 AM]


Objectives_template

Module 7: "Parallel Programming"


Lecture 12: "Steps in Writing a Parallel Program"

Parallel Programming

Prolog: Why bother?

Agenda

Ocean current simulation

Galaxy simulation

Ray tracing

Writing a parallel program

Some definitions

Decomposition of Iterative Equation Solver

Static assignment

Dynamic assignment

Decomposition types

Orchestration

Mapping

An example

Sequential program

[From Chapter 2 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture12/12_1.htm[6/13/2012 11:19:58 AM]


Objectives_template

Module 7: "Parallel Programming"


Lecture 12: "Steps in Writing a Parallel Program"

Prolog: Why bother?

As an architect why should you be concerned with parallel programming?


Understanding program behavior is very important in developing high-performance
computers
An architect designs machines that will be used by the software programmers: so need
to understand the needs of a program
Helps in making design trade-offs and cost/performance analysis i.e. what hardware
feature is worth supporting and what is not
Normally an architect needs to have a fairly good knowledge in compilers and
operating systems

Agenda

Parallel application case studies


Steps in writing a parallel program
Example

Ocean current simulation

Regular structure, scientific computing, important for weather forecast


Want to simulate the eddy current along the walls of ocean basin over a period of time
Discretize the 3-D basin into 2-D horizontal grids
Discretize each 2-D grid into points
One time step involves solving the equation of motion for each grid point
Enough concurrency within and across grids
After each time step synchronize the processors

Galaxy simulation

Simulate the interaction of many stars evolving over time


Want to compute force between every pair of stars for each time step
Essentially O(n2 ) computations (massive parallelism)
Hierarchical methods take advantage of square law
If a group of stars is far enough it is possible to approximate the group entirely by a
single star at the center of mass
Essentially four subparts in each step: divide the galaxy into zones until further division
does not improve accuracy, compute center of mass for each zone, compute force,
update star position based on force
Lot of concurrency across stars

file:///E|/parallel_com_arch/lecture12/12_2.htm[6/13/2012 11:19:59 AM]


Objectives_template

Module 7: "Parallel Programming"


Lecture 12: "Steps in Writing a Parallel Program"

Ray tracing

Want to render a scene using ray tracing


Generate rays through pixels in the image plane
The rays bounce from objects following reflection/refraction laws
New rays get generated: tree of rays from a root ray
Need to correctly simulate paths of all rays
The outcome is color and opacity of the objects in the scene: thus you render a scene
Concurrency across ray trees and subtrees

Writing a parallel program

Start from a sequential description


Identify work that can be done in parallel
Partition work and/or data among threads or processes
Decomposition and assignment
Add necessary communication and synchronization
Orchestration
Map threads to processors (Mapping)
How good is the parallel program?
Measure speedup = sequential execution time/parallel execution time = number of
processors ideally

Some definitions

Task
Arbitrary piece of sequential work
Concurrency is only across tasks
Fine-grained task vs. coarse-grained task: controls granularity of parallelism
(spectrum of grain: one instruction to the whole sequential program)
Process/thread
Logical entity that performs a task
Communication and synchronization happen between threads
Processors
Physical entity on which one or more processes execute

Decomposition of Iterative Equation Solver

Find concurrent tasks and divide the program into tasks


Level or grain of concurrency needs to be decided here
Too many tasks: may lead to too much of overhead communicating and
synchronizing between tasks
Too few tasks: may lead to idle processors
Goal: Just enough tasks to keep the processors busy
Number of tasks may vary dynamically
New tasks may get created as the computation proceeds: new rays in ray tracing
Number of available tasks at any point in time is an upper bound on the achievable
speedup

file:///E|/parallel_com_arch/lecture12/12_3.htm[6/13/2012 11:19:59 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture12/12_3.htm[6/13/2012 11:19:59 AM]


Objectives_template

Module 7: "Parallel Programming"


Lecture 12: "Steps in Writing a Parallel Program"

Static assignment

Given a decomposition it is possible to assign tasks statically


For example, some computation on an array of size N can be decomposed statically by
assigning a range of indices to each process: for k processes P0 operates on indices 0
to (N/k)-1, P1 operates on N/k to (2N/k)-1,…, Pk-1 operates on (k-1)N/k to N-1
For regular computations this works great: simple and low-overhead
What if the nature computation depends on the index?
For certain index ranges you do some heavy-weight computation while for others you
do something simple
Is there a problem?

Dynamic assignment

Static assignment may lead to load imbalance depending on how irregular the application is
Dynamic decomposition/assignment solves this issue by allowing a process to dynamically
choose any available task whenever it is done with its previous task
Normally in this case you decompose the program in such a way that the number of
available tasks is larger than the number of processes
Same example: divide the array into portions each with 10 indices; so you have N/10
tasks
An idle process grabs the next available task
Provides better load balance since longer tasks can execute concurrently with the
smaller ones
Dynamic assignment comes with its own overhead
Now you need to maintain a shared count of the number of available tasks
The update of this variable must be protected by a lock
Need to be careful so that this lock contention does not outweigh the benefits of
dynamic decomposition
More complicated applications where a task may not just operate on an index range, but
could manipulate a subtree or a complex data structure
Normally a dynamic task queue is maintained where each task is probably a pointer to
the data
The task queue gets populated as new tasks are discovered

Decomposition types

Decomposition by data
The most commonly found decomposition technique
The data set is partitioned into several subsets and each subset is assigned to a
process
The type of computation may or may not be identical on each subset
Very easy to program and manage
Computational decomposition
Not so popular: tricky to program and manage
All processes operate on the same data, but probably carry out different kinds of
computation
More common in systolic arrays, pipelined graphics processor units (GPUs) etc.

file:///E|/parallel_com_arch/lecture12/12_4.htm[6/13/2012 11:19:59 AM]


Objectives_template

Orchestration

Involves structuring communication and synchronization among processes, organizing data


structures to improve locality, and scheduling tasks
This step normally depends on the programming model and the underlying architecture
Goal is to
Reduce communication and synchronization costs
Maximize locality of data reference
Schedule tasks to maximize concurrency: do not schedule dependent tasks in parallel
Reduce overhead of parallelization and concurrency management (e.g., management of
the task queue, overhead of initiating a task etc.)

file:///E|/parallel_com_arch/lecture12/12_4.htm[6/13/2012 11:19:59 AM]


Objectives_template

Module 7: "Parallel Programming"


Lecture 12: "Steps in Writing a Parallel Program"

Mapping

At this point you have a parallel program


Just need to decide which and how many processes go to each processor of the parallel
machine
Could be specified by the program
Pin particular processes to a particular processor for the whole life of the program; the
processes cannot migrate to other processors
Could be controlled entirely by the OS
Schedule processes on idle processors
Various scheduling algorithms are possible e.g., round robin: process#k goes to
processor#k
NUMA-aware OS normally takes into account multiprocessor-specific metrics in scheduling
How many processes per processor? Most common is one-to-one

An example

Iterative equation solver


Main kernel in Ocean simulation
Update each 2-D grid point via Gauss-Seidel iterations
A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+A[i+1,j]+A[i-1,j])
Pad the n by n grid to (n+2) by (n+2) to avoid corner problems
Update only interior n by n grid
One iteration consists of updating all n2 points in-place and accumulating the difference
from the previous value at each point
If the difference is less than a threshold, the solver is said to have converged to a stable
grid equilibrium

Sequential program

int n; begin Solve (A)


float **A, diff; int i, j, done = 0;
float temp;
begin main() while (!done)
read (n); /* size of grid */ diff = 0.0;
Allocate (A); for i = 0 to n-1
Initialize (A); for j = 0 to n-1
Solve (A); temp = A[i,j];
end main
A[i,j] = 0.2(A[i,j]+A[i,j+1]+A[i,j-1]+A[i-
1,j]+A[i+1,j]);
diff += fabs (A[i,j] - temp);
endfor
endfor
if (diff/(n*n) < TOL) then done = 1;
endwhile
end Solve

file:///E|/parallel_com_arch/lecture12/12_5.htm[6/13/2012 11:20:00 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture12/12_5.htm[6/13/2012 11:20:00 AM]


Objectives_template

Module 7: "Parallel Programming"


Lecture 13: "Parallelizing a Sequential Program"

Parallel Programming

Decomposition of Iterative Equation Solver

Assignment

Shared memory version

Mutual exclusion

LOCK optimization

More synchronization

Message passing

Major changes

Message passing

Message Passing Grid Solver

MPI-like environment

[From Chapter 2 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture13/13_1.htm[6/13/2012 11:25:47 AM]


Objectives_template

Module 7: "Parallel Programming"


Lecture 13: "Parallelizing a Sequential Program"

Decomposition of Iterative Equation Solver

Look for concurrency in loop iterations


In this case iterations are really dependent
Iteration (i, j) depends on iterations (i, j-1) and (i-1, j)

Each anti-diagonal can be computed in parallel


Must synchronize after each anti-diagonal (or pt-to-pt)
Alternative: red-black ordering (different update pattern)
Can update all red points first, synchronize globally with a barrier and then update all black points
May converge faster or slower compared to sequential program
Converged equilibrium may also be different if there are multiple solutions
Ocean simulation uses this decomposition
We will ignore the loop-carried dependence and go ahead with a straight-forward loop decomposition
Allow updates to all points in parallel
This is yet another different update order and may affect convergence
Update to a point may or may not see the new updates to the nearest neighbors (this parallel
algorithm is non-deterministic)

while (!done)
diff = 0.0;
for_all i = 0 to n-1
for_all j = 0 to n-1
temp = A[i, j];
A[i, j] = 0.2(A[i, j]+A[i, j+1]+A[i, j-1]+A[i-1, j]+A[i+1, j]);
diff += fabs (A[i, j] – temp);
end for_all
end for_all
if (diff/(n*n) < TOL) then done = 1;
end while

Offers concurrency across elements: degree of concurrency is n 2


Make the j loop sequential to have row-wise decomposition: degree n concurrency

file:///E|/parallel_com_arch/lecture13/13_2.htm[6/13/2012 11:25:47 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture13/13_2.htm[6/13/2012 11:25:47 AM]


Objectives_template

Module 7: "Parallel Programming"


Lecture 13: "Parallelizing a Sequential Program"

Assignment

Possible static assignment: block row decomposition


Process 0 gets rows 0 to (n/p)-1, process 1 gets rows n/p to (2n/p)-1 etc.
Another static assignment: cyclic row decomposition
Process 0 gets rows 0, p, 2p,…; process 1 gets rows 1, p+1, 2p+1,….
Dynamic assignment
Grab next available row, work on that, grab a new row,…
Static block row assignment minimizes nearest neighbor communication by assigning
contiguous rows to the same process

Shared memory version

/* include files */
MAIN_ENV;
int P, n;
void Solve ();
struct gm_t {
LOCKDEC (diff_lock);
BARDEC (barrier);
float **A, diff;
} *gm;
int main (char **argv, int argc)
{
int i;
MAIN_INITENV;
gm = (struct gm_t*) G_MALLOC (sizeof (struct gm_t));
LOCKINIT (gm->diff_lock);
BARINIT (gm->barrier);
n = atoi (argv[1]);
P = atoi (argv[2]);
gm->A = (float**) G_MALLOC ((n+2)*sizeof (float*));
for (i = 0; i < n+2; i++) {
gm->A[i] = (float*) G_MALLOC ((n+2)*sizeof (float));
}
Initialize (gm->A);
for (i = 1; i < P; i++) { /* starts at 1 */
CREATE (Solve);
}
Solve ();
WAIT_FOR_END (P-1);
MAIN_END;
}

void Solve (void)


{
int i, j, pid, done = 0;

file:///E|/parallel_com_arch/lecture13/13_3.htm[6/13/2012 11:25:47 AM]


Objectives_template

float temp, local_diff;


GET_PID (pid);
while (!done) {
local_diff = 0.0;
if (!pid) gm->diff = 0.0;
BARRIER (gm->barrier, P);/*why?*/
for (i = pid*(n/P); i < (pid+1)*(n/P); i++) {
for (j = 0; j < n; j++) {
temp = gm->A[i] [j];
gm->A[i] [j] = 0.2*(gm->A[i] [j] + gm->A[i] [j-1] + gm->A[i] [j+1] + gm->A[i+1] [j] + gm->A[i-1]
[j]);
local_diff += fabs (gm->A[i] [j] – temp);
} /* end for */
} /* end for */
LOCK (gm->diff_lock);
gm->diff += local_diff;
UNLOCK (gm->diff_lock);
BARRIER (gm->barrier, P);
if (gm->diff/(n*n) < TOL) done = 1;
BARRIER (gm->barrier, P); /* why? */
} /* end while */
}

file:///E|/parallel_com_arch/lecture13/13_3.htm[6/13/2012 11:25:47 AM]


Objectives_template

Module 7: "Parallel Programming"


Lecture 13: "Parallelizing a Sequential Program"

Mutual exclusion

Use LOCK/UNLOCK around critical sections


Updates to shared variable diff must be sequential
Heavily contended locks may degrade performance
Try to minimize the use of critical sections: they are sequential anyway and will limit
speedup
This is the reason for using a local_diff instead of accessing gm->diff every time
Also, minimize the size of critical section because the longer you hold the lock, longer
will be the waiting time for other processors at lock acquire

LOCK optimization

Suppose each processor updates a shared variable holding a global cost value, only if its
local cost is less than the global cost: found frequently in minimization problems

LOCK (gm->cost_lock);
if (my_cost < gm->cost) {
gm->cost = my_cost;
}
UNLOCK (gm->cost_lock);
/* May lead to heavy lock contention if everyone tries to update at the same time */

if (my_cost < gm->cost) {


LOCK (gm->cost_lock);
if (my_cost < gm->cost)
{ /* make sure*/
gm->cost = my_cost;
}
UNLOCK (gm->cost_lock);
} /* this works because gm->cost is monotonically decreasing */

More synchronization

Global synchronization
Through barriers
Often used to separate computation phases
Point-to-point synchronization
A process directly notifies another about a certain event on which the latter was
waiting
Producer-consumer communication pattern
Semaphores are used for concurrent programming on uniprocessor through P and V
functions
Normally implemented through flags on shared memory multiprocessors (busy wait or
spin)

P0 : A = 1; flag = 1;
P1 : while (!flag); use (A);

file:///E|/parallel_com_arch/lecture13/13_4.htm[6/13/2012 11:25:48 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture13/13_4.htm[6/13/2012 11:25:48 AM]


Objectives_template

Module 7: "Parallel Programming"


Lecture 13: "Parallelizing a Sequential Program"

Message passing

What is different from shared memory?


No shared variable: expose communication through send/receive
No lock or barrier primitive
Must implement synchronization through send/receive
Grid solver example
P0 allocates and initializes matrix A in its local memory
Then it sends the block rows, n, P to each processor i.e. P1 waits to receive rows n/P
to 2n/P-1 etc. (this is one-time)
Within the while loop the first thing that every processor does is to send its first and
last rows to the upper and the lower processors (corner cases need to be handled)
Then each processor waits to receive the neighboring two rows from the upper and the
lower processors
At the end of the loop each processor sends its local_diff to P0 and P0 sends back the done
flag

Major changes

file:///E|/parallel_com_arch/lecture13/13_5.htm[6/13/2012 11:25:48 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture13/13_5.htm[6/13/2012 11:25:48 AM]


Objectives_template

Module 7: "Parallel Programming"


Lecture 13: "Parallelizing a Sequential Program"

Message passing

This algorithm is deterministic


May converge to a different solution compared to the shared memory version if there are
multiple solutions: why?
There is a fixed specific point in the program (at the beginning of each iteration) when
the neighboring rows are communicated
This is not true for shared memory

Message Passing Grid Solver

MPI-like environment

MPI stands for Message Passing Interface


A C library that provides a set of message passing primitives (e.g., send, receive,
broadcast etc.) to the user
PVM (Parallel Virtual Machine) is another well-known platform for message passing
programming
Background in MPI is not necessary for understanding this lecture
Only need to know
When you start an MPI program every thread runs the same main function
We will assume that we pin one thread to one processor just as we did in shared
memory
Instead of using the exact MPI syntax we will use some macros that call the MPI functions

MAIN_ENV;
/* define message tags */
#define ROW 99
#define DIFF 98
#define DONE 97
int main(int argc, char **argv)
{
int pid, P, done, i, j, N;
float tempdiff, local_diff, temp, **A;
MAIN_INITENV;
GET_PID(pid);
GET_NUMPROCS(P);
N = atoi(argv[1]);
tempdiff = 0.0;
done = 0;
A = (double **) malloc ((N/P+2) * sizeof(float *));
for (i=0; i < N/P+2; i++) {
A[i] = (float *) malloc (sizeof(float) * (N+2));
}
initialize(A);
while (!done) {
local_diff = 0.0;
/* MPI_CHAR means raw byte format */

file:///E|/parallel_com_arch/lecture13/13_6.htm[6/13/2012 11:25:48 AM]


Objectives_template

if (pid) { /* send my first row up */


SEND(&A[1][1], N*sizeof(float), MPI_CHAR, pid-1, ROW);
}
if (pid != P-1) { /* recv last row */
RECV(&A[N/P+1][1], N*sizeof(float), MPI_CHAR, pid+1, ROW);
}
if (pid != P-1) { /* send last row down */
SEND(&A[N/P][1], N*sizeof(float), MPI_CHAR, pid+1, ROW);
}
if (pid) { /* recv first row from above */
RECV(&A[0][1], N*sizeof(float), MPI_CHAR, pid-1, ROW);
}
for (i=1; i <= N/P; i++) for (j=1; j <= N; j++) {
temp = A[i][j];
A[i][j] = 0.2 * (A[i][j] + A[i][j-1] + A[i-1][j] + A[i][j+1] + A[i+1][j]);
local_diff += fabs(A[i][j] - temp);
}
if (pid) { /* tell P0 my diff */
SEND(&local_diff, sizeof(float), MPI_CHAR, 0, DIFF);
RECV(&done, sizeof(int), MPI_CHAR, 0, DONE);
}
else { /* recv from all and add up */
for (i=1; i < P; i++) {
RECV(&tempdiff, sizeof(float), MPI_CHAR, MPI_ANY_SOURCE, DIFF);
local_diff += tempdiff;
}
if (local_diff/(N*N) < TOL) done=1;
for (i=1; i < P; i++) {
/* tell all if done */
SEND(&done, sizeof(int), MPI_CHAR, i, DONE);
}
}
} /* end while */
MAIN_END;
} /* end main */

Note the matching tags in SEND and RECV


Macros used in this program
GET_PID
GET_NUMPROCS
SEND
RECV
These will get expanded into specific MPI library calls
Syntax of SEND/RECV
Starting address, how many elements, type of each element (we have used byte only),
source/dest, message tag

file:///E|/parallel_com_arch/lecture13/13_6.htm[6/13/2012 11:25:48 AM]


Objectives_template

Module 8: "Performance Issues"


Lecture 14: "Load Balancing and Domain Decomposition"

Performance Issues

Agenda

Partitioning for perf.

Load balancing

Dynamic task queues

Task stealing

Architect’s job

Partitioning and communication

Domain decomposition

Comm-to-comp ratio

Extra work

Data access and communication

Data access

[From Chapter 3 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture14/14_1.htm[6/13/2012 11:26:55 AM]


Objectives_template

Module 8: "Performance Issues"


Lecture 14: "Load Balancing and Domain Decomposition"

Agenda

Partitioning for performance


Data access and communication
Summary
Goal is to understand simple trade-offs involved in writing a parallel program keeping an eye
on parallel performance
Getting good performance out of a multiprocessor is difficult
Programmers need to be careful
A little carelessness may lead to extremely poor performanc

Partitioning for perf.

Partitioning plays an important role in the parallel performance


This is where you essentially determine the tasks
A good partitioning should practise
Load balance
Minimal communication
Low overhead to determine and manage task assignment (sometimes called extra
work)
A well-balanced parallel program automatically has low barrier or point-to-point
synchronization time
Ideally I want all the threads to arrive at a barrier at the same time

Load balancing

Achievable speedup is bounded above by


Sequential exec. time / Max. time for any processor
Thus speedup is maximized when the maximum time and minimum time across all
processors are close (want to minimize the variance of parallel execution time)
This directly gets translated to load balancing
What leads to a high variance?
Ultimately all processors finish at the same time
But some do useful work all over this period while others may spend a significant time
at synchronization points
This may arise from a bad partitioning
There may be other architectural reasons for load imbalance beyond the scope of a
programmer e.g., network congestion, unforeseen cache conflicts etc. (slows down a
few threads)
Effect of decomposition/assignment on load balancing
Static partitioning is good when the nature of computation is predictable and regular
Dynamic partitioning normally provides better load balance, but has more runtime
overhead for task management; also it may increase communication
Fine grain partitioning (extreme is one instruction per thread) leads to more overhead,
but better load balance
Coarse grain partitioning (e.g., large tasks) may lead to load imbalance if the tasks are
not well-balanced

file:///E|/parallel_com_arch/lecture14/14_2.htm[6/13/2012 11:26:55 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture14/14_2.htm[6/13/2012 11:26:55 AM]


Objectives_template

Module 8: "Performance Issues"


Lecture 14: "Load Balancing and Domain Decomposition"

Dynamic task queues

Introduced in the last lecture


Normally implemented as part of the parallel program
Two possible designs
Centralized task queue: a single queue of tasks; may lead to heavy contention because
insertion and deletion to/from the queue must be critical sections
Distributed task queues: one queue per processor
Issue with distributed task queues
When a queue of a particular processor is empty what does it do? Task stealing

Task stealing

A processor may choose to steal tasks from another processor’s queue if the former’s queue
is empty
How many tasks to steal? Whom to steal from?
The biggest question: how to detect termination? Really a distributed consensus!
Task stealing, in general, may increase overhead and communication, but a smart
design may lead to excellent load balance (normally hard to design efficiently)
This is a form of a more general technique called Receiver Initiated Diffusion (RID)
where the receiver of the task initiates the task transfer
In Sender Initiated Diffusion (SID) a processor may choose to insert into another
processor’s queue if the former’s task queue is full above a threshold

Architect’s job

Normally load balancing is a responsibility of the programmer


However, an architecture may provide efficient primitives to implement task queues and
task stealing
For example, the task queue may be allocated in a special shared memory segment,
accesses to which may be optimized by special hardware in the memory controller
But this may expose some of the architectural features to the programmer
There are multiprocessors that provide efficient implementations for certain
synchronization primitives; this may improve load balance
Sophisticated hardware tricks are possible: dynamic load monitoring and favoring slow
threads dynamicall

Partitioning and communication

Need to reduce inherent communication


This is the part of communication determined by assignment of tasks
There may be other communication traffic also (more later)
Goal is to assign tasks such that accessed data are mostly local to a process
Ideally I do not want any communication
But in life sometimes you need to talk to people to get some work done!

file:///E|/parallel_com_arch/lecture14/14_3.htm[6/13/2012 11:26:56 AM]


Objectives_template

Module 8: "Performance Issues"


Lecture 14: "Load Balancing and Domain Decomposition"

Domain decomposition

Normally applications show a local bias on data usage


Communication is short-range e.g. nearest neighbor
Even if it is long-range it falls off with distance
View the dataset of an application as the domain of the problem e.g., the 2-D grid in
equation solver
If you consider a point in this domain, in most of the applications it turns out that this
point depends on points that are close by
Partitioning can exploit this property by assigning contiguous pieces of data to each
process
Exact shape of decomposed domain depends on the application and load balancing
requirements

Comm-to-comp ratio

Surely, there could be many different domain decompositions for a particular problem
For grid solver we may have a square block decomposition, block row decomposition
or cyclic row decomposition
How to determine which one is good? Communication-to-computation ratio

Assume P processors and NxN grid for grid solver

Size of each block: N/vP by N/vP

Communication (perimeter): 4N/vP

Computation (area): N2 /P

Comm-to-comp ratio = 4vP/N

Sq. block decomp. for P=16

For block row decomposition


Each strip has N/P rows
Communication (boundary rows): 2N
Computation (area): N 2 /P (same as square block)
Comm-to-comp ratio: 2P/N
For cyclic row decomposition
Each processor gets N/P isolated rows
Communication: 2N2 /P
Computation: N 2 /P
Comm-to-comp ratio: 2
Normally N is much much larger than P
Asymptotically, square block yields lowest comm-to-comp ratio
Idea is to measure the volume of inherent communication per computation

file:///E|/parallel_com_arch/lecture14/14_4.htm[6/13/2012 11:26:56 AM]


Objectives_template

In most cases it is beneficial to pick the decomposition with the lowest comm-to-comp
ratio
But depends on the application structure i.e. picking the lowest comm-to-comp may
have other problems
Normally this ratio gives you a rough estimate about average communication
bandwidth requirement of the application i.e. how frequent is communication
But it does not tell you the nature of communication i.e. bursty or uniform
For grid solver comm. happens only at the start of each iteration; it is not uniformly
distributed over computation
Thus the worst case BW requirement may exceed the average comm-to-comp ratio

file:///E|/parallel_com_arch/lecture14/14_4.htm[6/13/2012 11:26:56 AM]


Objectives_template

Module 8: "Performance Issues"


Lecture 14: "Load Balancing and Domain Decomposition"

Extra work

Extra work in a parallel version of a sequential program may result from


Decomposition
Assignment techniques
Management of the task pool etc.
Speedup is bounded above by
Sequential work / Max (Useful work + Synchronization + Comm. cost + Extra work)
where the Max is taken over all processors
But this is still incomplete
We have only considered communication cost from the viewpoint of the algorithm and
ignored the architecture completely

Data access and communication

The memory hierarchy (caches and main memory) plays a significant role in determining
communication cost
May easily dominate the inherent communication of the algorithm
For uniprocessor, the execution time of a program is given by useful work time + data access
time
Useful work time is normally called the busy time or busy cycles
Data access time can be reduced either by architectural techniques (e.g., large caches)
or by cache-aware algorithm design that exploits spatial and temporal locality

Data access

In multiprocessors
Every processor wants to see the memory interface as its own local cache and the
main memory
In reality it is much more complicated
If the system has a centralized memory (e.g., SMPs), there are still caches of other
processors; if the memory is distributed then some part of it is local and some is
remote
For shared memory, data movement from local or remote memory to cache is
transparent while for message passing it is explicit
View a multiprocessor as an extended memory hierarchy where the extension includes
caches of other processors, remote memory modules and the network topology

file:///E|/parallel_com_arch/lecture14/14_5.htm[6/13/2012 11:26:56 AM]


Objectives_template

Module 8: "Performance Issues"


Lecture 15: "Locality and Communication Optimizations"

Artifactual comm.

Capacity problem

Temporal locality

Spatial locality

2D to 4D conversion

Transfer granularity

Worse: false sharing

Communication cost

Contention

Hot-spots

Overlap

Summary

[From Chapter 3 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture15/15_1.htm[6/13/2012 11:27:57 AM]


Objectives_template

Module 8: "Performance Issues"


Lecture 15: "Locality and Communication Optimizations"

Artifactual comm.

Communication caused by artifacts of extended memory hierarchy


Data accesses not satisfied in the cache or local memory cause communication
Inherent communication is caused by data transfers determined by the program
Artifactual communication is caused by poor allocation of data across distributed
memories, unnecessary data in a transfer, unnecessary transfers due to system-
dependent transfer granularity, redundant communication of data, finite replication
capacity (in cache or memory)
Inherent communication assumes infinite capacity and perfect knowledge of what should be
transferred

Capacity problem

Most probable reason for artifactual communication


Due to finite capacity of cache, local memory or remote memory
May view a multiprocessor as a three-level memory hierarchy for this purpose: local
cache, local memory, remote memory
Communication due to cold or compulsory misses and inherent communication are
independent of capacity
Capacity and conflict misses generate communication resulting from finite capacity
Generated traffic may be local or remote depending on the allocation of pages
General technique: exploit spatial and temporal locality to use the cache properly

Temporal locality

Maximize reuse of data


Schedule tasks that access same data in close succession
Many linear algebra kernels use blocking of matrices to improve temporal (and spatial)
locality
Example: Transpose phase in Fast Fourier Transform (FFT); to improve locality, the
algorithm carries out blocked transpose i.e. transposes a block of data at a time

Block
transpose

file:///E|/parallel_com_arch/lecture15/15_2.htm[6/13/2012 11:27:57 AM]


Objectives_template

Module 8: "Performance Issues"


Lecture 15: "Locality and Communication Optimizations"

Spatial locality

Consider a square block decomposition of grid solver and a C-like row major layout i.e. A[i][j]
and A[i][j+1] have contiguous memory locations
The same page is local to a
processor while remote to
others; same applies to
straddling cache lines.
Ideally, I want to have all
pages within a partition local
to a single processor.
Standard trick is to covert the
2D array to 4D.

2D to 4D conversion

Essentially you need to change the way memory is allocated


The matrix A needs to be allocated in such a way that the elements falling within a
partition are contiguous
The first two dimensions of the new 4D matrix are block row and column indices i.e. for
the partition assigned to processor P6 these are 1 and 2 respectively (assuming 16
processors)
The next two dimensions hold the data elements within that partition
Thus the 4D array may be declared as float B[vP][vP][N/vP][N/vP]
The element B[3][2][5][10] corresponds to the element in 10 th column, 5 th row of the
partition of P14
Now all elements within a partition have contiguous addresses
Clearly, naming requires some support from hw and OS
Need to make sure that the accessed virtual address gets translated to the correct
physical address

Transfer granularity

How much data do you transfer in one communication?


For message passing it is explicit in the program
For shared memory this is really under the control of the cache coherence protocol:
there is a fixed size for which transactions are defined (normally the block size of the
outermost level of cache hierarchy)
In shared memory you have to be careful
Since the minimum transfer size is a cache line you may end up transferring extra data
e.g., in grid solver the elements of the left and right neighbors for a square block

file:///E|/parallel_com_arch/lecture15/15_3.htm[6/13/2012 11:27:57 AM]


Objectives_template

decomposition (you need only one element, but must transfer the whole cache line): no
good solution

file:///E|/parallel_com_arch/lecture15/15_3.htm[6/13/2012 11:27:57 AM]


Objectives_template

Module 8: "Performance Issues"


Lecture 15: "Locality and Communication Optimizations"

Worse: false sharing

If the algorithm is designed so poorly that


Two processors write to two different words within a cache line at the same time
The cache line keeps on moving between two processors
The processors are not really accessing or updating the same element, but whatever
they are updating happen to fall within a cache line: not a true sharing, but false
sharing
For shared memory programs false sharing can easily degrade performance by a lot
Easy to avoid: just pad up to the end of the cache line before starting the allocation of
the data for the next processor (wastes memory, but improves performance)

Communication cost

Given the total volume of communication (in bytes, say) the goal is to reduce the end-to-end
latency
Simple model:

T = f*(o + L + (n / m) / B + tc – overlap) where


f = frequency of messages
o = overhead per message (at receiver and sender)
L = network delay per message (really the router delay)
n = total volume of communication in bytes
m = total number of messages
B = node-to-network bandwidth
tc = contention-induced average latency per message
overlap = how much communication time is overlapped with useful computation

The goal is to reduce T


Reduce o by communicating less: restructure algorithm to reduce m i.e. communicate
larger messages (easy for message passing, but need extra support in memory
controller for shared memory e.g., block transfer)
Reduce L = number of average hops*time per hop
Number of hops can be reduced by mapping the algorithm on the topology properly
e.g., nearest neighbor communication is well-suited for a ring (just left/right) or a mesh
(grid solver example); however, L is not very important because today routers are
really fast (routing delay is ~10 ns); o and tc are the dominant parts in T
Reduce tc by not creating hot-spots in the system: restructure algorithm to make
sure a particular node does not get flooded with messages; distribute uniformly

Contention

It is very easy to ignore contention effects when designing algorithms


Can severely degrade performance by creating hot-spots
Location hot-spot:
Consider accumulating a global variable; the accumulation takes place on a single
node i.e. all nodes access the variable allocated on that particular node whenever it
tries to increment it

file:///E|/parallel_com_arch/lecture15/15_4.htm[6/13/2012 11:27:57 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture15/15_4.htm[6/13/2012 11:27:57 AM]


Objectives_template

Module 8: "Performance Issues"


Lecture 15: "Locality and Communication Optimizations"

Hot-spots

Avoid location hot-spot by either staggering accesses to the same location or by designing the
algorithm to exploit a tree structured communication
Module hot-spot
Normally happens when a particular node saturates handling too many messages
(need not be to same memory location) within a short amount of time
Normal solution again is to design the algorithm in such a way that these messages
are staggered over time
Rule of thumb: design communication pattern such that it is not bursty; want to distribute it
uniformly over time

Overlap

Increase overlap between communication and computation


Not much to do at algorithm level unless the programming model and/or OS provide
some primitives to carry out prefetching, block data transfer, non-blocking receive etc.
Normally, these techniques increase bandwidth demand because you end up
communicating the same amount of data, but in a shorter amount of time (execution
time hopefully goes down if you can exploit overlap)

Summary

Comparison of sequential and parallel execution


Sequential execution time = busy useful time + local data access time
Parallel execution time = busy useful time + busy overhead (extra work) + local data
access time + remote data access time + synchronization time
Busy useful time in parallel execution is ideally equal to sequential busy useful time /
number of processors
Local data access time in parallel execution is also less compared to that in sequential
execution because ideally each processor accesses less than 1/P th of the local data
(some data now become remote)
Parallel programs introduce three overhead terms: busy overhead (extra work), remote data
access time, and synchronization time
Goal of a good parallel program is to minimize these three terms
Goal of a good parallel computer architecture is to provide sufficient
support to let programmers optimize these three terms (and this is the
focus of the rest of the course)

file:///E|/parallel_com_arch/lecture15/15_5.htm[6/13/2012 11:27:58 AM]


Objectives_template

Exercise : 1

These problems should be tried after module 08 is completed.

1. [10 points] Suppose you are given a program that does a fixed amount of
work, and some fraction s of that work must be done sequentially. The
remaining portion of the work is perfectly parallelizable on P processors. Derive
a formula for execution time on P processors and establish an upper bound on
the achievable speedup.

2. [40 points] Suppose you want to transfer n bytes from a source node S to a
destination node D and there are H links between S and D. Therefore, notice
that there are H+1 routers in the path (including the ones in S and D). Suppose
W is the node-to-network bandwidth at each router. So at S you require n/W
time to copy the message into the router buffer. Similarly, to copy the message
from the buffer of router in S to the buffer of the next router on the path, you
require another n/W time. Assuming a store-and-forward protocol total time
spent doing these copy operations would be (H+2)n/W and the data will end up
in some memory buffer in D. On top of this, at each router we spend R amount
of time to figure out the exit port. So the total time taken to transfer n bytes from
S to D in a store-and-forward protocol is (H+2)n/W+(H+1)R. On the other hand,
if you assume a cut-through protocol the critical path would just be n/W+(H+1)R.
Here we assume the best possible scenario where the header routing delay at
each node is exposed and only the startup n/W delay at S is exposed. The rest
is pipelined. Now suppose that you are asked to compare the performance of
these two routing protocols on an 8x8 grid. Compute the maximum, minimum,
and average latency to transfer an n byte message in this topology for both the
protocols. Assume the following values: W=3.2 GB/s and R=10 ns. Compute for
n=64 and 256. Note that for each protocol you will have three answers
(maximum, minimum, average) for each value of n. Here GB means 10^9 bytes
and not 2^30 bytes.

3. [20 points] Consider a simple computation on an nxn double matrix (each


element is 8 bytes) where each element A[i][j] is modified as follows. A[i][j] =
A[i][j] - (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1])/4. Suppose you assign one matrix
element to one processor (i.e. you have n^2 processors). Compute the total
amount of data communication between processors.

4. [30 points] Consider a machine running at 10^8 instructions per second on


some workload with the following mix: 50% ALU instructions, 20% load
instructions, 10% store instructions, and 20% branch instructions. Suppose the
instruction cache miss rate is 1%, the writeback data cache miss rate is 5%, and
the cache line size is 32 bytes. Assume that a store miss requires two cache
line transfers, one to load the newly updated line and one to replace the dirty
line at a later point in time. If the machine provides a 250 MB/s bus, how many
processors can it accommodate at peak bus bandwidth?

file:///E|/parallel_com_arch/lecture15/ex_1.htm[6/13/2012 11:27:58 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture15/ex_1.htm[6/13/2012 11:27:58 AM]


Objectives_template

Solution of Exercise : 1

1. [10 points] Suppose you are given a program that does a fixed amount of
work, and some fraction s of that work must be done sequentially. The
remaining portion of the work is perfectly parallelizable on P processors. Derive
a formula for execution time on P processors and establish an upper bound on
the achievable speedup.

Solution: Execution time on P processors, T(P) = sT(1) + (1-s)T(1)/P.


Speedup = 1/(s + (1-s)/P). Upper bound is achieved when P approaches infinity.
So maximum speedup = 1/s. As expected, the upper bound on achievable
speedup is inversely proportional to the sequential fraction.

2. [40 points] Suppose you want to transfer n bytes from a source node S to a
destination node D and there are H links between S and D. Therefore, notice
that there are H+1 routers in the path (including the ones in S and D). Suppose
W is the node-to-network bandwidth at each router. So at S you require n/W
time to copy the message into the router buffer. Similarly, to copy the message
from the buffer of router in S to the buffer of the next router on the path, you
require another n/W time. Assuming a store-and-forward protocol total time
spent doing these copy operations would be (H+2)n/W and the data will end up
in some memory buffer in D. On top of this, at each router we spend R amount
of time to figure out the exit port. So the total time taken to transfer n bytes from
S to D in a store-and-forward protocol is (H+2)n/W+(H+1)R. On the other hand,
if you assume a cut-through protocol the critical path would just be n/W+(H+1)R.
Here we assume the best possible scenario where the header routing delay at
each node is exposed and only the startup n/W delay at S is exposed. The rest
is pipelined. Now suppose that you are asked to compare the performance of
these two routing protocols on an 8x8 grid. Compute the maximum, minimum,
and average latency to transfer an n byte message in this topology for both the
protocols. Assume the following values: W=3.2 GB/s and R=10 ns. Compute for
n=64 and 256. Note that for each protocol you will have three answers
(maximum, minimum, average) for each value of n. Here GB means 10^9 bytes
and not 2^30 bytes.

Solution: The basic problem is to compute the maximum, minimum, and


average values of H. The rest is just about substituting the values of the
parameters. The maximum value of H is 14 while the minimum is 1. To compute
the average, you need to consider all possible messages, compute H for them,
and then take the average. Consider S=(x0, y0) and D=(x1, y1). So H = |x0-x1|
+ |y0-y1|. Therefore, average H = (sum over all x0, x1, y0, y1 |x0-x1| + |y0-
y1|)/(64*63), where each of x0, x1, y0, y1 varies from 0 to 7. Clearly, this is
same as (sum over x0, x1 |x0-x1| + sum over y0, y1 |y0-y1|)/63, which in turn is
equal to 2*(sum over x0, x1 |x0-x1|)/63 = 2*(sum over x0=0 to 7, x1=0 to x0
(x0-x1)+ sum over x0=0 to 7, x1=x0+1 to 7 (x1-x0))/63 = 16/3.

3. [20 points] Consider a simple computation on an nxn double matrix (each


element is 8 bytes) where each element A[i][j] is modified as follows. A[i][j] =

file:///E|/parallel_com_arch/lecture15/ex_sol_1.htm[6/13/2012 11:27:58 AM]


Objectives_template

A[i][j] - (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1])/4. Suppose you assign one matrix
element to one processor (i.e. you have n^2 processors). Compute the total
amount of data communication between processors.

Solution: Each processor requires the four neighbors i.e. 32 bytes. So total
amount of data communicated is 32n^2.

4. [30 points] Consider a machine running at 10^8 instructions per second on


some workload with the following mix: 50% ALU instructions, 20% load
instructions, 10% store instructions, and 20% branch instructions. Suppose the
instruction cache miss rate is 1%, the writeback data cache miss rate is 5%, and
the cache line size is 32 bytes. Assume that a store miss requires two cache
line transfers, one to load the newly updated line and one to replace the dirty
line at a later point in time. If the machine provides a 250 MB/s bus, how many
processors can it accommodate at peak bus bandwidth?

Solution: Let us compute the bandwidth requirement of the processor per


second. Instruction cache misses 10^6 times transferring 32 bytes on each miss.
Out of 20*10^6 loads 10^6 miss in the cache transferring 32 bytes on each
miss. Out of 10^7 stores 5*10^5 miss in the cache transferring 64 bytes on each
miss. Thus, total amount of data transferred per second is 96*10^6 bytes. Thus
at most two processors can be supported on a 250 MB/s bus.

file:///E|/parallel_com_arch/lecture15/ex_sol_1.htm[6/13/2012 11:27:58 AM]


Objectives_template

Module 9: "Introduction to Shared Memory Multiprocessors"


Lecture 16: "Multiprocessor Organizations and Cache Coherence"

Shared Memory Multiprocessors

Shared memory multiprocessors

Shared cache

Private cache/Dancehall

Distributed shared memory

Shared vs. private in CMPs

Cache coherence

Cache coherence: Example

What went wrong?

Implementations

file:///E|/parallel_com_arch/lecture16/16_1.htm[6/13/2012 11:29:10 AM]


Objectives_template

Module 9: "Introduction to Shared Memory Multiprocessors"


Lecture 16: "Multiprocessor Organizations and Cache Coherence"

Shared memory multiprocessors

What do they look like?


We will assume that each processor has a hierarchy of caches (possibly shared)
We will not discuss shared memory in a time-shared single-thread computer
A degenerate case of the following
Shared cache (popular in CMPs)
Private cache (popular in CMPs and SMPs)
Dancehall (popular in old computers)
Distributed shared memory (popular in medium to large-scale servers)

Shared cache

Private cache/Dancehall

Distributed shared memory

file:///E|/parallel_com_arch/lecture16/16_2.htm[6/13/2012 11:29:10 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture16/16_2.htm[6/13/2012 11:29:10 AM]


Objectives_template

Module 9: "Introduction to Shared Memory Multiprocessors"


Lecture 16: "Multiprocessor Organizations and Cache Coherence"

Shared vs. private in CMPs

Shared caches are often very large in the CMPs


They are banked to avoid worst-case wire delay
The banks are usually distributed across the floor of the chip on an interconnect

In shared caches, getting a block from a remote bank takes time proportional to the physical
distance between the requester and the bank
Non-uniform cache architecture (NUCA)
This is same for private caches, if the data resides in a remote cache
Shared cache may have higher average hit latency than the private cache
Hopefully most hits in the latter will be local
Shared caches are most likely to have less misses than private caches
Latter wastes space due to replication

Cache coherence

Nothing unique to multiprocessors


Even uniprocessor computers need to worry about cache coherence
For sequential programs we expect a memory location to return the latest value written
For concurrent programs running on multiple threads or processes on a single
processor we expect the same model to hold because all threads see the same cache
hierarchy (same as shared L1 cache)
For multiprocessors there remains a danger of using a stale value: hardware must
ensure that cached values are coherent across the system and they satisfy
programmers’ intuitive memory model

Cache coherence: Example

Assume a write-through cache


P0: reads x from memory, puts it in its cache, and gets the value 5
P1: reads x from memory, puts it in its cache, and gets the value 5
P1: writes x=7, updates its cached value and memory value
P0: reads x from its cache and gets the value 5
P2: reads x from memory, puts it in its cache, and gets the value 7 (now the system is
completely incoherent)
P2: writes x=10, updates its cached value and memory value
Consider the same example with a writeback cache
P0 has a cached value 5, P1 has 7, P2 has 10, memory has 5 (since caches are not
write through)

file:///E|/parallel_com_arch/lecture16/16_3.htm[6/13/2012 11:29:10 AM]


Objectives_template

The state of the line in P1 and P2 is M while the line in P0 is clean


Eviction of the line from P1 and P2 will issue writebacks while eviction of the line from
P0 will not issue a writeback (clean lines do not need writeback)
Suppose P2 evicts the line first, and then P1
Final memory value is 7: we lost the store x=10 from P2

file:///E|/parallel_com_arch/lecture16/16_3.htm[6/13/2012 11:29:10 AM]


Objectives_template

Module 9: "Introduction to Shared Memory Multiprocessors"


Lecture 16: "Multiprocessor Organizations and Cache Coherence"

What went wrong?

For write through cache


The memory value may be correct if the writes are correctly ordered
But the system allowed a store to proceed when there is already a cached copy
Lesson learned: must invalidate all cached copies before allowing a store to proceed
Writeback cache
Problem is even more complicated: stores are no longer visible to memory immediately
Writeback order is important
Lesson learned: do not allow more than one copy of a cache line in M state

Implementations

Must invalidate all cached copies before allowing a store to proceed


Need to know where the cached copies are
Solution1: Never mind! Just tell everyone that you are going to do a store
Leads to broadcast snoopy protocols
Popular with small-scale bus-based CMPs and SMPs
AMD Opteron implements it on a distributed network (the Hammer protocol)
The biggest reason why quotidian Windows fans would buy small-scale
multiprocessors and multi-core today
Solution2: Keep track of the sharers and invalidate them when needed
Where and how is this information stored?
Leads to directory-based scalable protocols
Directory-based protocols
Maintain one directory entry per memory block
Each directory entry contains a sharer bitvector and state bits
Concept of home node in distributed shared memory multiprocessors
Concept of sparse directory for on-chip coherence in CMPs
Do not allow more than one copy of a cache line in M state
Need some form of access control mechanism
Before a processor does a store it must take “permission” from the current “owner” (if
any)
Need to know who the current owner is
Either a processor or main memory
Solution1 and Solution2 apply here also
Latest value must be propagated to the requester
Notion of “latest” is very fuzzy
Once we know the owner, this is easy
Solution1 and Solution2 apply here also
Invariant: if a cache block is not in M state in any processor, memory must provide the block
to the requester
Memory must be updated when a block transitions from M state to S state
Note that a transition from M to I always updates memory in systems with writeback
caches (these are normal writeback operations)
Most of the implementations of a coherence protocol deals with uncommon cases and races

file:///E|/parallel_com_arch/lecture16/16_4.htm[6/13/2012 11:29:11 AM]


Objectives_template

Module 9: "Introduction to Shared Memory Multiprocessors"


Lecture 17: "Introduction to Cache Coherence Protocols"

Invalidation vs. Update

Sharing patterns

Migratory hand-off

States of a cache line

Stores

MSI protocol

State transition

MSI example

MESI protocol

State transition

MESI example

MOESI protocol

Hybrid inval+update

file:///E|/parallel_com_arch/lecture17/17_1.htm[6/13/2012 11:29:55 AM]


Objectives_template

Module 9: "Introduction to Shared Memory Multiprocessors"


Lecture 17: "Introduction to Cache Coherence Protocols"

Invalidation vs. Update

Two main classes of protocols:


Dictates what action should be taken on a store
Invalidation-based protocols invalidate sharers when a store miss appears
Update-based protocols update the sharer caches with new value on a store
Advantage of update-based protocols: sharers continue to hit in the cache while in
invalidation-based protocols sharers will miss next time they try to access the line
Advantage of invalidation-based protocols: only store misses go on bus and
subsequent stores to the same line are cache hits
When is update-based protocol good?
What sharing pattern? (large-scale producer/consumer)
Otherwise it would just waste bus bandwidth doing useless updates
When is invalidation-protocol good?
Sequence of multiple writes to a cache line
Saves intermediate write transactions
Overhead of initiating small updates
Invalidation-based protocols are much more popular
Some systems support both or maybe some hybrid based on dynamic sharing pattern
of a cache line

Sharing patterns

Producer-consumer (initially flag, done are zero)


T0: while (!exit) {x=y; flag=1; while (done != k); flag=0; done=0;}
T1 to Tk: while (!exit) {while (!flag); use x; done++; while (flag);}
Exit condition not shown
What if T1 to Tk do not have the outer loop?
Migratory (initially flag is zero)
T0: x = f0(x); flag++;
T1 to Tk: while (flag != pid); x = f1(x); flag++;
Migratory hand-off?

Migratory hand-off

Needs a memory writeback on every hand-off


r0, w0, r1, w1, r2, w2, r3, w3, r4, w4, …
How to avoid these unnecessary writebacks?
Saves memory bandwidth
Solution: add an owner state (different from M) in caches
Only owner can write a line back on eviction
Ownership shifts along the migratory chain

file:///E|/parallel_com_arch/lecture17/17_2.htm[6/13/2012 11:29:55 AM]


Objectives_template

Module 9: "Introduction to Shared Memory Multiprocessors"


Lecture 17: "Introduction to Cache Coherence Protocols"

States of a cache line

Invalid (I), Shared (S), Modified or dirty (M), Clean exclusive (E), Owned (O)
Every processor does not support all five states
E state is equivalent to M in the sense that the line has permission to write, but in E
state the line is not yet modified and the copy in memory is the same as in cache; if
someone else requests the line the memory will provide the line
O state is exactly same as E state but in this case memory is not responsible for
servicing requests to the line; the owner must supply the line (just as in M state)

Stores

Look at stores a little more closely


There are three situations at the time a store issues: the line is not in the cache, the
line is in the cache in S state, the line is in the cache in one of M, E and O states
If the line is in I state, the store generates a read-exclusive request on the bus and
gets the line in M state
If the line is in S or O state, that means the processor only has read permission for
that line; the store generates an upgrade request on the bus and the upgrade
acknowledgment gives it the write permission (this is a data-less transaction)

MSI protocol

Forms the foundation of invalidation-based writeback protocols


Assumes only three supported cache line states: I, S, and M
There may be multiple processors caching a line in S state
There must be exactly one processor caching a line in M state and it is the owner of
the line
If none of the caches have the line, memory must have the most up-to-date copy of
the line

file:///E|/parallel_com_arch/lecture17/17_3.htm[6/13/2012 11:29:55 AM]


Objectives_template

Module 9: "Introduction to Shared Memory Multiprocessors"


Lecture 17: "Introduction to Cache Coherence Protocols"

State transition

MSI example

Take the following example


P0 reads x, P1 reads x, P1 writes x, P0 reads x, P2 reads x, P3 writes x
Assume the state of the cache line containing the address of x is I in all processors

P0 generates BusRd, memory provides line, P0 puts line in S state

P1 generates BusRd, memory provides line, P1 puts line in S state

P1 generates BusUpgr, P0 snoops and invalidates line, memory does not respond, P1
sets state of line to M

P0 generates BusRd, P1 flushes line and goes to S state, P0 puts line in S state,
memory writes back

P2 generates BusRd, memory provides line, P2 puts line in S state

P3 generates BusRdX, P0, P1, P2 snoop and invalidate, memory provides line, P3 puts
line in cache in M state

MESI protocol

The most popular invalidation-based protocol e.g., appears in Intel Xeon MP


Why need E state?
The MSI protocol requires two transactions to go from I to M even if there is no
intervening requests for the line: BusRd followed by BusUpgr
Save one transaction by having memory controller respond to the first BusRd with E
state if there is no other sharer in the system
Needs a dedicated control wire that gets asserted by a sharer (wired OR)
Processor can write to a line in E state silently

file:///E|/parallel_com_arch/lecture17/17_4.htm[6/13/2012 11:29:56 AM]


Objectives_template

Module 9: "Introduction to Shared Memory Multiprocessors"


Lecture 17: "Introduction to Cache Coherence Protocols"

State transition

MESI example

Take the following example


P0 reads x, P0 writes x, P1 reads x, P1 writes x, …

P0 generates BusRd, memory provides line, P0 puts line in cache in E state

P0 does write silently, goes to M state

P1 generates BusRd, P0 provides line, P1 puts line in cache in S state, P0 transitions


to S state

Rest is identical to MSI

Consider this example: P0 reads x, P1 reads x, …

P0 generates BusRd, memory provides line, P0 puts line in cache in E state

P1 generates BusRd, memory provides line, P1 puts line in cache in S state, P0


transitions to S state (no cache-to-cache sharing)

Rest is same as MSI

file:///E|/parallel_com_arch/lecture17/17_5.htm[6/13/2012 11:29:56 AM]


Objectives_template

Module 9: "Introduction to Shared Memory Multiprocessors"


Lecture 17: "Introduction to Cache Coherence Protocols"

MOESI protocol

State transitions pertaining to O state


I to O: not possible
E to O or S to O: not possible
M to O: on a BusRd/Flush (but no memory writeback)
O to I: on CacheEvict/BusWB or {BusRdX,BusUpgr}/Flush
O to S: not possible
O to E: not possible
O to M: on PrWr/BusUpgr
At most one cache can have a line in O state at any point in time
Two main design choices for MOESI
Consider the example: P0 reads x, P0 writes x, P1 reads x, P2 reads x, P3 reads x, …
When P1 launches BusRd, P0 sources the line and now the protocol has two options:
1. The line in P0 goes to O and the line in P1 is filled in state S; 2. The line in P0
goes to S and the line in P1 is filled in state O i.e. P1 inherits ownership from P0
For distributed shared memory, the second choice is better
According to the second choice, when P2 generates a BusRd request, P1 sources the
line and transitions from O to S; P2 becomes the new owner
Some SMPs do not support the E state
In many cases it is not helpful, only complicates the protocol
MOSI allows a compact state encoding in 2 bits
Sun WildFire uses MOSI protocol

Hybrid inval+update

One possible hybrid protocol


Keep a counter per cache line and make Dragon update protocol the default
Every time the local processor accesses a cache line set its counter to some pre-
defined threshold k
On each received update decrease the counter by one
When the counter reaches zero, the line is locally invalidated hoping that eventually
the writer will switch to M state from Sm state when no sharers are left

file:///E|/parallel_com_arch/lecture17/17_6.htm[6/13/2012 11:29:56 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 18: "Introduction to Cache Coherence"

Shared Memory Multiprocessors

Four organizations

Hierarchical design

Cache Coherence

Example

What went wrong?

Definitions

Ordering memory op

Example

Cache coherence

Bus-based SMP

Snoopy protocols

Write through caches

State transition

Ordering memory op

Write through is bad

[From Chapter 5 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture18/18_1.htm[6/13/2012 11:30:35 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 18: "Introduction to Cache Coherence"

Four organizations

Shared cache

The switch is a simple controller for granting access to cache banks


Interconnect is between the processors and the shared cache
Which level of cache hierarchy is shared depends on the design: Chip multiprocessors today
normally share the outermost level (L2 or L3 cache)
The cache and memory are interleaved to improve bandwidth by allowing multiple concurrent
accesses
Normally small scale due to heavy bandwidth demand on switch and shared cache
Bus-based SMP

Scalability is limited by the shared bus bandwidth


Interconnect is a shared bus located between the private cache hierarchies and memory

file:///E|/parallel_com_arch/lecture18/18_2.htm[6/13/2012 11:30:35 AM]


Objectives_template

controller
The most popular organization for small to medium-scale servers
Possible to connect 30 or so processors with smart bus design
Bus bandwidth requirement is lower compared to shared cache approach
Why?
Dancehall

Better scalability compared to previous two designs


The difference from bus-based SMP is that the interconnect is a scalable point-to-point
network (e.g. crossbar or other topology)
Memory is still symmetric from all processors
Drawback: a cache miss may take a long time since all memory banks too far off from the
processors (may be several network hops)
Distributed shared memory

The most popular scalable organization


Each node now has local memory banks
Shared memory on other nodes must be accessed over the network
Remote memory access
Non-uniform memory access (NUMA)
Latency to access local memory is much smaller compared to remote memory
Caching is very important to reduce remote memory access
In all four organizations caches play an important role in reducing latency and bandwidth

file:///E|/parallel_com_arch/lecture18/18_2.htm[6/13/2012 11:30:35 AM]


Objectives_template

requirement
If an access is satisfied in cache, the transaction will not appear on the interconnect
and hence the bandwidth requirement of the interconnect will be less (shared L1 cache
does not have this advantage)
In distributed shared memory (DSM) cache and local memory should be used cleverly
Bus-based SMP and DSM are the two designs supported today by industry vendors
In bus-based SMP every cache miss is launched on the shared bus so that all
processors can see all transactions
In DSM this is not the case

file:///E|/parallel_com_arch/lecture18/18_2.htm[6/13/2012 11:30:35 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 18: "Introduction to Cache Coherence"

Hierarchical design

Possible to combine bus-based SMP and DSM to build hierarchical shared memory
Sun Wildfire connects four large SMPs (28 processors) over a scalable interconnect to
form a 112p multiprocessor
IBM POWER4 has two processors on-chip with private L1 caches, but shared L2 and
L3 caches (this is called a chip multiprocessor); connect these chips over a network to
form scalable multiprocessors
Next few lectures will focus on bus-based SMPs only

Cache Coherence

Intuitive memory model


For sequential programs we expect a memory location to return the latest value written
to that location
For concurrent programs running on multiple threads or processes on a single
processor we expect the same model to hold because all threads see the same cache
hierarchy (same as shared L1 cache)
For multiprocessors there remains a danger of using a stale value: in SMP or DSM the
caches are not shared and processors are allowed to replicate data independently in
each cache; hardware must ensure that cached values are coherent across the
system and they satisfy programmers’ intuitive memory model

Example

Assume a write-through cache i.e. every store updates the value in cache as well as in
memory
P0: reads x from memory, puts it in its cache, and gets the value 5
P1: reads x from memory, puts it in its cache, and gets the value 5
P1: writes x=7, updates its cached value and memory value
P0: reads x from its cache and gets the value 5
P2: reads x from memory, puts it in its cache, and gets the value 7 (now the system is
completely incoherent)
P2: writes x=10, updates its cached value and memory value
Consider the same example with a writeback cache i.e. values are written back to memory
only when the cache line is evicted from the cache
P0 has a cached value 5, P1 has 7, P2 has 10, memory has 5 (since caches are not
write through)
The state of the line in P1 and P2 is M while the line in P0 is clean
Eviction of the line from P1 and P2 will issue writebacks while eviction of the line from
P0 will not issue a writeback (clean lines do not need writeback)
Suppose P2 evicts the line first, and then P1
Final memory value is 7: we lost the store x=10 from P2

What went wrong?

For write through cache


The memory value may be correct if the writes are correctly ordered
But the system allowed a store to proceed when there is already a cached copy

file:///E|/parallel_com_arch/lecture18/18_3.htm[6/13/2012 11:30:35 AM]


Objectives_template

Lesson learned: must invalidate all cached copies before allowing a store to proceed
Writeback cache
Problem is even more complicated: stores are no longer visible to memory immediately
Writeback order is important
Lesson learned: do not allow more than one copy of a cache line in M state
Need to formalize the intuitive memory model
In sequential programs the order of read/write is defined by the program order; the
notion of “last write” is well-defined
For multiprocessors how do you define “last write to a memory location” in presence of
independent caches?
Within a processor it is still fine, but how do you order read/write across processors?

file:///E|/parallel_com_arch/lecture18/18_3.htm[6/13/2012 11:30:35 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 18: "Introduction to Cache Coherence"

Definitions

Memory operation: a read (load), a write (store), or a read-modify-write


Assumed to take place atomically
A memory operation is said to issue when it leaves the issue queue and looks up the cache
A memory operation is said to perform with respect to a processor when a processor can
tell that from other issued memory operations
A read is said to perform with respect to a processor when subsequent writes issued
by that processor cannot affect the returned read value
A write is said to perform with respect to a processor when a subsequent read from
that processor to the same address returns the new value

Ordering memory op

A memory operation is said to complete when it has performed with respect to all processors
in the system
Assume that there is a single shared memory and no caches
Memory operations complete in shared memory when they access the corresponding
memory locations
Operations from the same processor complete in program order: this imposes a
partial order among the memory operations
Operations from different processors are interleaved in such a way that the program
order is maintained for each processor: memory imposes some total order (many
are possible)

Example

P0: x = 8; u = y; v = 9;

P1: r = 5; y = 4; t = v;

Legal total order:

x = 8; u = y; r = 5; y = 4; t = v; v = 9;

Another legal total order:

x = 8; r = 5; y = 4; u = y; v = 9; t = v;

“Last” means the most recent in some legal total order


A system is coherent if
Reads get the last written value in the total order
All processors see writes to a location in the same order

Cache coherence

Formal definition
A memory system is coherent if the values returned by reads to a memory location
during an execution of a program are such that all operations to that location can form
a hypothetical total order that is consistent with the serial order and has the following
two properties:

file:///E|/parallel_com_arch/lecture18/18_4.htm[6/13/2012 11:30:36 AM]


Objectives_template

1. Operations issued by any particular processor perform according to the issue order
2. The value returned by a read is the value written to that location by the last write in
the total order
Two necessary features that follow from above:
A. Write propagation: writes must eventually become visible to all processors
B. Write serialization: Every processor should see the writes to a location in the same
order (if I see w1 before w2, you should not see w2 before w1)

Bus-based SMP

Extend the philosophy of uniprocessor bus transactions


Three phases: arbitrate for bus, launch command (often called request) and address,
transfer data
Every device connected to the bus can observe the transaction
Appropriate device responds to the request
In SMP, processors also observe the transactions and may take appropriate actions to
guarantee coherence
The other device on the bus that will be of interest to us is the memory controller (north
bridge in standard mother boards)
Depending on the bus transaction a cache block executes a finite state machine
implementing the coherence protocol

file:///E|/parallel_com_arch/lecture18/18_4.htm[6/13/2012 11:30:36 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 18: "Introduction to Cache Coherence"

Snoopy protocols

Cache coherence protocols implemented in bus-based machines are called snoopy protocols
The processors snoop or monitor the bus and take appropriate protocol actions based
on snoop results
Cache controller now receives requests both from processor and bus
Since cache state is maintained on a per line basis that also dictates the coherence
granularity
Cannot normally take a coherence action on parts of a cache line
The coherence protocol is implemented as a finite state machine on a per cache line
basis
The snoop logic in each processor grabs the address from the bus and decides if any
action should be taken on the cache line containing that address (only if the line is in
cache)

Write through caches

There are only two cache line states


Invalid (I): not in cache
Valid (V): present in cache, may be present in other caches also
Read access to a cache line in I state generates a BusRd request on the bus
Memory controller responds to the request and after reading from memory launches
the line on the bus
Requester matches the address and picks up the line from the bus and fills the cache
in V state
A store to a line always generates a BusWr transaction on the bus (since write
through); other sharers either invalidate the line in their caches or update the line with
new value

State transition

The finite state machine for each cache line:

On a write miss no line is allocated


The state remains at I: called write through write no-allocated
A/B means: A is generated by processor, B is the resulting bus transaction (if any)
Changes for write through write allocate?

Ordering memory op

Assume that the bus is atomic


It takes up the next transaction only after finishing the previous one
Read misses and writes appear on the bus and hence are visible to all processors

file:///E|/parallel_com_arch/lecture18/18_5.htm[6/13/2012 11:30:36 AM]


Objectives_template

What about read hits?


They take place transparently in the cache
But they are correct as long as they are correctly ordered with respect to writes
And all writes appear on the bus and hence are visible immediately in the presence of
an atomic bus
In general, in between writes reads can happen in any order without violating coherence
Writes establish a partial order

Write through is bad

High bandwidth requirement


Every write appears on the bus
Assume a 3 GHz processor running application with 10% store instructions, assume
CPI of 1
If the application runs for 100 cycles it generates 10 stores; assume each store is 4
bytes; 40 bytes are generated per 100/3 ns i.e. BW of 1.2 GB/s
A 1 GB/s bus cannot even support one processor
There are multiple processors and also there are read misses
Writeback caches absorb most of the write traffic
Writes that hit in cache do not go on bus (not visible to others)
Complicated coherence protocol with many choices

file:///E|/parallel_com_arch/lecture18/18_5.htm[6/13/2012 11:30:36 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 19: "Sequential Consistency and Cache Coherence Protocols"

Memory consistency

Consistency model

Sequential consistency

What is program order?

OOO and SC

SC example

Implementing SC

Write atomicity

Summary of SC

Back to shared bus

Snoopy protocols

Stores

Invalidation vs. update

Which one is better?

MSI protocol

State transition

MSI protocol

M to S, or M to I?

MSI example

MESI protocol

State transition

MESI protocol

MESI example

file:///E|/parallel_com_arch/lecture19/19_1.htm[6/13/2012 11:47:48 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 19: "Sequential Consistency and Cache Coherence Protocols"

Memory consistency

Need a more formal description of memory ordering


How to establish the order between reads and writes from different processors to
different variables?
The most clear way is to use synchronization
P0: A=1; flag=1
P1: while (!flag); print A;
Another example (assume A=0, B=0 initially)
P0: A=1; print B;
P1: B=1; print A;
What do you expect?
Memory consistency model is a contract between programmer and hardware regarding
memory ordering

Consistency model

A multiprocessor normally advertises the supported memory consistency model


This essentially tells the programmer what the possible correct outcome of a program
could be when run on that machine
Cache coherence deals with memory operations to the same location, but not different
locations
Without a formally defined order across all memory operations it often becomes
impossible to argue about what is correct and what is wrong in shared memory
Various memory consistency models
Sequential consistency (SC) is the most intuitive one and we will focus on it now (more
consistency models later)

Sequential consistency

Total order achieved by interleaving accesses from different processors


The accesses from the same processor are presented to the memory system in program
order
Essentially, behaves like a randomly moving switch connecting the processors to memory
Picks the next access from a randomly chosen processor
Lamport’s definition of SC
A multiprocessor is sequentially consistent if the result of any execution is the same as
if the operations of all the processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in the order specified
by its program

What is program order?

Any legal re-ordering is allowed


The program order is the order of instructions from a sequential piece of code where
programmer’s intuition is preserved
The order must produce the result a programmer expects
Can out-of-order execution violate program order?
No. All microprocessors commit instructions in-order and that is where the state

file:///E|/parallel_com_arch/lecture19/19_2.htm[6/13/2012 11:47:48 AM]


Objectives_template

becomes visible
For modern microprocessors the program order is really the commit order
Can out-of-order (OOO) execution violate SC?
Yes. Need extra logic to support SC on top of OOO

file:///E|/parallel_com_arch/lecture19/19_2.htm[6/13/2012 11:47:48 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 19: "Sequential Consistency and Cache Coherence Protocols"

OOO and SC

Consider a simple example (all are zero initially)

P0: x=w+1; r=y+1;

P1: y=2; w=y+1;

Suppose the load that reads w takes a miss and so w is not ready for a long time;
therefore, x=w+1 cannot complete immediately; eventually w returns with value 3
Inside the microprocessor r=y+1 completes (but does not commit) before x=w+1 and
gets the old value of y (possibly from cache); eventually instructions commit in order
with x=4, r=1, y=2, w=3
So we have the following partial orders

P0: x=w+1 < r=y+1 and P1: y=2 < w=y+1

Cross-thread: w=y+1 < x=w+1 and r=y+1 < y=2

Combine these to get a contradictory total order


What went wrong?

SC example

Consider the following example

P0: A=1; print B;

P1: B=1; print A;

Possible outcomes for an SC machine


(A, B) = (0,1); interleaving: B=1; print A; A=1; print B
(A, B) = (1,0); interleaving: A=1; print B; B=1; print A
(A, B) = (1,1); interleaving: A=1; B=1; print A; print B
A=1; B=1; print B; print A
(A, B) = (0,0) is impossible: read of A must occur before write of A and read of B must
occur before write of B i.e. print A < A=1 and print B < B=1, but A=1 < print B and B=1
< print A; thus print B < B=1 < print A < A=1 < print B which implies print B < print B, a
contradiction

Implementing SC

Two basic requirements


Memory operations issued by a processor must become visible to others in program
order
Need to make sure that all processors see the same total order of memory operations:
in the previous example for the (0,1) case both P0 and P1 should see the same
interleaving: B=1; print A; A=1; print B
The tricky part is to make sure that writes become visible in the same order to all processors
Write atomicity: as if each write is an atomic operation
Otherwise, two processors may end up using different values (which may still be

file:///E|/parallel_com_arch/lecture19/19_3.htm[6/13/2012 11:47:49 AM]


Objectives_template

correct from the viewpoint of cache coherence, but will violate SC)

Write atomicity

Example (A=0, B=0 initially)

P0: A=1;

P1: while (!A); B=1;

P2: while (!B); print A;

A correct execution on an SC machine should print A=1


A=0 will be printed only if write to A is not visible to P2, but clearly it is visible to P1
since it came out of the loop
Thus A=0 is possible if P1 sees the order A=1 < B=1 and P2 sees the order B=1 <
A=1 i.e. from the viewpoint of the whole system the write A=1 was not “atomic”
Without write atomicity P2 may proceed to print 0 with a stale value from its cache

Summary of SC

Program order from each processor creates a partial order among memory operations
Interleaving of these partial orders defines a total order
Sequential consistency: one of many total orders
A multiprocessor is said to be SC if any execution on this machine is SC compliant
Sufficient but not necessary conditions for SC
Issue memory operation in program order
Every processor waits for write to complete before issuing the next operation
Every processor waits for read to complete and the write that affects the returned
value to complete before issuing the next operation (important for write atomicity)

file:///E|/parallel_com_arch/lecture19/19_3.htm[6/13/2012 11:47:49 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 19: "Sequential Consistency and Cache Coherence Protocols"

Back to shared bus

Centralized shared bus makes it easy to support SC


Writes and reads are all serialized in a total order through the bus transaction ordering
If a read gets a value of a previous write, that write is guaranteed to be complete
because that bus transaction is complete
The write order seen by all processors is the same in a write through system because
every write causes a transaction and hence is visible to all in the same order
In a nutshell, every processor sees the same total bus order for all memory operations
and therefore any bus-based SMP with write through caches is SC
What about a multiprocessor with writeback cache?
No SMP uses write through protocol due to high BW

Snoopy protocols

No change to processor or cache


Just extend the cache controller with snoop logic and exploit the bus
We will focus on writeback caches only
Possible states of a cache line: Invalid (I), Shared (S), Modified or dirty (M), Clean
exclusive (E), Owned (O); every processor does not support all five states
E state is equivalent to M in the sense that the line has permission to write, but in E
state the line is not yet modified and the copy in memory is the same as in cache; if
someone else requests the line the memory will provide the line
O state is exactly same as E state but in this case memory is not responsible for
servicing requests to the line; the owner must supply the line (just as in M state)
Stores really read the memory (as opposed to write)

Stores

Look at stores a little more closely


There are three situations at the time a store issues: the line is not in the cache, the
line is in the cache in S state, the line is in the cache in one of M, E and O states
If the line is in I state, the store generates a read-exclusive request on the bus and
gets the line in M state
If the line is in S or O state, that means the processor only has read permission for
that line; the store generates an upgrade request on the bus and the upgrade
acknowledgment gives it the write permission (this is a data-less transaction)
If the line is in M or E state, no bus transaction is generated; the cache already has
write permission for the line (this is the case of a write hit; previous two are write
misses)

Invalidation vs. update

Two main classes of protocols:


Invalidation-based and update-based
Dictates what action should be taken on a write
Invalidation-based protocols invalidate sharers when a write miss (upgrade or readX)
appears on the bus
Update-based protocols update the sharer caches with new value on a write: requires

file:///E|/parallel_com_arch/lecture19/19_4.htm[6/13/2012 11:47:49 AM]


Objectives_template

write transactions (carrying just the modified bytes) on the bus even on write hits (not
very attractive with writeback caches)
Advantage of update-based protocols: sharers continue to hit in the cache while in
invalidation-based protocols sharers will miss next time they try to access the line
Advantage of invalidation-based protocols: only write misses go on bus (suited for
writeback caches) and subsequent stores to the same line are cache hits

Which one is better?

Difficult to answer
Depends on program behavior and hardware cost
When is update-based protocol good?
What sharing pattern? (large-scale producer/consumer)
Otherwise it would just waste bus bandwidth doing useless updates
When is invalidation-protocol good?
Sequence of multiple writes to a cache line
Saves intermediate write transactions
Also think about the overhead of initiating small updates for every write in update protocols
Invalidation-based protocols are much more popular
Some systems support both or maybe some hybrid based on dynamic sharing pattern
of a cache line

file:///E|/parallel_com_arch/lecture19/19_4.htm[6/13/2012 11:47:49 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 19: "Sequential Consistency and Cache Coherence Protocols"

MSI protocol

Forms the foundation of invalidation-based writeback protocols


Assumes only three supported cache line states: I, S, and M
There may be multiple processors caching a line in S state
There must be exactly one processor caching a line in M state and it is the owner of
the line
If none of the caches have the line, memory must have the most up-to-date copy of
the line
Processor requests to cache: PrRd, PrWr
Bus transactions: BusRd, BusRdX, BusUpgr, BusWB

State transition

MSI protocol

Few things to note


Flush operation essentially launches the line on the bus
Processor with the cache line in M state is responsible for flushing the line on bus
whenever there is a BusRd or BusRdX transaction generated by some other processor
On BusRd the line transitions from M to S, but not M to I. Why? Also at this point both
the requester and memory pick up the line from the bus; the requester puts the line in
its cache in S state while memory writes the line back. Why does memory need to write
back?
On BusRdX the line transitions from M to I and this time memory does not need to pick
up the line from bus. Only the requester picks up the line and puts it in M state in its
cache. Why?

M to S, or M to I?

BusRd takes a cache line in M state to S state


The assumption here is that the processor will read it soon, so save a cache miss by
going to S
May not be good if the sharing pattern is migratory: P0 reads and writes cache line A,

file:///E|/parallel_com_arch/lecture19/19_5.htm[6/13/2012 11:47:49 AM]


Objectives_template

then P1 reads and writes cache line A, then P2…


For migratory patterns it makes sense to go to I state so that a future invalidation is
saved
But for bus-based SMPs it does not matter much because an upgrade transaction will
be launched anyway by the next writer, unless there is special hardware support to
avoid that: how?
The big problem is that the sharing pattern for a cache line may change dynamically:
adaptive protocols are good and are supported by Sequent Symmetry and MIT Alewife

MSI example

Take the following example


P0 reads x, P1 reads x, P1 writes x, P0 reads x, P2 reads x, P3 writes x
Assume the state of the cache line containing the address of x is I in all processors

P0 generates BusRd, memory provides line, P0 puts line in S state

P1 generates BusRd, memory provides line, P1 puts line in S state

P1 generates BusUpgr, P0 snoops and invalidates line, memory does not respond, P1
sets state of line to M

P0 generates BusRd, P1 flushes line and goes to S state, P0 puts line in S state,
memory writes back

P2 generates BusRd, memory provides line, P2 puts line in S state

P3 generates BusRdX, P0, P1, P2 snoop and invalidate, memory provides line, P3 puts
line in cache in M state

file:///E|/parallel_com_arch/lecture19/19_5.htm[6/13/2012 11:47:49 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 19: "Sequential Consistency and Cache Coherence Protocols"

MESI protocol

The most popular invalidation-based protocol e.g., appears in Intel Xeon MP


Why need E state?
The MSI protocol requires two transactions to go from I to M even if there is no
intervening requests for the line: BusRd followed by BusUpgr
We can save one transaction by having memory controller respond to the first BusRd
with E state if there is no other sharer in the system
How to know if there is no other sharer? Needs a dedicated control wire that gets
asserted by a sharer (wired OR)
Processor can write to a line in E state silently and take it to M state

State transition

MESI protocol

If a cache line is in M state definitely the processor with the line is responsible for flushing it
on the next BusRd or BusRdX transaction
If a line is not in M state who is responsible?
Memory or other caches in S or E state?
Original Illinois MESI protocol assumed cache-to-cache transfer i.e. any processor in E
or S state is responsible for flushing the line
However, it requires some expensive hardware, namely, if multiple processors are
caching the line in S state who flushes it? Also, memory needs to wait to know if it
should source the line
Without cache-to-cache sharing memory always sources the line unless it is in M state

MESI example

Take the following example


P0 reads x, P0 writes x, P1 reads x, P1 writes x, …

P0 generates BusRd, memory provides line, P0 puts line in cache in E state

file:///E|/parallel_com_arch/lecture19/19_6.htm[6/13/2012 11:47:50 AM]


Objectives_template

P0 does write silently, goes to M state

P1 generates BusRd, P0 provides line, P1 puts line in cache in S state, P0 transitions


to S state
Rest is identical to MSI

Consider this example: P0 reads x, P1 reads x, …

P0 generates BusRd, memory provides line, P0 puts line in cache in E state

P1 generates BusRd, memory provides line, P1 puts line in cache in S state, P0


transitions to S state (no cache-to-cache sharing)
Rest is same as MSI

file:///E|/parallel_com_arch/lecture19/19_6.htm[6/13/2012 11:47:50 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 20: "Performance of Coherence Protocols"

MOESI protocol

Dragon protocol

State transition

Dragon example

Design issues

General issues

Evaluating protocols

Protocol optimizations

Cache size

Cache line size

Impact on bus traffic

Large cache line

Performance of update protocol

Hybrid inval+update

Update-based protocol

Shared cache

file:///E|/parallel_com_arch/lecture20/20_1.htm[6/13/2012 11:51:54 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 20: "Performance of Coherence Protocols"

MOESI protocol

Some SMPs implement MOESI today e.g., AMD Athlon MP and the IBM servers
Why is the O state needed?
O state is very similar to E state with four differences: 1. If a cache line is in O state
in some cache, that cache is responsible for sourcing the line to the next requester; 2.
The memory may not have the most up-to-date copy of the line (this implies 1); 3.
Eviction of a line in O state generates a BusWB; 4. Write to a line in O state must
generate a bus transaction
When a line transitions from M to S it is necessary to write the line back to memory
For a migratory sharing pattern (frequent in database workloads) this leads to a series
of writebacks to memory
These writebacks just keep the memory banks busy and consumes memory bandwidth
Take the following example
P0 reads x, P0 writes x, P1 reads x, P1 writes x, P2 reads x, P2 writes x, …
Thus at the time of a BusRd response the memory will write the line back: one
writeback per processor handover
O state aims at eliminating all these writebacks by transitioning from M to O instead of
M to S on a BusRd/Flush
Subsequent BusRd requests are replied by the owner holding the line in O state
The line is written back only when the owner evicts it: one single writeback
State transitions pertaining to O state
I to O: not possible (or maybe; see below)
E to O or S to O: not possible
M to O: on a BusRd/Flush (but no memory writeback)
O to I: on CacheEvict/BusWB or {BusRdX,BusUpgr}/Flush
O to S: not possible (or maybe; see below)
O to E: not possible (or maybe if silent eviction not allowed)
O to M: on PrWr/BusUpgr
At most one cache can have a line in O state at any point in time
Two main design choices for MOESI
Consider the example P0 reads x, P0 writes x, P1 reads x, P2 reads x, P3 reads x, …
When P1 launches BusRd, P0 sources the line and now the protocol has two options:
1. The line in P0 goes to O and the line in P1 is filled in state S; 2. The line in P0
goes to S and the line in P1 is filled in state O i.e. P1 inherits ownership from P0
For bus-based SMPs the two choices will yield roughly the same performance
For DSM multiprocessors we will revisit this issue if time permits
According to the second choice, when P2 generates a BusRd request, P1 sources the
line and transitions from O to S; P2 becomes the new owner
Some SMPs do not support the E state
In many cases it is not helpful, only complicates the protocol
MOSI allows a compact state encoding in 2 bits
Sun WildFire uses MOSI protocol

Dragon protocol

An update-based protocol for writeback caches


Four states: Two of them are standard E and M

file:///E|/parallel_com_arch/lecture20/20_2.htm[6/13/2012 11:51:54 AM]


Objectives_template

Shared clean (Sc): The standard S state


Shared modified (Sm): This is really the O state
In fact, five states because you always have I i.e. not in cache
So really a MOESI update-based protocol
New bus transaction: BusUpd
Used to update part of cache line
Distinguish between cache hits and misses:
PrRd and PrWr are hits, PrRdMiss and PrWrMiss are misses

file:///E|/parallel_com_arch/lecture20/20_2.htm[6/13/2012 11:51:54 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 20: "Performance of Coherence Protocols"

Design issues

Dragon example

Take the following sequence


P0 reads x, P1 reads x, P1 writes x, P0 reads x, P2 reads x, P3 writes x
P0 generates BusRd, shared line remains low, it puts line in E state
P1 generates BusRd, shared line is asserted by P0, P1 puts line in Sc state, P0 also
transitions to Sc state
P1 generates BusUpd, P0 asserts shared line, P1 takes the line to Sm state, P0
applies update but remains in Sc
P0 reads from cache, no state transition
P2 generates BusRd, P0 and P1 assert shared line, P1 sources the line on bus, P2
puts line in Sc state, P1 remains in Sm state, P0 remains in Sc state
P3 generates BusRd followed by BusUpd, P0, P1, P2 assert shared line, P1 sources
the line on bus, P3 puts line in Sm state, line in P1 goes to Sc state, lines in P0 and
P2 remain in Sc state, all processors update line

Design issues

Can we eliminate the Sm state?


Yes. Provided on every BusUpd the memory is also updated; then the Sc state is
sufficient (essentially boils down to a standard MSI update protocol)
However, update to cache may be faster than memory; but updating cache means
occupying data banks during update thereby preventing the processor from accessing
the cache, so not to degrade performance, extra cache ports may be needed
Is it necessary to launch a bus transaction on an eviction of a line in Sc state?
May help if this was the last copy of line in Sc state
If there is a line in Sm state, it can go back to M and save subsequent unnecessary
BusUpd transactions (the shared wire already solves this)

file:///E|/parallel_com_arch/lecture20/20_3.htm[6/13/2012 11:51:54 AM]


Objectives_template

General issues

Thus far we have assumed an atomic bus where transactions are not interleaved
In reality, high performance busses are pipelined and multiple transactions are in
progress at the same time
How do you reason about coherence?
Thus far we have assumed that the processor has only one level of cache
How to extend the coherence protocol to multiple levels of cache?
Normally, the cache coherence protocols we have discussed thus far executes only on
the outermost level of cache hierarchy
A simpler but different protocol runs within the hierarchy to maintain coherence
We will revisit these questions soon

Evaluating protocols

In message passing machines the design of the message layer plays an important role
Similarly, cache coherence protocols are central to the design of a shared memory
multiprocessor
The protocol performance depends on an array of parameters
Experience and intuition help in determining good design points
Otherwise designers use workload-driven simulations for cost/performance analysis
Goal is to decide where to spend money, time and energy
The simulators model the underlying multiprocessor in enough detail to capture correct
performance trends as one explores the parameter space

file:///E|/parallel_com_arch/lecture20/20_3.htm[6/13/2012 11:51:54 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 20: "Performance of Coherence Protocols"

Protocol optimizations

MSI vs. MESI


Need to measure bus bandwidth consumption with and without E state because E
state saves the otherwise S to M BusUpgr transactions
Turns out that the E state is not very helpful
The main reason is the E to M transition is rare; normally some other processor also
reads the line before the write takes place (if at all)
How important is BusUpgr?
Again need to look at bus bandwidth consumption with BusUpgr and with BusUpgr
replaced by BusRdX
Turns out that BusUpgr helps
Smaller caches demand more bus bandwidth
Especially when the primary working set does not fit in cache

Cache size

With increasing problem size normally working set size also increases
More pressure on cache
With increasing number of processors working set per processor goes down
Less pressure on cache
This effect sometimes leads to superlinear speedup i.e. on P processors you get
speedup more than P
Important to design the parallel program so that the critical working sets fit in cache
Otherwise bus bandwidth requirement may increase dramatically

Cache line size

Uniprocessors have three C misses: cold, capacity, conflict


Multiprocessors add two new types
True sharing miss: inherent in the algorithm e.g., P0 writes x and P1 uses x, so P1 will
suffer from a true sharing miss when it reads x
False sharing miss: artifactual miss due to cache line size e.g. P0 writes x and P1
reads y, but x and y belong to the same cache line
True and false sharing together form the communication or coherence misses in
multiprocessors making it four C misses
Technology is pushing for large cache line sizes, but…
Increasing cache line size helps reduce
Cold misses if there is spatial locality
True sharing misses if the algorithm is properly structured to exploit spatial locality
Increasing cache line size
Reduces the number of sets in a fixed-sized cache and may lead to more conflict
misses
May increase the volume of false sharing
May increase miss penalty depending on the bus transfer algorithm (need to transfer
more data per miss)
May fetch unnecessary data and waste bandwidth
Note that true sharing misses will exist even with an infinite cache

file:///E|/parallel_com_arch/lecture20/20_4.htm[6/13/2012 11:51:55 AM]


Objectives_template

Impact of cache line size on true sharing heavily depends on application characteristics
Blocked matrix computations tend to have good spatial locality with shared data
because they access data in small blocks thereby exploiting temporal as well as spatial
locality
Nearest neighbor computations tend to have little spatial locality when accessing left
and right border elements
The exact proportion of various types of misses in an application normally changes with
cache size, problem size and the number of processors
With small cache, capacity miss may dominate everything else
With large cache, true sharing misses may cause the major traffic

Impact on bus traffic

When cache line size is increased it may seem that we bring in more data together and have
better spatial locality and reuse
Should reduce bus traffic per unit computation
However, bus traffic normally increases monotonically with cache line size
Unless we have enough spatial and temporal locality to exploit, bus traffic will increase
For most cases bus bandwidth requirement attains a minimum at a block size different
from the minimum size; this is because at very small line sizes the overhead of
communication becomes too high

file:///E|/parallel_com_arch/lecture20/20_4.htm[6/13/2012 11:51:55 AM]


Objectives_template

Module 10: "Design of Shared Memory Multiprocessors"


Lecture 20: "Performance of Coherence Protocols"

Large cache line

Large cache lines are intended to amortize the DRAM access and bus transfer latency over a
large number of data points
But false sharing becomes a problem
Hardware solutions
Coherence at subblock level: divide the cache line into smaller blocks and maintain
coherence for each of them; subblock invalidation on a write reduces chances of
coherence misses even in the presence of false sharing
Delay invalidations: send invalidations only after the writer has completed several
writes; but this directly impacts the write propagation model and hence leads to
consistency models weaker than SC
Use update-based protocols instead of invalidation-based: probably not a good idea

Performance of update protocol

Already discussed main trade-offs


Consider running a sequential program on an SMP with update protocol
If the kernel decides to migrate the process to a different processor subsequent
updates will go to caches that are never used: “pack-rat” phenomenon
Possible designs that combine update and invalidation-based protocols
For each page, decide what type of protocol to run and make it part of the translation
(i.e. hold it in TLB)
Otherwise dynamically detect for each cache line what protocol is good

Hybrid inval+update

One possible hybrid protocol


Keep a counter per cache line and make Dragon update protocol the default
Every time the local processor accesses a cache line set its counter to some pre-
defined threshold k
On each received update decrease the counter by one
When the counter reaches zero, the line is locally invalidated hoping that eventually
the writer will switch to M state from Sm state when no sharers are left

Update-based protocol

Update-based protocols tend to increase capacity misses slightly


Cache lines stay longer in the cache compared to an invalidation-based protocol; why?
Update-based protocols can significantly reduce coherence misses
True sharing misses definitely go down
False sharing misses also decrease due to absence of invalidations
But update-based protocols significantly increase bus bandwidth demand
This increases bus contention and delays other transactions
Possible to delay the updates by merging a number of them in a buffer

Shared cache

Advantages
If there is only one level of cache no need for a coherence protocol

file:///E|/parallel_com_arch/lecture20/20_5.htm[6/13/2012 11:51:55 AM]


Objectives_template

Very fine-grained sharing resulting in fast cross-thread communication


No false sharing
Smaller cache capacity requirement: overlapped working set
One processor’s fetched cache line can be used by others: prefetch effect
Disadvantages
High cache bandwidth requirement and port contention
Destructive interference and conflict misses
Will revisit this when discussing chip multiprocessing and hyper-threading

file:///E|/parallel_com_arch/lecture20/20_5.htm[6/13/2012 11:51:55 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 21: "Introduction to Synchronization"

Cache Coherence & OOO Execution

Complication with stores

What about others?

More example

Yet another example

Types

Synchronization

Waiting algorithms

Implementation

Hardwired locks

Software locks

Hardware support

Atomic exchange

Test & set

Fetch & op

Compare & swap

[From Chapter 5 of Culler, Singh, Gupta]


[Speculative synchronization material taken from ASPLOS 2002 proceedings]

file:///E|/parallel_com_arch/lecture21/21_1.htm[6/13/2012 11:52:37 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 21: "Introduction to Synchronization"

Complication with stores

In OOO execution instructions issue out of program order


A store may issue out of program order
But it cannot write its value to cache until it retires i.e. comes to the head of ROB;
Why? (assume 1p)
So its value is kept in a store buffer (this is normally part of the store queue entry
occupied by the store)
If it hits in the cache (i.e. a write hit), nothing happens
If it misses in the cache, either a ReadX or an Upgrade request is issued on the bus
depending on the state of the requested cache line
Until the store retires subsequent loads from the same processor to the same
address can steal the value from store buffer (why not the old value?)

What about others?

Take the following example (assume invalidation-based protocol)


P0 writes x, P1 reads x
P0 issues store, assume that it hits in cache, but it commits much later (any simple
reason?)
P1 issues BusRd (Can it hit in P1’s cache?)
Snoop logic in P0’s cache controller finds that it is responsible for sourcing the cache
line (M state)
What value of x does the launched cache line contain? New value or the old value?
After this BusRd what is the state of P0’s line?
After this BusRd can the loads from P0 still continue to use the value written by the
store?
What happens when P0 ultimately commits the store?
Take the following example (assume invalidation-based protocol)
P0 writes x, P1 reads x
P0 issues store, assume that it hits in cache, but it commits much later (any simple
reason?)
P1 issues BusRd (Can it hit in P1’s cache?)
Snoop logic in P0’s cache controller finds that it is responsible for sourcing the cache
line (M state)
What value of x does the launched cache line contain? New value or the old value?
OLD VALUE
After this BusRd what is the state of P0’s line? S
After this BusRd can the matching loads from P0 still continue to use the value written
by the store? YES
What happens when P0 ultimately commits the store? UPGRADE MISS

More example

In the previous example same situation may arise even if P0 misses in the cache; the timing
of P1’s read decides whether the race happens or not
Another example
P0 writes x, P1 writes x

file:///E|/parallel_com_arch/lecture21/21_2.htm[6/13/2012 11:52:38 AM]


Objectives_template

Suppose the race does happen i.e. P1 launches BusRdX before P0’s store commits
(Can P1 launch upgrade?)
Surely the launched cache line will have old value of x as before
Is it safe for the matching loads from P0 to use the new value of x from store buffer?
What happens when P0’s store ultimately commits?
In the previous example same situation may arise even if P0 misses in the cache; the timing
of P1’s read decides whether the race happens or not
Another example
P0 writes x, P1 writes x
Suppose the race does happen i.e. P1 launches BusRdX before P0’s store commits
(Can P1 launch upgrade?)
Surely the launched cache line will have old value of x as before
Is it safe for the matching loads from P0 to use the new value of x from store buffer?
YES
What happens when P0 ’s store ultimately commits? READ-EXCLUSIVE MISS

Yet another example

Another example
P0 reads x, P0 writes x, P1 writes x
Suppose the race does happen i.e. P1 launches BusRdX before P0’s store commits
Surely the launched cache line will have old value of x as before
What value does P0’s load commit?

file:///E|/parallel_com_arch/lecture21/21_2.htm[6/13/2012 11:52:38 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 21: "Introduction to Synchronization"

Synchronization Types

Mutual exclusion
Synchronize entry into critical sections
Normally done with locks
Point-to-point synchronization
Tell a set of processors (normally set cardinality is one) that they can proceed
Normally done with flags
Global synchronization
Bring every processor to sync
Wait at a point until everyone is there
Normally done with barriers

Synchronization

Normally a two-part process: acquire and release; acquire can be broken into two parts: intent
and wait
Intent: express intent to synchronize (i.e. contend for the lock, arrive at a barrier)
Wait: wait for your turn to synchronization (i.e. wait until you get the lock)
Release: proceed past synchronization and enable other contenders to synchronize
Waiting algorithms do not depend on the type of synchronization

Waiting algorithms

Busy wait (common in multiprocessors)


Waiting processes repeatedly poll a location (implemented as a load in a loop)
Releasing process sets the location appropriately
May cause network or bus transactions
Block
Waiting processes are de-scheduled
Frees up processor cycles for doing something else
Busy waiting is better if
De-scheduling and re-scheduling take longer than busy waiting
No other active process
Does not work for single processor
Hybrid policies: busy wait for some time and then block

Implementation

Popular trend
Architects offer some simple atomic primitives
Library writers use these primitives to implement synchronization algorithms
Normally hardware primitives for acquire and possibly release are provided
Hard to offer hardware solutions for waiting
Also hardwired waiting may not offer that much of flexibility

file:///E|/parallel_com_arch/lecture21/21_3.htm[6/13/2012 11:52:38 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 21: "Introduction to Synchronization"

Hardwired locks

Not popular today


Less flexible
Cannot support large number of locks
Possible designs
Dedicated lock line in bus so that the lock holder keeps it asserted and waiters snoop
the lock line in hardware
Set of lock registers shared among processors and lock holder gets a lock register
(Cray Xmp)

Software locks

Bakery algorithm
Shared: choosing[P] = FALSE, ticket[P] = 0;
Acquire: choosing[i] = TRUE; ticket[i] = max(ticket[0],…,ticket[P-1]) + 1; choosing[i] =
FALSE;
for j = 0 to P-1
while (choosing[j]);
while (ticket[j] && ((ticket[j], j) < (ticket[i], i)));
endfor
Release: ticket[i] = 0;
Does it work for multiprocessors?
Assume sequential consistency
Performance issues related to coherence?
Too much overhead: need faster and simpler lock algorithms
Need some hardware support

Hardware support

Start with a simple software lock


Shared: lock = 0;
Acquire: while (lock); lock = 1;
Release or Unlock: lock = 0;
Assembly translation
Lock: lw register, lock_addr /* register is any processor register */
bnez register, Lock
addi register, register, 0x1
sw register, lock_addr
Unlock: xor register, register, register
sw register, lock_addr
Does it work?
What went wrong?
We wanted the read-modify-write sequence to be atomic

file:///E|/parallel_com_arch/lecture21/21_4.htm[6/13/2012 11:52:38 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 21: "Introduction to Synchronization"

Atomic exchange

We can fix this if we have an atomic exchange instruction

addi register, r0, 0x1 /* r0 is hardwired to 0 */


Lock: xchg register, lock_addr /* An atomic load and store */
bnez register, Lock
Unlock remains unchanged

Various processors support this type of instruction


Intel x86 has xchg, Sun UltraSPARC has ldstub (load-store-unsigned byte),
UltraSPARC also has swap
Normally easy to implement for bus-based systems: whoever wins the bus for xchg
can lock the bus
Difficult to support in distributed memory systems

Test & set

Less general compared to exchange

Lock: ts register, lock_addr


bnez register, Lock
Unlock remains unchanged

Loads current lock value in a register and sets location always with 1
Exchange allows to swap any value
A similar type of instruction is fetch & op
Fetch memory location in a register and apply op on the memory location
Op can be a set of supported operations e.g. add, increment, decrement, store etc.
In Test & set op=set

Fetch & op

Possible to implement a lock with fetch & clear then add (used to be supported in BBN
Butterfly 1)

addi reg1, r0, 0x1


Lock: fetch & clr then add reg1, reg2, lock_addr /* fetch in reg2, clear, add reg1 */
bnez reg2, Lock

Butterfly 1 also supports fetch & clear then xor


Sequent Symmetry supports fetch & store
More sophisticated: compare & swap
Takes three operands: reg1, reg2, memory address
Compares the value in reg1 with address and if they are equal swaps the contents of
reg2 and address
Not in line with RISC philosophy (same goes for fetch & add)

Compare & swap

addi reg1, r0, 0x0 /* reg1 has 0x0 */

file:///E|/parallel_com_arch/lecture21/21_5.htm[6/13/2012 11:52:39 AM]


Objectives_template

addi reg2, r0, 0x1 /* reg2 has 0x1 */


Lock: compare & swap reg1, reg2, lock_addr
bnez reg2, Lock

file:///E|/parallel_com_arch/lecture21/21_5.htm[6/13/2012 11:52:39 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 22: "Scalable Locking Primitives"

Traffic of test & set

Backoff test & set

Test & test & set

TTS traffic analysis

Goals of a lock algorithm

Ticket lock

Array-based lock

RISC processors

LL/SC

Locks with LL/SC

Fetch & op with LL/SC

Store conditional & OOO

Speculative SC?

Point-to-point synch.

file:///E|/parallel_com_arch/lecture22/22_1.htm[6/13/2012 11:53:18 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 22: "Scalable Locking Primitives"

Traffic of test & set

In some machines (e.g., SGI Origin 2000) uncached fetch & op is supported
every such instruction will generate a transaction (may be good or bad depending on
the support in memory controller; will discuss later)
Let us assume that the lock location is cacheable and is kept coherent
Every invocation of test & set must generate a bus transaction; Why? What is the
transaction? What are the possible states of the cache line holding lock_addr?
Therefore all lock contenders repeatedly generate bus transactions even if someone is
still in the critical section and is holding the lock
Can we improve this?
Test & set with backoff

Backoff test & set

Instead of retrying immediately wait for a while


How long to wait?
Waiting for too long may lead to long latency and lost opportunity
Constant and variable backoff
Special kind of variable backoff: exponential backoff (after the i th attempt the delay is
k*ci where k and c are constants)
Test & set with exponential backoff works pretty well

delay = k
Lock: ts register, lock_addr
bez register, Enter_CS
pause (delay) /* Can be simulated as a timed loop */
delay = delay*c
j Lock

Test & test & set

Reduce traffic further


Before trying test & set make sure that the lock is free

Lock: ts register, lock_addr


bez register, Enter_CS
Test: lw register, lock_addr
bnez register, Test
j Lock

How good is it?


In a cacheable lock environment the Test loop will execute from cache until it receives
an invalidation (due to store in unlock); at this point the load may return a zero value
after fetching the cache line
If the location is zero then only everyone will try test & set

TTS traffic analysis

Recall that unlock is always a simple store

file:///E|/parallel_com_arch/lecture22/22_2.htm[6/13/2012 11:53:18 AM]


Objectives_template

In the worst case everyone will try to enter the CS at the same time
First time P transactions for ts and one succeeds; every other processor suffers a miss
on the load in Test loop; then loops from cache
The lock-holder when unlocking generates an upgrade (why?) and invalidates all
others
All other processors suffer read miss and get value zero now; so they break Test loop
and try ts and the process continues until everyone has visited the CS

(P+(P-1)+1+(P-1))+((P-1)+(P-2)+1+(P-2))+… = (3P-1) + (3P-4) + (3P-7) + … ~ 1.5P2


asymptotically

For distributed shared memory the situation is worse because each invalidation
becomes a separate message (more later)

file:///E|/parallel_com_arch/lecture22/22_2.htm[6/13/2012 11:53:18 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 22: "Scalable Locking Primitives"

Goals of a lock algorithm

Low latency: if no contender the lock should be acquired fast


Low traffic: worst case lock acquire traffic should be low; otherwise it may affect unrelated
transactions
Scalability: Traffic and latency should scale slowly with the number of processors
Low storage cost: Maintaining lock states should not impose unrealistic memory overhead
Fairness: Ideally processors should enter CS according to the order of lock request (TS or
TTS does not guarantee this)

Ticket lock

Similar to Bakery algorithm but simpler


A nice application of fetch & inc
Basic idea is to come and hold a unique ticket and wait until your turn comes
Bakery algorithm failed to offer this uniqueness thereby increasing complexity

Shared: ticket = 0, release_count = 0;


Lock: fetch & inc reg1, ticket_addr
Wait: lw reg2, release_count_addr /* while (release_count != ticket); */
sub reg3, reg2, reg1
bnez reg3, Wait

Unlock: addi reg2, reg2, 0x1 /* release_count++ */


sw reg2, release_count_addr

Initial fetch & inc generates O(P) traffic on bus-based machines (may be worse in DSM
depending on implementation of fetch & inc)
But the waiting algorithm still suffers from 0.5P2 messages asymptotically
Researchers have proposed proportional backoff i.e. in the wait loop put a delay
proportional to the difference between ticket value and last read release_count
Latency and storage-wise better than Bakery
Traffic-wise better than TTS and Bakery (I leave it to you to analyze the traffic of Bakery)
Guaranteed fairness: the ticket value induces a FIFO queue

Array-based lock

Solves the O(P 2 ) traffic problem


The idea is to have a bit vector (essentially a character array if boolean type is not supported)
Each processor comes and takes the next free index into the array via fetch & inc
Then each processor loops on its index location until it becomes set
On unlock a processor is responsible to set the next index location if someone is waiting
Initial fetch & inc still needs O(P) traffic, but the wait loop now needs O(1) traffic
Disadvantage: storage overhead is O(P)
Performance concerns
Avoid false sharing: allocate each array location on a different cache line
Assume a cache line size of 128 bytes and a character array: allocate an array of size
128P bytes and use every 128th position in the array
For distributed shared memory the location a processor loops on may not be in its local

file:///E|/parallel_com_arch/lecture22/22_3.htm[6/13/2012 11:53:18 AM]


Objectives_template

memory: on acquire it must take a remote miss; allocate P pages and let each
processor loop on one bit in a page? Too much wastage; better solution: MCS lock
(Mellor-Crummey & Scott)
Correctness concerns
Make sure to handle corner cases such as determining if someone is waiting on the
next location (this must be an atomic operation) while unlocking
Remember to reset your index location to zero while unlocking

file:///E|/parallel_com_arch/lecture22/22_3.htm[6/13/2012 11:53:18 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 22: "Scalable Locking Primitives"

RISC processors

All these atomic instructions deviate from the RISC line


Instruction needs a load as well as a store
Also, it would be great if we can offer a few simple instructions with which we can build most
of the atomic primitives
Note that it is impossible to build atomic fetch & inc with xchg instruction
MIPS, Alpha and IBM processors support a pair of instructions: LL and SC
Load linked and store conditional

LL/SC

Load linked behaves just like a normal load with some extra tricks
Puts the loaded value in destination register as usual
Sets a load_linked bit residing in cache controller to 1
Puts the address in a special lock_address register residing in the cache controller
Store conditional is a special store
sc reg, addr stores value in reg to addr only if load_linked bit is set; also it copies the
value in load_linked bit to reg and resets load_linked bit
Any intervening “operation” (e.g., bus transaction or cache replacement) to the cache line
containing the address in lock_address register clears the load_linked bit so that subsequent
sc fails

Locks with LL/SC

Test & set

Lock: LL r1, lock_addr /* Normal read miss/BusRead */


addi r2, r0, 0x1
SC r2, lock_addr /* Possibly upgrade miss */
beqz r2, Lock /* Check if SC succeeded */
bnez r1, Lock /* Check if someone is in CS */

LL/SC is best-suited for test & test & set locks

Lock: LL r1, lock_addr


bnez r1, Lock
addi r1, r0, 0x1
SC r1, lock_addr
beqz r1, Lock

Fetch & op with LL/SC

Fetch & inc

Try: LL r1, addr


addi r1, r1, 0x1
SC r1, addr
beqz r1, Try

file:///E|/parallel_com_arch/lecture22/22_4.htm[6/13/2012 11:53:19 AM]


Objectives_template

Compare & swap: Compare with r1, swap r2 and memory location (here we keep on trying
until comparison passes)

Try: LL r3, addr


sub r4, r3, r1
bnez r4, Try
add r4, r2, r0
SC r4, addr
beqz r4, Try
add r2, r3, r0

file:///E|/parallel_com_arch/lecture22/22_4.htm[6/13/2012 11:53:19 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 22: "Scalable Locking Primitives"

Store conditional & OOO

Execution of SC in an OOO pipeline


Rather subtle
For now assume that SC issues only when it comes to the head of ROB i.e. non-
speculative execution of SC
It first checks the load_linked bit; if reset doesn’t even access cache (saves cache
bandwidth and unnecessary bus transactions) and returns zero in register
If load_linked bit is set, it accesses cache and issues bus transaction if needed
(BusReadX if cache line in I state and BusUpgr if in S state)
Checks load_linked bit again before writing to cache (note that cache line goes to M
state in any case)
Can wake up dependents only when SC graduates (a case where a store initiates a
dependence chain)

Speculative SC?

What happens if SC is issued speculatively?


Actual store happens only when it graduates and issuing a store early only starts the
write permission process
Suppose two processors are contending for a lock
Both do LL and succeed because nobody is in CS
Both issue SC speculatively and due to some reason the graduation of SC in both of
them gets delayed
So although initially both may get the line one after another in M state in their caches,
the load_linked bit will get reset in both by the time SC tries to graduate
They go back and start over with LL and may issue SC again speculatively leading to a
livelock (probability of this type of livelock increases with more processors)
Speculative issue of SC with hardwired backoff may help
Better to turn off speculation for SC
What about the branch following SC?
Can we speculate past that branch?
Assume that the branch predictor tells you that the branch is not taken i.e. fall through:
we speculatively venture into the critical section
We speculatively execute the critical section
This may be good and bad
If the branch prediction was correct we did great
If the predictor went wrong, we might have interfered with the execution of the
processor that is actually in CS: may cause unnecessary invalidations and extra traffic
Any correctness issues?

Point-to-point synch.

Normally done in software with flags

P0: A = 1; flag = 1;
P1: while (!flag); print A;

Some old machines supported full/empty bits in memory

file:///E|/parallel_com_arch/lecture22/22_5.htm[6/13/2012 11:53:19 AM]


Objectives_template

Each memory location is augmented with a full/empty bit


Producer writes the location only if bit is reset
Consumer reads location if bit is set and resets it
Lot less flexible: one producer-one consumer sharing only (one producer-many
consumers is very popular); all accesses to a memory location become synchronized
(unless compiler flags some accesses as special)
Possible optimization for shared memory
Allocate flag and data structures (if small) guarded by flag in same cache line e.g., flag
and A in above example

file:///E|/parallel_com_arch/lecture22/22_5.htm[6/13/2012 11:53:19 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 23: "Barriers and Speculative Synchronization"

Barrier

Centralized barrier

Sense reversal

Centralized barrier

Tree barrier

Hardware support

Hardware barrier

Speculative synch.

Why is it good?

How does it work?

Why is it correct?

Performance concerns

Speculative flags and barriers

Speculative flags and branch prediction

file:///E|/parallel_com_arch/lecture23/23_1.htm[6/13/2012 11:54:08 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 23: "Barriers and Speculative Synchronization"

Barrier

High-level classification of barriers


Hardware and software barriers
Will focus on two types of software barriers
.Centralized barrier: every processor polls a single count
Distributed tree barrier: shows much better scalability
Performance goals of a barrier implementation
Low latency: after all processors have arrived at the barrier, they should be able to
leave quickly
Low traffic: minimize bus transaction and contention
Scalability: latency and traffic should scale slowly with the number of processors
Low storage: barrier state should not be big
Fairness: Preserve some strict order of barrier exit (could be FIFO according to arrival
order); a particular processor should not always be the last one to exit

Centralized barrier

struct bar_type {
int counter;
struct lock_type lock;
int flag = 0;
} bar_name;

BARINIT (bar_name) {
LOCKINIT(bar_name.lock);
bar_name.counter = 0;
}

BARRIER (bar_name, P) {
int my_count;
LOCK (bar_name.lock);
if (!bar_name.counter) {
bar_name.flag = 0; /* first one */
}
my_count = ++bar_name.counter;
UNLOCK (bar_name.lock);
if (my_count == P) {
bar_name.counter = 0;
bar_name.flag = 1; /* last one */
}
else {
while (!bar_name.flag);
}
}

Sense reversal

file:///E|/parallel_com_arch/lecture23/23_2.htm[6/13/2012 11:54:08 AM]


Objectives_template

The last implementation fails to work for two consecutive barrier invocations
Need to prevent a process from entering a barrier instance until all have left the
previous instance
Reverse the sense of a barrier i.e. every other barrier will have the same sense:
basically attach parity or sense to a barrier

BARRIER (bar_name, P) {
local sense = !local_sense; /* this is private per processor */
LOCK (bar_name.lock);
bar_name.counter++;
if (bar_name.counter == P) {
UNLOCK (bar_name.lock);
bar_name.counter = 0;
bar_name.flag = local_sense;
}
else {
UNLOCK (bar_name.lock);
while (bar_name.flag != local_sense);
}
}

file:///E|/parallel_com_arch/lecture23/23_2.htm[6/13/2012 11:54:08 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 23: "Barriers and Speculative Synchronization"

Centralized barrier

How fast is it?


Assume that the program is perfectly balanced and hence all processors arrive at the
barrier at the same time
Latency is proportional to P due to the critical section (assume that the lock algorithm
exhibits at most O(P) latency)
The amount of traffic of acquire section (the CS) depends on the lock algorithm; after
everyone has settled in the waiting loop the last processor will generate a BusRdX
during release (flag write) and others will subsequently generate BusRd before
releasing: O(P)
Scalability turns out to be low partly due to the critical section and partly due to O(P)
traffic of release
No fairness in terms of who exits first

Tree barrier

Does not need a lock, only uses flags


Arrange the processors logically in a binary tree (higher degree also possible)
Two siblings tell each other of arrival via simple flags (i.e. one waits on a flag while the
other sets it on arrival)
One of them moves up the tree to participate in the next level of the barrier
Introduces concurrency in the barrier algorithm since independent subtrees can
proceed in parallel
Takes log(P) steps to complete the acquire
A fixed processor starts a downward pass of release waking up other processors that
in turn set other flags
Shows much better scalability compared to centralized barriers in DSM
multiprocessors; the advantage in small bus-based systems is not much, since all
transactions are any way serialized on the bus; in fact the additional log (P) delay may
hurt performance in bus-based SMPs

TreeBarrier (pid, P) {
unsigned int i, mask;
for (i = 0, mask = 1; (mask & pid) != 0; ++i, mask <<= 1) {
while (!flag[pid][i]);
flag[pid][i] = 0;
}
if (pid < (P - 1)) {
flag[pid + mask][i] = 1;
while (!flag[pid][MAX- 1]);
flag[pid][MAX - 1] = 0;
}
for (mask >>= 1; mask > 0; mask >>= 1) {
flag[pid - mask][MAX-1] = 1;
}
}

file:///E|/parallel_com_arch/lecture23/23_3.htm[6/13/2012 11:54:08 AM]


Objectives_template

Convince yourself that this works


Take 8 processors and arrange them on leaves of a tree of depth 3
You will find that only odd nodes move up at every level during acquire (implemented in the
first for loop)
The even nodes just set the flags (the first statement in the if condition): they bail out of the
first loop with mask=1
The release is initiated by the last processor in the last for loop; only odd nodes execute this
loop (7 wakes up 3, 5, 6; 5 wakes up 4; 3 wakes up 1, 2; 1 wakes up 0)
Each processor will need at most log (P) + 1 flags
Avoid false sharing: allocate each processor’s flags on a separate chunk of cache lines
With some memory wastage (possibly worth it) allocate each processor’s flags on a separate
page and map that page locally in that processor’s physical memory
Avoid remote misses in DSM multiprocessor
Does not matter in bus-based SMPs

file:///E|/parallel_com_arch/lecture23/23_3.htm[6/13/2012 11:54:08 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 23: "Barriers and Speculative Synchronization"

Hardware support

Read broadcast
Possible to reduce the number of bus transactions from P-1 to 1 in the best case
A processor seeing a read miss to flag location (possibly from a fellow processor)
backs off and does not put its read miss on the bus
Every processor picks up the read reply from the bus and the release completes with
one bus transaction
Needs special hardware/compiler support to recognize these flag addresses and resort
to read broadcast

Hardware barrier

Useful if frequency of barriers is high


Need a couple of wired-AND bus lines: one for odd barriers and one for even barriers
A processor arrives at the barrier and asserts its input line and waits for the wired-AND
line output to go HIGH
Not very flexible: assumes that all processors will always participate in all barriers
Bigger problem: what if multiple processes belonging to the same parallel program are
assigned to each processor?
No SMP supports it today
However, possible to provide flexible hardware barrier support in the memory controller
of DSM multiprocessors: memory controller can recognize accesses to special barrier
counter or barrier flag, combine them in memory and reply to processors only when the
barrier is complete (no retry due to failed lock)

Speculative synch.

Speculative synchronization
Basic idea is to introduce speculation in the execution of critical sections
Assume that no other processor will have conflicting data accesses in the critical
section and hence don’t even try to acquire the lock
Just venture into the critical section and start executing
Note the difference between this and speculative execution of critical section due to
speculation on the branch following SC: there you still contend for the lock generating
network transactions
Martinez and Torrellas. In ASPLOS 2002.
Rajwar and Goodman. In ASPLOS 2002.
We will discuss Martinez and Torrellas

Why is it good?

In many cases compiler/user inserts synchronization conservatively


Hard to know exact access pattern
The addresses accessed may depend on input
Take a simple example of a hash table
When the hash table is updated by two processes you really do not know which bins
they will insert into
So you conservatively make the hash table access a critical section

file:///E|/parallel_com_arch/lecture23/23_4.htm[6/13/2012 11:54:09 AM]


Objectives_template

For certain input values it may happen that the processes could actually update the
hash table concurrently

file:///E|/parallel_com_arch/lecture23/23_4.htm[6/13/2012 11:54:09 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 23: "Barriers and Speculative Synchronization"

How does it work?

Speculative locks
Every processor comes to the critical section and tries to acquire the lock
One of them succeeds and the rest fail
The successful processor becomes the safe thread
The failed ones don’t retry but venture into the critical section speculatively as if they
have the lock; at this point a speculative thread also takes a checkpoint of its register
state in case a rollback is needed
The safe thread executes the critical section as usual
The speculative threads are allowed to consume values produced by the safe thread
but not by the sp. threads
All stores from a speculative thread are kept inside its cache hierarchy in a special
“speculative modified” state; these lines cannot be sent to memory until it is known to
be safe; if such a line is replaced from cache either it can be kept in a small buffer or
the thread can be stalled
Speculative locks (continued)
If a speculative thread receives a request for a cache line that is in speculative M
state, that means there is a data race inside the critical section and by design the
receiver thread is rolled back to the beginning of critical section
Why can’t the requester thread be rolled back?
In summary, the safe thread is never squashed and the speculative threads are not
squashed if there is no cross-thread data race
If a speculative thread finishes executing the critical section without getting squashed,
it still must wait for the safe thread to finish the critical section before committing the
speculative state (i.e. changing speculative M lines to M); why?
Speculative locks (continued)
Upon finishing the critical section, a speculative thread can continue executing beyond
the CS, but still remaining in speculative mode
When the safe thread finishes the CS all speculative threads that have already
completed CS, can commit in some non-deterministic order and revert to normal
execution
The speculative threads that are still inside the critical section remain speculative; a
dedicated hardware unit elects one of them the lock owner and that becomes the safe
non-speculative thread; the process continues
Clearly, under favorable conditions speculative synchronization can reduce lock
contention enormously

file:///E|/parallel_com_arch/lecture23/23_5.htm[6/13/2012 11:54:09 AM]


Objectives_template

Module 11: "Synchronization"


Lecture 23: "Barriers and Speculative Synchronization"

Why is it correct?

In a non-speculative setting there is no order in which the threads execute the CS


Even if there is an order that must be enforced by the program itself
In speculative synchronization some threads are considered safe (depends on time of arrival)
and there is exactly one safe thread at a time in a CS
The speculative threads behave as if they complete the CS in some order after the safe
thread(s)
A read from a thread (spec. or safe) after a write from another speculative thread to the same
cache line triggers a squash
It may not be correct to consume the speculative value
Same applies to write after write

Performance concerns

Maintaining a safe thread guarantees forward progress


Otherwise if all were speculative, cross-thread races may repeatedly squash all of them
False sharing?
What if two bins of a hash table belong to the same cache line?
Two threads are really not accessing the same address, but the speculative thread will
still suffer from a squash
Possible to maintain per-word speculative state

Speculative flags and barriers

Speculative flags are easy to support


Just continue past an unset flag in speculative mode
The thread that sets the flag is always safe
The thread(s) that read the flag will speculate
Speculative barriers come for free
Barriers use locks and flags
However, since the critical section in a barrier accesses a counter, multiple threads
venturing into the CS are guaranteed to have conflicts
So just speculate on the flag and let the critical section be executed conventionally

Speculative flags and branch prediction

P0: A=1; flag=1;


P1: while (!flag); print A;
Assembly of P1’s code
Loop: lw register, flag_addr
beqz register, Loop

What if I pass a hint via the compiler (say, a single bit in each branch instruction) to the
branch predictor asking it to always predict not taken for this branch?
Isn’t it achieving the same effect as speculative flag, but with a much simpler
technique? No.

file:///E|/parallel_com_arch/lecture23/23_6.htm[6/13/2012 11:54:09 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture23/23_6.htm[6/13/2012 11:54:09 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 24: "Write Serialization in a Simple Design"

Multiprocessors on A Snoopy Bus

Agenda

Correctness goals

A simple design

Cache controller

Snoop logic

Writebacks

A simple design

Inherently non-atomic

Write serialization

Fetch deadlock

Livelock

Starvation

More on LL/SC

Multi-level caches

[From Chapter 6 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture24/24_1.htm[6/13/2012 11:54:50 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 24: "Write Serialization in a Simple Design"

Agenda

Goal is to understand what influences the performance, cost and scalability of SMPs
Details of physical design of SMPs
At least three goals of any design: correctness, performance, low hardware complexity
Performance gains are normally achieved by pipelining memory transactions and
having multiple outstanding requests
These performance optimizations occasionally introduce new protocol races involving
transient states leading to correctness issues in terms of coherence and consistency

Correctness goals

Must enforce coherence and write serialization


Recall that write serialization guarantees all writes to a location to be seen in the same
order by all processors
Must obey the target memory consistency model
If sequential consistency is the goal, the system must provide write atomicity and
detect write completion correctly (write atomicity extends the definition of write
serialization for any location i.e. it guarantees that positions of writes within the total
order seen by all processors be the same)
Must be free of deadlock, livelock and starvation
Starvation confined to a part of the system is not as problematic as deadlock and
livelock
However, system-wide starvation leads to livelock

A simple design

Start with a rather naïve design


Each processor has a single level of data and instruction caches
The cache allows exactly one outstanding miss at a time i.e. a cache miss request is
blocked if already another is outstanding (this serializes all bus requests from a
particular processor)
The bus is atomic i.e. it handles one request at a time

Cache controller

Must be able to respond to bus transactions as necessary 1


Handled by the snoop logic
The snoop logic should have access to the cache tags
A single set of tags cannot allow concurrent accesses by the processor-side and the
bus-side controllers
When the snoop logic accesses the tags the processor must remain locked out from
accessing the tags
Possible enhancements: two read ports in the tag RAM allows concurrent reads;
duplicate copies are also possible; multiple banks reduce the contention also
In all cases, updates to tags must still be atomic or must be applied to both copies in
case of duplicate tags; however, tag updates are a lot less frequent compared to reads

file:///E|/parallel_com_arch/lecture24/24_2.htm[6/13/2012 11:54:50 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 24: "Write Serialization in a Simple Design"

Snoop logic

Couple of decisions need to be taken while designing the snoop logic


How long should the snoop decision take?
How should processors convey the snoop decision?
Snoop latency (three design choices)
Possible to set an upper bound in terms of number of cycles; advantage: no change in
memory controller hardware; disadvantage: potentially large snoop latency (Pentium Pro,
Sun Enterprise servers)
The memory controller samples the snoop results every cycle until all caches have
completed snoop (SGI Challenge uses this approach where the memory controller
fetches the line from memory, but stalls if all caches haven’t yet snooped)
Maintain a bit per memory line to indicate if it is in M state in some cache
Conveying snoop result
For MESI the bus is augmented with three wired-OR snoop result lines (shared,
modified, valid); the valid line is active low
The original Illinois MESI protocol requires cache-to-cache transfer even when the line
is in S state; this may complicate the hardware enormously due to the involved priority
mechanism
Commercial MESI protocols normally allow cache-to-cache sharing only for lines in M
state
SGI Challenge and Sun Enterprise allow cache-to-cache transfers only in M state;
Challenge updates memory when going from M to S while Enterprise exercises a
MOESI protocol

Writebacks

Writebacks are essentially eviction of modified lines


Caused by a miss mapping to the same cache index
Needs two bus transactions: one for the miss and one for the writeback
Definitely the miss should be given first priority since this directly impacts forward
progress of the program
Need a writeback buffer (WBB) to hold the evicted line until the bus can be acquired for
the second time by this cache
In the meantime a new request from another processor may be launched for the evicted
line: the evicting cache must provide the line from the WBB and cancel the pending
writeback (need an address comparator with WBB)

A simple design

file:///E|/parallel_com_arch/lecture24/24_3.htm[6/13/2012 11:54:50 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture24/24_3.htm[6/13/2012 11:54:50 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 24: "Write Serialization in a Simple Design"

Inherently non-atomic

Even though the bus is atomic, a complete protocol transaction involves quite a few steps
which together forms a non-atomic transaction
Issuing processor request
Looking up cache tags
Arbitrating for bus
Snoop action in other cache controller
Refill in requesting cache controller at the end
Different requests from different processors may be in a different phase of a transaction
This makes a protocol transition inherently non-atomic
Consider an example
P0 and P1 have cache line C in shared state
Both proceed to write the line
Both cache controllers look up the tags, put a BusUpgr into the bus request queue, and
start arbitrating for the bus
P1 gets the bus first and launches its BusUpgr
P0 observes the BusUpgr and now it must invalidate C in its cache and change the
request type to BusRdX
So every cache controller needs to do an associative lookup of the snoop address
against its pending request queue and depending on the request type take appropriate
actions
One way to reason about the correctness is to introduce transient states
Possible to think of the last problem as the line C being in a transient S M state
On observing a BusUpgr or BusRdX, this state transitions to I M which is also
transient
The line C goes to stable M state only after the transaction completes
These transient states are not really encoded in the state bits of a cache line because
at any point in time there will be a small number of outstanding requests from a
particular processor (today the maximum I know of is 16)
These states are really determined by the state of an outstanding line and the state of
the cache controller

Write serialization

Atomic bus makes it rather easy, but optimizations are possible


Consider a processor write to a shared cache line
Is it safe to continue with the write and change the state to M even before the bus
transaction is complete?
After the bus transaction is launched it is totally safe because the bus is atomic and
hence the position of the write is committed in the total order; therefore no need to wait
any further (note that the exact point in time when the other caches invalidate the line
is not important)
If the processor decides to proceed even before the bus transaction is launched (very
much possible in ooo execution), the cache controller must take the responsibility of
squashing and re-executing offending instructions so that the total order is consistent
across the system

file:///E|/parallel_com_arch/lecture24/24_4.htm[6/13/2012 11:54:51 AM]


Objectives_template

Fetch deadlock

Just a fancy name for a pretty intuitive deadlock


Suppose P0’s cache controller is waiting to get the bus for launching a BusRdX to
cache line A
P1 has a modified copy of cache line A
P1 has launched a BusRd to cache line B and awaiting completion
P0 has a modified copy of cache line B
If both keep on waiting without responding to snoop requests, the deadlock cycle is
pretty obvious
So every controller must continue to respond to snoop requests while waiting for the
bus for its own requests
Normally the cache controller is designed as two separate independent logic units,
namely, the inbound unit (handles snoop requests) and the outbound unit (handles own
requests and arbitrates for bus)

file:///E|/parallel_com_arch/lecture24/24_4.htm[6/13/2012 11:54:51 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 24: "Write Serialization in a Simple Design"

Livelock

Consider the following example


P0 and P1 try to write to the same cache line
P0 gets exclusive ownership, fills the line in cache and notifies the load/store unit (or
retirement unit) to retry the store
While all these are happening P1’s request appears on the bus and P0’s cache
controller modifies tag state to I before the store could retry
This can easily lead to a livelock
Normally this is avoided by giving the load/store unit higher priority for tag access (i.e.
the snoop logic cannot modify the tag arrays when there is a processor access
pending in the same clock cycle)
This is even rarer in multi-level cache hierarchy (more later)

Starvation

Some amount of fairness is necessary in the bus arbiter


An FCFS policy is possible for granting bus, but that needs some buffering in the
arbiter to hold already placed requests
Most machines implement an aging scheme which keeps track of the number of times
a particular request is denied and when the count crosses a threshold that request
becomes the highest priority (this too needs some storage)

More on LL/SC

We have seen that both LL and SC may suffer from cache misses (a read followed by an
upgrade miss)
Is it possible to save one transaction?
What if I design my cache controller in such a way that it can recognize LL instructions
and launch a BusRdX instead of BusRd?
This is called Read-for-Ownership (RFO); also used by Intel atomic xchg instruction
Nice idea, but you have to be careful
By doing this you have just enormously increased the probability of a livelock: before
the SC executes there is a high probability that another LL will take away the line
Possible solution is to buffer incoming snoop requests until the SC completes (buffer
space is proportional to P); may introduce new deadlock cycles (especially for modern
non-atomic busses)

Multi-level caches

We have talked about multi-level caches and the involved inclusion property
Multiprocessors create new problems related to multi-level caches
A bus snoop result may be relevant to inner levels of cache e.g., bus transactions are
not visible to the first level cache controller
Similarly, modifications made in the first level cache may not be visible to the second
level cache controller which is responsible for handling bus requests
Inclusion property makes it easier to maintain coherence
Since L1 cache is a subset of L2 cache a snoop miss in L2 cache need not be sent to
L1 cache

file:///E|/parallel_com_arch/lecture24/24_5.htm[6/13/2012 11:54:51 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture24/24_5.htm[6/13/2012 11:54:51 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 25: "Protocols for Split-transaction Buses"

Recap of inclusion

Inclusion and snoop

L2 to L1 interventions

Invalidation acks?

Intervention races

Tag RAM design

Exclusive cache levels

Split-transaction bus

New issues

SGI Powerpath-2 bus

Bus interface logic

Snoop results

[From Chapter 6 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture25/25_1.htm[6/13/2012 11:59:05 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 25: "Protocols for Split-transaction Buses"

Recap of inclusion

A processor read
Looks up L1 first and in case of miss goes to L2, and finally may need to launch a
BusRd request if it misses in L2
Finally, the line is in S state in both L1 and L2
A processor write
Looks up L1 first and if it is in I state sends a ReadX request to L2 which may have
the line in M state
In case of L2 hit, the line is filled in M state in L1
In case of L2 miss, if the line is in S state in L2 it launches BusUpgr; otherwise it
launches BusRdX; finally, the line is in state M in both L1 and L2
If the line is in S state in L1, it sends an upgrade request to L2 and either there is an
L2 hit or L2 just conveys the upgrade to bus (Why can’t it get changed to BusRdX?)
L1 cache replacement
Replacement of a line in S state may or may not be conveyed to L2
Replacement of a line in M state must be sent to L2 so that it can hold the most up-to-
date copy
The line is in I state in L1 after replacement, the state of line remains unchanged in L2
L2 cache replacement
Replacement of a line in S state may or may not generate a bus transaction; it must
send a notification to the L1 caches so that they can invalidate the line to maintain
inclusion
Replacement of a line in M state first asks the L1 cache to send all the relevant L1
lines (these are the most up-to-date copies) and then launches a BusWB
The state of line in both L1 and L2 is I after replacement
Replacement of a line in E state from L1?
Replacement of a line in E state from L2?
Replacement of a line in O state from L1?
Replacement of a line in O state from L2?
In summary
A line in S state in L2 may or may not be in L1 in S state
A line in M state in L2 may or may not be in L1 in M state; Why? Can it be in S state?
A line in I state in L2 must not be present in L

Inclusion and snoop

BusRd snoop
Look up L2 cache tag; if in I state no action; if in S state no action; if in M state assert
wired-OR M line, send read intervention to L1 data cache, L1 data cache sends lines
back, L2 controller launches line on bus, both L1 and L2 lines go to S state
BusRdX snoop
Look up L2 cache tag; if in I state no action; if in S state invalidate and also notify L1;
if in M state assert wired-OR M line, send readX intervention to L1 data cache, L1 data
cache sends lines back, L2 controller launches line on bus, both L1 and L2 lines go to
I state
BusUpgr snoop
Similar to BusRdX without the cache line flush

file:///E|/parallel_com_arch/lecture25/25_2.htm[6/13/2012 11:59:06 AM]


Objectives_template

L2 to L1 interventions

Two types of interventions


One is read/readX intervention that requires data reply
Other is plain invalidation that does not need data reply
Data interventions can be eliminated by making L1 cache write-through
But introduces too much of write traffic to L2
One possible solution is to have a store buffer that can handle the stores in
background obeying the available BW, so that the processor can proceed
independently; this can easily violate sequential consistency unless store buffer also
becomes a part of snoop logic
Useless invalidations can be eliminated by introducing an inclusion bit in L2 cache state

file:///E|/parallel_com_arch/lecture25/25_2.htm[6/13/2012 11:59:06 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 25: "Protocols for Split-transaction Buses"

Invalidation acks?

On a BusRdX or BusUpgr in case of a snoop hit in S state L2 cache sends invalidation to L1


caches
Does the snoop logic wait for an invalidation acknowledgment from L1 cache before
the transaction can be marked complete?
Do we need a two-phase mechanism?
What are the issues?

Intervention races

Writebacks introduce new races in multi-level cache hierarchy


Suppose L2 sends a read intervention to L1 and in the meantime L1 decides to
replace that line (due to some conflicting processor access)
The intervention will naturally miss the up-to-date copy
When the writeback arrives at L2, L2 realizes that the intervention race has occurred
(need extra hardware to implement this logic; what hardware?)
When the intervention reply arrives from L1, L2 can apply the newly received writeback
and launch the line on bus
Exactly same situation may arise even in uniprocessor if a dirty replacement from L2
misses the line in L1 because L1 just replaced that line too

Tag RAM design

A multi-level cache hierarchy reduces tag contention


L1 tags are mostly accessed by the processor because L2 cache acts as a filter for
external requests
L2 tags are mostly accessed by the system because hopefully L1 cache can absorb
most of the processor traffic
Still some machines maintain duplicate tags at all or the outermost level only

Exclusive cache levels

AMD K7 (Athlon XP) and K8 (Athlon64, Opteron) architectures chose to have exclusive levels
of caches instead of inclusive
Definitely provides you much better utilization of on-chip caches since there is no
duplication
But complicates many issues related to coherence
The uniprocessor protocol is to refill requested lines directly into L1 without placing a
copy in L2; only on an L1 eviction put the line into L2; on an L1 miss look up L2 and in
case of L2 hit replace line from L2 and put it in L1 (may have to replace multiple L1
lines to accommodate the full L2 line; not sure what K8 does: possible to maintain
inclusion bit per L1 line sector in L2 cache)
For multiprocessors one solution could be to have one snoop engine per cache level
and a tournament logic that selects the successful snoop result

file:///E|/parallel_com_arch/lecture25/25_3.htm[6/13/2012 11:59:06 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 25: "Protocols for Split-transaction Buses"

Split-transaction bus

Atomic bus leads to underutilization of bus resources


Between the address is taken off the bus and the snoop responses are available the
bus stays idle
Even after the snoop result is available the bus may remain idle due to high memory
access latency
Split-transaction bus divides each transaction into two parts: request and response
Between the request and response of a particular transaction there may be other
requests and/or responses from different transactions
Outstanding transactions that have not yet started or have completed only one phase
are buffered in the requesting cache controllers

New issues

Split-transaction bus introduces new protocol races


P0 and P1 have a line in S state and both issue BusUpgr, say, in consecutive cycles
Snoop response arrives later because it takes time
Now both P0 and P1 may think that they have ownership
Flow control is important since buffer space is finite
In-order or out-of-order response?
Out-of-order response may better tolerate variable memory latency by servicing other
requests
Pentium Pro uses in-order response
SGI Challenge and Sun Enterprise use out-of-order response i.e. no ordering is
enforced

SGI Powerpath-2 bus

Used in SGI Challenge


Conflicts are resolved by not allowing multiple bus transactions to the same cache line
Allows eight outstanding requests on the bus at any point in time
Flow control on buffers is provided by negative acknowledgments (NACKs): the bus
has a dedicated NACK line which remains asserted if the buffer holding outstanding
transactions is full; a NACKed transaction must be retried
The request order determines the total order of memory accesses, but the responses
may be delivered in a different order depending on the completion time of them
In subsequent slides we call this design Powerpath-2 since it is loosely based on that
Logically two separate buses
Request bus for launching the command type (BusRd, BusWB etc.) and the involved
address
Response bus for providing the data response, if any
Since responses may arrive in an order different from the request order, a 3-bit tag is
assigned to each request
Responses launch this tag on the tag bus along with the data reply so that the
address bus may be left free for other requests
The data bus is 256-bit wide while a cache line is 128 bytes
One data response phase needs four bus cycles along with one additional hardware

file:///E|/parallel_com_arch/lecture25/25_4.htm[6/13/2012 11:59:06 AM]


Objectives_template

turnaround cycle

file:///E|/parallel_com_arch/lecture25/25_4.htm[6/13/2012 11:59:06 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 25: "Protocols for Split-transaction Buses"

SGI Powerpath-2 bus

Essentially two main buses and various control wires for snoop results, flow control etc.
Address bus: five cycle arbitration, used during request
Data bus: five cycle arbitration, five cycle transfer, used during response
Three different transactions may be in one of these three phases at any point in time

Forming a total order


After the decode cycle during request phase every cache controller takes appropriate
coherence actions i.e. BusRd downgrades M line to S, BusRdX invalidates line
If a cache controller does not get the tags due to contention with the processor; it simply
lengthens the ack phase beyond one cycle
Thus the total order is formed during the request phase itself i.e. the position of each request in
the total order is determined at that point
BusWB case
BusWB only needs the request phase
However needs both address and data lines together
Must arbitrate for both together
BusUpgr case
Consists only of the request phase
No response or acknowledgment
As soon as the “ack” phase of address arbitration is completed by the issuing node, the
upgrade has sealed a position in the total order and hence is marked complete by sending a
completion signal to the issuing processor by its local bus controller (each node has its own bus
controller to handle bus requests)

file:///E|/parallel_com_arch/lecture25/25_5.htm[6/13/2012 11:59:07 AM]


Objectives_template

file:///E|/parallel_com_arch/lecture25/25_5.htm[6/13/2012 11:59:07 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 25: "Protocols for Split-transaction Buses"

Bus interface logic

A request table entry is freed when the response is observed on the bus

Snoop results

Three snoop wires: shared, modified, inhibit (all wired-OR)


The inhibit wire helps in holding off snoop responses until the data response is launched
on the bus
Although the request phase determines who will source the data i.e. some cache or
memory, the memory controller does not know it
The cache with a modified copy keeps the inhibit line asserted until it gets the data bus
and flushes the data; this prevents memory controller from sourcing the data
Otherwise memory controller arbitrates for the data bus
When the data appears all cache controllers appropriately assert the shared and
modified line
Why not launch snoop results as soon as they are available?

file:///E|/parallel_com_arch/lecture25/25_6.htm[6/13/2012 11:59:07 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 26: "Case Studies"

Conflict resolution

Path of a cache miss

Write serialization

Write atomicity and SC

Another example

In-order response

Multi-level caches

Dependence graph

Multiple outstanding requests

SGI Challenge

Sun Enterprise

Sun Gigaplane bus

[From Chapter 6 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture26/26_1.htm[6/13/2012 11:59:55 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 26: "Case Studies"

Conflict resolution

Use the pending request table to resolve conflicts


Every processor has a copy of the table
Before arbitrating for the address bus every processor looks up the table to see if there
is a match
In case of a match the request is not issued and is held in a pending buffer
Flow control is needed at different levels
Essentially need to detect if any buffer is full
SGI Challenge uses a separate NACK line for each of address and data phases
Before the phases reach the “ack” cycle any cache controller can assert the NACK line
if it runs out of some critical buffer; this invalidates the transaction and the requester
must retry (may use back-off and/or priority)
Sun Enterprise requires the receiver to generate the retry when it has buffer space
(thus only one retry)

Path of a cache miss

Assume a read miss


Look up request table; in case of a match with BusRd just mark the entry indicating
that this processor will snoop the response from the bus and that it will also assert the
shared line
In case of a request table hit with BusRdX the cache controller must hold on to the
request until the conflict resolves
In case of a request table miss the requester arbitrates for address bus; while
arbitrating if a conflicting request arrives, the controller must put a NOP transaction
within the slot it is granted and hold on to the request until the conflict resolves
Suppose the requester succeeds in putting the request on address/command bus
Other cache controllers snoop the request, register it in request table (the requester
also does this), take appropriate coherence action within own cache hierarchy, main
memory also starts fetching the cache line
If a cache holds the line in M state it should source it on bus during response phase; it
keeps the inhibit line asserted until it gets the data bus; then it lowers inhibit line and
asserts the modified line; at this point the memory controller aborts the data
fetch/response and instead fields the line from the data bus for writing back
If the memory fetches the line even before the snoop is complete, the inhibit line will not allow
the memory controller to launch the data on bus
After the inhibit line is lowered depending on the state of the modified line memory
cancels the data response
If no one has the line in M state, the requester grabs the response from memory
A store miss is similar
Only difference is that even if a cache has the line in M state, the memory controller
does not write the response back
Also any pending BusUpgr to the same cache line must be converted to BusReadX

Write serialization

In a split-transaction bus setting, the request table provides sufficient support for write

file:///E|/parallel_com_arch/lecture26/26_2.htm[6/13/2012 11:59:56 AM]


Objectives_template

serialization
Requests to the same cache line are not allowed to proceed at the same time
A read to a line after a write to the same line can be launched only after the write
response phase has completed; this guarantees that the read will see the new value
A write after a read to the same line can be started only after the read response has
completed; this guarantees that the value of the read cannot be altered by the value
written

file:///E|/parallel_com_arch/lecture26/26_2.htm[6/13/2012 11:59:56 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 26: "Case Studies"

Write atomicity and SC

Sequential consistency (SC) requires write atomicity i.e. total order of all writes seen by all
processors should be identical
Since a BusRdX or BusUpgr does not wait until the invalidations are actually applied to
the caches, you have to be careful

P0: A=1; B=1;


P1: print B; print A

Under SC (A, B) = (0, 1) is not allowed


Suppose to start with P1 has the line containing A in cache, but not the line containing
B
The stores of P0 queue the invalidation of A in P1’s cache controller
P1 takes read miss for B, but the response of B is re-ordered by P1’s cache
controller so that it overtakes the invalidaton (thought it may be better to prioritize
reads)

Another example

P0: A=1; print B;

P1: B=1; print A;

Under SC (A, B) = (0, 0) is not allowed


Same problem if P0 executes both instructions first, then P1 executes the write of B (which
let’s assume generates an upgrade so that it is marked complete as soon as the address
arbitration phase finishes), then the upgrade completion is re-ordered with the pending
invalidation of A
So, the reason these two cases fail is that the new values are made visible before older
invalidations are applied
One solution is to have a strict FIFO queue between the bus controller and the cache
hierarchy
But it is sufficient as long as replies do not overtake invalidations; otherwise the bus
responses can be re-ordered without violating write atomicity and hence SC (e.g., if there are
only read and write responses in the queue, it sometimes may make sense to prioritize read
responses)

In-order response

In-order response can simplify quite a few things in the design


The fully associative request table can be replaced by a FIFO queue
Conflicting requests where one is a write can actually be allowed now (multiple reads
were allowed even before although only the first one actually appears on the bus)
Consider a BusRdX followed by a BusRd from two different processors
With in-order response it is guaranteed that the BusRdX response will be granted the
data bus before the BusRd response (which may not be true for ooo response and
hence such a conflict is disallowed)
So when the cache controller generating the BusRdX sees the BusRd it only notes that

file:///E|/parallel_com_arch/lecture26/26_3.htm[6/13/2012 11:59:56 AM]


Objectives_template

it should source the line for this request after its own write is completed
The performance penalty may be huge
Essentially because of the memory
Consider a situation where three requests are pending to cache lines A, B, C in that
order
A and B map to the same memory bank while C is in a different bank
Although the response for C may be ready long before that of B, it cannot get the bus

file:///E|/parallel_com_arch/lecture26/26_3.htm[6/13/2012 11:59:56 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 26: "Case Studies"

Multi-level caches

Split-transaction bus makes the design of multi-level caches a little more difficult
The usual design is to have queues between levels of caches in each direction
How do you size the queues? Between processor and L1 one buffer is sufficient
(assume one outstanding processor access), L1-to-L2 needs P+1 buffers (why?), L2-to-
L1 needs P buffers (why?), L1 to processor needs one buffer
With smaller buffers there is a possibility of deadlock: suppose the L1-to-L2 and L2-to-L1
have one queue entry each, there is a request in L1-to-L2 queue and there is also an
intervention in L2-to-L1 queue; clearly L1 cannot pick up the intervention because it
does not have space to put the reply in L1-to-L2 queue while L2 cannot pick up the
request because it might need space in L2-to-L1 queue in case of an L2 hit
Formalizing the deadlock with dependence graph
There are four types of transactions in the cache hierarchy: 1. Processor requests
(outbound requests), 2. Responses to processor requests (inbound responses), 3.
Interventions (inbound requests), 4. Intervention responses (outbound responses)
Processor requests need space in L1-to-L2 queue; responses to processors need space
in L2-to-L1 queue; interventions need space in L2-to-L1 queue; intervention responses
need space in L1-to-L2 queue
Thus a message in L1-to-L2 queue may need space in L2-to-L1 queue (e.g. a processor
request generating a response due to L2 hit); also a message in L2-to-L1 queue may
need space in L1-to-L2 queue (e.g. an intervention response)
This creates a cycle in queue space dependence graph

Dependence graph

Represent a queue by a vertex in the graph


Number of vertices = number of queues
A directed edge from vertex u to vertex v is present if a message at the head of queue u may
generate another message which requires space in queue v
In our case we have two queues
L2-L1 and L1-L2; the graph is not a DAG, hence deadlock

Multi-level caches

In summary
L2 cache controller refuses to drain L1-to-L2 queue if there is no space in L2-to-L1
queue; this is rather conservative because the message at the head of L1-to-L2 queue
may not need space in L2-to-L1 queue e.g., in case of L2 miss or if it is an intervention

file:///E|/parallel_com_arch/lecture26/26_4.htm[6/13/2012 11:59:56 AM]


Objectives_template

reply; but after popping the head of L1-to-L2 queue it is impossible to backtrack if the
message does need space in L2-to-L1 queue
Similarly, L1 cache controller refuses to drain L2-to-L1 queue if there is no space in L1-
to-L2 queue
How do we break this cycle?
Observe that responses for processor requests are guaranteed not to generate any more
messages and intervention requests do not generate new requests, but can only
generate replies
Solving the queue deadlock
Introduce one more queue in each direction i.e. have a pair of queues in each direction
L1-to-L2 processor request queue and L1-to-L2 intervention response queue
Similarly, L2-to-L1 intervention request queue and L2-to-L1 processor response queue
Now L2 cache controller can serve L1-to-L2 processor request queue as long as there is
space in L2-to-L1 processor response queue, but there is no constraint on L1 cache
controller for draining L2-to-L1 processor response queue
Similarly, L1 cache controller can serve L2-to-L1 intervention request queue as long as
there is space in L1-to-L2 intervention response queue, but L1-to-L2 intervention
response queue will drain as soon as bus is granted

file:///E|/parallel_com_arch/lecture26/26_4.htm[6/13/2012 11:59:56 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 26: "Case Studies"

Dependence graph

Now we have four queues


Processor request (PR) and intervention reply (IY) are L1 to L2
Processor reply (PY) and intervention request (IR) are L2 to L1

Possible to combine PR and IY into a supernode of the graph and still be cycle-free
Leads to one L1 to L2 queue
Similarly, possible to combine IR and PY into a supernode
Leads to one L2 to L1 queue
Cannot do both
Leads to cycle as already discussed
Bottomline: need at least three queues for two-level cache hierarchy

Multiple outstanding requests

Today all processors allow multiple outstanding cache misses


We have already discussed issues related to ooo execution
Not much needs to be added on top of that to support multiple outstanding misses
For multi-level cache hierarchy the queue depths may be made bigger for performance
reasons
Various other buffers such as writeback buffer need to be made bigger

file:///E|/parallel_com_arch/lecture26/26_5.htm[6/13/2012 11:59:57 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 26: "Case Studies"

SGI Challenge

Supports 36 MIPS R4400 (4 per board) or 18 MIPS R8000 (2 per board)


A-chip has the address bus interface, request table
CC-chip handles coherence through the duplicate set of tags
Each D-chip handles 64 bits of data and as a whole 4 D-chips interface to a 256-bit wide data
bus

Sun Enterprise

Supports up to 30 UltraSPARC processors


2 processors and 1 GB memory per board
Wide 64-byte memory bus and hence two memory cycles to transfer the entire cache line (128
bytes)

Sun Gigaplane bus

Split-transaction, 256 bits data, 41 bits address, 83.5 MHz (compare to 47.6 MHz of SGI
Powerpath-2)
Supports 16 boards
112 outstanding transactions (up to 7 from each board)

file:///E|/parallel_com_arch/lecture26/26_6.htm[6/13/2012 11:59:57 AM]


Objectives_template

Snoop result is available 5 cycles after the request phase


Memory fetches data speculatively
MOESI protocol

file:///E|/parallel_com_arch/lecture26/26_6.htm[6/13/2012 11:59:57 AM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 27: "Scalable Snooping and AMD Hammer Protocol"

Special Topics

Virtually indexed caches

Virtual indexing

TLB coherence

TLB shootdown

Snooping on a ring

Scaling bandwidth

AMD Opteron

Opteron servers

AMD Hammer protocol

[From Chapter 6 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture27/27_1.htm[6/13/2012 12:00:41 PM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 27: "Scalable Snooping and AMD Hammer Protocol"

Virtually indexed caches

Recall that to have concurrent accesses to TLB and cache, L1 caches are often made
virtually indexed
Can read the physical tag and data while the TLB lookup takes place
Later compare the tag for hit/miss detection
How does it impact the functioning of coherence protocols and snoop logic?
Even for uniprocessor the synonym problem
Two different virtual addresses may map to the same physical page frame
One simple solution may be to flush all cache lines mapped to a page frame at the
time of replacement
But this clearly prevents page sharing between two processes

Virtual indexing

Software normally employs page coloring to solve the synonym issue


Allow two virtual pages to point to the same physical page frame only if the two virtual
addresses have at least lower k bits common where k is equal to cache line block
offset plus log2 (number of cache sets)
This guarantees that in a virtually indexed cache, lines from both pages will map to the
same index range
What about the snoop logic?
Putting virtual address on the bus requires a VA to PA translation in the snoop so that
physical tags can be generated (adds extra latency to snoop and also requires
duplicate set of translations)
Putting physical address on the bus requires a reverse translation to generate the
virtual index (requires an inverted page table)
Dual tags (Goodman, 1987)
Hardware solution to avoid synonyms in shared memory
Maintain virtual and physical tags; each corresponding tag pair points to each other
Assume no page coloring
Use virtual address to look up cache (i.e. virtual index and virtual tag) from processor
side; if it hits everything is fine; if it misses use the physical address to look up the
physical tag and if it hits follow the physical tag to virtual tag pointer to find the index
If virtual tag misses and physical tag hits, that means the synonym problem has
happened i.e. two different VAs are mapped to the same PA; in this case invalidate the
cache line pointed to by physical tag, replace the line at the virtual index of the current
virtual address, place the contents of the invalidated line there and update the physical
tag pointer to point to the new virtual index
Goodman, 1987
Always use physical address for snooping
Obviates the need for a TLB in memory controller
The physical tag is used to look up the cache for snoop decision
In case of a snoop hit the pointer stored with the physical tag is followed to get the
virtual index and then the cache block can be accessed if needed (e.g., in M state)
Note that even if there are two different types of tags the state of a cache line is the
same and does not depend on what type of tag is used to access the line
Multi-level cache hierarchy

file:///E|/parallel_com_arch/lecture27/27_2.htm[6/13/2012 12:00:41 PM]


Objectives_template

Normally the L1 cache is designed to be virtually indexed and other levels are
physically indexed
L2 sends interventions to L1 by communicating the PA
L1 must determine the virtual index from that to access the cache: dual tags are
sufficient for this purpose

file:///E|/parallel_com_arch/lecture27/27_2.htm[6/13/2012 12:00:41 PM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 27: "Scalable Snooping and AMD Hammer Protocol"

TLB coherence

A page table entry (PTE) may be held in multiple processors in shared memory because all of
them access the same shared page
A PTE may get modified when the page is swapped out and/or access permissions are
changed
Must tell all processors having this PTE to invalidate
How to do it efficiently?
No TLB: virtually indexed virtually tagged L1 caches
On L1 miss directly access PTE in memory and bring it to cache; then use normal
cache coherence because the PTEs also reside in the shared memory segment
On page replacement the page fault handler can flush the cache line containing the
replaced PTE
Too impractical: fully virtual caches are rare, still uses a TLB for upper levels (Alpha
21264 instruction cache)
Hardware solution
Extend snoop logic to handle TLB coherence
PowerPC family exercises a tlbie instruction (TLB invalidate entry)
When OS modifies a PTE it puts a tlbie instruction on bus
Snoop logic picks it up and invalidates the TLB entry if present in all processors
This is well suited for bus-based SMPs, but not for DSMs because broadcast in a
large-scale machine is not good

TLB shootdown

Popular TLB coherence solution


Invoked by an initiator (the processor which modifies the PTE) by sending interrupt to
processors that might be caching the PTE in TLBs; before doing so OS also locks the
involved PTE to avoid any further access to it in case of TLB misses from other
processors
The receiver of the interrupt simply invalidates the involved PTE if it is present in its
TLB and sets a flag in shared memory on which the initiator polls
On completion the initiator unlocks the PTE
SGI Origin uses a lazy TLB shootdown i.e. it invalidates a TLB entry only when a
processor tries to access it next time (will discuss in detail)

Snooping on a ring

Length of the bus limits the frequency at which it can be clocked which in turn limits the
bandwidth offered by the bus leading to a limited number of processors
A ring interconnect provides a better solution
Connect a processor only to its two neighbors
Short wires, much higher switching frequency, better bandwidth, more processors
Each node has private local memory (more like a distributed shared memory
multiprocessor)
Every cache line has a home node i.e. the node where the memory contains this line:
can be determined by upper few bits of the PA
Transactions traverse the ring node by node

file:///E|/parallel_com_arch/lecture27/27_3.htm[6/13/2012 12:00:41 PM]


Objectives_template

Snoop mechanism
When a transaction passes by the ring interface of a node it snoops the transaction,
takes appropriate coherence actions, and forwards the transaction to its neighbor if
necessary
The home node also receives the transaction eventually and let’s assume that it has a
dirty bit associated with every memory line (otherwise you need a two-phase protocol)
A request transaction is removed from the ring when it comes back to the requester
(serves as an acknowledgment that every node has seen the request)
The ring is essentially divided into time slots where a node can insert new request or
response; if there is no free time slot it must wait until one passes by: called a
slotted ring
The snoop logic must be able to finish coherence actions for a transaction before the next
time slot arrives
The main problem of a ring is the end-to-end latency, since the transactions must traverse
hop-by-hop
Serialization and sequential consistency is trickier
The order of two transactions may be differently seen by two processors if the source
of one transaction is between the two processors
The home node can resort to NACKs if it sees conflicting outstanding requests
Introduces many races in the protocol

file:///E|/parallel_com_arch/lecture27/27_3.htm[6/13/2012 12:00:41 PM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 27: "Scalable Snooping and AMD Hammer Protocol"

Scaling bandwidth

Data bandwidth
Make the bus wider: costly hardware
Replace bus by point-to-point crossbar: since only the address portion of a transaction
is needed for coherence, the data transaction can be directly between source and
destination
Add multiple data busses
Snoop or coherence bandwidth
This is determined by the number of snoop actions that can be executed in unit time
Having concurrent non-conflicting snoop actions definitely helps improve the protocol
throughput
Multiple address busses: a separate snoop engine is associated with each bus on each
node
Order the address busses logically to define a partial order among concurrent requests
so that these partial orders can be combined to form a total order

AMD Opteron

file:///E|/parallel_com_arch/lecture27/27_4.htm[6/13/2012 12:00:41 PM]


Objectives_template

Each node contains an x86-64 core, 64 KB L1 data and instruction caches, 1 MB L2 cache,
on-chip integrated memory controller, three fast routing links called hyperTransport, local DDR
memory
Glueless MP: just connect 8 Opteron chips via HT to design a distributed shared memory
multiprocessor
L2 cache supports 10 outstanding misses
Integrated memory controller and north bridge functionality help a lot
Can clock the memory controller at processor frequency (2 GHz)
No need to have a cumbersome motherboard; just buy the Opteron chip and connect it
to a few peripherals (system maintenance is much easier)
Overall, improves performance by 20-25% over Athlon
Snoop throughput and bandwidth is much higher since the snoop logic is clocked at 2
GHz
Integrated hyperTransport provides very high communication bandwidth
Point-to-point links, split-transaction and full duplex (bidirectional links)
On each HT link you can connect a processor or I/O

file:///E|/parallel_com_arch/lecture27/27_4.htm[6/13/2012 12:00:41 PM]


Objectives_template

Module 12: "Multiprocessors on a Snoopy Bus"


Lecture 27: "Scalable Snooping and AMD Hammer Protocol"

Opteron servers

Produced from IEEE Micro

AMD Hammer protocol

Opteron uses the snoop-based Hammer protocol


First the requester sends a transaction to home node
The home node starts accessing main memory and in parallel broadcasts the request to
all the nodes via point-to-point messages
The nodes individually snoop the request, take appropriate coherence actions in their
local caches, and sends data (if someone has it in M or O state) or an empty completion
acknowledgment to the requester; the home memory also sends the data speculatively
After gathering all responses the requester sends a completion message to the home
node so that it can proceed with subsequent requests (this completion ack may be
needed for serializing conflicting requests)
This is one example of a snoopy protocol over a point-to-point interconnect unlike the
shared bus

file:///E|/parallel_com_arch/lecture27/27_5.htm[6/13/2012 12:00:42 PM]


Objectives_template

Exercise : 2

These problems should be tried after module 12 is completed.

1. [30 points] For each of the memory reference streams given in the following,
compare the cost of executing it on a bus-based SMP that supports (a) MESI
protocol without cache-to-cache sharing, and (b) Dragon protocol. A read from
processor N is denoted by rN while a write from processor N is denoted by wN.
Assume that all caches are empty to start with and that cache hits take a single
cycle, misses requiring upgrade or update take 60 cycles, and misses requiring
whole block transfer take 90 cycles. Assume that all caches are writeback.

Stream1: r1 w1 r1 w1 r2 w2 r2 w2 r3 w3 r3 w3
Stream2: r1 r2 r3 w1 w2 w3 r1 r2 r3 w3 w1
Stream3: r1 r2 r3 r3 w1 w1 w1 w1 w2 w3

[For each stream for each protocol: 5 points]

2. [15 points] (a) As cache miss latency increases, does an update protocol
become more or less preferable as compared to an invalidation based protocol?
Explain.

(b) In a multi-level cache hierarchy, would you propagate updates all the way to
the first-level cache? What are the alternative design choices?

(c) Why is update-based protocol not a good idea for multiprogramming


workloads running on SMPs?

3. [20 points] Assuming all variables to be initialized to zero, enumerate all


outcomes possible under sequential consistency for the following code
segments.

(a) P1: A=1;


P2: u=A; B=1;
P3: v=B; w=A;

(b) P1: A=1;


P2: u=A; v=B;
P3: B=1;
P4: w=B; x=A;

(c) P1: u=A; A=u+1;


P2: v=A; A=v+1;

(d) P1: fetch-and-inc (A)


P2: fetch-and-inc (A)

4. [30 points] Consider a quad SMP using a MESI protocol (without cache-to-
cache sharing). Each processor tries to acquire a test-and-set lock to gain
access to a null critical section. Assume that test-and-set instructions always go

file:///E|/parallel_com_arch/lecture27/ex_2.htm[6/13/2012 12:00:42 PM]


Objectives_template

on the bus and they take the same time as the normal read transactions. The
initial condition is such that processor 1 has the lock and processors 2, 3, 4 are
spinning on their caches waiting for the lock to be released. Every processor
gets the lock once, unlocks, and then exits the program. Consider the bus
transactions related to the lock/unlock operations only.

(a) What is the least number of transactions executed to get from the initial to
the final state? [10 points]

(b) What is the worst-case number of transactions? [5 points]

(c) Answer the above two questions if the protocol is changed to Dragon. [15
points]

5. [30 points] Answer the above question for a test-and-test-and-set lock for a
16-processor SMP. The initial condition is such that the lock is released and no
one has got the lock yet.

6. [10 points] If the lock variable is not allowed to be cached, how will the traffic
of a test-and-set lock compare against that of a test-and-test-and set lock?

7. [15 points] You are given a bus-based shared memory machine. Assume that
the processors have a cache block size of 32 bytes and A is an array of
integers (four bytes each). You want to parallelize the following loop.

for(i=0; i<17; i++) {


for (j=0; j<256; j++) {
A[j] = do_something(A[j]);
}
}

(a) Under what conditions would it be better to use a dynamically scheduled


loop?

(b) Under what conditions would it be better to use a statically scheduled loop?

(c) For a dynamically scheduled inner loop, how many iterations should a
processor pick each time?

8. [20 points] The following barrier implementation is wrong. Make as little


change as possible to correct it.

struct bar_struct {
LOCKDEC(lock);
int count; // Initialized to zero
int releasing; // Initialized to zero
} bar;

void BARRIER (int P) {


LOCK(bar.lock);
bar.count++;
if (bar.count == P) {
bar.releasing = 1;
bar.count--;
}

file:///E|/parallel_com_arch/lecture27/ex_2.htm[6/13/2012 12:00:42 PM]


Objectives_template

else {
UNLOCK(bar.lock);
while (!bar.releasing);
LOCK(bar.lock);
bar.count--;
if (bar.count == 0) {
bar.releasing = 0;
}
}
UNLOCK(bar.lock);
}

file:///E|/parallel_com_arch/lecture27/ex_2.htm[6/13/2012 12:00:42 PM]


Objectives_template

Solution of Exercise : 2

[Thanks to Saurabh Joshi for some of the suggestions.]

1. [30 points] For each of the memory reference streams given in the following,
compare the cost of executing it on a bus-based SMP that supports (a) MESI
protocol without cache-to-cache sharing, and (b) Dragon protocol. A read from
processor N is denoted by rN while a write from processor N is denoted by wN.
Assume that all caches are empty to start with and that cache hits take a single
cycle, misses requiring upgrade or update take 60 cycles, and misses requiring
whole block transfer take 90 cycles. Assume that all caches are writeback.

Solution:
Stream1: r1 w1 r1 w1 r2 w2 r2 w2 r3 w3 r3 w3

(a) MESI: read miss, hit, hit, hit, read miss, upgrade, hit, hit, read miss,
upgrade, hit, hit. Total latency = 90+1+1+1+2*(90+60+1+1) = 397 cycles
(b) Dragon: read miss, hit, hit, hit, read miss, update, hit, update, read miss,
update, hit, update. Total latency = 90+1+1+1+2*(90+60+1+60) = 515 cycles

Stream2: r1 r2 r3 w1 w2 w3 r1 r2 r3 w3 w1

(a) MESI: read miss, read miss, read miss, upgrade, readX, readX, read miss,
read miss, hit, upgrade, readX. Total latency =
90+90+90+60+90+90+90+90+1+60+90 = 841 cycles
(b) Dragon: read miss, read miss, read miss, update, update, update, hit, hit,
hit, update, update. Total latency = 90+90+90+60+60+60+1+1+1+60+60=573
cycles

Stream3: r1 r2 r3 r3 w1 w1 w1 w1 w2 w3

(a) MESI: read miss, read miss, read miss, hit, upgrade, hit, hit, hit, readX,
readX. Total latency = 90+90+90+1+60+1+1+1+90+90 = 514 cycles
(b) Dragon: read miss, read miss, read miss, hit, update, update, update,
update, update, update. Total latency=90+90+90+1+60*6=631 cycles

[For each stream for each protocol: 5 points]

2. [15 points] (a) As cache miss latency increases, does an update protocol
become more or less preferable as compared to an invalidation based protocol?
Explain.

Solution: If the system is bandwidth-limited, invalidation protocol will remain


the choice. However, if there is enough bandwidth, with increasing cache miss
latency, invalidation protocol will lose in importance.

(b) In a multi-level cache hierarchy, would you propagate updates all the way to
the first-level cache? What are the alternative design choices?

Solution: If updates are not propagated to L1 caches, on an update the L1


block must be invalidated/retrieved to the L2 cache.

file:///E|/parallel_com_arch/lecture27/ex_sol_2.htm[6/13/2012 12:00:42 PM]


Objectives_template

(c) Why is update-based protocol not a good idea for multiprogramming


workloads running on SMPs?

Solution: Pack-rat. Discussed in class.

3. [20 points] Assuming all variables to be initialized to zero, enumerate all


outcomes possible under sequential consistency for the following code
segments.

(a) P1: A=1;


P2: u=A; B=1;
P3: v=B; w=A;

Solution: If u=1 and v=1, then w must be 1. So (u, v, w) = (1, 1, 0) is not


allowed. All other outcomes are possible.

(b) P1: A=1;


P2: u=A; v=B;
P3: B=1;
P4: w=B; x=A;

Solution: Observe that if u sees the new value A, v does not see the new
value of B, and w sees that new value of B, then x cannot see the old value of
A. So (u, v, w, x) = (1, 0, 1, 0) is not allowed. Similarly, if w sees the new value
of B, x sees the old value of A, u sees the new value of A, then v cannot see
the old value B. So (1, 0, 1, 0) is not allowed, which is already eliminated in the
above case. All other 15 combinations are possible.

(c) P1: u=A; A=u+1;


P2: v=A; A=v+1;

Solution: If v=A happens before A=u+1, then the final (u, v, A) = (0, 0, 1).
If v=A happens after A=u+1, then the final (u, v, A) = (0, 1, 2).
Since u and v are symmetric, we will also observe the outcome (1, 0, 2) in some
cases.

(d) P1: fetch-and-inc (A)


P2: fetch-and-inc (A)

Solution: The final value of A is 2.

4. [30 points] Consider a quad SMP using a MESI protocol (without cache-to-
cache sharing). Each processor tries to acquire a test-and-set lock to gain
access to a null critical section. Assume that test-and-set instructions always go
on the bus and they take the same time as the normal read transactions. The
initial condition is such that processor 1 has the lock and processors 2, 3, 4 are
spinning on their caches waiting for the lock to be released. Every processor
gets the lock once, unlocks, and then exits the program. Consider the bus
transactions related to the lock/unlock operations only.

(a) What is the least number of transactions executed to get from the initial to
the final state? [10 points]

Solution: 1 unlocks, 2 locks, 2 unlocks (no transaction), 3 locks, 3 unlocks (no

file:///E|/parallel_com_arch/lecture27/ex_sol_2.htm[6/13/2012 12:00:42 PM]


Objectives_template

transaction), 4 locks, 4 unlocks (no transaction). Notice that in the best possible
scenario, the timings will be such that when someone is in the critical section no
one will even attempt a test-and-set. So when the lock holder unlocks, the
cache block will still be in its cache in M state.

(b) What is the worst-case number of transactions? [5 points]

Solution: Unbounded. While someone is holding the lock, other contending


processors may keep on invalidating each other indefinite number of times.

(c) Answer the above two questions if the protocol is changed to Dragon. [15
points]

Solution: Observe that it is an order of magnitude more difficult to implement


shared test-and-set locks (LL/SC-based locks are easier to implement) in a
machine running an update-based protocol. In a straightforward implementation,
on an unlock everyone will update the value in cache and then will try to do
test-and-set. Observe that the processor which wins the bus and puts its update
first, will be the one to enter the critical section. Others will observe the update
on the bus and must abort their test-and-set attempts. While someone is in the
critical section, nothing stops the other contending processors from trying test-
and-set (notice the difference with test-and-test-and-set). However, these
processors will not succeed in getting entry to the critical section until the unlock
happens.

Least number is still 7. A test-and-set or an unlock involves putting an update


on the bus.

Worst case is still unbounded.

5. [30 points] Answer the above question for a test-and-test-and-set lock for a
16-processor SMP. The initial condition is such that the lock is released and no
one has got the lock yet.

Solution:MESI:

Best case analysis: 1 locks, 1 unlocks, 2 locks, 2 unlocks, ... This involves
exactly 16 transactions (unlocks will not generate any transaction in the best
case timing).

Worst case analysis: Done in the class. The first round will have (16 + 15 + 1 +
15) transactions. The second round will have (15 + 14 + 1 + 14) transactions.
The last but one round will have (2 + 1 + 1 + 1) transactions and the last round
will have one transaction (just locking of the last processor). The last unlock will
not generate any transaction. If you add these up, you will get (1.5P+2)(P-1) +
1. For P=16, this is 391.

Dragon:

Best case analysis: Now both unlocks and locks will generate updates. So the
total number of transactions would be 32.

Worst case analysis: The test & set attempts in each round will generate
updates. The unlocks will also generate updates. Everything else will be cache
hits. So the number of transactions is (16+1)+(15+1)+...+(1+1) = 152.

file:///E|/parallel_com_arch/lecture27/ex_sol_2.htm[6/13/2012 12:00:42 PM]


Objectives_template

6. [10 points] If the lock variable is not allowed to be cached, how will the traffic
of a test-and-set lock compare against that of a test-and-test-and-set lock?

Solution:In the worst case both would be unbounded.

7. [15 points] You are given a bus-based shared memory machine. Assume that
the processors have a cache block size of 32 bytes and A is an array of
integers (four bytes each). You want to parallelize the following loop.

for(i=0; i<17; i++) {


for (j=0; j<256; j++) {
A[j] = do_something(A[j]);
}
}

(a) Under what conditions would it be better to use a dynamically scheduled


loop?

Solution: If runtime of do_something varies a lot depending on its argument


value or if nothing is known about do_something.

(b) Under what conditions would it be better to use a statically scheduled loop?

Solution: If runtime of do_something is roughly independent of its argument


value.

(c) For a dynamically scheduled inner loop, how many iterations should a
processor pick each time?

Solution: Multiple of 8 integers (one cache block is eight integers).

8. [20 points] The following barrier implementation is wrong. Make as little


change as possible to correct it.

struct bar_struct {
LOCKDEC(lock);
int count; // Initialized to zero
int releasing; // Initialized to zero
} bar;

void BARRIER (int P) {


LOCK(bar.lock);
bar.count++;
if (bar.count == P) {
bar.releasing = 1;
bar.count--;
}
else {
UNLOCK(bar.lock);
while (!bar.releasing);
LOCK(bar.lock);
bar.count--;
if (bar.count == 0) {
bar.releasing = 0;

file:///E|/parallel_com_arch/lecture27/ex_sol_2.htm[6/13/2012 12:00:42 PM]


Objectives_template

}
}
UNLOCK(bar.lock);
}

Solution: There are too many problems with this implementation. I will not list
them here. The correct barrier code is given below which requires addition of
one line of code. Notice that the releasing variable nicely captures the notion of
sense reversal.

void BARRIER (int P) {


while (bar.releasing); // New addition
LOCK(bar.lock);
bar.count++;
if (bar.count == P) {
bar.releasing = 1;
bar.count--;
}
else {
UNLOCK(bar.lock);
while (!bar.releasing);
LOCK(bar.lock);
bar.count--;
if (bar.count == 0) {
bar.releasing = 0;
}
}
UNLOCK(bar.lock);
}

file:///E|/parallel_com_arch/lecture27/ex_sol_2.htm[6/13/2012 12:00:42 PM]


Objectives_template

Module 13: "Scalable Multiprocessors"


Lecture 28: "Scalable Multiprocessors"

Scalable Multiprocessors

Agenda

Basics of scalability

Bandwidth scaling

Latency scaling

Cost scaling

Physical scaling

IBM SP-2

Programming model

Common challenges

Spectrum of designs

Physical DMA

nCUBE/2

User-level ports

User-level handling

Message co-processor

Intel Paragon

Meiko CS-2

Shared physical addr.

Caching shared data?

COWs and NOWs

Scalable synchronization

Distributed queue locks

[From Chapter 7 of Culler, Singh, Gupta]

file:///E|/parallel_com_arch/lecture28/28_1.htm[6/13/2012 12:11:52 PM]


Objectives_template

Module 13: "Scalable Multiprocessors"


Lecture 28: "Scalable Multiprocessors"

Agenda

Basics of scalability
Programming models
Physical DMA
User-level networking
Dedicated message processing
Shared physical address
Cluster of workstations (COWs) and Network of workstations (NOWs)
Scaling parallel software
Scalable synchronization

Basics of scalability

Main problem is the communication medium


Busses don’t scale
Need more wires that are not always shared by all
Replace bus by a network of switches
Distributed memory multiprocessors
Each node has its own local memory
To access remote memory, node sends a point-to-point message to the destination
How to support efficient messaging?
Main goal is to reduce the ratio of remote memory latency to local memory latency
In shared memory, how to support coherence efficiently?

Bandwidth scaling

Need a large number of independent paths between two nodes


Makes it possible to support a large number of concurrent transactions
They get initiated independently (as opposed to a single centralized bus arbiter)
Local accesses should be higher bandwidth
Since communication takes place via point-to-point messages, only the routers or switches
along the path are involved
No global visibility of messages (unlike a bus)
Must send separate messages to make sure that global visibility is guaranteed when
necessary (e.g., invalidations)

Latency scaling

End-to-end latency of a message involves three parts (log model)


Overhead (o): time to initiate a message and to terminate a message (at sender and
receiver respectively); normally involves kernel overhead in message passing and the
coherence overhead in shared memory
Node-to-network time or gap (g): number of bytes/link bandwidth where this is the
bandwidth offered by the router to/from network (how fast you can push packets into
the network or pull packets from the network); normally the bandwidth between
network interface (NI) and the router is at least as big and hence is not a bottleneck
Routing time or hop time (L): determined by topology, routing algorithm, and router
circuitry (e.g., arbitration, number of ports etc.)

file:///E|/parallel_com_arch/lecture28/28_2.htm[6/13/2012 12:11:53 PM]


Objectives_template

Importance: L < g < o for most scientific applications

Cost scaling

Cost of anything has two components


Fixed cost
Incremental cost for adding something more (in our case more nodes)
Bus-based SMPs have too much of fixed cost
Scaling involves adding a commodity processor and possibly more DRAM
Need to have more modular cost scaling i.e. don’t want to pay so much even for a
small scale machine
Costup = cost of P nodes / cost of single node
Parallel computing on a machine is cost-effective if speedup > costup on average for target
applications

file:///E|/parallel_com_arch/lecture28/28_2.htm[6/13/2012 12:11:53 PM]


Objectives_template

Module 13: "Scalable Multiprocessors"


Lecture 28: "Scalable Multiprocessors"

Physical scaling

Integration at various levels


Chip-level integration to keep wires short: nCUBE/2 (1990) puts the processor, router,
MMU, DRAM interface on a single chip
Board-level integration: Thinking machine CM-5 uses the core of a Sun SparcStation 1
and other peripherals on a single board to realize a node
System-level integration: IBM SP1 and SP2 exploit commodity RS6000 workstations and
connect them through an external communication assist / network interface card and a
router

IBM SP-2

Programming model

Shared address space


Communication initiated by load/store instructions
Requires local or remote memory access before the data can be sent
Request/response protocol initiated by receiver
Message passing
A one way communication: normally initiated by the sender
Number of sends can be buffered before a matching receive shows up

file:///E|/parallel_com_arch/lecture28/28_3.htm[6/13/2012 12:11:53 PM]


Objectives_template

Synchronous vs. asynchronous sends


Active messages
Restricted form of remote procedure call

file:///E|/parallel_com_arch/lecture28/28_3.htm[6/13/2012 12:11:53 PM]


Objectives_template

Module 13: "Scalable Multiprocessors"


Lecture 28: "Scalable Multiprocessors"

Common challenges

Input buffer overflow


Reserve space per source
Refuse input when full
Let the network backup naturally: tree saturation
Deadlock-free networks
Traffic not bound to hot-spot nodes may get severely affected
Keep a reserved NACK channel
May drop packets depending on the network protocol
Fetch deadlock
Nodes must continue to serve new messages while waiting for queue space so that it
can put new requests
Separate request and response virtual networks (essentially disjoint set of queues in
each port of the router) or have enough buffer space to never run into this problem
(may be too expensive)

Spectrum of designs

In increasing order of hardware support and probably performance and cost


Physical bit stream, physical DMA (nCUBE, iPSC)
User-level network port (CM-5, MIT *T)
User-level handler (MIT J machine, Monsoon)
Remote virtual address (Intel Paragon, Meiko CS-2)
Global physical address (Cray T3D, T3E)
Cache-coherent shared memory (SGI Origin, Alpha GS320, Sun S3.mp)

Physical DMA

A reserved area in physical memory is used for sending and receiving messages
After setting up the memory region the processor takes a trap to the kernel
The interrupt handler typically copies the data into kernel area so that it can be
manipulated
Finally, kernel instructs the DMA device to push the message into the network via the
physical address of the message (typically called DMA channel address)
At the destination the DMA device will deposit the message in a predefined physical
memory area and generates an interrupt for the processor
The interrupt handler now can copy the message into kernel area and inspect and
parse the message to take appropriate actions (this is called blind deposit)

nCUBE/2

Independent DMA channels per link direction


Segmented messages (first segment can be inspected to decide what to do with the rest)
Active messages: 13 µs outbound, 15 µs inbound
Dimension-order routing on hypercube

User-level ports

Network ports and status registers are memory-mapped in user address space

file:///E|/parallel_com_arch/lecture28/28_4.htm[6/13/2012 12:11:53 PM]


Objectives_template

User program can initiate a transaction by composing the message and writing to the
status registers
Communication assist does the protection check and pushes the message into the
physical medium
A message at the destination can sit in the input queue until the user program pops it
off
A system message generates an interrupt through the destination assist, but user
messages do not require OS intervention
Problem with context switch: messages are now really part of the process state; need
to save and restore them
Thinking machine CM-5 has outbound message latency of 1.5 µs and inbound 1.6 µs

file:///E|/parallel_com_arch/lecture28/28_4.htm[6/13/2012 12:11:53 PM]


Objectives_template

Module 13: "Scalable Multiprocessors"


Lecture 28: "Scalable Multiprocessors"

User-level handling

Instead of mapping the network ports to user memory, make them processor registers
Even faster messaging
Communication assist now looks really like a functional unit inside the processor (just
like a FPU)
Send and receive are now register to register transfers
iWARP from CMU and Intel, *T from MIT and Motorola, J machine from MIT
iWARP binds two processor registers to the heads of the network input and output
ports; the processor accesses the message word-by-word as it streams in
*T extended Motorola 88110 RISC core to include a network function unit containing
dedicated sets of input and output registers; a message is spread over a set of such
registers and a special instruction initiates the transfer

Message co-processor

Nodes equipped with a dedicated message processor or communication processor (CP)


Two possible organizations: main processor and CP sit on a shared memory bus along
with the main memory and NI; otherwise the CP may be integrated into the NI
The main processor and CP talk to each other via the normal cache coherence
protocol i.e. while sending a message the main processor fills a shared buffer and sets
a flag and while receiving a message CP does the same thing
Possible inefficiency due to invalidation-based coherence protocol (Update protocol
would be worse)
CP may need to handle a lot of concurrent transactions e.g., from main processor and
from network: a single dispatch loop serializes the processing (multi-threaded CP?)

Intel Paragon

One i860XP processor per SMP node (MESI) is dedicated as CP


One receive and one transmit DMA engine for transferring data from/to shared memory
to/from NI transmit/receive queue (each 2 KB)
While sending a large message the NI queue may become full and the network may not drain
that fast
To avoid deadlock the transmit DMA is stalled by hardware flow control and the bus is
relinquished
Takes about 10 µs to send a small message (about two cache lines) from the register file of
source to the register file of destination

Meiko CS-2

The CP is tightly integrated with the NI and has separate concurrent units
The command unit sits directly on the shared bus and is responsible for fielding
processor requests
The processor executes an atomic swap between one register and a fixed memory
location which is mapped to the head of the CP input queue
The command contains a command type and a VA
Depending on the command type the command processor can invoke the DMA
processor (may need assistance from VA to PA unit), an event processor (to wake up

file:///E|/parallel_com_arch/lecture28/28_5.htm[6/13/2012 12:11:54 PM]


Objectives_template

some thread on the main processor), or a thread processor to construct and issue a
message
The input processor fields new messages from the network and may invoke the reply
processor, or any of the above three units

Shared physical addr.

Memory controller on each node accepts PAs from the system bus
The processor initially issues a VA
The TLB provides the translation and the upper few bits of PA represent the home
node for this address (determined when the mapping is established for the first time)
If the address is local i.e. requester is the home node, the memory controller returns
data just as in uniprocessor
If address is remote the memory controller instructs the communication assist
(essentially the NI) to generate a remote memory request
In the remote home the CA issues a request to the memory controller to read memory
and eventually data is returned to the requester

file:///E|/parallel_com_arch/lecture28/28_5.htm[6/13/2012 12:11:54 PM]


Objectives_template

Module 13: "Scalable Multiprocessors"


Lecture 28: "Scalable Multiprocessors"

Caching shared data?

All transactions are no longer visible to all


Whether a page should be cached or not is normally part of the VA to PA translation
For example, in some graphics co-processors all operations must be through uncached
writes or storing command/data to memory-mapped control registers is also uncached
Private memory lines can be cached without any problem and does not depend on if the
line is local or remote
Caching shared lines introduce coherence issues

COWs and NOWs

Historically, used to build a multi-programmed multi-user system


Connect a number of PCs or workstations over a cheap commodity network and
schedule independent jobs on machines depending on the load
Increasingly, these clusters are being used as parallel machines
One major reason is the availability of message passing libraries to express parallelism
over commodity LAN or WAN
Also, technology breakthrough in terms of high-speed interconnects (ServerNet, Myrinet,
Infiniband, PCI Express AS, etc.)
Varying specialized support in CA
Conventional TCP/IP stack imposes an enormous overhead to move even a small
amount of data (often more than common Ethernet): network processor architecture has
been a hot research topic
Active messages allow user-level communication
Reflective memory allows writes to special regions of memory to appear as writes into
regions on remote processors
Virtual interface architecture (VIA): each process has a communication end point
consisting of a send queue, receive queue, and status; a process can deposit a
message in its send queue with a dest. id so that it actually gets into the receive queue
of target process
The CA hardware normally plugs on to I/O bus as opposed memory bus (fast PCI bus
supports coherence); could be on the graphics bus also

Scalable synchronization

In message-passing a send/receive pair provides point-to-point synchronization


Handle all-to-all synchronization via tree barrier
Also, all-to-all communication must be properly staggered to avoid hot-spots in the system
Classical example of matrix transpose
Scalable locks such as ticket or array should be used
Any problem with array locks?
Array locations now may not be local: invalidation causes remote misses

Distributed queue locks

Goodman, Vernon, Woest (1989)


Logically arrange the array as a linked list
Allocate a new node (must be atomic) when a processor enters acquire

file:///E|/parallel_com_arch/lecture28/28_6.htm[6/13/2012 12:11:54 PM]


Objectives_template

The node is allocated on the local memory of the contending processor


A tail pointer is always maintained

file:///E|/parallel_com_arch/lecture28/28_6.htm[6/13/2012 12:11:54 PM]


Objectives_template

Module 14: "Directory-based Cache Coherence"


Lecture 29: "Basics of Directory"

Directory-based Cache Coherence:

What is needed?

Adv. of MP nodes

Disadvantages

Basics of directory

Directory organization

Is directory useful?

Sharing pattern

Directory organization

Directory overhead

Path of a read miss

Correctness issues

[From Chapter 8 of Culler, Singh, Gupta]


[SGI Origin 2000 material taken from Laudon and Lenoski, ISCA 1997]
[GS320 material taken from Gharachorloo et al., ASPLOS 2000]

file:///E|/parallel_com_arch/lecture29/29_1.htm[6/13/2012 12:12:37 PM]


Objectives_template

Module 14: "Directory-based Cache Coherence"


Lecture 29: "Basics of Directory"

What is needed?

On every memory operation


Find the state of the cache line (normally present in cache)
Tell others about the operation if needed (achieved by broadcasting in bus-based or
small-scale distributed systems)
Other processors must take appropriate actions
Also, need a mechanism to resolve races between concurrent conflicting accesses (i.e.
one of them is a write): this essentially needs some central control on a per cache line
basis
Atomic bus provides an easy way of serializing
Split-transaction bus with a distributed request table works only because every request
table can see every transaction
Need to have a table that gets accessed/updated by every cache line access
This is the directory
Every cache line has a separate directory entry
The directory entry stores the state of the line, who the current owner is (if any), the
sharers (if any), etc.
On a miss, the directory entry must be located, and appropriate coherence action must
be taken
A popular architecture is to have a two-level hierarchy: each node is SMP, kept
coherent via snoopy or directory protocol, and nodes are kept coherent by a scalable
directory protocol (Convex Exemplar: directory-directory, Sequent, Data General, HAL,
DASH: snoopy-directory)

Adv. of MP nodes

Amortization of node fixed cost over multiple processors; can use commodity SMPs
Much communication may be contained within a node i.e. less “remote” communication
Request combining by some extra hardware in memory controller
Possible to share caches e.g., chip multiprocessor nodes (IBM POWER4 and POWER5) or
hyper-threaded nodes (Intel Xeon MP)
Exact benefit depends on sharing pattern
Widely shared data or nearest neighbor (if properly mapped) may be good

Disadvantages

Snoopy bus delays all accesses


The local snoop must complete first
Then only a request can be sent to remote home
Same delay may be incurred at the remote home also depending on the coherence
scheme
This dictated SGI Origin 2000 to have dual processor nodes, but managed entirely by
director
Bandwidth at critical points is shared by all processors
System bus, memory controller, DRAM, router
Bad communication patterns can actually result in execution time larger than
uniprocessor nodes even though average “hop” time may be larger e.g., compare two

file:///E|/parallel_com_arch/lecture29/29_2.htm[6/13/2012 12:12:38 PM]


Objectives_template

16P systems one with 4-way 4 nodes and one with 16 nodess

file:///E|/parallel_com_arch/lecture29/29_2.htm[6/13/2012 12:12:38 PM]


Objectives_template

Module 14: "Directory-based Cache Coherence"


Lecture 29: "Basics of Directory"

Basics of directory

Theoretically speaking each directory entry should have a dirty bit and a bitvector of length P
On a read from processor k, if dirty bit is off read cache line from memory, send it to k,
set bit[k] in vector; if dirty bit is on read owner id from vector (different interpretation of
bitvector), send read intervention to owner, owner replies line directly to k (how?),
sends a copy to home, home updates memory, directory controller sets bit[k] and
bit[owner] in vector
On a write from processor k, if dirty bit is off send invalidations to all sharers marked in
vector, wait for acknowledgments, read cache line from memory, send it to k, zero out
vector and write k in vector, set dirty bit; if dirty bit on same as read, but now
intervention is of readX type and memory does not write the line back, dirty bit is set
and vector=k

Directory organization

Centralized vs. distributed


Centralized directory helps to resolve many races, but becomes a bandwidth bottleneck
One solution is to provide a banked directory structure: with each memory bank
associate its directory bank
But since memory is distributed, this essentially leads to distributed directory structure
i.e. each node is responsible for holding the directory entries corresponding to the
memory lines it is holding
Why did we decide to have a distributed memory organization instead of dance hall?

Is directory useful?

One drawback of directory


Before looking up the directory you cannot decide what to do (even if you start reading
memory speculatively)
So directory introduces one level of indirection in every request that misses in
processor’s cache hierarchy
Therefore, broadcast is definitely preferable over directory if the system can offer
enough memory controller and router bandwidth to handle broadcast messages
(network link bandwidth is normally not the bottleneck since most messages do not
carry data; observe that you would never broadcast a reply); AMD Opteron adopted this
scheme, but target is small scale
Directory is preferable
If number of sharers is small because in this case a broadcast would waste enormous
amount of memory controller bandwidth

file:///E|/parallel_com_arch/lecture29/29_3.htm[6/13/2012 12:12:38 PM]


Objectives_template

Module 14: "Directory-based Cache Coherence"


Lecture 29: "Basics of Directory"

Sharing pattern

Problem is with the writes


Frequently written cache lines exhibit a small number of sharers; so small number of
invalidations
Widely shared data are written infrequently; so large number of invalidations, but rare
Synchronization variables are notorious: heavily contended locks are widely shared and
written in quick succession generating a burst of invalidations; require special solutions
such as queue locks or tree barriers
What about interventions? These are very problematic because in these cases you
cannot send the interventions before looking up the directory and any speculative
memory lookup would be useless
For scientific applications interventions are small due to mostly one producer-many
consumer pattern; for database workloads these take the lion’s share due to migratory
pattern and tend to increase with bigger cache
Optimizing interventions related to migratory sharing has been a major focus of high-end
scalable servers
AlphaServer GS320 employs few optimizations to quickly resolve races related to
migratory hand-off (more later)
Some academic research looked at destination or owner prediction to speculatively
send interventions even before consulting the directory (Martin and Hill 2003, Acacio et
al 2002)
In general, directory provides far better utilization of bandwidth for scalable MPs compared to
broadcast

Directory organization

How to find source of directory information


Centralized: just access the directory (bandwidth limited)
Distributed: flat scheme distributes directory with memory and every cache line has a
home node where its memory and directory reside
Hierarchical scheme organizes the processors as the leaves of a logical tree (need not
be binary) and an internal node stores the directory entries for the memory lines local
to its children; a directory entry essentially tells you which of its children subtrees are
caching the line and if some subtree which is not its children is also caching; finding
the directory entry of a cache line involves a traversal of the tree until the entry is found
(inclusion is maintained between level k and k+1 directory node where the root is at the
highest level i.e. in the worst case may have to go to the root to find dir.)
Format of a directory entry
Varies a lot: no specific rule
Memory-based scheme: directory entry is co-located in the home node with the
memory line; various organizations can be used; the most popular one is a simple bit
vector (with a 128 bytes line, storage overhead for 64 nodes is 6.35%, for 256 nodes
25%, for 1024 nodes 100%); clearly does not scale with P (more later)
Cache-based scheme: Organize the directory as a distributed linked-list where the
sharer nodes form a chain; the cache tag is extended to hold a node number; the
home node only knows the id of the first sharer; on a read miss the requester adds
itself to the head (involves home and first sharer); on a write miss traverse list and

file:///E|/parallel_com_arch/lecture29/29_4.htm[6/13/2012 12:12:38 PM]


Objectives_template

invalidate (essentially serialized chain of messages); advantage: distributes contention


and does not make the home node a hot-spot, storage overhead is fixed; but very
complex (IEEE SCI standard)
Lot of research has been done to reduce directory storage overhead
The trade-off is between preciseness of information and performance
Normal trick is to have a superset of information e.g., group every two sharers into a
cluster and have a bit per cluster: may lead to one useless invalidation per cluster
We will explore this in detail later
Memory-based bitvector scheme is very popular: invalidations can be overlapped or
multicast
Cache-based schemes incur serialized message chain for invalidation
Hierarchical schemes are not used much due to high latency and volume of messages
(up and down tree); also root may become a bandwidth bottleneck

file:///E|/parallel_com_arch/lecture29/29_4.htm[6/13/2012 12:12:38 PM]


Objectives_template

Module 14: "Directory-based Cache Coherence"


Lecture 29: "Basics of Directory"

Directory overhead

Quadratic in number of processors for bitvector


Assume P processors, each with M amount of local memory (i.e. total shared memory
size is M*P)
Let the coherence granularity (cache block size) be B
Number of cache blocks per node = M/B = number of directory entries per node
Size of one directory entry = P + O(1)
Total size of directory memory across all processors = (M/B)(P+O(1))*P = O(P 2 )

Path of a read miss

Assume that the line is not shared by anyone


Load issues from load queue (for data) or fetcher accesses icache; looks up TLB and
gets PA
Misses in L1, L2, L3,… caches
Launches address and request type on system bus
The request gets queued in memory controller and registered in OTT or TTT
(Outstanding Transaction Table or Transactions in Transit Table)
Memory controller eventually schedules the request
Decodes home node from upper few bits of address
Local home: access directory and data memory (how?)
Remote home: request gets queued in network interface
From NI onward
Eventually the request gets forwarded to the router and through the network to the
home
At the home the request gets queued in NI and waits for being scheduled by the home
memory controller
After it is scheduled home memory controller looks up directory and data memory
Reply returns through the same path
Total time (by log model and memory latency m)
Local home: max(k h o, m)
Remote home: k ro + g h+a + Nl + g h+a + max(k h o, m) + g h+a+d + Nl + g h+a+d + k ro

Correctness issues

Serialization to a location
Schedule order at home
Use NACKs (extra traffic and livelock) or smarter techniques (back-off, NACK-free)
Flow control deadlock
Avoid buffer dependence cycles
Avoid network queue dependence cycles
Virtual networks multiplexed on physical networks
Coherence protocol dictates the virtual network usage

file:///E|/parallel_com_arch/lecture29/29_5.htm[6/13/2012 12:12:38 PM]


Objectives_template

Module 14: "Directory-based Cache Coherence"


Lecture 30: "SGI Origin 2000"

Directory-based Cache Coherence:

Virtual networks

Three-lane protocols

Performance issues

SGI Origin 2000

Origin 2000 network

Origin 2000 I/O

Origin directory

Cache and dir. states

Handling a read miss

Serializing requests

Handling writebacks

[From Chapter 8 of Culler, Singh, Gupta]


[SGI Origin 2000 material taken from Laudon and Lenoski, ISCA 1997]
[GS320 material taken from Gharachorloo et al., ASPLOS 2000]

file:///E|/parallel_com_arch/lecture30/30_1.htm[6/13/2012 12:13:08 PM]


Objectives_template

Module 14: "Directory-based Cache Coherence"


Lecture 30: "SGI Origin 2000"

Virtual networks

Consider a two-node system with one incoming and one outgoing queue on each node

Single queue is not enough to avoid deadlock


Single queue forms a single virtual network
Similar deadlock issues as multi-level caches
An incoming message may generate another message e.g., request generates reply, ReadX
generates reply and invalidation requests, request may generate intervention request
Memory controller refuses to schedule a message if the outgoing queue is full
Same situation may happen on all nodes: deadlock
One incoming and one outgoing queue is not enough
What if we have two in each direction?: one for request and one for reply
Replies can usually sink
Requests generating requests?
What is the length of the longest transaction in terms of number of messages?
This decides the number of queues needed in each direction (Origin 2000 uses a different
scheme)
One type of message is usually assigned to a queue
One queue type connected across the system forms a virtual network of that type e.g.
request network, reply network, third party request (invalidations and interventions) network
Virtual networks are multiplexed over a physical network
Sink message type must get scheduled eventually
Resources should be sized properly so that scheduling of these messages does not depend
on anything
Avoid buffer shortage (and deadlock) by keeping reserved buffer for the sink queue

Three-lane protocols

Quite popular due to its simplicity


Let the request network be R, reply network Y, intervention/invalidation network be RR
Network dependence (aka lane dependence) graph looks something like this

file:///E|/parallel_com_arch/lecture30/30_2.htm[6/13/2012 12:13:09 PM]


Objectives_template

file:///E|/parallel_com_arch/lecture30/30_2.htm[6/13/2012 12:13:09 PM]


Objectives_template

Module 14: "Directory-based Cache Coherence"


Lecture 30: "SGI Origin 2000"

Performance issues

Latency optimizations
Reduce transactions on critical path: 3-hop vs. 4-hop
Overlap activities: protocol processing and data access, invalidations, invalidation
acknowledgments
Make critical path fast: directory cache, integrated memory controller, smart protocol
Reduce occupancy of protocol engine
Throughput optimizations
Pipeline the protocol processing
Multiple coherence engines
Protocol decisions: where to collect invalidation acknowledgments, existence of clean
replacement hints

SGI Origin 2000

Similar to Stanford DASH


Flat memory-based directory organization

file:///E|/parallel_com_arch/lecture30/30_3.htm[6/13/2012 12:13:09 PM]


Objectives_template

Connections to Backplane

Directory state in separate DRAMs, accessed in parallel with data


Up to 512 nodes (1024 processors)
195 MHz MIPS R10k (peak 390 MFLOPS and 780 MIPS per processor)
Peak SysADBus (64 bits) bandwidth is 780 MB/s; same for hub-memory
Hub to router and Xbow (I/O processor) is 1.56 GB/s
Hub is 500 K gates in 0.5 micron CMOS
Outstanding transaction buffer (aka CRB): 4 per processor
Two processors per node are not snoop-coherent

file:///E|/parallel_com_arch/lecture30/30_3.htm[6/13/2012 12:13:09 PM]


Objectives_template

Module 14: "Directory-based Cache Coherence"


Lecture 30: "SGI Origin 2000"

Origin 2000 network

Each router has six pairs of 1.56 GB/s unidirectional links; two to nodes (bristled), four to other
routers
41 ns pin to pin latency
Four virtual networks: request, reply, priority, I/O

Origin 2000 I/O

Any processor can access I/O device either through uncached ops or through coherent DMA
Any I/O device can access any data through router/hub

Origin directory

Directory formats
If exclusive in a cache, entry contains processor number (not node number)
If shared, entry is a bitvector of sharers where each corresponds to a node (not a
processor)

file:///E|/parallel_com_arch/lecture30/30_4.htm[6/13/2012 12:13:09 PM]


Objectives_template

Invalidations sent to a node is broadcast to both processors by hub


Two sizes
16-bit format (up to 32 processors), kept in DRAM
64-bit format (up to 128 processors), kept in extension DRAM
For machine sizes larger than 128 processors the protocol is coarse-vector (each
bit is for 8 nodes)
Machine can switch between BV and CV dynamically

file:///E|/parallel_com_arch/lecture30/30_4.htm[6/13/2012 12:13:09 PM]


Objectives_template

Module 14: "Directory-based Cache Coherence"


Lecture 30: "SGI Origin 2000"

Cache and dir. states

Cache states: MESI


Six directory states (may not be six bits)
Unowned (I): no cache has a copy, memory copy is valid
Shared (S): one or more caches have copies, memory copy is valid
Dirty exclusive (M or DEX): exactly one cache has block in M or E state
Directory cannot distinguish between M and E
Two pending or busy or transient states (PSH and PDEX): a transaction for the cache block
is in progress; home cannot accept any new request
Poisoned state: used for efficient page migration

Handling a read miss

Origin protocol does not assume anything about ordering of messages in the network
At requesting hub
Address is decoded and home is located
Request forwarded to home if home is remote
At home
Directory lookup and data lookup are initiated in parallel
Directory banks are designed to be slightly faster than other banks
The directory entry may reveal several possible states
Actions taken depends on this
Directory state lookup
Unowned: mark directory to point to requester, state becomes M, send cache line
Shared: mark directory bit, send cache line
Busy: send NACK to requester
Modified: if owner is not home, forward to owner

3-hop vs. 4-hop reply?


Origin has only two virtual networks available to protocol
How to handle interventions?
Directory state M
Actions at home: set PSH state, set the vector with two sharers, NACK all subsequent
requests until state is S
Actions at owner: if cache state is M send reply to requester (how to know who the requester
is?) and send sharing writeback (SWB) to home; if cache state is E send completion

file:///E|/parallel_com_arch/lecture30/30_5.htm[6/13/2012 12:13:10 PM]


Objectives_template

messages to requester and home (no data is sent); in all cases cache state becomes S
Sharing writeback or completion message, on arrival at home, changes directory state to S
If the owner state is E, how does the requester get the data?
The famous speculative reply of Origin 2000
Note how processor design (in this case MIPS R10k) influences protocol decisions

file:///E|/parallel_com_arch/lecture30/30_5.htm[6/13/2012 12:13:10 PM]


Objectives_template

Module 14: "Directory-based Cache Coherence"


Lecture 30: "SGI Origin 2000"

Handling a write miss

Request opcode could be upgrade or read exclusive


State busy: send NACK
State unowned: if ReadX send cache block, change state to M, mark owner in vector; if
upgrade what do you do?
State shared: send reply (upgrade ack or ReadX reply) with number of sharers, send
invalidations to sharers, change state to M, mark owner in vector; sharers send
invalidation acknowledgments to requester directly
What if outgoing request network queue fills up before all invalidations are sent?
State M: same as read miss except directory remains in PDEX state until completion
message (no data) is received from owner; directory remains in M state, only the owner
changes; how do you handle upgrades here?

Serializing requests

The tricky situation is collection of invalidation acknowledgments


Note from previous slides that even before all acknowledgments are collected at the
requester, the directory at home goes to M state with the new owner marked
A subsequent request will get forwarded to the new owner (at this point directory goes
to PSH or PDEX state)
The owner is responsible for serializing the new request with the previous write
The write is not complete until all invalidation acknowledgments are collected
OTT (aka CRB) of the owner is equipped to block any incoming request until all
the acknowledgments and the reply from home are collected (early
interventions)
Note that there is no completion message back to home

Handling writebacks

Valid directory states: M or busy; cannot be S or I


State M
Just clear directory entry, write block to memory
Need to send writeback acknowledgment to the evicting processor (explanation coming
up)
State busy
How can this happen? (Late intervention race)
Can NACK writeback? What support needed for this?
Better solution: writeback forwarding
Any special consideration at the evicting node?
Drop intervention (how?)
How does the directory state change in this case?

file:///E|/parallel_com_arch/lecture30/30_6.htm[6/13/2012 12:13:10 PM]

You might also like