TCC Thesis BDC Defense

Programming with Transactional Memory
Brian D. Carlstrom
Computer Systems Laboratory

Stanford University
http://tcc.stanford.edu
The Problem: “The free lunch is over”
Chip manufacturers have switched from making faster uniprocessors to
adding more processor cores per chip
 Software developers can no longer just hope that the next
generation of processor will make their program faster
10000
??%/year?
Uniprocessor
1000 Performance
Performance (vs. VAX-11/780)
Trends
(SPECint) 52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006
Programming with Transactional Memory 2
Parallel Programming for the Masses?
Every programmer is now a parallel programmer
 The black arts now need to be taught to undergraduates
Year Microprocessor Proc/chip Thread/proc Thread/chip
• IBM and Sun 2004 IBM POWER5 2 2 4
went multi- 2005 Azul Vega 1 24 1 24
core first on 2005 Sun Niagara 1 8 4 32
the server 2005 AMD Opteron 2 1 2
side 2006 Intel Woodcrest 2 2 4
• AMD/Intel 2006 Intel Barcelona 4 1 4

2006 Azul Vega 2 48 1 48
now in core
2007 AMD Barcelona 4 1 4
count race for
2007 Sun Niagara 2 8 8 64
laptops, 2008 Intel 4 2 8
desktops, and 2009 AMD 8 1 8
servers 2009 Intel 8 2 16

What Makes Parallel Programming Hard?
Typical parallel program
 Single memory shared by multiple program threads
 Need to coordinate access to memory shared b/w threads
 Locks allow temporary exclusive access to shared data
Lock granularity tradeoff

 Coarse grained locks - contention, lack of scaling, …
 Fine grained locks - excessive overhead, deadlock,…
Apparent tradeoff between correctness and performance

 Easier to reason about only a few locks…
 … but only a few locks can lead to contention

Transactional Memory to the Rescue?
Transactional Memory
 Replaces waiting for locks with concurrency
 Allows non-conflicting updates to shared data
 Shown to improve scalability of short critical regions
Promise of Transactional Memory

 Program with coarse transactions
 Performance like fine grained lock
Focus on correctness, tune for performance

 Easier to reason about only a few transactions…
 … only focus on areas with true contention

Thesis and Contributions
Thesis:
If transactional memory is to make parallel programming
easier, rather than just more scalable, the programming
interface requires more than simple atomic transactions
To support this thesis I will:

• Show why lock based programs cannot be simply
translated to a transactional memory model
• Present the design of Atomos, a parallel programming
language designed for transactional memory
• Show how Atomos can support semantic concurrency
control, allowing programs with coarse transactions to
perform competitively with fine-grained transactions.

Overview
Motivation and Thesis
 How to make parallel programming of chip multiprocessors
easier using transactional memory
 Concepts, implementation, environment
JavaT [SCP 2006]
 Executing Java programs with Transactional Memory
Atomos [PLDI 2006]
 A transactional programming language
Semantic concurrency control [PPoPP 2007]
 Improving scalability of applications with long transactions

Locks versus Transactions
Lock Transaction
... ...
synchronized (lock) { atomic {
x = x + y; x = x + y;
} }
... ...
Mapping from lock to protected data Transaction protects all data

 lock protects x  No need to worry if another lock
is necessary to protect y

Transactional Memory at Runtime
What if transactions modify the same data?
 First commit causes other transactions to abort & restart
 Can provide programmer with useful feedback!
Transaction A
LOAD X Transaction B
Time STORE X Violation! LOAD X Original Code:

STORE X ... = X + Y;
Commit X
X = ...
LOAD X
Re-execute
with new
data STORE X

Transactional Memory Related Work
 Transactional Memory: Architectural Support for Lock-Free Data
Structures [Herlihy & Moss 1993]
 Software Transactional Memory [Shavit & Touitou 1995]
Database
 Transaction Processing [Gray & Reuter 1993]
4.7) Nested transactions [Moss 1981]
4.9) Multi-level transactions [Weikum & Schek 1984]
4.10) Open nesting [Gray 1981]
16.7.3) Commit and abort handlers [Eppinger et al. 1991]
Recent Transactional Memory
 Language support for lightweight txs [Harris & Fraser 2003]
 Exceptions and side-effects in atomic blocks [Harris 2004]
 Open nesting in STM [Ni et al. 2007]

Hardware Environment
Chip Multiprocessor
 up to 32 CPUs
Bus Arbiters
 write-back L1
 shared L2 CPU 1 CPU 2 CPU N
 x86 ISA
...
L1 L1 L1
Lock evaluation Bus & Snoop Control Bus & Snoop Control Bus & Snoop Control
 MESI protocol
Commit Bus
TM evaluation Refill Bus

 L1 buffers speculative data On-chip L2 Cache
 Bus snooping detects data
dependency violations Changes for TM support

Software Environment
Virtual Machine
 IBM’s Jikes RVM (Research Virtual Machine) 2.4.2+CVS
 GNU Classpath 0.19
HTM extensions
 VM_Magic methods converted by JIT to HTM primitives
Polyglot
 Translate language extensions to VM_Magic calls

Overview
JavaT [SCP 2006]
Atomos [PLDI 2006]

JavaT: Transactional Execution of Java Programs
Goals
 Run existing Java programs using transactional memory
 Require no new language constructs
 Require minimal changes to program source
 Compare performance of locks and transactions
Non-Goals
 Create a new programming language
 Add new transactional extensions
 Run all Java programs correctly without modification

JavaT: Rules for Translating Java to TM
Three rules create transactions in Java programs
 synchronized defines a transaction
 volatile references define transactions
 Object.wait performs a transaction commit
Allows supports execution of a variety of programs:
 Histogram based on our ASPLOS 2004 paper
 STM benchmarks from Harris & Fraser, OOPSLA 2003
 SPECjbb2000 benchmark
 All of Java Grande (5 kernels and 3 applications)
Performance comparable or better in almost all cases
Many developers already believe that synchronized

means atomic, as opposed to mutual exclusion!

JavaT: Defining transactions with synchronized
synchronized blocks define transactions

public static void main (String args[]){
a(); a(); // non-transactional
synchronized (x){ BeginNestedTX();
b(); b(); // transactional
} EndNestedTX();
c(); c(); // non-transactional
}
We use closed nesting for nested synchronized blocks

public static void main (String args[]){
a(); a(); // non-transactional
synchronized (x){ BeginNestedTX();
b1(); b1(); // transaction at level 1
synchronized (y) { BeginNestedTX();
} EndNestedTX();
} EndNestedTX();
c(); c(); // non-transactional
}

JavaT: Alternative to rollback on wait
JavaT rules say that Object.wait commits transaction
 Other proposals rollback on wait (or prohibit side effects)
• C.A.R. Hoare’s Conditional Critical Regions (CCRs)
• Harris’s retry keyword
• Welc et al.’s Transactional Monitors
Rollback handles one common pattern of condition variables

sychronized (lock) {
while (!condition)
wait();
...
}

JavaT: Commiting on wait
• So why does JavaT commit on wait?
• Motivating example: A simple barrier implementation
synchronized (lock) {
count++;
if (count != thread_count) {
lock.wait();
} else {
count = 0;
lock.notifyAll();
}
}
Code like this is found in Sun Java Tutorial, SPECjbb2000, and Java Grande
• With commit, barrier works as intended
• With rollback, all threads think they are first to barrier

JavaT: Commit on wait tradeoff
Major positive of commit on wait
 Allows transactional execution of existing Java code
Major negative of commit on wait
 Nested transaction problem
 We don’t want to commit value of “a” when we wait:
synchronized (x) {
a = true;
synchronized (y) {
while (!b)
y.wait();
c = true;}}
 With locks, wait releases specific lock
 With transactions, wait commits all outstanding transactions
 In practice, nesting examples are very rare
• It is bad to wait while holding a lock
• wait and notify are usually used for unnested top level coordination

JavaT: Keeping Scalable Code Simple
TestCompound benchmark from Harris & Fraser, OOPSLA 2003
Atomic swap of Map elements
Java HashMap,
Java Hashtable,
ConcurrentHashMap
 Simple lock around
swap does not scale
ConcurrentHM Fine
 Use ordered key
locks to avoid
deadlock
JavaT HashMap
 Use simplest code of
Java HM, performs
best of all!

SPECjbb2000 Overview
Client Tier Transaction Server Tier Database Tier
Driver Threads Warehouse order

(B-Tree)
nextID
newOrder
Transaction YTD (B-Tree)
Manager
history
Driver Threads (B-Tree)
Warehouse order
• Java Business Benchmark (B-Tree)
 3-tier Java benchmark modeled on TPC-C newOrder
 5 ops: order, payment, status, delivery, stock level (B-Tree)
• Most updates local to single warehouse history
 1% case of inter-warehouse transactions (B-Tree)
JavaT: SPECjbb2000 Results
SPECjbb2000
• Close to linear scaling for transactions and locks up to 32 CPUs
 32 CPU scale limited by bus in simulated CMP configuration

JavaT: Transactional Execution of Java Programs
Goals (revisited)
 Run existing Java programs using transactional memory
• Can run a wide variety of existing benchmarks
 Require no new language constructs
• Used existing synchronized, volatile, and Object.wait
 Require minimal changes to program source
• No changes required for these programs
 Compare performance of locks and transactions
• Generally better performance from transactions
Problem
 Conditional waiting semantics not right for all programs
 What can we do if we can change the language?

Overview
JavaT [SCP 2006]
Atomos [PLDI 2006]

The Atomos Programming Language
Atomos derived from Java
 atomic replaces synchronized
 retry replaces wait/notify/notifyAll
Atomos design features
 Open nested transactions
• open blocks committing nested child transaction before parent
• Useful for language implementation but also available for applications
 Commit and Abort handlers
• Allow code to run dependant on transaction outcome
 Watch Sets
• Extension to retry for efficient conditional waiting on HTM systems

Atomos: The counter problem
Application JIT Compiler
atomic { // method prolog
... ...
id = nextId(); invocationCounter++;
... ...
} // method body
static long nextId() { ...
atomic { // method epilogue
nextID++; ...
}}
• Lower-level updates to global data can lead to violations

• General problem not confined to counters:
 Application level caching
 Cooperative scheduling in virtual machine
Atomos: Open nested counter solution
Solution Benefits
 Wrap counter update in  Violation of counter just replays open
open nested transaction nested transaction
 Open nested commit discards child’s
atomic {
read-set preventing later violations
... Issues
id = nextId();  What happens if parent rolls back
... after child commits?
}  Okay for statistical counters and UID
 Not okay for SPECjbb2000 YTD
static long nextID () { (year-to-date) payment counters
open { • Need to some way to coordinate with
parent transaction
nextID++;
}
}

Atomos: Commit and Abort Handlers
Programs can specify callbacks at end of transaction
 Separate interfaces for commit and abort outcomes
public interface CommitHandler { boolean onCommit();}
public interface AbortHandler { boolean onAbort ();}
Historical uses for commit and abort handlers
 DB technique for delaying non-transactional operations
 Harris brought the technique to STM for solving I/O problem
• See Exceptions and side-effects in atomic blocks.
• Buffer output until commit, rewind input on abort
Atomos applications
 EITHER Delay updates to shared data until parent commits
• Update YTD field only when parent is committing
 OR Provide compensation action to open nesting
• Undo YTD update when parent is aborted

Atomos: SPECjbb2000 Results
SPECjbb2000
 Difference between JavaT and Atomos result is handler overhead
 Overhead has negligible impact, Atomos still outperforms Java

Atomos Summary
Atomos similarities to other proposals
 atomic, retry, and commit/abort handlers
Atomos differences
 Open nested transactions for reduced isolation
 watch allows for scalable HTM retry implementation
Open nested transactions controversial

 Some uses straight forward
 More sophisticated uses require proper handlers
Can we give programmers the benefits of open nesting
without expecting them to use it directly?

Overview
JavaT [SCP 2006]
Atomos [PLDI 2006]

What happens to SPECjbb with long transactions?
Old: SPECjbb could scale

 Open nesting High-contention SPECjbb Results
addresses counters
 Only 1% of operations
touch other warehouse
data structures
New: high-contention SPECjbb

 All threads in 1
warehouse
 All transactions touch
some shared Map
Open nested results not much

better than Baseline

Violations in logically independent operations
Map
TX #1 starting TX #2 starting
size=2
size=3
put(3,…) {1 => …, put(4,…)

closed-nested 2 => …,
…} closed-nested
transaction transaction
3 => …}
TX #1 commit TX #2 abort

Unwanted data dependencies limit scaling
Data structure bookkeeping causing serialization
 Frequent HashMap and TreeMap violations updating size
and modification counts
With short transactions

 Enough parallelism from operations that do not conflict to
make up for the ones that do conflict
With long transactions

 Too much lost work from conflicting operations
How can we eliminate unwanted dependencies?

Reducing unwanted dependencies
Custom hash table
 Don’t need size or modCount? Build stripped down Map
 Disadvantage: Do not want to custom build data structures
Open-nested transactions
 Allows a child transaction to commit before parent
 Disadvantage: Lose transactional atomicity
Segmented hash tables
 Use ConcurrentHashMap (or similar approaches)
• Compiler and Runtime Support for Efficient STM, Intel, PLDI 2006
 Disadvantage:
Reduces, but does not eliminate, unnecessary violations
Is this reduction of violations good enough?

Semantic Concurrency Control
Database concept of multi-level transactions
 Release low-level locks on data after acquiring higher-level
locks on semantic concepts such as keys and size
Example
 Before releasing lock on B-tree node containing key 7
record dependency on key 7 in lock table
 B-tree locks prevent races – lock table provides isolation
4
TX# Key Mode
… … …
2 6
#2317 7 Read
1 3 5 7 … … …

Semantic Concurrency Control
Applying Semantic Concurrency Control to TM
 Avoid retaining memory level dependencies
 Replace with semantic dependencies
 Add conflict detection on semantic properties
Transactional Collection Classes

 Avoid memory level dependencies on size field, …
 Replace with semantic dependencies on keys, size, …
 Only detect semantic conflicts that are necessary
No more memory conflicts on implementation details

Benefits of Transactional Collection Classes
Programmer just uses the usual collection interfaces
 Code change as simple as replacing
Map map = new HashMap();
 with
Map map = new TransactionalMap();
Similar interface coverage to util.concurrent
 Maps: TransactionalMap, TransactionalSortedMap
 Sets: TransactionalSet, TransactionalSortedSet
 Queue: TransactionalQueue
Only library writers deal directly with open nesting

 Similar to java.util.concurrent.atomic

Implementing Transactional Collection Classes
General Approach Simplified Map example

Read operations get(key) add dependencies on key
Acquire semantic dependency returns value from underlying map
Open nesting reads underlying state
Write operations put(key,value) writes to thread
Buffer changes until commit local buffer
On commit Apply buffer to underlying map,
Apply buffer to underlying state violate transactions that depended on
Check for semantic conflicts the keys we are writing
On commit and abort Remove key dependencies

Release semantic
dependencies

Example of non-conflicting put operations
Underlying
Map
size=4
size=2
size=3
put(c,23) {a => 50, put(d,42)

open-nested b => 17,
17} open-nested
23,
c => 23}
d => 42}
TX #1 commit TX #2 commit
and handler and handler
execution Depend- execution
encies
{c
{d =>
{}[1]}
[1],
[2]}
Write Buffer Write Buffer
d => [2]}
{} 23}
{c => {d =>
{} 42}

Example of conflicting put and get operations
Underlying
Map
size=3
size=2
put(c,23) {a => 50, get(c) open-

open-nested 17,
b => 17} nested
c => 23}
TX #1 commit TX #2 abort
and handler and handler
execution Depend- execution
encies
{c{c
=>{}
=>[1]}
Write Buffer [1,2]} Write Buffer
{} 23}
{c => {}

Benefits of Semantic Concurrency Approach
Transactional Collection Class works with abstract type
 Can work with any conforming implementation
 HashMap, TreeMap, …
Avoids implementation specific violations

 Not just size and mod count
 HashTable resizing does not abort parent transactions
 TreeMap rotations invisible as well

High-contention SPECjbb2000 results
Java Locks
Short critical sections
Atomos Baseline
Full protection of logical ops
Atomos Open
Use simple open-nesting for
UID generation
Atomos Transactional
Change to Transactional
Collection Classes
Performance Limit?
Semantic violations from
calls to
SortedMap.firstKey()
SortedMap dependency
SortedMap use
overloaded
1. Lookup by ID
2. Get oldest ID
for deletion
Replace with Map and

Queue
1. Use Map for
lookup by ID
2. Use Queue to
find oldest

What else could we do?
 Split larger
transactions into
smaller ones
 In the limit, we can
end up with
transactions
matching the short
critical regions of
Java
Return on investment
 Coarse grained
transactional version
is giving almost 8x on
16 processors
 Coarse grained lock Focus on correctness
version would not
have scaled at all tune for performance

SPECjbb2000 Return on Investment
Version Speedup on Effort Atomos 14 changes 7.8x
16 CPUs Java 272 changes 13x
Baseline 1.6 1 atomic statement
Open 2.7 4 open statements
Transactional 4.1 2 Transactional Map, 1 TxnSortedMap

2 transactional counters
Queue 7.8 Change TxnSortedMap to TxnMap/TxnQueue
(2 new calls: Queue.add & Queue.remove)
Short 12.5 272 atomic statements
Java 13.0 272 synchronized statements

Semantic Concurrency Control Summary
Transactional memory promises to ease parallelization
 Need to support coarse grained transactions
Need to access shared data from within transactions

 While composing operations atomically
 While avoiding unnecessary data dependency violations
 While still having reasonable performance!

 Provides needed scalability through familiar library
interfaces of Map, SortedMap, Set, SortedSet, and Queue
 Removes need for direct use of open nested transactions

Overview
JavaT [SCP 2006]
Atomos [PLDI 2006]

Summary
Thesis:
If transactional memory is to make parallel programming
easier, rather than just more scalable, the programming
interface requires more than simple atomic transactions
JavaT
 Transactions alone cannot run all existing Java programs
due to incompatibility of monitor conditional waiting
Atomos Programming Language
 Features to support reduced isolation and integration non-
transactional operations through handlers
 Using semantic concurrency control to improve scalability of
applications using long transactions

Future Work
Transaction-aware I/O libraries
 Semantic concurrency control for structured files such as b-trees
 Support for automatically buffering OutputStreams and Writers
 Support for application logging within transactions
Integrating with other transactional systems (distributed transactions)

 Treat TM as resource manager like DB or transactional file system
Programming Language
 Language support for loop based parallelism
 Task-based, rather than thread-based, models
Virtual Machines
 Garbage Collector

Acknowledgements
• My wife Jennifer and kids Michael, Daniel, and Bethany
• My parents David and Elizabeth
• My advisors Kunle Olukotun and Christos Kozyrakis
• My committee Dawson Engler, Margot Gerritsen, John Mitchell
• Jared Casper, Hassan Chafi, JaeWoong Chung, Austen McDonald
and the rest of TCC group for the simulator and everything else
• Andrew Selle and Jacob Leverich for all those cycles
• Normans Adams, Marc Brown, and John Ellis for encouraging me to
go back to school
• Everyone at Ariba that made it possible to go back to school
• Olin Shivers and Tom Knight and the MIT UROP program for inspiring
me to do research as an undergraduate
• Intel for my PhD fellowship
• DARPA, not just for supporting me for the last five years, but for
employing my father for my first five years…

TCC Thesis BDC Defense

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

TCC Thesis BDC Defense

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TCC Thesis BDC Defense

Uploaded by

Copyright:

Available Formats

Programming with Transactional Memory

Computer Systems Laboratory

• AMD/Intel 2006 Intel Barcelona 4 1 4

Programming with Transactional Memory 3

Lock granularity tradeoff

Apparent tradeoff between correctness and performance

Programming with Transactional Memory 4

Promise of Transactional Memory

Focus on correctness, tune for performance

Programming with Transactional Memory 5

To support this thesis I will:

Programming with Transactional Memory 6

Programming with Transactional Memory 7

Mapping from lock to protected data Transaction protects all data

Programming with Transactional Memory 8

Time STORE X Violation! LOAD X Original Code:

Programming with Transactional Memory 9

Programming with Transactional Memory 10

TM evaluation Refill Bus

Programming with Transactional Memory 11

Programming with Transactional Memory 12

Programming with Transactional Memory 13

Programming with Transactional Memory 14

Many developers already believe that synchronized

Programming with Transactional Memory 15

synchronized blocks define transactions

We use closed nesting for nested synchronized blocks

Programming with Transactional Memory 16

Rollback handles one common pattern of condition variables

Programming with Transactional Memory 17

Programming with Transactional Memory 18

Programming with Transactional Memory 19

Programming with Transactional Memory 20

Client Tier Transaction Server Tier Database Tier

Driver Threads Warehouse order

Programming with Transactional Memory 22

Programming with Transactional Memory 23

Programming with Transactional Memory 24

Programming with Transactional Memory 25

• Lower-level updates to global data can lead to violations

Programming with Transactional Memory 27

Programming with Transactional Memory 28

Programming with Transactional Memory 29

Open nested transactions controversial

Programming with Transactional Memory 30

Programming with Transactional Memory 31

Old: SPECjbb could scale

New: high-contention SPECjbb

Open nested results not much

Programming with Transactional Memory 32

put(3,…) {1 => …, put(4,…)

Programming with Transactional Memory 33

With short transactions

With long transactions

How can we eliminate unwanted dependencies?

Programming with Transactional Memory 34

Programming with Transactional Memory 35

Programming with Transactional Memory 36