TCC Thesis BDC Defense
TCC Thesis BDC Defense
TCC Thesis BDC Defense
Brian D. Carlstrom
??%/year?
Uniprocessor
1000 Performance
Performance (vs. VAX-11/780)
Trends
(SPECint) 52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006
Programming with Transactional Memory 2
Parallel Programming for the Masses?
Every programmer is now a parallel programmer
The black arts now need to be taught to undergraduates
Year Microprocessor Proc/chip Thread/proc Thread/chip
• IBM and Sun 2004 IBM POWER5 2 2 4
went multi- 2005 Azul Vega 1 24 1 24
core first on 2005 Sun Niagara 1 8 4 32
the server 2005 AMD Opteron 2 1 2
side 2006 Intel Woodcrest 2 2 4
Transaction A
LOAD X Transaction B
Chip Multiprocessor
up to 32 CPUs
Bus Arbiters
write-back L1
shared L2 CPU 1 CPU 2 CPU N
x86 ISA
...
L1 L1 L1
Lock evaluation Bus & Snoop Control Bus & Snoop Control Bus & Snoop Control
MESI protocol
Commit Bus
HTM extensions
VM_Magic methods converted by JIT to HTM primitives
Polyglot
Translate language extensions to VM_Magic calls
Goals
Run existing Java programs using transactional memory
Require no new language constructs
Require minimal changes to program source
Compare performance of locks and transactions
Non-Goals
Create a new programming language
Add new transactional extensions
Run all Java programs correctly without modification
Warehouse order
• Java Business Benchmark (B-Tree)
3-tier Java benchmark modeled on TPC-C newOrder
5 ops: order, payment, status, delivery, stock level (B-Tree)
• Most updates local to single warehouse history
1% case of inter-warehouse transactions (B-Tree)
Programming with Transactional Memory 21
JavaT: SPECjbb2000 Results
SPECjbb2000
• Close to linear scaling for transactions and locks up to 32 CPUs
32 CPU scale limited by bus in simulated CMP configuration
Goals (revisited)
Run existing Java programs using transactional memory
• Can run a wide variety of existing benchmarks
Require no new language constructs
• Used existing synchronized, volatile, and Object.wait
Require minimal changes to program source
• No changes required for these programs
Compare performance of locks and transactions
• Generally better performance from transactions
Problem
Conditional waiting semantics not right for all programs
What can we do if we can change the language?
Map
TX #1 starting TX #2 starting
size=2
size=3
TX #1 commit TX #2 abort
Underlying
TX #1 starting TX #2 starting
Map
size=4
size=2
size=3
Underlying
TX #1 starting TX #2 starting
Map
size=3
size=2
TX #1 commit TX #2 abort
and handler and handler
execution Depend- execution
encies
{c{c
=>{}
=>[1]}
Write Buffer [1,2]} Write Buffer
{} 23}
{c => {}
Performance Limit?
Semantic violations from
calls to
SortedMap.firstKey()
Programming with Transactional Memory 43
High-contention SPECjbb2000 results
SortedMap dependency
SortedMap use
overloaded
1. Lookup by ID
2. Get oldest ID
for deletion
Return on investment
Coarse grained
transactional version
is giving almost 8x on
16 processors
Coarse grained lock Focus on correctness
version would not
have scaled at all tune for performance
JavaT
Transactions alone cannot run all existing Java programs
due to incompatibility of monitor conditional waiting
Atomos Programming Language
Features to support reduced isolation and integration non-
transactional operations through handlers
Transactional Collection Classes
Using semantic concurrency control to improve scalability of
applications using long transactions
Programming Language
Language support for loop based parallelism
Task-based, rather than thread-based, models
Virtual Machines
Garbage Collector