Software Transactional Memory: Why Isitonlya Research Toy?
Software Transactional Memory: Why Isitonlya Research Toy?
doi:10.1145/ 1400214.1400228
and observe the overall performance of
The promise of STM may likely be undermined TM is much worse at low levels of paral-
lelism, which is likely to limit the adop-
by its overheads and workload applicabilities. tion of this programming paradigm.
Different implementations of
by Călin CAS
˛ caval, Colin Blundell, Maged Michael, transactional memory systems make
Harold W. Cain, Peng Wu, Stefanie Chiras, tradeoffs that impact both performance
and Siddhartha Chatterjee and programmability. Larus and Ra-
jwar16 present an overview of design
Software
trade-offs for implementations of trans-
actional memory systems. We summa-
rize some of the design choices here:
˲˲ Software-only (STM)7, 10, 12, 14, 18, 23, 25 is
Transactional
the focus here. While offering flexibility
and no hardware cost, it leads to over-
head in excess of most users’ tolerance.
˲˲ Hardware-only (HTM)2, 4, 9, 13, 19, 20, 35
Memory: Why
suffers from two major impediments:
high implementation and verification
costs lead to design risks too large to
justify on a niche programming model;
is it Only a
hardware capacity constraints lead to
significant performance degradation
when overflow occurs, and proposals for
managing overflows (for example, sig-
Research Toy?
natures5) incur false positives that add
complexity to the programming model.
Therefore, from an industrial perspec-
tive, HTM designs have to provide more
benefits for the cost, on a more diverse
set of workloads (with varying transac-
tional characteristics) for hardware de-
signers to consider implementation.a
˲˲ Hybrid1, 6, 24, 28 is the most likely plat-
form for the eventual adoption of TM
Transactional mem o ry (TM)13 is a concurrency by a wide audience, although the exact
control paradigm that provides atomic and isolated mix of hardware and software support
remains unclear.
execution for regions of code. TM is considered by A special case of the hybrid systems
many researchers to be one of the most promising are hardware-accelerated STMs. In this
scenario, the transactional semantics
solutions to address the problem of programming are provided by the STM, and hardware
multicore processors. Its most appealing feature is primitives are only used to speed up
that most programmers only need to reason locally critical performance bottlenecks in the
STM. Such systems could offer an at-
about shared data accesses, mark the code region to tractive solution if the cost of hardware
be executed transactionally, and let the underlying primitives is modest and may be further
amortized by other uses in the system.
system ensure the correct concurrent execution. This Independent of these implementa-
model promises to provide the scalability of fine-
grained locking while avoiding common pitfalls of a Reuse of hardware for other purposes can also
lock composition such as deadlock. In this article, we justify its inclusion, as the case may be for
Sun’s implementation of Scout Threading in
explore the performance of a highly optimized STM the Rock processor.32
(a) Pseudo-code for STM begin (b) Pseudo-code for STM validate
(c) Pseudo-code for STM read barrier (d) Pseudo-code for STM end
tion decisions, there are transactional ing private data. Furthermore, the non- state of the art STM runtime system and
semantics issues that break the ideal determinism introduced by aborting compiler framework, the freely avail-
transactional programming model for transactions complicates debugging— able IBM STM.31 Here, we describe this
which the community had hoped. TM transactional code may be executed and experience, starting with a discussion of
introduces a variety of programming is- aborted on conflicts, which makes it dif- STM algorithms and design decisions.
sues that are not present in lock-based ficult for the programmer to find deter- We then compare the performance of
mutual exclusion. For example, seman- ministic paths with repeatable behav- this STM with two other state of the art
tics are muddled by: ior. Both of these dilute the productivity implementations (the Intel STM14 and
˲˲ Interaction with non-transactional argument for transactions, especially the Sun TL2 STM7) as well as dissect the
codes, including access to shared data software-only TM implementations. operations executed by the IBM STM
from outside of a transaction (tolerating Given all these issues, we conclude and provide a detailed analysis of the
weak atomicity) and the use of locks in- that TM has not yet matured to the point performance hotspots of the STM.
side a transaction (breaking isolation to where it presents a compelling value
make locking operations visible outside proposition that will trigger its wide- Software Transactional Memory
transactions); spread adoption. While TM can be a STM implements all the transactional
˲˲ Exceptions and serializability: how useful tool in the parallel programmer’s semantics in software. That includes
to handle exceptions and propagate portfolio, it is our view that it is not go- conflict detection, guaranteeing the
consistent exception information from ing to solve the parallel programming consistency of transactional reads, pres-
within a transactional context, and dilemma by itself. There is evidence ervation of atomicity and isolation (pre-
how to guarantee that transactional ex- that it helps with building certain con- venting other threads from observing
ecution respects a correct ordering of current data structures, such as hash ta- speculative writes before the transac-
operations; bles and binary trees. In addition, there tion succeeds), and conflict resolution
˲˲ Interaction with code that cannot are anecdotal claims that it helps with (transaction arbitration). The pseudo-
be transactionalized, due to either com- workloads; however, despite several code for the main operations executed
munication with other threads or a re- years of active research and publication by a typical STM is illustrated in Figure
quirement barring speculation; in the area, we are disappointed to find 1. We show two STM algorithms, one
˲˲ Livelock, or the system guarantee no mentions in the research literature that performs full validation and one
that all transactions make progress of large-scale applications that make that uses a global version number (the
even in the presence of conflicts. use of TM. The STAMP30 and Lonestar17 additional statements marked with the
In addition to the intrinsic semantic benchmark suites are promising starts, gv# comment).
issues, there are also implementation- but have a long way to go to be represen- The advantage of an STM for system
specific optimizations motivated by tative of full applications. programmers is that it offers flexibility
high transactional overheads, such as We base these conclusions on our in implementing different mechanisms
programmer annotations for exclud- work over the past two years building a and policies for these operations. For
n ov e mb er 2 0 0 8 | vo l. 51 | n o. 1 1 | c om m u n ic at ion s of t he acm 41
practice
Figure 2: . Scalability results for three STM runtimes on a quad-core these overheads can become a high
Intel Xeon server: IBM, Intel STM v2, and Sun TL2. hurdle for STM to achieve performance.
The sequential overheads (that is, con-
flict-free overheads that are incurred re-
delaunay — Intel — IBM — Sun TL2
2.5 gardless of the actions of other concur-
Scalability normalized
2
it must continue to be accessed trans-
actionally. With some STM designs, the
to sequential
Figure 3: Scalability results for manual and compiler instrumented benchmarks on AIX PowerPC with IBM XLCSTM compiler.
Speedup
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
# of threads # of threads
ally cannot support transactions calling Figure 4: Single-threaded overhead of the STM algorithms.
legacy codes that are not instrumented
(for example, third-party libraries) with- fv gv#
out seriously limiting concurrency, such 8
as by serializing transactions. 118.1 43.8 49.2
7
Evaluation
runtime (norm. to sequential)
6
Here we use the following set of bench-
5
marks:
˲˲ b+tree is an implementation of da- 4
tabase indexing operations on a b-tree
3
data structure for which the data is
stored only on the tree leaves. This im- 2
plementation uses coarse-grain trans-
1
actions for every tree operation. Each
b+ tree operation starts from the tree 0
root and descends down to the leaves. b+tree delaunay kmeans genome vacation
A leaf update may trigger a structural
modification to rebalance the tree. A
rebalancing operation often involves version 0.9.4. For a detailed description STM barely attains single thread perfor-
recursive ascent over the child-parent of these benchmarks see STAMP.30 mance at 4 threads, while on vacation
edges. In the worst case, the rebalanc- Baseline Performance. In Figure 2 we none of the STMs actually overcome the
ing operation modifies the entire tree. present a performance comparison of overhead of transactional memory even
Our workload inserts 2,048 items in a three STMs: the IBM,31, 34 Intel,14 and with 8 threads.
b+tree of order 20. For this code we have Sun’s TL27 STMs. The runs are on a Compiler Instrumentation. The com-
only a transactional version that is not quad-core, two-way hyperthreaded Intel piler is a necessary component of an
manually instrumented, therefore ex- Xeon 2.3GHz box running Linux Fedora STM-based programming environment
perimental results are presented only Core 6. In these runs, we used the manu- that is to be adopted by mass program-
in configurations where we can use our ally instrumented versions of the codes mers. Its basic role is to eliminate the
compiler to provide instrumentation; that aggressively minimize the number need for programmers to manually in-
˲˲ delaunay implements the Delaunay of barriers for the IBM and TL2 STMs. strument memory references to STM
Mesh Refinement algorithm described Since we do not have access to low-level read- and write-barriers. While offering
in Kulkarni et al.15 The code produces APIs for the Intel STM, the curves for the convenience, compiler instrumenta-
a guaranteed quality Delaunay mesh. Intel STM are from codes instrumented tion does add another layer of over-
This is a Delaunay triangulation with by its compiler, which incur additional heads to the STM system by introducing
the additional constraint that no angle barrier overheads due to compiler in- redundant barriers, often due to conser-
in the mesh be less than 30 degrees. strumentation.36 The graphs are scal- vativeness of compiler analysis, as also
The benchmark takes as input an un- ability curves with respect to the serial, observed in Yoo.36
refined Delaunay triangulation and non-transactionalized version. There- Figure 3 provides another baseline:
produces a new triangulation that sat- fore a value of 1 on the y-axis represents the overhead of compiler instrumen-
isfies this constraint. In the TM imple- performance equal to the serial version. tation. The performance is measured
mentation of the algorithm, multiple The performance of these STMs is most- on a 16-way POWER5 running AIX 5.3.
threads choose their elements from a ly on par, with the IBM STM showing For the STMXLC curve, we use the un-
work-queue and refine the cavities as better scalability on delaunay and TL2 instrumented versions of the codes
separate transactions. obtaining better scalability on genome. and annotate transactional regions and
˲˲ genome, kmeans, and vacation are However, the overall performance ob- functions using the language exten-
part of the STAMP benchmark suite19 tained is very low: on kmeans the IBM sions provided by the compiler.31
n ov e mb er 2 0 0 8 | vo l. 51 | n o. 1 1 | c om m u n ic at ion s of t he acm 43
practice
Figure 5: Percentage of time spent in different STM operations. instrumentation and provides an accu-
rate breakdown of the STM overheads.
other end malloc begin desc We study the performance of two
read free write stack_range kernel
100
STM algorithms: one that fully validates
(“fv") the read set after each transac-
90 tional read and one that uses a global
runtime (norm. to sequential)
80
version number (“gv#") to avoid the full
validation, while maintaining the cor-
70 rectness of the operations. The fv algo-
60 rithm provides more concurrency at a
much higher price. The gv# is deemed
50
as one of the best trade-offs for STM im-
40 plementations.
Figure 4 presents the single-thread-
30
ed overhead of these algorithms over
20 sequential runs, illustrating again the
substantial slowdowns that the algo-
10
rithms induce. Figure 5 breaks down
0 fv gv# fv gv# fv gv# fv gv# fv gv#
these overheads into the various STM
b+tree delaunay kmeans genome vacation components. For both algorithms, the
overhead of transactional reads domi-
nates due to the frequency of read op-
Figure 6: Percentage of time spent in STM read sub-operations. erations relative to all other operations.
The effectiveness of the global version
return add metadata to read set check read after write number in reducing overheads is shown
validate check if metadata is locked setup
sync read metadata call
in the lower read overhead of “gv#.”
read data calculate metadata other Figure 6 gives a fine-grain breakdown
100
of the overheads of the transactional
90 read operation. As expected, the over-
head of validating the read set domi-
80
% of cycles (norm. to fv)
component of the total time. posed STM systems.7 Conflict detection hence this scheme is amenable for use
Overhead Optimizations. There have is simplified significantly by the static in environments where memory man-
been many proposals on reducing STM nature because conflicts can be ruled agement is explicit.
overheads through compiler or runtime out already when ownership records are Recent work explored algorithmic
techniques, most of which are comple- acquired (at transaction start). optimizations and/or alternative imple-
mentary to STM hardware acceleration. DSTM12 is the first dynamic STM mentations of the basic STM algorithms
˲˲ Redundant barrier elimination. One system; the design follows a per-object described here. Riegel et al. propose the
technique is to eliminate barriers to runtime organization (locator object). use of real-time clocks to enhance the
thread-local objects through escape Variables (objects) in the application STM scalability using a global version
analysis. Such analysis is typically quite heap refer to a locator object. Unlike number.22 JudoSTM21 and RingSTM29 re-
effective identifing thread-local access- in a design with ownership records (for duce the number of atomic operations
es that are close to the object allocation example, Harris and Fraser10), the loca- that must be performed when commit-
site. It can eliminate both read- and tor does not store a version number but ting a transaction at the cost of serial-
write-barriers, but is often more effec- refers to the most recently committed izing commit and/or incurring spurious
tive on write-barriers. For example, we version of the object. A particularity of aborts due to imprecise conflict detec-
observe that an intra-procedural escape the DSTM design is that objects must be tion. Several proposals have been made
analysis can eliminate 40–50% of write explicitly ‘opened’ (in read-only or read- for STMs that operate via dynamic bina-
barriers in vacation, genome, and b+tree. write mode) before transactional access; ry rewriting in order to allow the usage
However, its impact on performance is also DSTM allows for early release. The of STM on legacy binaries.8, 21, 33
more limited: from negligible to 12%. authors argue that both mechanisms fa- Yoo et. al36 analyze the overhead in
To target redundant read-barriers, a cilitate the reduction of conflicts. the execution of Intel’s STM.14, 23 They
whole-program analysis called Not-Ac- The design principles of the RSTM18 identify four major sources of overhead:
cessed-In-Transaction analysis27 elimi- system are similar to DSTM in that it as- over-instrumentation, false sharing,
nates some barriers to read-only objects sociates transactional metadata with ob- amortization costs, and privatization-
in transactions; jects. Unlike DSTM however, the system safety costs. False sharing, privatiza-
˲˲ Barrier strength reduction. These op- does not require the dynamic allocation tion-safety, and over-instrumentation
timizations do not eliminate barriers, of transactional data but co-locates it are implementation artifacts that can
but identify at runtime special locations with the non-transactional data. This be eliminated by either using finer
that require only lightweight barrier scheme has two benefits: first, it facili- granularity bookkeeping, more refined
processing, such as dynamic tracking of tates spatial access locality and hence analysis, or user annotations. Amortiza-
thread-local objects11, 27 and runtime fil- fosters execution performance and tion costs are inherent overheads in an
tering of stack references and duplicate transaction throughput. Second, the dy- STM that, as we demonstrated here, are
references;11 namic memory management of trans- not likely to be eliminated.
˲˲ Code generation optimizations. One actional data (usually done through a A large amount of research effort
common technique is to inline the fast garbage collector) is not necessary and has been spent in analyzing the opera-
path of barriers. It has the potential
benefit of reducing function call over- Figure 7: Percentage of time spent in STM end sub-operations.
head, increasing ILP, and exposing re-
return write data check for read-only
use of common sub-barrier operations.
cleanup transactional state validate setup
In our experiments, compiler inlining release metadata sync call
achieved less than 2% overall improve- increment gv# acquire metadata other
marks by up to 14%. 70
Such optimizations have a positive
impact on STM performance. However, 60
Related Work 20
n ov e mb er 2 0 0 8 | vo l. 51 | n o. 1 1 | c om m u n ic at ion s of t he acm 45
practice
Symposium on Computer Architecture. ACM, NY, 2007. Symposium on Principles and Practice of Parallel
tions in TM systems. Recent software 3. Blundell, C., Lewis, C., and Martin, M.M.K. Subtleties Programming. Mar. 2006, ACM, NY, 187–197.
optimizations have managed to accel- of transactional memory atomicity semantics. IEEE 24. Saha, B., Adl-Tabatabai, A.R., and Jacobson, Q.
TCCA Computer Architecture Letters 5, 2 (Nov 2006). Architectural support for software transactional
erate STM performance by 2%–15%. We 4. Bobba, J., Goyal, N., Hill, M.D., Swift, M.M., and Wood, memory. In Proceedings of the 39th Annual
D.A. TokenTM: Efficient execution of large transactions International Symposium on Microarchitecture. Dec.
believe such analysis is a good practice with hardware transactional memory. In Proceedings 2006, 185–196.
that should be extended to every piece of the 35th International Symposium on Computer 25. Shavit, N., and Touitou, D. Software Transactional
Architecture. IEEE Computer Society, Washington, Memory. In Proceedings of the ACM Symposium of
of system software, especially open D.C., 2008, 127–138. Principles of Distributed Computing. ACM, 1995.
source. However, the gains are only a mi- 5. Ceze, L., Tuck, J., Cascaval, C., Torrellas, J. 26. Shavit, N. and Touitou, D. Software transactional
Bulk disambiguation of speculative threads in memory. In Proceedings of the 14th ACM Symposium
nor dent in the overheads we observed, multiprocessors. In Proceedings of the 34th Annual on Principles of Distributed Computing. ACM, NY, 1995.
indicating the challenge that lies before International Symposium on Computer Architecture. 27. Shpeisman, T., Menon, V., Adl-Tabatabai, A-R.,
ACM, NY, 2006, 237–238. Balensiefer, S., Grossman, D., Hudson, R., Moore, K.F.,
the community in making STM perfor- 6. Damron, P., Federova, A., Lev, Y., Luchangco, V., Moir, and Saha, B. Enforcing isolation and ordering in STM.
mance compelling. M., and Nussbaum, D. Hybrid transactional memory. In Proceedings of Proceedings of the Programming
In Proceedings of the 12th International Conference Language Design and Implementation Conference.
on Architectural Support for Programming Languages ACM, 2007, 78–88.
Conclusion and Operating Systems, Oct. 2006. 28. Shriraman, A., Spear, M.F., Hossain, H., Marathe,
7. Dice, D., Shalev, O., and Shavit, N. Transactional V.J., Dwarkadas, S., and Scott, M.L. An integrated
Based on our results, we believe that the Locking II. DISC, Sept. 2006, 194–208. hardware-software approach to flexible transactional
road ahead for STM is quite challeng- 8. Felber, P., Fetzer, C., Mueller, U., Riegel, T., Suesskraut, memory. In Proceedings of the 34th Annual
M., and Sturzrehm, H. Transactifying applications International Symposium on Computer Architecture.
ing. Lowering the overheads of STM to using an open compiler framework. In Proceedings ACM, NY, 2007, 104–115.
a point where it is generally appealing of the ACM SIGPLAN Workshop on Transactional 29. Spears, M.T., Michael, M.M., and von Praum, C.
Computing. Aug. 2007. Ringstm: Scalable transactions with a single
is a difficult task and significantly bet- 9. Hammond, L., Wong, V., Chen, M., Carlstrom, B.D., atomic instruction. In Proceedings of the 20th
Davis, J.D., Hertzberg, B., Prabhu, M.K., Wijaya, H., ACM Symposium on Parallelism in Algorithms and
ter results have to be demonstrated. If Kozyrakis, C., and Olukotun, K. Transactional memory Architectures. ACM, NY, 275–284.
we could stress a single direction for coherence and consistency. In Proceedings of the 30. STAMP benchmark; http://stamp.stanford.edu/ (2007).
31st Annual International Symposium on Computer 31. (IBM) XL C/C++ for Transactional Memory for AIX;
further research, it is the elimination of Architecture. IEEE Computer Society, June 2004, 102. www.alphaworks.ibm.com/tech/xlcstm/ (2008).
dynamically unnecessary read and write 10. Harris, T. and Fraser, K. Language support for 32. Tremblay, M. and Chaudhry, S. A third generation
lightweight transactions. In Proceedings of Object- 65nm 16-core 32-thread plus 32-scout-thread CMT.
barriers—possibly the single most pow- Oriented Programming, Systems, Languages, and In Proceedings of the IEEE International Solid-State
erful lever toward further reduction of Applications. Oct. 2003, 388–402. Circuits Conference. Feb. 2008.
11. Harris, T., Plesko, M., Shinnar, A., and Tarditi, D. 33. Wang, C. Chein, W-Y, Wu, Y., Saha, B., and Adl-
STM overheads. However, given the dif- Optimizing memory transactions. In Proceedings Tabatabai, A.R. Code generation and optimization for
ficulty of similar problems explored by of the Programming Language Design and transactional memory constructs in an unmanaged
Implementation Conference. 2003, 388–402. language. In Proceedings of International Symposium
the research community such as alias 12. Herlihy, M., Luchangco, V., Moir, M., and Scherer III, on Code Generation and Optimization. 2007, 34–48.
analysis, escape analysis, and so on, this W.N. Software transactional memory for dynamic- 34. Wu, P., Michael, M.M., von Praun, C., Nakaike, T.,
sized data structures. In Proceedings of the 22nd ACM Bordawekar, R., Cain, H.W., Cascaval, C., Chatterjee,
may be an uphill battle. And because Symposium on Principles of Distributed Computing. S., Chiras, S., Hou, R., Mergen, M., Shen, X., Spear,
the argument for TM hinges upon its July 2003, 92–101. M.F., Wang, H.Y., and Wang, K. Compiler and
13. Herlihy, M. and Moss, J.E.B. Transactional memory: runtime techniques for software transactional
simplicity and productivity benefits, we Architectural support for lock-free data structures. memory optimization. To appear in Concurrency and
are deeply skeptical of any proposed so- In Proceedings of the 20th Annual International Computation: Practice and Experience, 2008.
Symposium on Computer Architecture. May 1993. 35. Yen, L., Bobba, J., Marty, M.M., Moore, K.E., Volos,
lutions to performance problems that 14. Intel C++ STM compiler, prototype edition 2.0.; http:// H., Hill, M.D., Swift, M.M., and Wood, D.A. LogTM-SE:
softwarecommunity.intel.com/articles/eng/1460.htm/ Decoupling hardware transactional memory from
require extra work by the programmer. (2008). caches. In Proceedings of the 13th International
We observed that the TM program- 15. Kulkarni, M., Pingali, K., Walter, B., Ramanarayanan, G., Symposium on High-Performance Computer
Bala, K., and Chew, P.L. Optimistic parallelism requires Architecture. Feb 2007.
ming model itself, whether implement- abstractions. In Proceedings of the PLDI 2007. ACM, 36. Yoo, R.M., Ni, Y., Welc, A., Saha, B. Adl-Tabatabai,
ed in hardware or software, introduces NY, 2007, 211–222. A-R. and Lee, H-H.S. Kicking the tires of software
16. Larus, J.R., and Rajwar, R. Transactional Memory. transactional memory: why the going gets tough.
complexities that limit the expected Morgan Claypool, 2006. Proceedings of the 20th Annual ACM Symposium on
productivity gains, thus reducing the 17. The Lonestar benchmark suite; http://iss.ices.utexas. Parallelism in Algorithms and Architectures, 2008.
edu/lonestar/ (2008). 37. Zhang, R., Budimlić, Z. and Scherer III, W.N. Commit
current incentive for migration to trans- 18. Marathe, V.J., Spear, M.F., Heriot, C., Acharya, A., phase in timestamp-based STM. In Proceedings of the
actional programming, and the justifi- Eisenstat, D., Scherer III, W.N., and Scott, M.L. 20th Annual Symposium on Parallelism in Algorithms
Lowering the overhead of software transactional and Architectures. ACM, NY, 326–335.
cation at present for anything more than memory. Technical Report TR 893, Computer Science
a small amount of hardware support. Department, University of Rochester, Mar 2006.
Condensed version submitted for publication. Călin Ca˛scaval ([email protected]) is a Research
19. Minh, C.C., Trautmann, M., Chung, J., McDonald, A., Staff Member and Manager of Programming Models and
Acknowledgments Bronson, N., Casper, J., Kozyrakis, C., and Olukotun, K. Tools for Scalable Systems at IBM TJ Watson Research
An effective hybrid transactional memory system with Center, Yorktown Heights, NY.
We would like to thank Pratap Pattnaik strong isolation guarantees. In Proceedings of the
for his continuous support, Christoph 34th Annual International Symposium on Computer Colin Blundell is a member of the Architecture
Architecture. ACM, NY, 2007, 69–80. and Compilers Group, Department of Computer and
von Praun for numerous discussions, 20. Moore, K.E., Bobba, J., Moravan, M.J., Hill, M.D., and Information Science, University of Pennsylvania.
Wood, D.A. LogTM: Log-based transactional memory.
work on benchmarks and runtimes, In Proceedings of the 12th Annual International Maged Michael is a Research Staff Research Member at
and Rajesh Bordawekar for the B+tree Symposium on High Performance Computer IBM TJ Watson Research Center, Yorktown Heights, NY.
Architecture, Feb 2006.
code implementation. 21. Olszewski, M., Cutler, J., Steffan, J.G. Judostm: A Trey Cain is a Research Staff Member at IBM TJ Watson
dynamic binary-rewriting approach to software Research Center, Yorktown Heights, NY.
transactional memory. In Proceedings of the 16th
References International Conference on Parallel Architecture Peng Wu is a Research Staff Member at IBM TJ Watson
1. Baugh, L., Neelakantam, N., and Zilles, C. Using and Compilation Techniques. 2007. IEEE Computer Research Center, Yorktown Heights, NY.
hardware memory protection to build a high- Society, Washington D.C., 365-375.
performance, strongly-atomic hybrid transactional 22. Riegel, T., Fetzer, C., and Felber, P. Time-based Stefanie Chiras is a manager in IBM's Systems and
memory. In Proceedings of the 35th International transactional memory with scalable time bases. Technology Group.
Symposium on Computer Architecture. IEEE In Proceedings of the 19th ACM Symposium on
Computer Society, Washington, DC, 2008, 115–126. Parallelism in Algorithms and Architectures, 2007. Siddhartha Chatterjee is director of the Austin Research
2. Blundell, C., Devietti, J., Lewis, E.L., Martin, M.M.K. 23. Saha, B., Adl-Tabatabai, A.R., Hudson, R.L., Minh, C.C., Laboratory, IBM Research, Austin, TX.
Making the fast case common and the uncommon and Hertzberg, B. Mcrt-stm: A high performance
case simple in unbounded transactional memory. software transactional memory system for a
In Proceedings of the 34th Annual International multi-core runtime. In Proceedings of the 11th ACM © 2008 ACM 0001-0782/08/1100 $5.00