0% found this document useful (0 votes)
340 views116 pages

Communications200910 DL

Uploaded by

hhhzine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
340 views116 pages

Communications200910 DL

Uploaded by

hhhzine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

COMMUNICATIONS

ACM
CACM.ACM.ORG OF THE 10/2009 VOL.52 NO.10

A View of
Parallel
Computing
A Conversation
with David E. Shaw
Smoothed Analysis
A Retrospective
from C.A.R. Hoare
Debating Net
Neutrality

Association for
Computing Machinery
ABCD springer.com

Noteworthy Titles
Computational More Math Into User Interface
Geometry LaTeX Design for
Algorithms and G. Grätzer, University Programmers
Applications of Manitoba, Winnipeg, J. Spolsky, Fog Creek
MB, Canada Software, New York,
M. d. Berg, TU Eind-
hoven, The Netherlands; For close to two decades, NY, USA
O. Cheong, KAIST, Math into Latex has The author of one of the
Daejeon, Korea; been the standard most popular indepen-
M. v. Kreveld, M. Overmars, Utrecht University, introduction and dent web sites gives you a brilliantly readable
Utrecht, The Netherlands complete reference for writing articles and book with what programmers need to know
books containing mathematical formulas. In (TM) about User Inteface Design. Spolsky
This well-accepted introduction to computa-
this fourth edition, the reader is provided with concentrates especially on the common
tional geometry is a textbook for high-level
important updates on articles and books. An mistakes that too many programs exhibit.Most
undergraduate and low-level graduate courses.
important new topic is discussed: transparen- programmers dislike user interface program-
The focus is on algorithms and hence the book
cies (computer projections). ming, butthis book makes it quintessentially
is well suited for students in computer science
and engineering. The book is largely self- easy, straightforward, and fun.
2007. XXXIV, 619 p. 44 illus. Softcover
contained and can be used for self-study by ISBN 978-0-387-32289-6 7 $49.95 2006. 160 p. Softcover
anyone with a basic background in algorithms.
ISBN 978-1-893115-94-1 7 $34.99
In this third edition, besides revisions to the
second edition, new sections discussing The Algorithm
Voronoi diagrams of line segments, farthest-
point Voronoi diagrams, and realistic input
Design Manual Fast Track to
models have been added. S. S. Skiena, State MDX
University of New York, M. Whitehorn,
3rd ed. 2008. XII, 386 p. 370 illus. Hardcover Stony Brook, NY, USA University College
ISBN 978-3-540-77973-5 7 $49.95
This expanded and Worcester, UK;
updated second edition R. Zare, M. Pasumansky,
of a classic bestseller Microsoft Corporation,
Robot Building continues to take the Redmond, WA, USA
for Beginners “mystery” out of designing and analyzing Fast Track to MDX
D. Cook, Motorla, algorithms and their efficacy and efficiency. provides all the necessary background needed
Whitestown, IN, USA Expanding on the highly successful formula of to write useful, powerful MDX expressions and
the first edition, the book now serves as the introduces the most frequently used MDX
Learning robotics by
primary textbook of choice for any algorithm functions and constructs. No prior knowledge
yourself isn‘t easy. This
design course while maintaining its status as is assumed and examples are used throughout
book by an experienced
the premier practical reference guide to the book to rapidly develop MDX skills to the
software developer and
algorithms. point where they can solve real business
self-taught mechanic
tells how to build robots from scratch with problems. A CD-ROM containing examples
2nd ed. 2008. XVI, 736 p. 115 illus. With online
individual electronic components – in easy-to- from within the book, and a time-limited
files/update. Hardcover
understand language with step-by-step version of ProClarity, are included.
ISBN 978-1-84800-069-8 7 $79.95
instructions.
2nd ed. 2006. XXVI, 310 p. 199 illus. With CD-ROM.
1st ed. 2002. Corr. 2nd printing 2005. XV, 568 p. Softcover
Softcover ISBN 978-1-84628-174-7 7 $54.95
ISBN 978-1-893115-44-6 7 $29.95

Easy Ways to Order for the Americas 7 Write: Springer Order Department, PO Box 2485, Secaucus, NJ 07096-2485, USA 7 Call: (toll free) 1-800-SPRINGER
7 Fax: 1-201-348-4505 7 Email: [email protected] or for outside the Americas 7 Write: Springer Customer Service Center GmbH, Haberstrasse 7,
69126 Heidelberg, Germany 7 Call: +49 (0) 6221-345-4301 7 Fax : +49 (0) 6221-345-4229 7 Email: [email protected]
7 Prices are subject to change without notice. All prices are net prices. 014091x
COMMUNICATIONS OF THE ACM

Departments News Viewpoints

5 President’s Letter 19 The Business of Software


ACM Europe Contagious Craziness,
By Wendy Hall Spreading Sanity
Some examples of the upward
6 Letters to the Editor or downward spiral of behaviors
Time and Computing in the workplace.
By Phillip G. Armour
7 In the Virtual Extension
21 Historical Reflections
8 blog@CACM Computing in the Depression Era
The Netflix Prize, Computer Insights from difficult times
Science Outreach, and in the computer industry.
Japanese Mobile Phones By Martin Campbell-Kelly
Greg Linden writes about machine
learning and the Netflix Prize, 23 Inside Risks
Judy Robertson offers suggestions 11 Managing Data Reflections on Conficker
about getting teenagers interested Managing terabytes of data is not An insider’s view of the
in computer science, and Michael the monumental task it once was. analysis and implications of
Conover discusses mobile phone The difficult part is presenting the Conficker conundrum.
usage and quick response codes enormous amounts of information By Phillip Porras
in Japan. in ways that are most useful to
a wide range of users. 25 Technology Strategy and Management
10 CACM Online By David Lindley Dealing with the
Following the Leaders Venture Capital Crisis
By David Roman 14 Debating Net Neutrality How New Zealand and other
Advocates seek to protect users small, isolated markets can act
13 Calendar from potential business practices, as “natural incubators.”
but defenders of the status quo By Michael Cusumano
107 Careers say that concerns are overblown.
By Alan Joch 28 Kode Vicious
Kode Reviews 101
Last Byte 16 Shaping the Future A review of code review do’s and don’ts.
To create shape-shifting robotic By George V. Neville-Neil
111 Q&A ensembles, researchers need to teach
The Networker micro-machines to work together. 30 Viewpoint
Jon Kleinberg talks about By Tom Geller Retrospective: An Axiomatic Basis
algorithms, information flow, for Computer Programming
and the connections between C.A.R. Hoare revisits his past
PHOTOGRA PH COURT ESY O F CARNEGIE M ELLO N UNIVERSIT Y- INTEL

Web search and social networks. Communications article on


By Leah Hoffmann the axiomatic approach to
programming and uses it as
a touchstone for the future.
By C.A.R. Hoare

Association for Computing Machinery


Advancing Computing as a Science & Profession

2 COMM UNICATIO NS O F THE ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


10/2009 VOL. 52 NO. 10

Practice Contributed Articles Virtual Extension

34 Probing Biomolecular Machines 56 A View of the Parallel As with all magazines, page limitations often
with Graphics Processors Computing Landscape prevent the publication of articles that might
GPU acceleration and other Writing programs that scale with otherwise be included in the print edition.
To ensure timely publication, ACM created
computer performance increases increasing numbers of cores should
Communications’ Virtual Extension (VE).
will offer critical benefits to be as easy as writing programs VE articles undergo the same rigorous review
biomedical science. for sequential computers. process as those in the print edition and are
By James C. Phillips and John E. Stone By Krste Asanovic, Rastislav Bodik, accepted for publication on their merit. These
James Demmel, Tony Keaveny, articles are now available to ACM members in
42 Unifying Biological Image Kurt Keutzer, John Kubiatowicz, the Digital Library.
Formats with HDF5 Nelson Morgan, David Patterson,
The biosciences need an image Koushik Sen, John Wawrzynek, Balancing Four Factors
format capable of high performance David Wessel, and Katherine Yelick in System Development Projects
and long-term maintenance. Girish H. Subramanian, Gary Klein,
Is HDF5 the answer? 68 Automated Support for Managing James J. Jiang, and Chien-Lung Chan
By Matthew T. Dougherty, Michael J. Folk, Feature Requests in Open Forums
Erez Zadok, Herbert J. Bernstein, The result is stable, focused, Attaining Superior
Frances C. Bernstein, Kevin W. Eliceiri, dynamic discussion threads that Complaint Resolution
Werner Benger, Christoph Best avoid redundant ideas and engage Sridhar R. Papagari Sangareddy,
thousands of stakeholders. Sanjeev Jha, Chen Ye,
48 A Conversation with David E. Shaw By Jane Cleland-Huang, and Kevin C. Desouza
Stanford professor Pat Hanrahan Horatiu Dumitru, Chuan Duan,
sits down with the noted hedge fund and Carlos Castro-Herrera Making Ubiquitous
founder, computational biochemist, Computing Available
and (above all) computer scientist. Vivienne Waller and Robert B. Johnson
Research Highlights
De-escalating IT Projects:
Review Articles 86 Technical Perspective The DMM Project
Relational Query Optimization— Donal Flynn, Gary Pan, Mark Keil,
76 Smoothed Analysis: An Attempt Data Management Meets and Magnus Mahring
to Explain the Behavior Statistical Estimation
of Algorithms in Practice By Surajit Chaudhuri Human Interaction for High-Quality
This Gödel Prize-winning work traces Machine Translation
the steps toward modeling real data. 87 Distinct-Value Synopses Francisco Casacuberta, Jorge Civera,
By Daniel A. Spielman for Multiset Operations Elsa Cubel, Antonio L. Lagardia,
and Shang-Hua Teng By Kevin Beyer, Rainer Gemulla, Guy Lampalme, Elliott Macklovitch,
Peter J. Haas, Berthold Reinwald, and Enrique Vidal
and Yannis Sismanis
How Effective is Google’s
96 Technical Perspective Translation Service in Search?
Data Stream Processing— Jacques Savoy and Ljiljana Dolamic
About the Cover: When You Only Get One Look
Leonello Calvetti brings By Johannes Gehrke Overcoming the J-Shaped
to life the bridge analogy

FPO
connecting users to Distribution of Product Reviews
a parallel IT industry 97 Finding the Frequent Nan Hu, Paul A. Pavlou, and Jie Zhang
as noted by the authors
of the cover story Items in Streams of Data
beginning on page 56. By Graham Cormode Technical Opinion
Calvetti, a photo-realistic
illustrator, is a graduate and Marios Hadjieleftheriou Do SAP Successes Outperform
of Benvenuto Cellini Themselves and Their Competitors?
College in Florence,
Italy, where he studied Richard J. Goeke and Robert H. Faley
technical design and
developed his artistic
applications of 3D software and digital imagery.

O C TO BE R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF THE ACM 3


COMMUNICATIONS OF THE ACM
Trusted insights for computing’s leading professionals.

Communications of the ACM is the leading monthly print and online magazine for the computing and information technology fields.
Communications is recognized as the most trusted and knowledgeable source of industry information for today’s computing professional.
Communications brings its readership in-depth coverage of emerging areas of computer science, new trends in information technology,
and practical applications. Industry leaders use Communications as a platform to present and debate various technology implications,
public policies, engineering challenges, and market trends. The prestige and unmatched reputation that Communications of the ACM
enjoys today is built upon a 50-year commitment to high-quality editorial content and a steadfast dedication to advancing the arts,
sciences, and applications of information technology.

ACM, the world’s largest educational STA F F ED ITORIA L BOARD


and scientific computing society, delivers
resources that advance computing as a GROUP P U BL ISH ER E DITOR -IN - C HIE F
science and profession. ACM provides the Scott E. Delman Moshe Y. Vardi ACM Copyright Notice
computing field’s premier Digital Library [email protected] [email protected] Copyright © 2009 by Association for
and serves its members and the computing Executive Editor NE W S Computing Machinery, Inc. (ACM).
profession with leading-edge publications, Diane Crawford Co-chairs Permission to make digital or hard copies
conferences, and career resources. Managing Editor Marc Najork and Prabhakar Raghavan of part or all of this work for personal
Thomas E. Lambert Board Members or classroom use is granted without
Executive Director and CEO Senior Editor Brian Bershad; Hsiao-Wuen Hon; fee provided that copies are not made
John White Andrew Rosenbloom Mei Kobayashi; Rajeev Rastogi; or distributed for profit or commercial
Deputy Executive Director and COO Senior Editor/News Jeannette Wing advantage and that copies bear this
Patricia Ryan Jack Rosenberger notice and full citation on the first
Director, Office of Information Systems Web Editor VIE W P OINTS page. Copyright for components of this
Wayne Graves David Roman Co-chairs work owned by others than ACM must
Director, Office of Financial Services Editorial Assistant Susanne E. Hambrusch; John Leslie King; be honored. Abstracting with credit is
Russell Harris Zarina Strakhan J Strother Moore permitted. To copy otherwise, to republish,
Director, Office of Membership Rights and Permissions Board Members to post on servers, or to redistribute to
Lillian Israel Deborah Cotton P. Anandan; William Aspray; Stefan lists, requires prior specific permission
Director, Office of SIG Services Bechtold; Judith Bishop; and/or fee. Request permission to publish
Donna Cappo Art Director Stuart I. Feldman; Peter Freeman; from [email protected] or fax
Andrij Borys Seymour Goodman; Shane Greenstein; (212) 869-0481.
ACM COU N C I L Associate Art Director Mark Guzdial; Richard Heeks; Richard Ladner;
President Alicia Kubista Susan Landau; Carlos Jose Pereira de Lucena; For other copying of articles that carry a
Wendy Hall Assistant Art Director Helen Nissenbaum; Beng Chin Ooi; code at the bottom of the first or last page
Vice-President Mia Angelica Balaquiot Loren Terveen or screen display, copying is permitted
Alain Chesnais Production Manager provided that the per-copy fee indicated
Secretary/Treasurer Lynn D’Addesio P R AC TIC E in the code is paid through the Copyright
Barbara Ryder Director of Media Sales Clearance Center; www.copyright.com.
Chair
Past President Jennifer Ruzicka Stephen Bourne
Stuart I. Feldman Marketing & Communications Manager Subscriptions
Board Members
Chair, SGB Board Brian Hebert An annual subscription cost is included
Eric Allman; Charles Beeler;
Alexander Wolf Public Relations Coordinator in ACM member dues of $99 ($40 of
David J. Brown; Bryan Cantrill;
Co-Chairs, Publications Board Virgina Gold which is allocated to a subscription to
Terry Coatta; Mark Compton;
Ronald Boisvert, Holly Rushmeier Publications Assistant Communications); for students, cost
Benjamin Fried; Pat Hanrahan;
Members-at-Large Emily Eng is included in $42 dues ($20 of which
Marshall Kirk McKusick;
Carlo Ghezzi; is allocated to a Communications
George Neville-Neil
Anthony Joseph; Columnists subscription). A nonmember annual
Mathai Joseph; Alok Aggarwal; Phillip G. Armour; The Practice section of the CACM
subscription is $100.
Kelly Lyons; Martin Campbell-Kelly; Editorial Board also serves as
Bruce Maggs; Michael Cusumano; Peter J. Denning; the Editorial Board of .
ACM Media Advertising Policy
Mary Lou Soffa; Shane Greenstein; Mark Guzdial; Communications of the ACM and other
C ONTR IB U TE D A RTIC LES
Fei-Yue Wang Peter Harsha; Leah Hoffmann; ACM Media publications accept advertising
Co-chairs
SGB Council Representatives Mari Sako; Pamela Samuelson; in both print and electronic formats. All
Al Aho and Georg Gottlob
Joseph A. Konstan; Gene Spafford; Cameron Wilson advertising in ACM Media publications is
Board Members
Robert A. Walker; at the discretion of ACM and is intended
C ON TACT P O I N TS Yannis Bakos; Gilles Brassard; Alan Bundy;
Jack Davidson to provide financial support for the various
Copyright permission Peter Buneman; Ghezzi Carlo;
PUB LICATI O N S BOA R D [email protected] Andrew Chien; Anja Feldmann; activities and services for ACM members.
Co-Chairs Calendar items Blake Ives; James Larus; Igor Markov; Current Advertising Rates can be found
Ronald F. Boisvert and Holly Rushmeier [email protected] Gail C. Murphy; Shree Nayar; Lionel M. Ni; by visiting http://www.acm-media.org or
Board Members Change of address Sriram Rajamani; Jennifer Rexford; by contacting ACM Media Sales at
Gul Agha; Michel Beaudouin-Lafon; [email protected] Marie-Christine Rousset; Avi Rubin; (212) 626-0654.
Jack Davidson; Nikil Dutt; Carol Hutchins; Letters to the Editor Abigail Sellen; Ron Shamir; Marc Snir;
Ee-Peng Lim; M. Tamer Ozsu; Vincent [email protected] Larry Snyder; Veda Storey; Single Copies
Shen; Mary Lou Soffa; Ricardo Baeza-Yates Manuela Veloso; Michael Vitale; Single copies of Communications of the
W E B SIT E Wolfgang Wahlster; ACM are available for purchase. Please
ACM U.S. Public Policy Office http://cacm.acm.org Andy Chi-Chih Yao; Willy Zwaenepoel contact [email protected].
Cameron Wilson, Director
AU TH O R G U ID E L IN ES RESEARC H HIG HLIG H TS COMM UN ICATION S OF THE ACM
1100 Seventeenth St., NW, Suite 50 http://cacm.acm.org/guidelines Co-chairs (ISSN 0001-0782) is published monthly
Washington, DC 20036 USA
David A. Patterson and by ACM Media, 2 Penn Plaza, Suite 701,
T (202) 659-9711; F (202) 667-1066
A DVE RT I S IN G Stuart J. Russell New York, NY 10121-0701. Periodicals
Computer Science Teachers Board Members postage paid at New York, NY 10001,
ACM ADVERT ISIN G D EPARTM E NT Martin Abadi; Stuart K. Card;
Association and other mailing offices.
Chris Stephenson 2 Penn Plaza, Suite 701, New York, NY Deborah Estrin; Shafi Goldwasser;
Executive Director 10121-0701 Monika Henzinger; Maurice Herlihy; POST MASTER
2 Penn Plaza, Suite 701 T (212) 869-7440 Norm Jouppi; Andrew B. Kahng; Please send address changes to
New York, NY 10121-0701 USA F (212) 869-0481 Gregory Morrisett; Linda Petzold; Communications of the ACM
T (800) 401-1799; F (541) 687-1840 Michael Reiter; Mendel Rosenblum; 2 Penn Plaza, Suite 701
Director of Media Sales New York, NY 10121-0701 USA
Ronitt Rubinfeld; David Salesin;
Association for Computing Machinery Jennifer Ruzicka
Lawrence K. Saul; Guy Steele, Jr.;
(ACM) [email protected]
Gerhard Weikum; Alexander L. Wolf
2 Penn Plaza, Suite 701 Media Kit [email protected]
New York, NY 10121-0701 USA WE B
T (212) 869-7440; F (212) 869-0481 Co-chairs
Marti Hearst and James Landay 

 


Board Members





Jason I. Hong; Jeff Johnson;





Greg Linden; Wendy E. MacKay Printed in the U.S.A.




 


4 C OMMUNICATIO NS O F THE ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


president’s letter

DOI:10.1145/1562764.1562765 Wendy Hall

ACM Europe pean contingent includes over 15,000


members, more than 100 chapters,
and an ever-growing number of ACM-
Increasing ACM’s relevance and influence sponsored conferences and symposia.
The primary goal of ACM Europe is to
as a membership organization in the global serve ACM’s European members by
computing community has been a top improving the exchange of ideas, and
priority from the outset of my presidency. by raising awareness about ACM with
the public and with European decision
makers in order to play an active role in
For it is only by making concerted ef- by Informatics Europe (http://www. critical technical, educational, and so-
forts to share ACM’s vast array of val- informatics-europe.org/ECSS09/). The cial issues related to computing.
ued resources and services on a global reception will honor the achievements The ACM Europe Council has estab-
scale, and by discovering the work and of European computer scientists and lished three subcommittees to launch
welcoming the talent from all corners recognize European winners of ACM’s its initial work in serving its constitu-
of the computing arena, that ACM can A.M. Turing Award, as well as other ACM ency. One subcommittee will review
truly call itself the world’s leading com- award winners, and ACM Fellows. The the current landscape of European
puting society. event will also serve to share the work chapters and help develop strategies
To this end, ACM has made impres- and activities planned by ACM Europe. for increasing the number of chapters,
sive strides over the last year. Major particularly student chapters. A second
internationalization initiatives have ACM Europe Council group will focus on generating interest
resulted in multiyear plans for raising ACM Europe is spearheaded by the in, and nominations for, the many ACM
the Association’s visibility and promot- ACM Europe Council—a group of 15 professional awards and distinguished
ing membership in the major technol- noted European computer scientists member grades. This group will also
ogy hubs of Europe, India, and China. from academia and industry who came work to suggest names for key ACM vol-
A growing number of ACM’s Special together last October pledging to help unteer positions to add a stronger Eu-
Interest Groups (SIGs) are hosting or build an ACM presence that would fo- ropean presence and perspective to the
planning their flagship conferences cus on bringing high-quality technical Association’s future plans. The third
throughout Asia, Europe, and Latin activities, conferences, and services group will work toward engaging ACM’s
America. And ACM’s Executive Com- to ACM members and computing pro- SIGs to “think Europe” when planning
mittee, Council, as well as its many fessionals throughout the continent. conferences and major events.
Boards and Committees stand solidly Chairing the ACM Europe Council is ACM Europe is one of many inter-
behind this cause, fully recognizing Fabrizio Gagliardi, director of external national initiatives currently at the top
what it means—and what it takes—for research programs for Microsoft Re- of ACM’s agenda. Planning is already
ACM to be an international society. search Europe. (Council members are under way for the newly formed ACM
As the Association’s first non-North listed here.) India Council to host a launch event
American President, and a European, The roots of ACM’s European com- in early 2010 and we will be announc-
I am especially pleased to announce munity are as old and run as deep as the ing details of our plans for the devel-
ACM Europe—a new effort within ACM Association itself, having been in exis- opment of ACM China next year. ACM
to recognize and support European tence over 50 years. Today’s ACM Euro- SIGs continue to hold record numbers
members and ACM activities in Eu- of conferences outside North America.
rope. Dedicated ACM volunteers and When I agreed to accept the invita-
executive staff have been working with ACM Europe Council tion to run for president of ACM in early
industry and academic leaders in Eu- CHAIR: 2008, I noted in my election statement
rope to determine ways to better serve Fabrizio Gagliardi, Switzerland that I truly believed ACM’s future suc-
ACM’s current European members and MEMBERS: cess depends on its ability to increase
to draw more into the fold. By strength- Andrew McGettrick, Scotland its diversity and support of academic
ening its ties in the region, ACM will be Marc Shapiro, France and research communities around the
Thomas Hofmann, Switzerland
better positioned to appreciate the key Gerhard Schimpf, Germany world. Given the progress made so far
issues and challenges within Europe’s Alexander Wolf, U.K. through the efforts and ongoing inter-
academic, research, and professional Gabriele Kotsis, Austria est of so many devoted volunteers and
computing communities and to re- Jan van Leeuwen, The Netherlands ACM leaders to this cause, I am heart-
Avi Mendelson, Israel
spond effectively. Michel Beaudouin-Lafon, France ened to say that we are well on the way
To celebrate the introduction of Burkhard Neidecker-Lutz, Germany to successfully achieving our aims.
ACM Europe, the Association will host Bertrand Meyer, Switzerland
Professor Dame Wendy Hall ([email protected]) is
Paul Spirakis, Greece
a special event in Paris on October 8 in Wendy Hall, U.K. (and ACM President)
president of ACM and a professor of computer science at
the University of Southampton, U.K.
conjunction with the 2009 European Mateo Valero, Spain
Computer Science Summit, sponsored © 2009 ACM 0001-0782/09/1000 $10.00

O C TO B E R 2 0 0 9 | VO L. 52 | NO. 1 0 | C OM M U N IC AT ION S OF THE ACM 5


letters to the editor

DOI:10.1145/1562764.1562767

Time and Computing

E
DWARD E. LEE’S “Computing program manager and Lockheed fi- and as a tool for the young learning
Needs Time” (May 2009) nally agreed to cancel the project. skills in the field. However, symbolic
might be the most impor- It was no surprise 20 or 30 years and spatial presentation (for the most
tant, stimulating, and timely ago that technical people involved in part the essence of the mathematical
article I have read in Commu- the development of demanding em- teaching method) has little to do with
nications over the past 50 years. It dealt bedded systems learned primarily the ability to achieve proficiency in
with a major technical topic long ne- on the job, with more trial and error mathematics or computing. Simply
glected by ACM but that is today more than we care to admit but that we ac- linearizing the symbols and spatial
important than ever. One sentence in cepted. Are universities today educat- arrangement for access by the blind
particular resonated: “It is easy to en- ing the new generation of systems and through voice (or Braille) techniques
vision new capabilities that are tech- software engineers so they are able to could still miss the point but seems to
nically within striking distance but develop computerized performance be the best we can do today.
that would be extremely difficult to systems? In my own recent cynical Similarly, teaching design as hier-
deploy with today’s methods.” Having moments rebooting my Mac, I think archical decomposition and spatial
been involved in the development of such software development is being layout redolent in techniques (such as
embedded systems in military aircraft done by people who would, perhaps, UML) also misses the point; it might
from 1968 to 1994, I would say that to be downright dangerous if assigned to thus be replaced by the simple text
succeed, computerized performance develop critical computerized perfor- “generate a hierarchical decomposi-
systems need the following: mance systems. tion and group the resulting units”
Knowledge. Of the operational objec- In any case, Lee’s technical vision with examples of what is meant and
tives and requirements of the system; and thinking were truly impressive, as any of the tricks that might be avail-
Discipline. At all technical and man- was his skill in communicating them. able. Noting that time arrangement
agement levels to assure system re- Sherm Mullin, Oxnard, CA could be replaced with spatial-ar-
sources are well defined and carefully rangement restrictions compounds
budgeted, with timing at the top of the the challenge. (A bundle of tactile
list; and great care How the Blind Learn knotted strings might be a better ex-
Selection. Of exceptionally self-di- Math and Design emplar of hierarchy than a neat 2.5D
rected project personnel, including Kristen Shinohara and Josh Tenenberg graph, as I believe Donald Knuth once
pruning out those unsuited for such were absolutely correct in “A Blind said.)
system-development work. Person’s Interactions with Technol- It may be that seeing the world as
You might view them as platitudes, ogy” (Aug. 2009) saying that replacing the sighted see it is the goal of prac-
so consider, too, several examples text with voice does not automatically titioners and educators alike; the
from four programs involving military provide the blind interaction with or motes, as it were, could indeed be in
aircraft: even access to technology. However, the wrong eyes.
In both the P-3C maritime patrol they might have added that this says Bernard M. Diaz, Liverpool, U.K.
aircraft and the carrier-based S-3A an- more about how the sighted value text
tisubmarine aircraft, senior software and its characteristics (such as left to
engineers (in both the contracting right, fixed character sets, and lamina) Correction
agency and the contractor) developed than it does about the absence of that Brad Chen of Google contributed to the
a set of real-time software “require- value in the blind. development of the news story “Toward
ments” that could not be met with In trying to teach computer science Native Web Execution” (July 2009).
available memory. In each case Navy to two completely blind students,
program managers were steeped in I have found that two key issues—
the operational requirements and had mathematics and design—are both Communications welcomes your opinion. To submit a
Letter to the Editor, please limit your comments to 500
exercised their authority to severely so dependent on spatial arrangement words or less and send to [email protected].
limit them. that the standard approaches are at
In the case of the F-117 stealth best unreliable and at worst confus-
fighter, an important new operational ing. Even a cursory glance at the excel-
subsystem was never deployed. The lent Blindmath notice board (http://
customer and the Lockheed Advanced www.nfbnet.org/mailman/listinfo/
Development Company jointly in- blindmath_nfbnet.org) demonstrates
vented a set of utopian system “re- the importance of Abraham Nemeth’s
quirements,” basically ignoring man- Braille coding for math-symbol lay-
agement directions. The customer’s out to both experienced practitioners © 2009 ACM 0001-0782/09/1000 $10.00

6 C OMMUNICATIO NS O F THE ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


in the virtual extension

DOI:10.1145/1562764.1562768

In the Virtual Extension


Communications’ Virtual Extension brings more quality articles to ACM
members. These articles are now available in the ACM Digital Library.

Balancing Four Factors indicating the possibility for action. This and even effective. Compared to a
in System Development Projects article translates the conceptual findings monolingual search, the automatic
into principles for design. query translation hurts the retrieval
Girish H. Subramanian, Gary Klein, effectiveness (from 12% to 30%
James J. Jiang, and Chien-Lung Chan De-escalating IT Projects: depending on the language pairs).
Various translation difficulties as
The DMM Project
It is often the methodology that dictates well as linguistic features may explain
system development success criteria. Donal Flynn, Gary Pan, Mark Keil, such degradation. Instead of providing
However, an organization will benefit and Magnus Mahring a direct translation for all language
most from a perspective that looks at Taming runaway information technology pairs, we can select an intermediary
both the quality of the product and an projects is a continuing challenge for language or pivot (for example, English)
efficient process. This imperative will many managers. These are projects that and such strategy does not always
more likely incorporate the multiple grossly exceed their planned budgets further degrade the search quality.
goals of various stakeholders. To this end, and schedules, often by a factor of
a project view of system development 2–3 fold or greater. Many end in failure,
should adjust to incorporate controls into not only in the sense of budget or
a process that stresses flexibility, promote schedule, but in terms of delivered Overcoming the J-Shaped
learning in a process that is more rigid, functionality. This article examines Distribution of Product Reviews
and be certain to use evaluation criteria three approaches that have been
that stress the quality of the product as Nan Hu, Paul A. Pavlou, and Jie Zhang
suggested for managing de-escalation.
well as the efficiency of the process. By combining the best elements from Product review systems rely on a simple
the approaches, we provide an integrated database technology that allows people
Attaining Superior framework as well as introducing a to rate products. Following the view of
Complaint Resolution de-escalation management maturity information systems as socio-technical
(DMM) model that provides a useful systems, product review systems denote
Sridhar R. Papagari Sangareddy, roadmap for improving practice. the interaction between people and
Sanjeev Jha, Chen Ye, and
Kevin C. Desouza technology. This article provides evidence
Human Interaction to support the finding that online
Why is customer service more important for High-Quality product reviews suffer from two kinds
than ever for consumer technology Machine Translation of potential biases: purchasing bias and
companies? How can they retain existing under-reporting bias. Therefore the
customers and prevent them from Francisco Casacuberta, Jorge Civera, average of ratings alone may not be
discontinuing their products? What can Elsa Cubel, Antonio L. Lagardia,
representative of product quality, and
they do to provide superior complaint Guy Lampalme, Elliott Macklovitch,
consumers need to look at the entire
management and resolution? The and Enrique Vidal
distribution of the reviews.
authors answer these questions and more The interactive-predictive approach
by proposing and evaluating a model allows for the construction of machine
of complaint management and service translation systems that produce
evaluation. A holistic approach toward Technical Opinion:
high-quality results in a cost-effective
complaint management is recommended manner by placing a human operator Do SAP Successes
for retaining customers. at the center of the production Outperform Themselves
process. The human serves as the and Their Competitors?
guarantor of high quality and the role
Making Ubiquitous of the automated systems is to ensure Richard J. Goeke and Robert H. Faley
Computing Available increased productivity by proposing Managers and researchers have long
well-formed extensions to the current debated how to measure the business
Vivienne Waller and Robert B. Johnson target text, which the operator may value of IT investments. ERP systems
Computing artifacts that seamlessly then accept, orrect, or ignore.
have recently entered this debate,
support everyday activities must be made Interactivity allows the system to take
advantage of the human-validated portion as research measuring the business
both physically and cognitively ‘available’
of the text to improve the accuracy of value of these large IT investments has
to users. Artifacts that are designed using
a traditional model of computing tend subsequent predictions. produced mixed results. Using regression
to get in the way of what we want to do. discontinuity analysis, the authors
Drawing on Heidegger, the authors delve found that successful SAP implementers
deeper into the concept of ‘availability’ How Effective is Google’s improved inventory turnover in their
than existing studies in human-computer Translation Service in Search? post-implementation periods. These
interaction have done. They find two ways SAP successes also significantly improved
Jacques Savoy and Ljiljana Dolamic inventory turnover relative to their
that ubiquitous computing can be truly
available are through manipulating the Using freely available translation competitors. However, profitability
space of possible actions and through services, bilingual search is possible improvements were more difficult to find.

O CTO B E R 2 0 0 9 | VO L. 52 | NO. 1 0 | C OM M U N IC AT ION S OF THE ACM 7


The Communications Web site, http://cacm.acm.org,
features more than a dozen bloggers in the BLOG@CACM
community. In each issue of Communications, we’ll publish
excerpts from selected posts.

DOI:10.1145/1562764.1562769 http://cacm.acm.org/blogs/blog-cacm

The Netflix Prize, threw off-the-shelf algorithms at the


data, yielding something that works,

Computer Science
but not with particularly impressive re-
sults.
Without a clear metric for success

Outreach, and Japanese and a way to test against that metric,


development stops there. But, like
Google and Amazon do with ubiqui-

Mobile Phones tous A/B testing, the Netflix Prize had


a clear metric for success and a way to
test against that metric.
Greg Linden writes about machine learning and the Netflix There are a lot of lessons that can be
Prize, Judy Robertson offers suggestions about getting teenagers taken from the Netflix contest, but a big
interested in computer science, and Michael Conover discusses one should be the importance of con-
stant experimentation and learning.
mobile phone usage and quick response codes in Japan.
By competing algorithms against each
other, by looking carefully at the data,
by thinking about what people want
From Greg Linden’s stalled until Gavin Potter discovered and why they do what they do, and by
“The Biggest Gains peculiarities in the data, including that continuous testing and experimenta-
Come From Knowing people interpret the rating scale differ- tion, you can reap big gains.
Your Data” ently.
Machine learning is hard. More recently, Yehuda Koren found From Judy Robertson’s
It can be awfully tempt- additional gains by supplementing the “Computer Science
ing to try to skip the work. Can’t we just models to allow for temporal effects, Outreach: Meeting the
download a machine learning package? such as that people tend to rate older Kids Halfway”
Do we really need to understand what movies higher, that movies rated to- I just spent the afternoon
we are doing? gether in a short time window tend to working with teenagers at
It is true that off-the-shelf algorithms be more related, and that people over some of our summer school workshops.
are a fast way to get going and experi- time might start rating all the movies As luck would have it, we had two differ-
ment. Just plug in your data and go. they see higher or lower. ent sessions running on the same af-
The only issue is if development In both cases, looking closely at the ternoon, and while galloping between
stops there. By understanding the pe- data, better understanding how people labs, it occurred to me some interesting
culiarities of your data and what people behave, and then adapting the models things were going on. First, a bit about
want and need on your site, by experi- yielded substantial gains. Combined the workshops; the summer schools
menting and learning, it is likely you with other work, that was enough to were both for 17- and 18-year-olds, both
can outperform a generic system. win the million-dollar prize. were set up to encourage young people
A great example of how understand- The Netflix Prize followed a pattern to study computer science, and both
ing the peculiarities of your data can you often see when people try to imple- involved building virtual worlds. One
help came out of the Netflix Prize. ment a feature that requires machine of the workshops, on making computer
Progress on the $1 million prize largely learning. Most of the early attempts games using the Neverwinter Nights 2

8 COMMUNICATIO NS O F THE ACM | M O NTH 2009 | VO L . 00 | NO. 00


blog@cacm

toolset, lasted for just two hours, and can help them achieve something that 1,817 kanji, QR codes are a forerunner
the other was the final presentation ses- appeals to them. As in “You’re inter- of ubiquitous computing technology
sion of an eight-week project on Second ested in alcohol and The Simpsons. and portend great things to come.
Life programming. Both of them went Ideal. How about you make a 3D Hom- What’s remarkable to me is, for all
very well from the point of view of intro- er Simpson whose arm can move up our similarities, how widely divergent
ducing young people to the fun aspects and down to drink beer?” At that point American and Japanese urban cul-
of computer science. Whether they you can start explaining the necessary tures can be. The market penetration
pay off in terms of recruiting people to programming and math concepts to numbers aren’t that strikingly differ-
study our degree courses in CS remains do the rotation in 3D space. Or even ent; a March 2008 study showed that
to be seen. But you have to start some- just admire what they have figured out more than 84% of Americans own a
where, right? Here are some things I by themselves. Once you have them cell phone, where a Wolfram Alpha
noticed that might be useful to others hooked on programming or signed up query shows that 83% of Japanese own
who are interested in schools’ outreach on your degree program, you can build one. The differences in practice, how-
and recruitment. on it. I’m not saying we don’t need to ever, could not be more pronounced.
! A relaxed atmosphere prevailed. teach sober, serious, and worthy as- In terms of mobile phone use, walking
The young people were joking around pects of computer science. Of course the streets of Japan is like being on a
and enjoying themselves. Importantly, we do. I’m just saying we don’t need college campus all the time. It’s not un-
they were laughing with the staff rather to push it immediately. It’s kind of like reasonable to estimate that every fifth
than at them. Having some handpicked when you have a new boyfriend and person is interacting in some way with
students who I knew to be friendly and you know you have to introduce him to a mobile device, and here’s the rub on
approachable really helped with this. your weird family. Do you take him to this point—Americans make calls on
! The young people were doing stuff meet the mad uncle with the scary eye- their phones, the Japanese interact.
rather than listening to me drone on. brows straight off? No, you introduce Ubiquitous Web access and wide-
The games workshop kids spent most him to a friendly cousin who will make spread support for the mobile platform,
of their time exploring the software with him feel at home and has something in addition to the vastly increased data-
minimal time spent in demos. The Sec- in common with him. transfer capabilities, mean Japan is a
ond Life project groups were presenting What I’m suggesting is not new— society in which cell phones are a prac-
their projects and giving demos while there are pockets of excellent out- tical mobile computing platform. QR
their classmates assessed their work. reach work with kids in various parts codes have blossomed in this culture
They seemed to be taking the assess- of the world. I think it’s time we tried not only because they’re immensely
ment task seriously and responsibly. more of it, even although it is time useful to both organizations and con-
And I’ll tell you what: It really makes consuming. After all, we know we can sumers, but because the cultural soil is
them ask sensible questions at the end recruit hardcore computer scientists ripe for their adoption. QR codes have
of each presentation. This is a contrast to our degree programs with our cur- been met with lukewarm response in
to the usual setup in class where stu- rent tactics (you know, the people who the U.S., and I fear it may be yet another
dents sit like turnips when you ask, “Are are born with silver Linux kernels in mobile technology to which we get hip
there any questions?” their mouths). But given there aren’t three to five years behind the curve.
! Both the workshops involved cre- that many of them, it’s well worth the Irrespective of this, the applications
ative tasks where the teenagers chose effort to reach out to the normal popu- of QR codes in Japan are at times as-
for themselves what to build. This does lation. Unleash the inner computer tounding. For many high-dollar cor-
have the drawback of revealing my igno- scientist in everyone! porations, such as Louis Vuitton and
rance of the latest pop culture fads, but Coca-Cola, the QR code is the ad (art?)
at least I do know what South Park is. Se- From Michael itself. Oftentimes, the QR code is the
riously, though, this is very important. Conover’s actual content, made of something un-
If you want people to take pride in their “Advertainment” expected or even a medium for digital
work, they need to take some owner- Mobile phones are a way activism. Because of its robust digital
ship of it. For that to happen, they need of life in Japan, and this format, creative marketers have a lot
to have the choice to work on person- aspect of the culture man- of wiggle room when it comes to cre-
ally meaningful projects and this often ifests itself in many ways. Among the ating eye-catching, market-driven ap-
means embracing popular culture in a more remarkable are the ubiquitous plications of this technology and, like
way which we, as grown-up computer quick response (QR) codes that adorn a ubiquitous translation technology, it’s
scientists, might find baffling or in- sizable percentage of billboards, mag- the widespread use of Internet-enabled
tensely irritating. azines, and other printed media. In phones that underlies this technologi-
Rather than pushing our agenda of brief, these two-dimensional bar codes cal paradigm shift.
what we think is important and berat- offer camera phones with the appropri-
ing young people that they ought to ate software an opportunity to connect Greg Linden is the founder of Geeky Ventures. Michael
find it interesting, we need to meet with Web-based resources relating to Conover, a Ph.D. candidate at Indiana University, is a
visiting researcher at IBM Tokyo Research Laboratory.
them halfway. We need to start from the product or service featured in an Judy Robertson is a lecturer at Heriot-Watt University.
their interests, and then help them to advertisement. Encoding a maximum
see how computer science knowledge of 4,296 alphanumeric characters, or © 2009 ACM 0001-0782/09/1000 $10.00

MO N T H 2 0 0 9 | VO L. 0 0 | NO. 0 0 | C OM M U N IC AT ION S OF THE ACM 9


cacm online

ACM
Member
News
DOI:10.1145/1562764.1562770 David Roman
Susan T. Dumais,

Following the Leaders a principal


researcher in the
adaptive systems
and interaction
group at
The articles, sections, and services available on Communications’ Web site all vie Microsoft
Research, is the recipient of the
for visitor attention. According to our latest Web statistics, the following features 2009 Gerard Salton Award,
are the most popular in pageviews since the site’s formal launch in April 2009. presented by the Special Interest
Group on Information Retrieval
Top Opinion Content (SIGIR). SIGIR recognized
FYI: All but #2 are from the Viewpoints section of the print issue. Dumais for her “innovative
1. Research Evaluation for Computer Science 6. Why ‘Open Source’ Misses contributions to information
cacm.acm.org/magazines/2009/4/22954 the Point of Free Software indexing and retrieval systems
cacm.acm.org/magazines/2009/6/28491 that have widely impacted the
quality of search from the
2. Conferences vs. Journals in Computing Research 7. Advising Students for Success desktop to the Web.”
cacm.acm.org/magazines/2009/5/24632 cacm.acm.org/magazines/2009/3/21781 “Obviously I was
3. CS Education in the U.S.: 8. Teaching Computing to Everyone tremendously honored to be
Heading in the Wrong Direction? cacm.acm.org/magazines/2009/5/24643 considered in the same category
cacm.acm.org/magazines/2009/7/32090 as the previous eight recipients
who have shaped the field of
4. Time for Computer Science to Grow Up 9. Beyond Computational Thinking information retrieval for almost
cacm.acm.org/magazines/2009/8/34492 cacm.acm.org/magazines/2009/6/28490 50 years,” Dumais said in an
5. Computing as Social Science 10. An Interview With C.A.R. Hoare email interview. “This was quickly
cacm.acm.org/magazines/2009/4/22953 cacm.acm.org/magazines/2009/3/21782 followed by the realization that
I had worked on search-related
problems for almost 30 years. ſ”
Top Practice Articles Asked about her role
FYI: High readership is common for material as a mentor, Dumais said:
from the print edition’s Practice pages. “Throughout my career, I have
had pleasure of working with
1. The Five-Minute Rule 20 Years Later amazing mentors and colleagues
cacm.acm.org/magazines/2009/7/32091 (notably Drake Bradley, Richard
2. API Design Matters
Shiffrin, Thomas Landauer,
cacm.acm.org/magazines/2009/5/24646
and Eric Horvitz) who have
challenged me to think critically
3. Whither Sockets? about research problems. I
cacm.acm.org/magazines/2009/6/28495 enjoy collaborating with people
4. Security in the Browser
from different disciplines and
cacm.acm.org/magazines/2009/5/24645
perspectives since it continues
to broaden the perspective
5. A Direct Path to Dependable Software that I have on problems. By
cacm.acm.org/magazines/2009/4/22960 mentoring interns, serving
on Ph.D. committees, taking
part in doctoral consortia and,
Most Popular Browse by Subject Categories more generally, by reviewing
FYI: The AI page gets more pageviews than and editorial activities, I feel
the BBS landing page. that I can give back to the
community and encourage the
1. Artificial Intelligence
next generation of researchers
cacm.acm.org/browse-by-subject/
to tackle key problems using a
artificial-intelligence
wide range of theoretical and
2. Communications/Networking experimental methods. In doing
cacm.acm.org/browse-by-subject/ Top Services research, I believe that it is
communications-networking FYI: These pages deliver a service, important to understand past
not just information. contributions and perspectives,
3. Software
and to pursue new solutions
cacm.acm.org/browse-by-subject/software 1. Sign In
to key problems. In mentoring
4. Security 2. Forgot Your Password younger researchers, I encourage
cacm.acm.org/browse-by-subject/security them to identify key challenges,
3. Create a Web Account
to purse them using a broad
5. Human-Computer Interaction
4. RSS Feeds range of approaches, and not
cacm.acm.org/browse-by-subject/
to give up when the first idea or
human-computer-interaction 5. Alerts and Feeds
implementation doesn’t work.”

10 COM MUNICATIO NS O F THE ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


N
news

Science | DOI:10.1145/1562764.1562771 David Lindley

Managing Data Ickert-Bond, curator of the Herbarium


and an assistant professor of botany at
the University of Alaska. The Herbari-
Dealing with terabytes of data is not the monumental task it once was. um’s holdings are now accessible via
The difficult part is presenting enormous amounts of information Arctos, an online collaboration of five
in ways that are most useful to a wide variety of users. U.S. university museums.
Nearly 80,000 of the Herbarium’s

T
220,000 specimens have been digitized
H E H E RBA RI U M AT the Univer- data repository, via the National Sci- by hand, with students laboriously key-
sity of Alaska’s Museum of ence Foundation’s TeraGrid cyber- ing in detailed information from the
the North houses one of the infrastructure. “We take images and specimens’ labels, some of which are
world’s largest collections of they’re immediately downloaded to handwritten and some of which are
arctic plant specimens and Texas, and in live time we have a link to in Russian. To record the remaining
Russian flora. For scientists studying that file from our database,” says Steffi 140,000 specimens, however, the proj-
the ecology of this region or the chang-
ing biodiversity caused by human en-
croachment and climate change, it’s an
invaluable resource. However, the Her-
barium is not ideally located for most
scientists. Because of the considerable
expense of traveling to Alaska, scien-
tists often have specimens temporarily
shipped to their institutions, but that
also costs money, and delicate speci-
mens suffer from the attendant wear
and tear. As a result, the Herbarium is
scanning and digitizing its extensive
collection, making the images and text
available on the Internet to scientists,
not to mention enthusiastic amateurs,
everywhere in the world.
The amount of data involved—
about five terabytes so far—is hardly
intimidating by today’s standards, but
it proved to be overwhelming for the
Herbarium’s previous database part-
ner. Consequently, the Herbarium
teamed up with the University of Texas
at Austin’s Texas Advanced Comput-
ing Center (TACC), gaining access to
TACC’s Corral, an online 1.2 petabyte The ambitious Encyclopedia of Life aims to create a Web page for every known species on Earth.

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 11
news

ect is shifting to automated label scan- visitors who want to dig deeper to more “discovered” on more than one occa-
ning using Google’s Tesseract optical specialized, detailed resources. In- sion and received duplicate names,
character recognition engine. deed, says EOL Executive Director Jim and sometimes molecular or DNA
Handling varied data, such as text, Edwards, also at NMNH, a principal analysis demonstrates that what’s long
numbers, images, and sounds, no motive behind EOL was the realization been regarded as a single species is, in
longer poses any great technical dif- that many research communities are fact, two distinct species.
ficulties, says Chris Jordan, who is in building their own databases—such as In addition, it’s essential to be able
charge of data infrastructure at TACC. AntWeb, FishBase, and others—each to search in less formal ways, such as
Corral uses Sun Microsystems’ Lustre with its own design, interface, and by habitat type, mating behavior, or
file-handling program to present users search procedures, created to meet the leaf shape, characteristics that aren’t
with a seamless interface to the data. needs of its particular community. often described by a standardized set
What’s trickier is building a presenta- Launched with grants from the Mac- of terms. That’s a particular issue for
tion layer that gives scientists in a par- Arthur and Sloan foundations, EOL re- one of EOL’s partner efforts, the Biodi-
ticular discipline access to the resourc- ceives support from several museums versity Heritage Library (BHL), a collab-
es in an intuitive and user-friendly way. and other institutions. EOL invites oration of 10 natural history museum
Unlike researchers in physics and en- scientific contributions from amateurs libraries, botanical libraries, and re-
gineering, for example, those working and academics, but uses a network of search institutions in the U.S. and the
in museums or the humanities aren’t researchers to decide what informa- U.K. that has put nearly 15 million digi-
accustomed to using computers at the tion will be included. The site tries to tized pages from 37,000 books online.
command line level, says Jordan, so “a steer a middle course between a pure Although optical character recognition
lot of our effort now is on [building] in- top-down model, with pages created software has become reliable, scanning
terfaces for people to locate data along only by experts, and a self-policed wiki. millions of pages is labor-intensive and
with descriptive metadata in a way EOL resides on servers at the Smith- time consuming, says Chris Freeland,
that’s reasonably easy.” sonian and the Marine Biological Lab- BHL’s technical director and manager
oratory in Woods Hole, MA, but since of the bioinformatics department at
Accessible and Useful it is mainly an index to other informa- the Missouri Botanical Garden in St.
Making vast repositories of biologi- tion, the data cached on those servers Louis. Someone—usually a volunteer
cal information widely accessible and amounts to a few hundred megabytes. student—must turn the pages of every
useful is the aim of the Encyclopedia EOL’s informatics challenges derive in book and correctly position them for
of Life (EOL), an ambitious project large part from the historical origins the camera. After that phase, though,
whose long-term goal is to create a of biological information. Even the the digitizing process is automated
Web page for every known species on formal names of species can be treach- and efficient. Typewritten or commer-
Earth. “We want to provide all infor- erous, since some species have been cially printed material is optically well
mation about all organisms to all audi-
ences,” says Cyndy Parr, EOL’s species
pages group leader, who is based at
the Smithsonian’s National Museum
of Natural History (NMNH) in Wash-
ington, D.C. So far, EOL has about
1.4 million species pages, but most of
them are little more than placeholders
for known organisms, frequently di-
recting visitors to an external link with
more information. Some 158,000 pag-
es, however, contain at least one data
object that’s been scientifically vetted
by EOL. The project is chasing a mov-
ing target, since the number of species
on the planet is unknown. A figure of
1.8 million known species is common-
ly accepted, says Parr, but new species
are being discovered at a rate of 20,000
or more each year, and some extrapo-
lations suggest that there may be as
many as 10 million distinct species.
Far from competing with efforts
such as those by the University of Alas-
ka’s Herbarium, EOL aims to build
species pages that deliver essential in-
formation in a uniform style and lead Galaxy Zoo uses crowdsourcing to help categorize galaxies as either spiral or elliptical.

12 C OMM UNICATIO NS O F TH E ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


news

recognized, although unusual scripts,


such as older typefaces and cursive
The Encyclopedia
Calendar
characters, can be problematic.
To date, BHL’s digitized collection
totals 50 terabytes. That’s not so large,
of Life is working of Events
Freeland notes, but replicating it at with computer October 15–16
mirror Web sites and ensuring its reli- scientists on natural 2009 ACM-IEEE International
Symposium on Empirical
ability and security is a challenge. Once
again, though, the larger headache is language processing Software Engineering and
Measurement,
presenting the data. Searching text for software that Lake Buena Vista, FL
Sponsored: SIGSOFT
keywords and phrases is easy, but be-
cause a great deal of biological infor- can intelligently Contact: Laurie Ann Williams
Email: [email protected]
mation is qualitative and descriptive, characterize October 17–18
it’s very difficult to construct a search
that will dig out all references to, say, digitized texts. First Asia-Pacific Symposium
on Internetware,
the mating behavior of bearded drag- Beijing, China
on lizards. As a result, EOL is working Contact: Hong Mei
Email: [email protected]
with computer scientists on natural
language processing software that can October 18–20
intelligently characterize the meaning 4th International Conference
of digitized texts. But accomplishing galaxies are bluer. Moreover, red spiral on Nano-Networks,
Lucerne, Switzerland
that goal automatically and reliably re- galaxies tend to inhabit the outskirts Contact: Alexandre Schmid
mains elusive, Edwards says. of galaxy clusters, a finding that allows Email: Alexandre.schmid@
cosmologists to test theories about the epfl.ch
Crowdsourcing Galaxies galaxies’ origins.
October 19–20
EOL is also investigating crowdsourc- One crowdsourcing lesson Nichol Symposium on Architecture
ing methods in which a Web site visi- has drawn from the success of the Gal- for Networking and
tor is asked to supply keywords for an axy Zoo is that it is crucial to precisely Communication Studies,
image or text extract. One example tell visitors what information is needed Princeton, NJ
Contact: Peter Z. Onufryk
of crowdsourcing paying scientific and why it is important. Some Galaxy Email: [email protected]
dividends is the Galaxy Zoo, an off- Zoo scientists have been discussing
shoot of the Sloan Digital Sky Survey, with other researchers, including bota- October 19–24
which uses an automated telescope nists, how to translate their methods ACM Multimedia Conference,
Beijing, China
to scan the sky for galaxies and digi- to other disciplines, but for a project Sponsored: SIGMM
tize images of them. (For more about as large as EOL, the task is daunting. Contact: Wen Gao
the Sloan Digital Sky Survey, see “Jim The required information is not easily Email: [email protected]
Gray, Astronomer” in the November reduced to a handful of simple ques-
October 20–22
2008 issue of Communications.) The tions, and while pages for popular 4th International Conference
Sloan survey examines so much space mammals and pretty flowers receive on Performance Evaluation
that astronomers cannot hope to look many hits, pages for undistinguished Methodologies and Tools,
Pisa, Italy
at every galaxy image, yet direct visual flies and obscure bacteria languish un-
Contact: Giovanni Stea
inspection is how astronomers have seen in the cyberspace equivalent of a Email: [email protected]
traditionally gained their understand- dusty museum attic. Crowdsourcing
ing of galaxies, says Bob Nichol, a pro- doesn’t work without crowds. October 21–23
Asia Information Retrieval
fessor of astrophysics at the University Although EOL has to date attained
Symposium,
of Portsmouth in England. Based at only a tiny fraction of its ultimate Sapporo, Japan
the University of Oxford, the Galaxy goals, it has earned respect and en- Contact: Tetsuya Sakai
Zoo gets some of that irreplaceable thusiasm from those in the biological Email: [email protected].
ne.jp
scientific insight from Web site visi- community who were initially skepti-
tors, who are asked to classify a series cal about its utility, Edwards says. And October 22–24
of galaxy images drawn randomly from as EOL continues to grow, he thinks ACM Special Interest Group
the Sloan survey as spiral or elliptical. its value will be understood by many for Information Technology
Education Conference,
The task can be addictive, says Nichol, different communities, including the Fairfax, VA
and 250,000 visitors have collectively public and commercial businesses, Contact: Donald Gantz
classified nearly one million galaxies so that it will become an indispens- Email: [email protected]
at least 30 times each. One statistically able resource.
solid result to emerge is that 15% of gal-
axies don’t obey the usual rule that the David Lindley is a science writer and author based in
Alexandria, VA.
color of elliptical galaxies tends toward
the red end of the spectrum and spiral © 2009 ACM 0001-0782/09/1000 $10.00

O CTO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | COM M U N IC AT ION S OF T HE ACM 13


news

Society | DOI:10.1145/1562764.1562773 Alan Joch

Debating Net Neutrality


Advocates seek to protect users from potential business practices,
but defenders of the status quo say that concerns are overblown.

T
HE CONTROVERSY ABOUT Vinton Cerf, chief Internet evange-
network neutrality—the list at Google, also sees potential acces-
principle that Internet us- sibility problems if broadband provid-
ers should be able to access ers wield too much power. “All of us, in
any Web content or use any the consumer space especially, stand
applications, without restrictions or to lose the opportunity to access new
limitations from their Internet service products and services if we don’t get
provider (ISP)—remains unresolved [the debate] right,” Cerf warns.
in the U.S. over who, if anyone, has a Cerf cites the well-publicized legal
legal or commercial right to regulate wrangling over broadband provider
Internet traffic. Net neutrality pro- Comcast’s deliberate slowing down of
ponents advocate for legislation that traffic for customers using BitTorrent
would keep broadband service provid- and other filing-sharing applications
ers from controlling Internet content on its network in 2007. “As long as we
or gaining the ability to impose extra have clumsy or consumer-unfriendly
charges for heavy users of the Internet. regimes for network management,” he
Opponents argue that existing rules says, “we will see problems for some
enforced by the Federal Communica- reasonable applications and thus some
tions Commission (FCC) and others inhibition of innovative new services.”
make additional laws unnecessary, or Cerf envisions a safer Net world
could jeopardize service providers’ where regulations would constrain
First Amendment rights. anti-competitive practices, such as un-
But now some analysts are warning fair pricing by ISPs that compete with
that the battle may ultimately mean providers of online movie-on-demand
much more than decisions about bits, Musical artist Moby, left, and U.S. Representative services. “Fairly high aggregate caps
Ed Markey at a SavetheInternet.com press
bandwidth, pricing, and flow control. conference on net neutrality in Washington, on total monthly transfers or, prefer-
“The battle for Net neutrality is also D.C., with a counterdemonstrator’s sign ably, a cap on the minimum assured
a battle over the shape of American blocking the view of the Capitol dome.
rate for a given price would be very at-
culture,” says Tim Wu, a Columbia explains. Part of an AT&T customer’s tractive,” Cerf says. The “key point,”
University law professor specializing monthly bill would support the Wall he adds, is that “the user gets to pay
in copyrights and communications Street Journal, for example, but these for a guarantee of a certain minimum
who wrote a 2003 paper, “Network subscribers “can’t get access to USA rate. If the user doesn’t use it, others
Neutrality, Broadband Discrimina- Today or something else. It starts to may. That’s the whole dynamic of ca-
tion,” which popularized the concept become a little bit dicey when you can pacity sharing.”
of Net neutrality. see how that would help [news organi-
Wu fears that in an under-regulated zation’s] business models.” Net Neutrality Skeptics
Internet ruled by ISPs, commercial mo- Innovation and experimentation Others dismiss such fears. “The threat
tives could stifle entertainment, cul- could also suffer, Wu warns. “In the level is not red or orange. We are way
ture, and political diversity. For exam- early days of mobile phones, the only down in yellow,” says Barbara Esbin,
ple, Wu says, the current open Internet way an application appeared on a mo- director of the Center for Communi-
is “bad for the business models” of bile phone was if it made money for cations and Competition Policy at the
traditional newspapers, many of which the phone company,” he says. But in Progress and Freedom Foundation, a
don’t charge for the news they publish an Internet context, because some pro-business Washington, D.C.-based
on their Web sites. But what if a news now-familiar features don’t have a di- think tank supported in part by broad-
organization, for financial means, rect commercial bent, they may never band providers such as AT&T, Comcast,
PHOTOGRA PH BY BRIA N LONG

aligned itself with an individual ISP to have been introduced. “In an Inter- and Verizon.
sell premium content? net that [ISPs] can control, why would Net neutrality skeptics dismiss the
“If you imagine a non-open Inter- you put Wikipedia on it?” he asks. “It notion that the lack of additional Inter-
net, there could be the official newspa- doesn’t make sense because it doesn’t net controls is endangering society, and
pers of AT&T’s network, let’s say,” Wu make money.” argue that the status quo engenders a

14 COMM UNICATIO NS O F TH E ACM | O C TOBER 2009 | VO L . 5 2 | NO. 1 0


news

competitive market that’s adequately could impinge on fundamental con-


watched over by the FCC and Federal cerns about personal privacy. David
Trade Commission. The current en- Clark, a senior research scientist at “What we are going
vironment promotes innovation, says Massachusetts Institute of Technol- to be fighting about
Robert Pepper, vice president of global ogy, says the danger is more subtle
technology policy for Cisco Systems. than that of ISPs blatantly snooping on in two years,” says
He cites the ongoing investments in what sort of content is being transmit- David Clark, “is
fiber-optic networks for broadband ted through their pipes. Using deep-
services as well as the move by wireless packet inspections to block content who has the right to
carriers to provide high-speed 3G and “is nonsense in the United States,” observe everything
WiMAX networks. “The whole innova- says Clark. “That’s so 20th century.”
tion engine offers numerous examples The more relevant danger, according you do.”
of new apps being developed, new busi- to Clark, is that ISPs will be motivat-
ness models being tried,” Pepper says. ed by profit opportunities to analyze
“Where is the failure to innovate?” what content subscribers are viewing
Others go further, saying that new and then use that data for commercial
constraints imposed in the name of gain. “What we are going to be fight-
Net neutrality could make it difficult ing about in two years is who has the place ads as precisely tuned to individ-
for service providers to do essential right to observe everything you do,” he ual users’ interests as those inserted
management and maintenance. For predicts. “[ISPs] can completely model by search companies. Clark notes that
example, some Net neutrality advo- you based on your behavior by doing modeling is already happening thanks
cates worry about ISPs examining not a deep-packet inspection. That’s the to companies such as Phorm, which
just the headers of the data traffic— issue that takes over from the rather feed ISPs analyses of consumer behav-
information required for routing it simplistic fear that what these ISPs are ior on the Web.
through the network—but also peer- going to block a packet or degrade ac- But existing regulatory and market
ing into the actual content of the pack- cess or something.” forces may already be working to keep
ets. They worry that such deep packet One attractive commercial option is abuses in check. “I’ve heard of a few
inspections could open the door to for ISPs to replicate the revenue niche companies in the past that have bought
blocking of content for commercial or established by companies like Google. those services,” Farber says. “But they
political reasons. For example, when an Internet user en- usually backed off rapidly because of
But the danger of deep packet in- ters a search query for mortgage rates the noise they were getting” by congres-
spection should be kept in perspec- in Arizona, “that guy is targeted” by sional watchdogs. Indeed, an ad-target-
tive, say some experts. For one thing, search engine companies, Clark points ing company similar to Phorm, called
the technique is important for secure out. These search firms can insert rel- NebuAd, shut down its U.S. operation
network operation, says David Farber, a evant ads into the pages that display in mid-2009 after becoming the target
professor of computer science and pub- query results and charge advertisers a of congressional scrutiny. That wasn’t
lic policy at Carnegie Mellon University. premium for delivering their messages the final word, however; the company
“If I can’t do deep packet inspections to a highly targeted audience. Various immediately started doing business in
I have no way as a carrier of handling types of analytical applications could England as Insight Ready Ltd.
denial-of-service attacks,” he says. give broadband providers an efficient
way to slice and dice their customers’ Alan Joch is a business and technology writer based in
Francestown, NH.
Concerns About Privacy usage data, and thus gives ISPs an op-
Still, issues surrounding Net neutrality portunity to argue that they’re able to © 2009 ACM 0001-0782/09/1000 $10.00

Net Neutrality: A Worldwide Debate


Nations around the world are how the other resolves the Brussels might shape the Progress and Freedom
refereeing similar debates Net neutrality debate. In May, views of U.S. regulators. Foundation, a pro-market
although often in the context the European Parliament One analyst isn’t surprised Washington, D.C.-based
of a different competitive voted for a package of by the outreach by U.S. think tank.
environment. For example, telecommunications policies, lobbyists. “Net neutrality But Esbin believes the U.S.
industry estimates count more including one that affirms the is a rainmaker issue—you could ultimately play the most
than 200 European network principle of Net neutrality. can get a lot of funding for influential role of all. “The
operators that provide Internet According to The New York your organization if you can U.S. is going to be looked at for
services. The U.S., by contrast, Times, groups of U.S. lobbyists point to a crisis that requires how we resolve this,” she says,
has only about a dozen large were on site to influence your group’s advocacy to “and it would be great if this
broadband ISPs. European regulations over resolve it,” says Barbara country could show leadership
Nevertheless, the two the winter when the measure Esbin, director of the Center again on communications
continents may eventually was being debated. The goal: for Communications and policy.”
find themselves influencing any decisions formalized in Competition Policy at the —Alan Joch

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 15
news

Technology | DOI:10.1145/1562764.1562772 Tom Geller

Shaping the Future


To create shape-shifting robotic ensembles, researchers
need to teach micro-machines to work together.

A
P ROTOT YPE OF your com-
pany’s latest product sits
on the conference table.
You suggest sleeker styl-
ing. The 3D model reacts,
fluidly changing its angles. A colleague
recommends different proportions,
and again the model reflects those
edicts. A pass of the hands changes the
model’s color; another reveals its inte-
rior, while a third folds it back into a
flat sheet. Computer-aided design soft-
ware records all the changes to replay
them later, and to easily produce the
final version.
That’s the vision of researchers at
Carnegie Mellon University and else- A top view of two magnet rings, with individual driver boards, from the self-actuating Planar
where as they embark in the new field Catom V7 developed by the Carnegie Mellon University-Intel claytronics research team.

of robotics known as claytronics. In it, between Carnegie Mellon University’s Both projects face a host of chal-
a shape-shifting object is comprised of Collaborative Research in Program- lenges. Most immediately, the indi-
thousands of miniature robots known mable Matter project and Intel Re- vidual mechanical devices need to
as claytronic atoms, or catoms. Each search Pittsburgh’s Dynamic Physical be much smaller than the units most
catom has the power to move, receive Rendering project—focuses on pro- recently demonstrated, which were
power, compute, and communicate grammable materials’ shape-shifting about the size of large salt shakers. Ja-
with others, resulting in an ensemble aspects as well as the software needed son Campbell, a senior staff research
that dynamically envisions three-di- to drive it. The other major effort eyes scientist at Intel’s Pittsburgh labora-
mensional space much as the pixels military applications; this collabora- tory, says his group is working with
on a computer monitor depict a two-di- tion is between the Defense Advanced catoms of a tubular structure that are
mensional plane. Functional scenarios Research Projects Agency (DARPA) and much smaller—1 millimeter in diam-
for claytronics include a general-pur- a consortium of colleges including the eter and 10 millimeters in length. He
pose computer changing from laptop University of California at Berkeley, believes that they need only be small
to cellphone form as needed, or, say, Harvard University, the University of enough to disappear to our eyes and
a plumber’s tool that changes from a Pennsylvania, Massachusetts Institute touch. “That gives us a size target be-
scuttling, insect-like device in crawl of Technology, and Cornell University. tween a tenth of a millimeter and one
spaces to a slithering snake-like shape millimeter across,” he says, with the
when inside pipes. size requirement depending on ap-
Ultimately, the claytronics dream To form stable plication. A shape-changing radio an-
PHOTOGRA PHS COU RTESY OF CA RNEGIE M ELLON UNIVERSIT Y- INT EL

goes, we may have morphing human- tenna doesn’t need to be very small, for
oids like T-1000 in the movie Terminator ensembles, catoms example, whereas for a handheld de-
2 or Odo in the TV series “Star Trek: Deep must have power, vice, “the consumer is the perceptual
Space Nine.” Because their shape-shift- system” and miniaturization has great
ing abilities could make them excellent communication, value, says Campbell.
mimics of human forms, such androids and adherence Claytronics research flows out of the
might act as reusable stand-ins for sur- field of modular robotics, where indi-
geons, repair technicians, and other ex- to one another. vidual units are typically a few inches
perts who control them remotely. across. But as the units get smaller—
and more numerous—the physics con-
Collaborative Research trolling them changes. Kasper Støy, as-
Two joint ventures have taken the sociate professor of computer systems
lead in this field. One partnership— engineering at the University of South-

16 COMM UNICATIO NS O F THE ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


news

ern Denmark, points to one difference their size, the catoms are translucent. Education
My guess is that we’ll use multiple ways
U.S. CS
between the macro and micro levels.
With larger modules, he says, “we have of getting power in.”
to fight a lot with gravity. All the mod- Mark Yim, an associate professor
ules have to be able to lift the others.
We’ve long used electromechanical
of mechanical engineering at the Uni-
versity of Pennsylvania who is involved
Courses
actuators to overcome gravity in large
modules. But at the micro level, you
with the DARPA project, has experi-
mented with drawing power from the Decline
don’t want to have a big actuator in environment and letting the catoms’
every little unit. The question is, How intelligence and adhesion turn it into The number of computer
science courses being taught
do you go from a bunch of really weak useful motion. “I asked, What if we in high schools throughout
modules and get a big, strong robot?” shake the table to get the modules to America is steadily decreasing,
Size is not the only challenge. To move around each other, and the mod- according to the 2009 CSTA
Secondary Computer Science
form stable ensembles, catoms must ules determine when to grab or let go?”
Survey conducted by the
have power, communication, and ad- says Yim. “If they get really small, you Computer Science Teachers
herence to one another. The Carnegie might be able to move them with sound Association. The survey, based
Mellon project is currently attempting waves that shake the air.” on responses from 1,094 high
school teachers in spring 2009,
to address most of these issues by at- As catom production ramps up, found that only 65% of schools
taching electrostatic plates to the sur- these questions assume greater ur- offer an introductory CS course,
face of each catom. Plates given oppo- gency. Goldstein notes that his lab, in compared with 73% in 2007
site charges would cling to each other; collaboration with the U.S. Air Force and 78% in 2005. Likewise, the
number of advanced placement
changing the charges in a precisely Research Laboratory, has “refined the CS courses has also declined,
programmed sequence would force process to print the catom’s circuits with 27% in 2009, 32% in 2007,
the plates to roll over one another into on a standard silicon die, then post- and 40% in 2005.
“There are really three big
the desired configuration. Information process it to curl up into a sphere or requirements for reversing
could be exchanged between adjoining a tube.” By this December, Goldstein the trends highlighted in the
catoms by manipulating their voltage says, he hopes to fabricate a silicon cyl- survey,” said CSTA Director
differences. inder about one millimeter in diameter Chris Stephenson in an email
interview. “First, we need to
Claytronics researchers are con- that will “move around under its own ensure that computer science
sidering several options for providing control, transfer power from a ground is classified as a rigorous
power. In early versions, “catoms will plane, and be able to communicate academic subject in K-12 and not
as a ‘technical’ or ‘vocational’
get power from the table they sit on, with other units.”
subject. Second, as a community,
through capacitive coupling,” says Seth we have to do a much better job
Goldstein, an associate professor of The Shapes to Come of explaining to policymakers,
computer science who leads the Carn- A pile of catoms is useless without coor- parents, and students what
unique skills and knowledge
egie Mellon team. But additional tech- dination. And although the challenge of computer science provides
nologies are also under consideration, creating claytronic software is similar to and what future opportunities
Goldstein says. “We are looking at mag- that of any massively parallel computer it offers. Finally, we have to
netic resonance coupling, which goes system, there is one key difference: The fix the mess that is computer
science teacher preparation and
over a longer distance,” he says. “We’re catoms’ physical nature makes error- certification in most states, so
also looking at solar power. Because of correction more important. “Catoms that computer science teachers
can be certified as CS teachers
and not as math, science,
business, or tech teachers.
“Our highly decentralized
education system in the U.S.
makes it very difficult to achieve
systematic, long-term change,”
said Stephenson. “In each state,
the players, the priorities, and
the policies are different.”
The survey’s findings were
not unexpected. “This is the
third time we have done this
survey,” said Stephenson, “and
so while we were not really
surprised by the results, we
were more troubled than ever
with the continued decline in
the number of courses being
offered and the fact that it is
getting increasingly harder
for students to fit elective
A pair of Planar Catom V8s, each of which weighs 100 grams, with their stack of magnet- computer science courses into
sensor rings on the bottom and solid-state electronic control rings on the top. their timetables.”

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 17
news

ing tasks will need to be distributed his lab at Carnegie Mellon will build
throughout the ensemble. “The more working, submillimeter catoms by
For catoms to be parts you have, the more it matters the end of 2009. By 2012, he believes,
commercially viable, that you have processing on each part. the researchers will have “hundreds
There are still things you can central- of catoms working together intelli-
large quantities ize, but decisions about what modules gently.” More fantastically, Goldstein
of the devices must to move when have to be done inside hopes that in 30 years there will be a
the aggregate,” says Campbell. claytronic humanoid that looks, feels,
be produced at The Carnegie Mellon team uses and moves in ways almost identical
a low cost. simulations to address these ques- to a person.
tions, and has created two languages But such ambitious goals may ob-
that are optimized to program the en- scure a more fundamental issue. “Say,
semble. One, called Meld, is a state- you have something like claytronics,”
less, declarative, logic-based language Yim says. “You have the mechanical part
used to control the behavior of indi- working. You’ve got the computational
are mechanical devices, and mechani- vidual catoms. The second, locally dis- part working so you can control them in
cal devices fail,” says Campbell. tributed predicates (LDP), is used to a distributed fashion. The question is,
There’s also the question of external recognize patterns and evaluate con- What do you do with it? Maybe you want
control: How much instruction should ditions throughout the ensemble as a it to go get a beer from the refrigerator.
the aggregate get from an outside whole. Meld and LDP are complemen- How does it know what is the best shape
source? Boston University electrical tary, says language developer Michael to be in to open the door?”
and computer engineering professor Ashley-Rollman, a computer science Indeed, making claytronics truly
Tommaso Toffoli expresses skepticism graduate student at Carnegie Mellon. useful depends on developments in
about the ability of a claytronic ensem- “If one robot stops working,” he ex- areas beyond these projects’ current
ble to be self-directed. In the 1990s, he plains, “Meld just throws everything scope, including artificial intelligence
says, companies attempting to build away from that robot and pretends and human modeling. But the scenar-
massively parallel computers failed be- that it was never there.” LDP, on the ios described only illustrate possible
cause they misread this problem. “Peo- other hand, “permits programmers applications for claytronics, and in fact
ple eventually found that instead of to manage states themselves”—at researchers are quicker to say what the
having millions of tiny computers the the expense of error correction. LDP, technology will be than what it will do.
size of ants trying to simulate the brain, Ashley-Rollman says, “gives Meld rules But that’s the very problem that gener-
they got more work done by having just through which it can determine what al-purpose computers faced until the
a few hundred that act as slaves, carry- the state can be.” mid-1980s; like PCs, claytronic ensem-
ing sand and water to a floating-point bles may ultimately prove their worth
processor,” Toffoli says. The Reality Challenge in ways now unimagined.
Campbell disagrees. He believes Commercial viability for catoms will
that for catoms to be as effective as require major advances in the ability Tom Geller is an Oberlin, Ohio-based science, technology,
and business writer.
possible, they will each need to have to produce large quantities of the de-
onboard intelligence and that process- vices at low cost. Goldstein hopes that © 2009 ACM 0001-0782/09/1000 $10.00

Milestones

Computer Science Awards


The Internet Society and the David Farber, Anthony C. Hearn, supporting, demonstrated that an assistant professor in the
National Science Foundation and Lawrence Landweber—and researchers valued the type of University of Maryland, Baltimore
(NSF) recently honored members NSF program officer Kent Curtis, informal collaboration it made County’s mechanical engineering
of the computer science who encouraged and funded possible. The success of CSNET department, with Faculty Early
community for their innovative CSNET, were recognized for their encouraged NSF to undertake Career Development Awards.
research. important roles. the NSFNET program, which Jacob was recognized for
CSNET began in 1981 with a brought open networking to his research into the design
POSTEL SERVICE AWARD five-year NSF grant, and by 1986 a larger academic community of computer programs that
The Internet Society awarded the CSNET had connected more and presaged the emergence of accelerate the capture of high-
2009 Jonathan B. Postel Service than 165 academic, government today’s Internet. resolution images, and that
Award to CSNET, the research and industry computer research promise to enable early diagnosis
networking effort that during the groups, which comprised NSF CAREER AWARDS of cancer. Su’s virtual reality
early 1980s provided the critical more than 50,000 researchers, NSF recently awarded Mathews research involves the creation of
bridge from the original research educators and students around Jacob, an assistant professor in a virtual design environment in
undertaken through ARPANET to the world. Open to computer the department of biomedical which conceptual designs can
today’s Internet. Four principal researchers, CSNET, which engineering at the University be quickly and inexpensively
investigators—Peter J. Denning, was self-governing and self- of Rochester, and Haijun Su, evaluated.

18 COM MUNICATIO NS O F TH E ACM | O CTOB ER 2009 | VO L . 5 2 | NO. 1 0


V
viewpoints

DOI:10.1145/1562764.1562774 Phillip G. Armour

The Business of Software


Contagious Craziness,
Spreading Sanity
Some examples of the upward or downward spiral of behaviors in the workplace.

I
1970s the president
N T H E LAT E the drive to do things is a very impor- accomplished simply through force of
of a company I worked with tant attribute. Certainly, having a will. Just because we really, really want
asked the company’s econo- strong conviction that something can- something doesn’t mean we will be
mists and financial forecasters not be done is usually a self-fulfilling able to get it—even if we are the presi-
to provide their future market prophecy. If people are convinced that dent of a major corporation.
predictions for the next decade. This something is not achievable, then
was a capital-intensive business with they usually won’t achieve it—if we ar- Infectious Conduct
long lead times for equipment pur- gue for our limitations, we get to keep Capricious behavior, particularly on
chases, so a realistic appraisal of the them. But sometimes things cannot be the part of powerful or influential peo-
likely future business environment
was a pretty important thing. When
the predictions arrived, they did not
match what the president wanted to
see—not at all.
It is not uncommon that people in
positions of power are strong-willed,
opinionated individuals who are used
to getting their own way. The president
was no exception and his response to
the figures presented was quite dra-
matic. After some blustering, shout-
ing, and numerous derogatory remarks
directed at the professional expertise
PHOTOGRA PH BY M IKHA IL ESTEVES

of the financial forecasting group,


he scratched out some numbers on a
piece of paper and handed them over
to the chief economist. “These are the
numbers I want to see,” he said, “make
these happen.”
There is no doubt that the will and

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 19
viewpoints

ple, can be infectious. When one person to do for his customer, the customer al- and hold your ground like that.
in a system starts acting oddly, nearby ways wanted more. In fact, it seemed Then an interesting thing started
people have two choices: to label the be- like the more he gave the customer the happening. Since there were dependen-
havior as odd or to act like it is normal. more was wanted. The “customer is al- cies operating between the division’s
If, for whatever reason, people act as if ways right” approach did not seem to projects, other project managers start-
someone’s weird actions are OK, they be working. After much discussion we ed intentionally “hooking” their proj-
themselves start behaving weirdly. It is decided that: ect estimates and plans to my friend’s
almost as if the odd behavior is catch- ! The customer was a very strong- project plan. Their reasoning was that
ing. To compound the problem, we hu- willed and decisive person. since the boss doesn’t mess with that
mans have built-in rationalizing capa- ! Asking for “more” is a perfectly ap- project if my project has dependencies
bility that kicks in like a reflex when we propriate thing for a customer to do, on it, he won’t mess with my project ei-
act in an odd or unethical manner. This especially from the customer’s per- ther. As each project stabilized by being
rationalization employs a thing called spective. strongly coupled to more realistic proj-
“cognitive dissonance” and allows us ! The customer was asking for more ect deadlines, everyone started calming
to continue to act in a weird way while because more was being provided. down. Inexorably, sanity spread across
simultaneously retaining the convic- ! If the construction manager agreed the organization as people started com-
tion that we are not acting in a weird to do more, it meant he must be able to mitting only to things they really be-
way at all.a do it. lieved they could do.
! As a strong-willed decisive person,
The Project Managers: Can-Do the customer was expecting the man- Infectious Insanity
A friend of mine, a project manager at ager to be equally strong-willed and The opposite can be true too, as evi-
a large electronics company, described decisive in deciding what could not be denced in the case of the financial
this behavior in the planning meetings done. forecasting scenario mentioned ear-
he attends. The project managers re- ! The customer would continue pil- lier. Confronted with the president’s
porting to the strong-willed and force- ing on until the project manager said demand to “…make these numbers hap-
ful vice president in charge of their divi- “no.” pen…” the Ph.D. economists returned
sion almost compete with each other to This could be called “Newton’s First to their financial forecasting groups
promise things to the boss and pretend Law of Behavior”: every behavior will and had to figure out how to ignore fac-
they and their teams can do things they continue until acted upon by another tors or promote factors to make it hap-
really don’t think they can do at all. The behavior. We could even extend this to pen. One can imagine the chief econo-
boss is the instigator. He puts enormous Newton’s Third Law and infer that the mist’s assistant saying “…but why would
pressure on his people and browbeats other behavior must be equal and op- we ignore this factor, boss? We’d be crazy
them if they come up with numbers posite. This meant the project manager to do that! That’s not how it works!” To
he doesn’t like. When one manager had to learn how to apply equal force in which the chief economist might reply
“caves” and agrees to something he (se- saying “no” to the extra work to balance “Well, that’s how it works if you want to
cretly and privately) doesn’t think can the force the customer was applying in keep your job…”.
be done, the others feel they have to demanding the extra work. Then cognitive dissonance kicks in
do so as well. The compliant managers and people start rationalizing: “…well
then get praised by the vice president, Breaking the Cycle maybe the factor really isn’t that impor-
which reinforces the behavior. To stop this behavior, people and orga- tant…” and, starting from the top, ev-
Privately, the managers bemoan nizations must somehow get out of the eryone starts separating from reality.
their fate and wring their collective cycle. For the construction manager it In the business of software, this
managerial hands over what their boss was to learn to be firm and to realize the cycle results in significant and peren-
has forced them to commit to. But, un- customer is not best served by trying nial overcommitments. Lacking well-
til recently, they didn’t change their be- to do everything. The customer is best defined estimation practices, these
havior. served by most effectively doing the commitments are simply the wishes
most important things. of the strongest-willed people with the
The Construction Manager: For the software project manag- highest authority—unless the organi-
First Law of Behavior ers dealing with the vice president, zations and people that work for them
Years ago, I was coaching a (non-soft- my friend bravely decided to break the can provide the appropriate counter-
ware) manager working in the construc- cycle himself. After working a lot on his response.
tion industry. An affable and customer- project’s estimation practice, he vigor- It seems that sensible behavior or
centric person, his life was being made ously defended his estimates to the vice weird behavior will grow within organi-
very difficult by his primary customer. president and simply refused to back zations rather like a disease propagates.
The construction company built tele- down when pressured to reduce the It is an upward or a downward spiral.
phone switch centers and no matter projections. At one point, he even chal- But we can choose the direction.
what the manager promised and agreed lenged the vice president to fire him if
necessary. The vice president wisely Phillip G. Armour ([email protected]) is a senior
consultant at Corvus International Inc., Deer Park, IL.
a Tavris, C. and Aronson, E. Mistakes Were Made chose not to do this and privately com-
(But Not by Me). Harvest Books, 2003. mented that it “took guts” to stand up Copyright held by author.

20 COMM UNICATIO NS O F THE ACM | OC TOBER 2009 | VO L . 5 2 | N O. 1 0


V
viewpoints

DOI:10.1145/1562764.1562775 Martin Campbell-Kelly

Historical Reflections
Computing in the
Depression Era
Insights from difficult times in the computer industry.

S
I N CE T H E BE G I N N I N G of the
computer industry, it has
been through several ma-
jor recessions. The first big
one was in 1969; there was
a major shakeout in the mid-1980s;
and most recently there was the dot-
com bust in 2001. A common pattern
observed in these downturns was that
they occurred approximately five years
after the establishment of a new com-
puting paradigm—the launch of the
IBM System/360 platform in 1964, the
personal computer in the late 1970s,
and the commercial Internet in 1995. Despite the Great Depression, IBM increases its employment, trains more salesmen, and
These new computing modes created increases engineering efforts.
massive opportunities that the entre-
preneurial economy rapidly supplied But there was an office machine indus- exporters of office machinery in the
and then oversupplied. It then took try whose products—typewriters, add- 1930s were the U.S. and Germany. Most
only a small downturn in the wider ing machines, and accounting equip- other advanced countries, such as Brit-
economy to cause a major recession in ment—performed many of the tasks ain and France, had their own office
the computing industry. now done with computers. Indeed, the machine industries, but they found it
The current recession appears to be most successful of the early computer difficult to compete at the best of times.
quite different when compared to these firms sprang from the office machine Their response in the depression was
earlier downturns. Unlike in earlier re- industry—IBM, NCR, Remington Rand to impose “ad valorem” import duties
cessions, computing is not a victim of (later Univac), and Burroughs among of 25% or 50%. In these countries im-
its own excess, but is suffering from the them. port duties combined with a general
general economic malaise. Computing During the depression years protec- lack of business investment made of-
is not much different than any other in- tionism was seen as one of the policy fice machines formidably costly, and
dustry in the current recession—it has options. Because this option still sur- the domestic products were not always
PHOTOGRA PH COURT ESY O F IBM CORPORAT E ARCH IVES

no unique immunity. However, it has faces from time to time, it is instruc- an adequate substitute. Although the
no unique vulnerability either, which tive to see what happened in the 1930s. import duties on office machinery may
offers a small amount of comfort. The U.S. was a net exporter of office have briefly helped domestic manu-
machines, so it was not interested in facturers, retaliatory protectionist
Lessons from the Great Depression protectionism in this sector, of course. measures in other industries simply
To get some insight into what is hap- Most of the drive for protection came led to a downward spiral of economic
pening today we have to look back to from nations that imported office ma- activity. At the height of the depression
the Great Depression. Electronic com- chinery—but these polices were often in 1932, office machine consumption
puters did not exist at the time of the the result of a tit-for-tat elsewhere in worldwide was down a staggering 60%.
Wall Street crash of 1929, of course. the economy. The world’s two biggest It is a difficult lesson, but selective pro-

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 21
viewpoints

tectionism did not help in the Great ent than NCR’s. IBM’s main product IBM’s second big revenue source
Depression and it won’t help today. line between the wars was punched was the sale of punched cards. IBM en-
card accounting equipment, which was forced a rule—and got away with it until
IBM and NCR the most “high-tech” office machinery. an antitrust suit in 1936—that only IBM
The cases of IBM and NCR make in- There were machines for punching cards could be used on IBM machines.
teresting contrasts in weathering the and verifying cards, others for sorting Cards were sold for $1.40 a thousand,
economic storm.a NCR fared badly—it and rearranging them, and the most far more than they cost to produce. Card
didn’t go out of business, but it was not complex machine—the tabulator— revenues accounted for an astounding
until World War II that it fully recov- could perform sophisticated account- 30% of IBM’s total revenues and an even
ered. In 1928, the year before the crash, ing procedures and report generation. higher percentage of its profits. Because
NCR was ranked the world’s second Although orders for new machines fell cards were a consumable, firms had to
largest office machine firm (after Rem- drastically, IBM’s president Thomas continually purchase them so long as
ington Rand) with annual sales of $49 J. Watson, Sr. decided to maintain the they were still in business.
million; IBM was ranked the fourth manufacturing operation. Watson rea- The key difference between NCR
largest with sales of less than $20 mil- soned that rather than disband IBM’s and IBM was that NCR made almost
lion. A decade later, and well into the skilled design and manufacturing work all its money from the sale of new ma-
recovery, NCR sales were only $37 mil- force it would be more economical, as chines, whereas IBM made its money
lion, whereas IBM had sales approach- well as socially desirable, to stockpile from two sources: leasing and the sale
ing $40 million and it was the largest machines for when the upturn came. of punched card consumables. Look-
office machine supplier in the world. For IBM, the upturn came in 1935 ing back at the NCR and IBM experi-
The depression years 1929–1932 when President Franklin D. Roosevelt ences with the benefit of hindsight, we
were a desperate time for NCR. Before launched the New Deal and the Social can see it was an early incarnation of
the crash it had been a wonderful firm Security Administration. The new ad- the product-versus-services business
to work for. Headquartered in Dayton, ministration was hailed as the “world’s models. When they start out, product
Ohio, it had built a “daylight factory” biggest bookkeeping operation.” IBM firms have the advantage that they get
set in urban parkland in the 1890s and turned out to be the only office ma- a very rapid return on investment. In
it had pioneered in employee welfare, chine firm left that had an adequate a single sale, the firm gets all the re-
with all kinds of health, recreational, inventory of machines to service the turns it will ever get from a customer.
and cultural benefits. During the de- operation and it supplied 400 account- This helps a firm to grow organically
pression years NCR’s sales fell cata- ing machines to the government. It was in its early years. In contrast, when a
strophically. Overseas sales, which had a turning point for IBM and its profits services firm takes a new order it gets a
formerly amounted to 45% of total soared for the remainder of the 1930s. modest annual income extending over
sales, were badly affected by protec- Watson was justly celebrated for his many years. This slower rate of return
tionism. According to the then-CEO faith in the future. He became a confi- makes it difficult for a firm to retain
Edward Deeds “commercial treaties, dant of Roosevelt and chairman of the profits to achieve organic growth and
tariff barriers, trades restrictions, and International Chamber of Commerce. it may need access to capital until it
money complications” took “produc- Heroic as Watson’s strategy was, it starts to generate a positive income.
tivity from the Dayton factory.” It had to would not have been possible for NCR, But the slower growth and steady in-
cut its work force by more than half in Remington Rand, or Burroughs to do come makes for much less volatility.
order to survive—from 8,600 down to the same. IBM had an income that was As Communications columnist Michael
3,500 workers. At the worst of times, all practically recession-proof. First, IBM’s Cusumano noted in 2003—writing in
that could be done was to sponsor a re- machines were not sold but leased. the aftermath of the dot-com bust—
lief kitchen, run by the NCR Women’s The manufacturing cost of an account- the trick for survival of all firms is
Club, to feed the unemployed and their ing machine was recovered in the first getting the right mix of products and
families. Mirroring the fall in business, one or two year’s rental, and after that, services.b Products generate a rapid
NCR’s shares fell from a peak of $165 apart from the cost of maintenance, return on investment, but services pro-
in 1928 to $6.79 in 1932. Recovery was the machine revenue was pure profit. vide a steady income that gives some
very slow, and was only fully achieved The accounting machines had an aver- immunity against recessions. It’s not
with the coming of World War II when age life of at least 10 years, so it was an easy to get the balance right, but IBM
NCR was put to work making arma- extremely profitable business. During did it in the 1930s.
ments, analog computer bombsights, the depression, although new orders
and code-breaking machinery. stalled, very few firms gave up their
b Michael A. Cusumano, “Finding Your Balance
The story of IBM in the Great De- accounting machines—not only were
in the Products and Services Debate,” Com-
pression could hardly be more differ- they dependent on the equipment, mun. ACM 46, 3 (Mar. 2003), 15–17.
but they needed them more than ever
to improve efficiency. During the de-
a An excellent economic and business history
pression years, while IBM did not lease Martin Campbell-Kelly (M.Campbell-Kelly@warwick.
of these firms is: James W. Cortada, Before the ac.uk) is a professor in the Department of Computer
Computer: IBM, NCR, Burroughs, and Remington many new machines, it was kept afloat Science at the University of Warwick, where he specializes
Rand and the Industry They Created 1865–1965, by the revenues from those already in in the history of computing.
Princeton University Press, 1993. the field. Copyright held by author.

22 C OMMUNICATIO NS O F THE ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


V
viewpoints

DOI:10.1145/1562764.1562777 Phillip Porras

Inside Risks
Reflections on Conficker
An insider’s view of the analysis and implications of the Conficker conundrum.

Conficker is the name applied to a se- art in advanced commercial Internet


quence of malicious software. It initially malware, and provide several valuable
exploited a flaw in Microsoft software, insights for those willing to look close.
but has undergone significant evolution
since then (versions A through E thus far). What Have We Learned?
This column summarizes some of the most Nearly from its inception, Conficker
interesting aspects of Conficker from the demonstrated just how effective a ran-
viewpoint of an insider who has been an- dom scanning worm can take advantage
alyzing it. The lessons for the future are of the huge worldwide pool of poorly
particularly relevant. managed and unpatched Internet-ac-

I
cessible computers. Even on those oc-
N MID-FEBRUARY 2009, something casions when patches are diligently
unusual, but not unprecedented, produced, widely publicized, and auto-
occurred in the malware defense disseminated by operating system (OS)
community. Microsoft posted infect PCs around the world was known and application manufacturers, Con-
its fifth bounty on the heads of well before Conficker began to spread ficker demonstrates that millions of In-
those responsible for one of the latest in late November 2008. The earliest ac- ternet-accessible machines may remain
multimillion-node malware outbreaks.4 counts of the Microsoft Windows buffer permanently vulnerable. In some cases,
Previous bounties have included Sasser overflow used by Conficker arose in early even security-conscious environments
(awarded), Mydoom, MSBlaster, and September 2008, and a patch to this vul- may elect to forgo automated software
Sobig. Conficker’s alarming growth nerability had been distributed nearly a patching, choosing to trade off vulner-
rate in early 2009 along with the appar- month before Conficker was released. ability exposure for some perceived no-
ent mystery surrounding its ultimate Neither was Conficker the first to intro- tion of platform stability.7
purpose had raised enough concern duce dynamic domain generation as a Another lesson of Conficker is the
among whitehat security researchers method for selecting the daily Internet ability of malware to manipulate the
that reports were being distributed to rendezvous points used to coordinate current facilities through which Inter-
the general public and raising concerns its infected population. Prior malware net name space is governed. Dynamic
in Washington, D.C. such as Bobax, Kraken, and more re- domain generation algorithms (DGAs),
Was it all hype and of relative small cently Torpig and a few other malware along with fast flux (domain name look-
importance among an ever-increasing families have used dynamic domain ups that translate to hundreds or thou-
stream of new and sophisticated mal- generation as well. Conficker’s most re- sands of potential IP addresses), are
ware families? What weaknesses in the cent introduction of an encrypted peer- increasingly adopted by malware per-
ways of the Internet had this botnet to-peer (P2P) channel to upgrade its petrators as a retort to the growing effi-
brought into focus? Why was it here and ability to rapidly disseminate malware ciency with which whitehats were able to
when would it end? More broadly, why binaries is also preceded by other well- behead whole botnets by quickly identi-
do some malware outbreaks garner wide established kin, Storm worm being per- fying and removing their command and
PHOTOGRA PH BY J USTIN D. SULLIVA N

attention while other multimillion-vic- haps the most well-known example. control sites and redirecting all bot cli-
tim epidemics (such as Seneka) receive Nevertheless, among the long his- ent links. While not an original concept,
little notice? All are fair questions, and tory of malware epidemics, very few can Conficker’s DGA produced a new and
to some degree still remain open. claim sustained worldwide infiltration unique struggle between Conficker’s
In several ways, Conficker was not of multiple millions of infected drones. authors and the whitehat community,
fundamentally novel. The primary in- The rapidly evolving set of Conficker who fought for control of the daily sets
filtration method used by Conficker to variants do represent the state of the of domains used as Conficker’s Internet

O C TO B E R 2 0 0 9 | VO L. 52 | NO. 1 0 | C OM M U N IC AT ION S OF T HE ACM 23


viewpoints

rendezvous points.2 to their detriment, Conficker has been


Yet another lesson from the study somewhat of a catalyst to help unify a
of Conficker is the ominous sophisti- Conficker has been large group of professional and aca-
cation with which modern malware is somewhat of a demic whitehats. Organized groups of
able to terminate, disable, reconfigure, whitehats on invitation-only forums
or blackhole native OS and third-party catalyst to help have certainly previously self-organized
security services.6 Today’s malware unify a large group to discuss and share insights. But Con-
truly poses a comprehensive challenge ficker brought together a focused group
to our legacy host-based security prod- of professional and of individuals on a much larger scale
ucts, including Microsoft’s own anti- academic whitehats. with a clearly defined target, now called
malware and host recovery technolo- the Conficker Working Group (CWG).1
gies. Conficker offers a nice illustration The details of the CWG and its struc-
of the degree to which security vendors ture are outside the scope of this col-
are challenged to not just hunt for ma- umn, but the output from this group
licious logic, but to defend their own provides some insight into their capa-
availability, integrity, and the network Conficker-infected hosts have been bilities. Perhaps its most visible action
connectivity vital to providing them a observed downloading Waledec from has been the CWG’s efforts to work with
continual flow of the latest malware Waledec server sites, which are known top-level domain managers to block In-
threat intelligence. To address this to distribute spam. Conficker has also ternet rendezvous point domains from
concern, we may eventually need new been observed installing rogue antivirus use by Conficker’s authors. Additional-
OS services (perhaps even hardware fraudware, which has proven a lucrative ly, group members have posted numer-
support) specifically designed to help business for malware developers.3 ous detailed analyses of the Conficker
third-party security applications main- variants, and have used this informa-
tain their foothold within the host. Is Conficker Over? tion to rapidly develop diagnosis and re-
From October to April 2009, Conficker’s mediation utilities that are widely avail-
What Is Conficker’s Purpose? authors had produced five major vari- able to the general public. They actively
Perhaps one of the main reasons why ants, lettered A through E: a develop- track the infected population, and have
Conficker had gained so much early ment pace that would rival most Silicon worked with organizations and gov-
attention was our initial lack of under- Valley startups. With the exception of ernments to help identify and remove
standing of why it was there. From ana- Conficker’s variant E, which appeared infected machines. They continue to
lyzing its internal binary logic, there is in April and committed suicide on May provide government policymakers, the
little mystery to Conficker. It is, in fact, 5th, Conficker is here to stay, barring Internet governance community, and
a robust and secure distribution util- some significant organized eradication Internet data providers with informa-
ity for disseminating malicious binary campaign that goes well beyond secu- tion to better reduce the impact of fu-
applications to millions of computers rity patches. Unlike traditional botnets ture malware epidemics. Whether such
across the Internet. This utility incor- that lay dormant until instructed, Con- organized efforts can be sustained and
porates a potent arsenal of methods to ficker hosts operate with autonomy, applied to new Internet threats has yet
defend itself from security products, independently and permanently scan- to be demonstrated.
updates, and diagnosis tools. Its au- ning for new victims, and constantly
thors maintain a rapid development seeking out new coordination points References
1. Conficker Working Group Web site (June 2009); http://
pace and defend their current foot- (new Internet rendezvous points and www.confickerworkinggroup.org
hold on their large pool of Internet- peers for its P2P protocol). However, 2. Giles, J. The inside story of the Conficker worm.
New Scientist Journal (June 12, 2009); http://www.
connected victim hosts. despite their constant hunt for new vic- newscientist.com/article/mg20227121.500-the-inside-
Nevertheless, knowing what it can tims, our Conficker A and B daily cen- story-of-the-conficker-worm.html?full=true
3. Krebs, B. Massive profits fueling rogue antivirus market.
do does not tell us why it is there. What sus (C is an overlay on prior-infected The Washington Post (Mar. 16, 2009); http://voices.
washingtonpost.com/securityfix/2009/03/obscene_
did Conficker’s authors plan to send hosts) appears to have stabilized at be- profits_fuel_rogue_ant.html
to these infected drones and for what tween approximately 5 and 5.5 million 4. Lemos, R. Cabal forms to fight Conficker, offers
bounty. Security Focus (Feb. 13, 2009); http://www.
purpose? Early speculation included unique IP addresses (as of July 2009).1 securityfocus.com/news/11546
everything from typical malware busi- Nevertheless, any new exploit (a new 5. Markoff, J. The Conficker worm: April Fool’s joke or
unthinkable disaster? Bits: The New York Times (Mar. 19,
ness models (spam, rogueAV, host propagation method) that Conficker’s 2009); http://bits.blogs.nytimes.com/2009/03/19/the-
trading, data theft, phishing), to build- authors decide to distribute is but one conficker-worm-april-fools-joke-or-unthinkable-disaster/
6. Porras, P.A., Saidi, H., and Yegneswaran, V. Conficker C
ing the next ‘Dark Google’,5 to fears of peer exchange away from every current analysis. SRI International Technical Report (Apr. 4,
full-fledged nation-state information Conficker victim. It is most probable 2009); http://mtc.sri.com/Conficker/addendumC/#Secu
rityProductDisablement
warfare. In some sense, we are fortu- that Conficker will remain a profitable 7. Williams, C. Conficker seizes city’s hospital network.
nate that it now appears that Conficker criminal tool and relevant threat to the The Register (U.K.) (Jan. 20, 2009); http://www.
theregister.co.uk/2009/01/20/sheffield_conficker/
is currently being used as a platform Internet for the foreseeable future.
for conducting wide-scale fraud, spam, Phillip Porras ([email protected]) leads SRI
and general Internet misuse (rather Is There Any Good News? International’s Cyber Threat Analytics effort that includes
the Malware Threat Center and BotHunter.
traditional uses with well-understood Yes. Perhaps in ways the Conficker de-
motives). As recently as April 2009, velopers have not foreseen and certainly Copyright held by author.

24 C OMMUNICATIO NS O F TH E ACM | O C TOB ER 2009 | VO L . 5 2 | N O. 1 0


V
viewpoints

DOI:10.1145/1562764.1562776 Michael Cusumano

Technology Strategy
and Management
Dealing with the
Venture Capital Crisis
How New Zealand and other small, isolated markets can act as “natural incubators.”

T
HE VENTURE CAPITAL indus-
try, like financial services
in general, has fallen on
hard times. Venture funds
historically have returned
about 20% annually to investors, twice
the average of U.S. stocks. Like stocks,
though, returns over the past year have
been sharply negative and investing
has fallen dramatically. U.S. invest-
ments have dropped from the 2000
peak of $105 billion to a low of $20 bil-
lion in 2003 and recovered only to $28
billion in 2008.1 The 2009 numbers are
running at half the 2008 level. Other
countries see similar trends, includ-
ing Israel, which usually has the high-
est level of venture capital investment
given the size of its economy.3 Invest-
ment there was merely $800 million in vesting recently in energy and the en- sense to explore these big, growing re-
2008, down from $2.7 billion in 2000.a vironment as well as more traditional gions. But there is still a lot of competi-
Part of the problem is that large areas, such as software, health care, tion. What really might jump-start the
payoffs have become increasingly biotechnology, medical devices, multi- industry is more creative globalization,
scarce: Average valuations of venture- media, and communications. But per- with an eye toward using some overseas
backed startups in 2008 were half haps the biggest future challenge will markets as “natural incubators.”
those of 2007, while public offerings be not the sector but the geography: VC Governments and universities have
(IPOs) have dwindled—only six in the firms put most of their money in home long supported entrepreneurs with in-
U.S. in 2008, compared to 86 in 2007.b markets to keep a close watch on their cubators that offer seed money, office
But creativity may be another problem entrepreneurs. U.S.-based VC firms also space, financial and business advice,
for the industry. can argue that their home market offers and introductions to potential execu-
Venture capital firms have been in- the biggest public offerings and asset tives and customers. But most incuba-
sales. But a recent survey of 700 VC firms tors I know of have done poorly (I was
PHOTOGRA PH BY TO M COATES

found that 52% are now investing over- an advisor to four firms in the late 1990s
a Israel Venture Capital Research Center; seas. Nearly this same number planned and worked for several venture funds).
http://www.ivc-online.com/
b See VC Industry Statistics Archive from
to invest more in China, India, and In the U.S., entrepreneurs with the best
the National Venture Capital Association at other parts of Asia, while others want ideas usually do not need to be incu-
http://www.nvca.org/ to invest in South America.2 It makes bated and get funding directly from VC

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 25
viewpoints

projects at early-stage companies.


This experience has encouraged me to
think more about how small, isolated
markets (though not small in land ar-
ea—New Zealand is as big as Califor-
nia or Japan) can add some new life to
the venture capital industry.
Economically, New Zealand is an
advanced nation, though per-capita in-
come used to rank with Western Europe
and has fallen below Spain. The major-
ity of the economy is services, like other
industrialized nations, but it also relies
heavily on tourism as well as large-scale
exports of agricultural commodities.
The desire for more variety in the econ-
omy is one reason why the government
has promoted software, biotech, and
other technology-based entrepreneur-
New Zealand-based special-effects company Weta Workshop, whose work includes King ship. FRST, established in 1990, annu-
Kong shown here, has been awarded NZ$5.8 million in government funding for a research
partnership with TechNZ—the business investment program of the Foundation for Research,
ally invests about 500 million New Zea-
Science and Technology. land dollars (US$300 million) in local
company R&D and other promotional
firms, private individuals serving as this country is an integral part of Eu- or educational efforts. The government
“angel” investors, or corporations. Dur- rope, and Nokia’s huge international works with local venture capitalists,
ing the Internet boom, incubators also success has increased the level of inter- who in 2008 invested NZ$66 million
funded weak ideas and weaker teams. national attention and investment. Ire- (US$40 million) in 52 deals (down from
Nonetheless, in markets with limited land (population four million) may fall NZ$82 million in 60 deals in 2007).4
venture capital or angel funding, incu- into this category (see my October 2005 Americans have not ignored New Zea-
bators are still useful to help entrepre- Communications column, “Software in land entirely. Motorola has invested in
neurs get started. Ireland: A Balance of Entrepreneurship Opencloud (mobile software), Vinod
It also occurs to me that potentially and…Lifestyle Management?”), though Khosla in LanzaTech (bio-fuel), and Se-
even more useful are markets that serve its entrepreneurs generally prefer to re- quoia Capital and others in Right Hemi-
as natural incubators: small countries main small and benefit from proximity sphere (visual collaboration software).
usually overlooked by international to the U.K., with 60 million customers, In general, though, most of the world’s
VC firms, either because of their size and Europe, with a half-billion people. VC firms view New Zealand as too small
or isolation, but which have advanced There may well be other interesting and remote to merit much attention;
economies, sophisticated customers, markets to explore in Southeast Asia, this is shortsighted. The following brief
good universities, strong intellectual Africa, and Latin America. descriptions of software-related firms
property rights, favorable tax laws, and But a truly remote country I have I visited suggest a wide variety of tech-
vibrant entrepreneurial cultures. They come to know well is New Zealand nologies and ideas are being commer-
might spawn ventures that become (population four million). Over the cialized in just this one sector:
important new sources of wealth, so- past two years I have visited there ! Endace (http://www.endace.com)
cial welfare, and employment—for the twice, sponsored by the Foundation sells products that probe and moni-
hosts and the world. for Research, Science, and Technology tor networks by putting “time stamps”
Israel (population seven million) (FRST; http://www.frst.govt.nz/), a gov- on Internet packets. Financial institu-
used to be one of these natural incuba- ernment agency that invests in R&D tions, governments, and other organi-
tors. It is the source of numerous impor- zations are using the technology to im-
tant technologies (for example, instant prove security and optimize network
messaging, security software, software Perhaps the biggest performance. The company is public
FILM ST ILL COURTESY OF WETA /UNIVERSAL PICT URES

testing tools, and SAP’s NetWeaver) and on London’s AIM exchange and re-
was discovered in the 1980s and 1990s future challenge will ported revenues in the last fiscal year
by the VC community as well as Ameri- not be the sector but of about US$30 million, with custom-
can and European multinationals. ers in 30 countries.
Also, the U.S. provides nearly $3 billion the geography. ! Framecad (http://www.framecad.
a year in aid to Israel. This, along with com) sells machinery and design soft-
large defense and security spending, ware as well as consulting services
helps fuel demand for high technol- that enable construction companies
ogy. Another small, advanced market is to create small or medium-sized steel-
Finland (population five million). But framed buildings. These are relatively

26 COMMUNICATIO NS O F THE ACM | O CTOB ER 2009 | VO L . 5 2 | NO. 1 0


viewpoints

inexpensive and do not require much they represent common needs, are
skilled labor to construct. In addi- highly standardized, and do not need
tion to New Zealand, the company has With the proper level customization or specialized knowl-
found customers in developing econo- of ambition, talent, edge are the easiest to scale and export.
mies such as China, the Middle East, But these businesses attract the most
and Southeast Asia. and opportunity, competition (see my July 2003 Commu-
! GFG Group (http://www.gfg-group.
even a small, isolated nications column, “Beware the Lure of
com) provides off-the-shelf electronic the Horizontal”). The easiest markets
payment solutions (credit and debit company can turn the to gain a foothold in are “vertical servic-
cards, mobile phone applications, world into its market. es,” such as custom-built applications
ATM and POS applications) and relat- or specialized services for a particular
ed services to 115 million customers industry. But labor-intensive or skill-
in 40 countries. Most of its business dependent work is difficult to scale and
outside New Zealand is in Australia, more difficult to export—which is why
Singapore, the Philippines, and the having an incubator and time to experi-
United Arab Emirates. panies can use the site and tools as a de- ment can be important.
! Inro Technologies (http://www. velopment platform. The applications The population, physical character-
inro.co.nz) sells robotics technology to run inside a browser and do not require istics, or other unique local require-
retrofit manual forklifts and turn them large (and slow) software downloads. ments of a country can also inspire en-
into automated vehicles. Fonterra, ! Weta Workshop (http://www.weta- trepreneurial creativity. For example,
New Zealand’s largest firm and a ma- workshop.com) is one of the largest vid- New Zealand has a severe shortage of
jor exporter of dairy products, invested eo-effects design and production com- people, so we see firms use software
because finding forklift drivers is diffi- panies in the world serving the movie and other technologies to foster auto-
cult and expensive. It is even more ex- industry and now branching out into mation (such as retrofit forklifts) and
pensive to build new automated ware- other markets, such as animation for devise inexpensive, fast solutions to
houses from scratch. children’s TV and technology for video common problems (building construc-
! Methodware (http://www.method- game producers. It is best known for tion, data warehouse design, virtual
ware.com) provides customized risk video effects in the Lord of the Rings mov- hosting, electronic payments). Other
management and internal audit soft- ies (which were shot in New Zealand). It firms take advantage of New Zealand’s
ware to financial services, energy, and has won five Academy Awards for visual breathtaking scenery and creative art
utilities companies as well as the pub- effects, costumes, and makeup. communities. But perhaps most im-
lic sector. It originally targeted small ! Wherescape (http://www.wheres- portant is for VC firms as well as entre-
and medium-size firms but through cape.com) combines unique in-house preneurs investing in small markets to
partnerships has been able to sell to tools with an agile development meth- think big—and follow the lead of Nokia
1,800 corporate clients in 80 countries. odology to build data warehouses or one of the 70 or so Israeli companies
! NextWindow (http://www.nextwin- quickly and cheaply for a variety of in- that have been listed on the NASDAQ
dow.com) sells touch-screen computer dustries. It has hundreds of customers, stock exchange (more than any other
displays and overlay touch-screen de- mainly in New Zealand and Australia, foreign country). With the proper level
vices and software, initially for large but also works with partners around of ambition, talent, and opportunity,
screens and kiosks. It has grown very the world. even a small, isolated company can
rapidly through international distribu- ! Virtual Infrastructure Profession- turn the world into its market.
tion partnerships and alliances with PC als (http://www.vipl.co.nz) provides
manufacturers around the world, espe- custom-built virtualization, hosting, References
1. Cain Miller, C. What will fix the venture capital crisis?
cially in the U.S. and disaster recovery solutions (using The New York Times (May 4, 2009); http://bits.blogs.
! OpenCloud (http://www.open- mostly VMWare and Citrix). It partners nytimes.com/2009/05/04/what-will-fix-the-venture-
capital-crisis/ and Sales of startups plummet, along
cloud.com) sells a suite of real-time with most of the major software and with prices, The New York Times (Apr. 1, 2009); http://
Java application servers (Rhino) as well hardware producers, but does most of bits.blogs.nytimes.com/2009/04/01/sales-of-start-
ups-plummet-along-with-prices/
as provides development tools and its business in New Zealand. 2. Global Economic Downturn Driving Evolution of Venture
consulting for companies interested These firms seemed very interested Capital Industry, National Venture Capital Association
Press Release (June 10, 2009); www.nvca.org
in building multimedia products and in exports, though many lack capital 3. Megginson, W.L. Towards a global model of venture
services, especially for mobile phones. and experienced difficulties growing capital? Journal of Applied Corporate Finance 16
(2004), 89–107.
It moved its headquarters to the U.K. to beyond a certain size. Some companies 4. New Zealand Private Equity and Venture Capital
Association and Ernst & Young, The New Zealand
gain better access to customers. succeeded only because there was little Private Equity and Venture Capital Monitor 2008; http://
! Smallworlds (http://www.small- international competition. The critical www.nzvca.co.nz/Shared/Content/Documents/
worlds.com) is a 3D “virtual world” and decision for government as well as pri-
social networking site that enables us- vate investors is to determine—before Michael Cusumano ([email protected]) is Sloan
Management Review Distinguished Professor of
ers to set up their own room spaces and they put in too much time and mon- Management and Engineering Systems at the MIT Sloan
then do instant messaging as well as ey—which firms can export in volume. School of Management and School of Engineering in
Cambridge, MA.
share audio and video content or play “Horizontal products” that can be sold
games with their friends. Outside com- to almost anyone in any market because Copyright held by author.

O C TO B E R 2 0 0 9 | VO L. 52 | NO. 1 0 | C OM M U N IC AT ION S OF T HE ACM 27


V
viewpoints

DOI:10.1145/1562764.1562778 George V. Neville-Neil


Article development led by
queue.acm.org

Kode Vicious
Kode Reviews 101
A review of code review do’s and don’ts.
Dear KV,
My company recently went through a
round of layoffs, and at the same time
a lot of people left the company vol-
untarily. While trying to pick up the
pieces, we’ve discovered that there are
just some components of our system
that no one understands. Now man-
agement is considering hiring back
some folks as “consultants” to try to fix
this mess. It’s not like the code is un-
commented or spaghetti; it’s just that
there are bits of it that no one remain-
ing with the company understands.
It all seems pretty stupid and waste-
ful, but perhaps I’m just a bit grumpy
because I didn’t get a nice farewell
package and instead have to clean up
the mess.
Holding the Bag

Dear Holding, tend. Right now, that advice is a bit like on-the-desk, vein-pulsing-in-the-head
Maybe you should quit and see if you closing the barn door after the train has kind of experience. I’ve stopped such
get hired back as a consultant; I hear left the station, or…whatever. It does code reviews after 10 minutes when
they get really good rates! Maybe that’s bring me to a few things I would like to I realized no one in the room had
ILLUSTRATION BY A ND REW STELLMA N <H TT P:// WWW. ST ELLM A N- GREENE.COM >

not the right advice here. I meant to say about how to do a proper code re- read a single line of the code before
say, “Welcome to the latest round of re- view, which is something I don’t think they showed up in the meeting room.
cession,” wherein companies that grew most people ever learn how to do—and Please, to preserve life and sanity, pre-
bloated in good times try to grow lean certainly most programmers would not pare for a code review.
in bad times, but realize they can’t shed learn to do this if it were a choice. Preparations for a code review in-
all the useless pounds they thought There are three phases to any code clude selecting a constrained section of
they could. In my career this is round review: preparation, the review, and the code to look at, informing everyone
three of this wheel of fortune, and I am afterward. One of the things most which piece of code you have picked,
sure it will not be the last. people miss when they call for—or in and telling them that if they don’t read
The best way to make sure that every- the case of managers, order—a code the code before the appointed meet-
one on a team or that enough people in review is that it is unproductive just to ing time, you will be very displeased.
a group or company are able to main- shove a bunch of people in a room and When I say constrained, I do not nor-
tain a significant piece of software is show them unfamiliar code. When I mally mean a single page of code, nor
to institute system code reviews and to say unproductive, what I mean is that do I mean a set of 30 files. Balance, as
beat senseless anyone who does not at- it is a teeth-grinding, head-banging- in all things KV talks about, is impor-

28 COMM UNICATIO NS O F THE ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


viewpoints

tant to the process. If you’re reviewing ing the code review, who may or may not
a particularly nasty bit of code—for be the author of the code, should give a
example, a job scheduler written by short (no more than 10- to 15-minute) A compiler reads
someone who is no longer with the introduction to the code to be reviewed. your code, but it
company—then you’re going to want Make sure to keep the person focused
to take smaller chunks for each review. on the code being reviewed. Letting a doesn’t understand
The reason for the smaller chunks programmer wax poetic leads to poor its purpose or design;
is that the person who wrote it is not poetry and to wishing that, like Ulysses,
present to explain it to you. A code re- you had wax in your ears. Once the in- for the moment, only
view without a guide takes about two to troduction is complete, you can walk a person can do that.
three times as long as one conducted through any header files or places where
with the author of the code. data structures, base classes, and other
The next step is to schedule a time elements are defined. With the basics
to review the code. Do not review code of the code covered, you can move on
directly after lunch. KV once worked to the functions or methods and review
for someone who would schedule those next. When the review is over, there is still
meetings at 2 p.m., when lunch was One of the most difficult challenges work to be done. Of course, someone
normally from noon to 1 p.m. I never in a code review is to avoid getting dis- has to fix all the spelling, grammati-
failed to snore in his meetings. Code tracted by minutiae. Many people like cal, and language conformance issues,
reviews should be done when people to point out every spelling and gram- as well as the genuine bugs, but that’s
have plenty of brainpower and energy matical error, as if they’re scoring not all. Copies of the notes should be
because reading someone else’s code points on a test. Such people are sim- distributed to all participants, just in
normally takes more energy than read- ply distracting the group from under- case something happens to the author
ing or writing your own: it’s extremely standing the overall structure of the of the code, and the marked-up copies
difficult to get into someone else’s code and possibly digging for deeper of the code should be kept somewhere
mind-set. Trying to adjust your way problems. The same is true for lan- for future reference. There are now
of thinking to that of some other pro- guage fascists, who feel they need to some tools that will handle this elec-
grammer takes quite an effort, if only quote the language standard, chapter tronically. Google has a code-review
to keep yourself from beating on that and verse, as if an ANSI standard is a tool called Rietveld for code kept in
other programmer for being so ap- holy book. Both of these types of issues the subversion source-code control
parently foolish. When I call a code should be noted, quickly, and then you system. Although an electronic system
review, it is either early in the day or should move on. Do not dwell here, for for recording and acting on code-re-
two to three hours after lunch. Do not it is here that you will lose your way and view issues is an excellent tool, it is not
perform a code review for more than be dashed upon the rocks by the Scylla a substitute for formal code reviews
two hours. There is no such thing as a of spelling and Charybdis of syntax. where you discuss the design, as well
productive four-hour meeting, except As in any other engineering endeav- as the implementation, of a piece of
for management types who equate the or, someone will need to take notes on code. A compiler reads your code, but
number of hours they’ve blathered on what problems or issues were found it doesn’t understand its purpose or
with the amount of work they’ve done. in the review. It is infuriating in the design; for the moment, only a person
Now for the review itself. Providing extreme to review the same piece of can do that.
coffee and food is probably a good idea. code twice in six months and to find KV
Food has the effect of making sure peo- the same issues, all because everyone
ple show up, and coffee has the effect of was too lazy to write down the issues.
keeping them awake. The person lead- A whiteboard or flip chart is fine for Related articles
this, and in a pinch you might be able on queue.acm.org
to trust the author of the code to do A conversation with Steve Bourne,
One of the most this for you. I would follow the latter Eric Allman, and Bryan Cantrill
path only if I trusted the author of the http://queue.acm.org/detail.cfm?id=1454460
difficult challenges code, because programmers are gener- Kode Vicious: The Return
in a code review ally lazy and will subconsciously avoid http://queue.acm.org/detail.cfm?id=1039521
work. It’s not even that they’ll know
is to avoid getting they left off something to fix, but three
The Yin and Yang of Software Development
http://queue.acm.org/detail.cfm?id=1388787
distracted by months later you’ll say to them, “Wait,
we told to you to fix this. Why isn’t this
minutiae. fixed?!” To which you will receive cu-
George V. Neville-Neil ([email protected]) is the proprietor of
Neville-Neil Consulting and a member of the ACM Queue
editorial board. He works on networking and operating
rious looks accompanied by “What? systems code for fun and profit, teaches courses on
This? Oh, you meant this?” If it’s an various programming-related subjects, and encourages
your comments, quips, and code snips pertaining to his
issue, write it down while the group is Communications column.
thinking about it, and go over the list
at the end of the review. Copyright held by author.

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 29
V
viewpoints

DOI:10.1145/1562764.1562779 C.A.R. Hoare

Viewpoint
Retrospective: An Axiomatic Basis
for Computer Programming
C.A.R. Hoare revisits his past Communications article on the axiomatic
approach to programming and uses it as a touchstone for the future.

T
HIS MONTH MARKS the 40th an- of the widely varying hardware archi- pectations led me in 1968 to move from
niversary of the publication tectures prevalent at the time. an industrial to an academic career. And
of the first article I wrote as an I expected that research into the axi- when I retired in 1999, both the positive
academic.a I have been invited omatic method would occupy me for my and the negative expectations had been
to give my personal view of the entire working life; and I expected that entirely fulfilled.
advances that have been made in the its results would not find widespread The main attraction of the axiomatic
subject since then, and the further ad- practical application in industry until method was its potential provision of
vances that remain to be made. Which after I reached retirement age. These ex- an objective criterion of the quality of
of them did I expect, and which of them
surprised me?

Retrospective (1969–1999)
My first job (1960–1968) was in the
computer industry; and my first major
project was to lead a team that imple-
mented an early compiler for ALGOL
60. Our compiler was directly struc-
tured on the syntax of the language, so
elegantly and so rigorously formalized
as a context-free language. But the se-
mantics of the language was even more
important, and that was left informal
in the language definition. It occurred
to me that an elegant formalization
might consist of a collection of axioms,
similar to those introduced by Euclid
to formalize the science of land mea-
surement. My hope was to find axioms
that would be strong enough to en-
able programmers to discharge their
responsibility to write correct and ef-
ficient programs. Yet I wanted them
PHOTOGRA PH BY ROBERT M . M cCLURE

to be weak enough to permit a variety


of efficient implementation strategies,
suited to the particular characteristics

a Hoare, C.A.R. An axiomatic basis for comput-


er programming. Commun. ACM 12, 10 (Oct.
1969), 576–580. C.A.R. Hoare attending the NATO Software Engineering Techniques Conference in 1969.

30 COMMUNICATIO NS O F TH E ACM | O CTOB ER 2009 | VO L . 5 2 | NO. 1 0


viewpoints

a programming language, and the ease soning about programs. They include
with which programmers could use it. the dynamic logic of actions, temporal
For this reason, I appealed to academic I did not realize that logic, linear logic, and separation logic.
researchers engaged in programming the success of tests Some of these theories are now being
language design to help me in the re- reused in the study of computational
search. The latest response comes from is that they test biology, genetics, and sociology.
hardware designers, who are using axi- the programmer, Equally spectacular (and to me unex-
oms in anger (and for the same reasons pected) progress has been made in the
as given above) to define the properties not the program. automation of logical and mathemati-
of modern multicore chips with weak cal proof. Part of this is due to Moore’s
memory consistency. Law. Since 1969, we have seen steady ex-
One thing I got spectacularly wrong. ponential improvements in computer
I could see that programs were getting capacity, speed, and cost, from mega-
larger, and I thought that testing would bytes to gigabytes, and from megahertz
be an increasingly ineffective way of re- sprinkled more or less liberally in the to gigahertz, and from megabucks to
moving errors from them. I did not real- program text, were used in development kilobucks. There has been also at least
ize that the success of tests is that they practice, not to prove correctness of pro- a thousand-fold increase in the efficien-
test the programmer, not the program. grams, but rather to help detect and di- cy of algorithms for proof discovery and
Rigorous testing regimes rapidly per- agnose programming errors. They are counterexample (test case) generation.
suade error-prone programmers (like evaluated at runtime during overnight Crudely multiplying these factors, a
me) to remove themselves from the tests, and indicate the occurrence of any trillion-fold improvement has brought
profession. Failure in test immediately error as close as possible to the place in us over a tipping point, at which it has
punishes any lapse in programming the program where it actually occurred. become easier (and certainly more reli-
concentration, and (just as important) The more expensive assertions were able) for a researcher in verification to
the failure count enables implementers removed from customer code before use the available proof tools than not to
to resist management pressure for pre- delivery. More recently, the use of asser- do so. There is a prospect that the activ-
mature delivery of unreliable code. The tions as contracts between one module ities of a scientific user community will
experience, judgment, and intuition of of program and another has been incor- give back to the tool-builders a wealth
programmers who have survived the rig- porated in Microsoft implementations of experience, together with realistic
ors of testing are what make programs of standard programming languages. experimental and competition materi-
of the present day useful, efficient, and This is just one example of the use of al, leading to yet further improvements
(nearly) correct. Formal methods for formal methods in debugging, long be- of the tools.
achieving correctness must support the fore it becomes possible to use them in For many years I used to speculate
intuitive judgment of programmers, not proof of correctness. about the eventual way in which the re-
replace it. In 1969, my proof rules for programs sults of research into verification might
My basic mistake was to set up proof were devised to extract easily from reach practical application. A general
in opposition to testing, where in fact a well-asserted program the math- belief was that some accident or se-
both of them are valuable and mutu- ematical ‘verification conditions’, the ries of accidents involving loss of life,
ally supportive ways of accumulating proof of which is required to establish perhaps followed by an expensive suit
evidence of the correctness and service- program correctness. I expected that for damages, would persuade software
ability of programs. As in other branches these conditions would be proved by managers to consider the merits of pro-
of engineering, it is the responsibility of the reasoning methods of standard gram verification.
the individual software engineer to use logic, on the basis of standard axioms This never happened. When a bug
all available and practicable methods, and theories of discrete mathematics. occurred, like the one that crashed the
in a combination adapted to the needs What has happened in recent years is maiden flight of the Ariane V spacecraft
of a particular project, product, client, exactly the opposite of this, and even in 1996, the first response of the manag-
or environment. The best contribution more interesting. New branches of er was to intensify the test regimes, on
of the scientific researcher is to extend applied discrete mathematics have the reasonable grounds that if the erro-
and improve the methods available to been developed to formalize the pro- neous code had been exercised on test,
the engineer, and to provide convincing gramming concepts that have been it would have been easily corrected be-
evidence of their range of applicability. introduced since 1969 into standard fore launch. And if the issue ever came
Any more direct advocacy of personal programming languages (for example, to court, the defense of ‘state-of-the-art’
research results actually excites resis- objects, classes, heaps, pointers). New practice would always prevail. It was
tance from the engineer. forms of algebra have been discovered clearly a mistake to try to frighten peo-
for application to distributed, concur- ple into changing their ways. Far more
Progress (1999–2009) rent, and communicating processes. effective is the incentive of reduction in
On retirement from University, I ac- New forms of modal logic and abstract cost. A recent report from the U.S. De-
cepted a job offer from Microsoft Re- domains, with carefully restricted ex- partment of Commerce has suggested
search in Cambridge (England). I was pressive power, have been invented to that the cost of programming error to
surprised to discover that assertions, simplify human and mechanical rea- the world economy is measured in tens

O CTO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | COM M U N IC AT ION S OF T HE ACM 31


viewpoints

of billions of dollars per year, most of it is (and should always be) to pluck the ers. Such experiments will often be the
falling (in small but frequent doses) on ‘low-hanging fruit’; that is, to solve rational reengineering of existing real-
the users of software rather than on the the easiest parts of the most prevalent istic applications. Experience gained
producers. problems, in the particular circum- in the experiments is expected to lead
The phenomenon that triggered in- stances of here and now. But the goal to revisions and improvements in the
terest in software verification from the of the pure research scientist is exactly tools, and in the theories on which
software industry was totally unpredict- the opposite: it is to construct the most the tools were based. Scientific rivalry
ed and unpredictable. It was the attack general theories, covering the widest between experimenters and between
of the hacker, leading to an occasional possible range of phenomena, and to tool builders can thereby lead to an ex-
shutdown of worldwide commercial seek certainty of knowledge that will ponential growth in the capabilities of
activity, costing an estimated $4 billion endure for future generations. It is to the tools and their fitness to purpose.
on each occasion. A hacker exploits avoid the compromises so essential to The knowledge and understanding
vulnerabilities in code that no reason- engineering, and to seek ideals like ac- gained in worldwide long-term re-
able test strategy could ever remove curacy of measurement, purity of mate- search will guide the evolution of so-
(perhaps by provoking race conditions, rials, and correctness of programs, far phisticated design automation tools
or even bringing dead code cunningly beyond the current perceived needs of for software, to match the design au-
to life). The only way to reach these vul- industry or popularity in the market- tomation tools routinely available to
nerabilities is by automatic analysis place. For this reason, it is only scien- engineers of other disciplines.
of the text of the program itself. And it tific research that can prepare man-
is much cheaper, whenever possible, kind for the unknown unknowns of the The End
to base the analysis on mathematical forever uncertain future. No exponential growth can continue
proof, rather than to deal individually So I believe there is now a better scope forever. I hope progress in verifica-
with a flood of false alarms. In the in- than ever for pure research in computer tion will not slow down until our
terests of security and safety, other science. The research must be motivat- programming theories and tools are
industries (automobile, electronics, ed by curiosity about the fundamental adequate for all existing applications
aerospace) are also pioneering the use principles of computer programming, of computers, and for supporting the
of formal tools for programming. There and the desire to answer the basic ques- continuing stream of innovations
is now ample scope for employment of tions common to all branches of sci- that computers make possible in all
formal methods researchers in applied ence: what does this program do; how aspects of modern life. By that time,
industrial research. does it work; why does it work; and what I hope the phenomenon of program-
is the evidence for believing the answers ming error will be reduced to insignif-
Prospective (2009–) to all these questions? We know in prin- icance: computer programming will
In 1969, I was afraid industrial re- ciple how to answer them. It is the speci- be recognized as the most reliable of
search would dispose such vastly su- fications that describes what a program engineering disciplines, and com-
perior resources that the academic does; it is assertions and other internal puter programs will be considered
researcher would be well advised to interface contracts between component the most reliable components in any
withdraw from competition and modules that explain how it works; it is system that includes them.
move to a new area of research. But programming language semantics that Even then, verification will not be a
again, I was wrong. Pure academic re- explains why it works; and it is math- panacea. Verification technology can
search and applied industrial research ematical and logical proof, nowadays only work against errors that have been
are complementary, and should be constructed and checked by computer, accurately specified, with as much ac-
pursued concurrently and in collabo- that ensures mutual consistency of curacy and attention to detail as all
ration. The goal of industrial research specifications, interfaces, programs, other aspects of the programming task.
and their implementations. There will always be a limit at which the
There are grounds for hope that engineer judges that the cost of such
The phenomenon that progress in basic research will be much specification is greater than the benefit
faster than in the early days. I have that could be obtained from it; and that
triggered interest in already described the vastly broader testing will be adequate for the pur-
software verification theories that have been proposed to pose, and cheaper. Finally, verification
understand the concepts of modern cannot protect against errors in the
from the software programming. I have welcomed the specification itself. All these limits can
industry was totally enormous increase in the power of au- be freely acknowledged by the scien-
tomated tools for proof. The remaining tist, with no reduction in enthusiasm
unpredicted and opportunity and obligation for the sci- for pushing back the limits as far as
unpredictable. entist is to conduct convincing experi- they will go.
ments, to check whether the tools, and
the theories on which they are based, C.A.R. Hoare ([email protected]) is a principal
are adequate to cover the vast range of researcher at Microsoft Research in Cambridge, U.K., and
Emeritus Professor of Computing at Oxford University.
programs, design patterns, languages,
and applications of today’s comput- Copyright held by author.

32 C OMM UNICATIO NS O F THE ACM | O C TOBER 2009 | VO L . 5 2 | NO. 1 0


Browse the ACM Digital Library and
Guide to Computing Literature Index
Journals U Magazines U Conference Proceedings U
ACM Web sites U Newsletters U ACM Oral History Interviews
U Publications by Affiliated Organizations

The Most Complete U Unlimited full-text access to all scholarly content ever published by
ACM since 1954 to the present
Global Resource for
Includes 256,000 Articles, 270+ Conference Proceedings Titles,
Computer Scientists and
U
37 High-Impact Journals, 9 Specialized Magazines, and 43 Special Interest
Information Technology Groups contributing content
Professionals U Full-text access to the complete contents of ACM’s award-winning
flagship publication Communications of the ACM, including access to
all premium content on the magazine’s Web site http://cacm.acm.org
U Access to ACM TechNews technology news briefing service,
and all video and multimedia content published by ACM, including
800+ multimedia files
U Contains The Guide to Computing Literature Index — the largest and
most comprehensive bibliographic database in the field of computing
with over 1.3 million records and 7 million references to the
computing and IT literature
U Access to The Online Computing Reviews Service, providing timely
commentary and critiques by leading computing experts of the most
ACM Membership: essential books and articles.
Professionals and students may U Advanced technology includes cutting-edge search functionality
purchase both an ACM Membership and guided navigation
and the ACM Digital Library benefitting U COUNTER Compliant Usage Statistics
from a wide variety of resources,
including access to over 40 high-impact
U Advancing the field of computing and technology with the
publications; free online books and
highest-quality content at affordable rates
courses for professional development;
a searchable Digital Library; a growing
online community of electronic forums;
ACM Digital Library access is available by annual subscription
and more. For pricing information or to institutions and individuals worldwide through ACM
to subscribe visit us online at https:// Membership, Academic Consortia and Corporate Libraries.
campus.acm.org/public/quickJoin/
Consortia and Corporate libraries are eligible to subscribe to
interim.cfm or contact ACM Member special Digital Library Packages, which include the DL Core Package,
Services Department at 1.800.342.6626 DL Master SIG Package, and The Guide to Computing Literature.
(U.S. and Canada), 212.626.0500 For pricing information or to subscribe, contact Nolen S. Harris at
(Global) or email [email protected]. 212.626.0676 or email [email protected].

FOR MORE INFORMATION VISIT


WWW.ACM.ORG/DL
practice
DOI:10.1145/ 1562764.1562780
in dialects of the popular C/C++ pro-
Article development led by
queue.acm.org
gramming languages.
This has created a tremendous op-
portunity to employ new simulation
GPU acceleration and other computer and analysis techniques that were previ-
performance increases will offer critical ously too computationally demanding
to use. In other cases, the computation-
benefits to biomedical science. al power provided by GPUs can bring
analysis techniques that previously
BY JAMES C. PHILLIPS AND JOHN E. STONE required computation on high-perfor-
mance computing (HPC) clusters down

Probing
to desktop computers, making them
accessible to application scientists lack-
ing experience with clustering, queuing
systems, and the like.

Biomolecular
This article is based on our experi-
ences developing software for use by
and in cooperation with scientists, often
graduate students, with backgrounds

Machines
in physics, chemistry, and biology. Our
programs, NAMD18 and VMD10 (Visual
Molecular Dynamics), run on computer
systems ranging from laptops to super-

with Graphics
computers and are used to model pro-
teins, nucleic acids, and lipids at the
atomic level in order to understand how
protein structure enables cellular func-

Processors
tions such as catalyzing reactions, har-
vesting sunlight, generating forces, and
sculpting membranes (for additional
scientific applications, see http://www.
ks.uiuc.edu/). In 2007 we began work-
ing with the Nvidia CUDA (Compute
Unified Device Architecture) system
for general-purpose graphics proces-
sor programming to bring the power of
many-core computing to practical sci-
entific applications.22
COMPUTER SIMULATION HAS become an integral
part of the study of the structure and function of Bottom-up Biology
If one were to design a system to safe-
biological molecules. For years, parallel computers guard critical data for thousands of
have been used to conduct these computationally years, it would require massive redun-
demanding simulations and to analyze their results. dancy, self-replication, easily replace-
able components, and easily interpreted
These simulations function as a “computational formats. These are the same challenges
microscope,” allowing the scientist to observe details faced by our genes, which build around
of molecular processes too small, fast, or delicate themselves cells, organisms, popula-
tions, and entire species for the sole
to capture with traditional instruments. Over time, purpose of continuing their own surviv-
commodity GPUs (graphics processing units) have al. The DNA of every cell contains both
data (the amino acid sequences of every
evolved into massively parallel computing devices, and protein required for life) and metadata
more recently it has become possible to program them (large stretches of “junk” DNA that inter-

34 COMM UNICATIO NS O F THE ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


An early success in applying GPUs to biomolecular modeling involved rapid calculation of electrostatic fields used to place ions in simulated
structures. The satellite tobacco mosaic virus model contains hundreds of ions (individual atoms shown in yellow and purple) that must be
correctly placed so that subsequent simulations yield correct results.22

act with hormones to control whether a not self-assemble into a unique struc- reactions, efficiently harnessing and ex-
sequence is exposed to the cell’s protein ture in a reasonable time. Determining pending energy obtained from respira-
expression machinery or hidden deep the folded structure of a protein based tion or photosynthesis. While the ami-
inside the coils of the chromosome). only on its sequence is one of the great no acid chain is woven into a scaffold
The protein sequences of life, once challenges in biology, for while DNA of helices and sheets, and some compo-
expressed as a one-dimensional chain sequences are known for entire organ- nents are easily recognized, there are no
of amino acids by the ribosome, then isms, protein structures are available rigid shafts, hinges, or gears to simplify
fold largely unaided into the unique only through the painstaking work of the analysis.
three-dimensional structures required crystallographers. To observe the dynamic behavior of
for their functions. The same protein Simply knowing the folded structure proteins and larger biomolecular aggre-
from different species may have simi- of a protein is not enough to under- gates, we turn to a computational micro-
lar structures despite greatly differing stand its function. Many proteins serve scope in the form of a molecular dynam-
sequences. Protein sequences have a mechanical role of generating, trans- ics simulation. As all proteins are built
been selected for the capacity to fold, ferring, or diffusing forces and torques. from a fixed set of amino acids, a model
as random chains of amino acids do Others control and catalyze chemical of the forces acting on every atom can

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | COM M U N IC AT ION S OF T HE ACM 35


practice

be constructed for any given protein, in- nearly identical initial conditions will
cluding bonded, electrostatic, and van quickly diverge, but over time their aver-
der Waals components. Newton’s laws age properties will converge if the pos-
of motion then describe the dynamics sible conformations of the system are
of the protein over time. When experi-
mental observation is insufficient in If one were to sufficiently well sampled. To guarantee
that an expected transition is observed,
resolving power, with the computer we
have a perfect view of this simple and
design a system to it is often necessary to apply steering
forces to the calculation. Analysis is per-
limited model. safeguard critical formed, both during the calculation and
Is it necessary to simulate every atom
in a protein to understand its function?
data for thousands later, on periodically saved snapshots to
measure average properties of the sys-
Answering no would require a complete of years, it would tem and to determine which transitions
knowledge of the mechanisms involved,
in which case the simulation could pro-
require massive occur with what frequency.
As the late American mathematician
duce little new insight. Proteins are not redundancy, R. W. Hamming said, “The purpose of
designed cleanly from distinct compo-
nents but are in a sense hacked together self-replication, computing is insight, not numbers.”
Simulations would spark little insight if
from available materials. Rising above easily replaceable scientists could not see the biomolecu-
the level of atoms necessarily abandons
some detail, so it is best to reserve this components, and lar model in three dimensions on the
computer screen, rotate it in space, cut
for the study of aggregate-level phenom-
ena that are otherwise too large or slow
easily interpreted away obstructions, simplify representa-
tions, incorporate other data, and ob-
to simulate. formats. These serve its behavior to generate hypothe-
Tracking the motions of atoms re-
quires advancing positions and veloci-
are the same ses. Once a mechanism of operation for
a protein is proposed, it can be tested by
ties forward through millions or bil- challenges faced both simulation and experiment, and
lions of femtosecond (10–15 second)
time steps to simulate nanoseconds or by our genes. the details refined. Excellent visual rep-
resentation is then needed to an equal
microseconds of simulated time. Simu- extent to publicize and explain the dis-
lation sizes range from a single protein covery to the biomedical community.
in water with fewer than 100,000 atoms
to large multicomponent structures Biomedical Users
of 1–10 million atoms. Although every We have more than a decade of experi-
atom interacts with every other atom, ence guiding the development of the
numerical methods have been devel- NAMD (http://www.ks.uiuc.edu/Re-
oped to calculate long-range interac- search/namd/) and VMD (http://www.
tions for N atoms with order N or N log ks.uiuc.edu/Research/vmd/) programs
N rather than N2 complexity. for the simulation, analysis, and visual-
Before a molecular dynamics simu- ization of large biomolecular systems.
lation can begin, a model of the biomo- The community of scientists that we
lecular system must be assembled in serve numbers in the tens of thousands
as close to a typical state as possible. and circles the globe, ranging from
First, a crystallographic structure of students with only a laptop to leaders
any proteins must be obtained (pdb.org of their fields with access to the most
provides a public archive), missing de- powerful supercomputers and graph-
tails filled in by comparison with other ics workstations. Some are highly expe-
structures, and the proteins embedded rienced in the art of simulation, while
in a lipid membrane or bound to DNA many are primarily experimental re-
molecules as appropriate. The entire searchers turning to simulation to help
complex is then surrounded by water explain their results and guide future
molecules and an appropriate concen- work.
tration of ions, located to minimize The education of the computational
their electrostatic energy. The simula- scientist is quite different from that of
tion must then be equilibrated at the the scientifically oriented computer
proper temperature and pressure until scientist. Most start out in physics or
the configuration stabilizes. another mathematically oriented field
Processes at the atomic scale are sto- and learn scientific computing infor-
chastic, driven by random thermal fluc- mally from their fellow lab mates and
tuations across barriers in the energy advisors, originally in Fortran 77 and
landscape. Simulations starting from today in Matlab. Although skilled at

36 C OMMUNICATIO NS O F TH E ACM | O C TOB ER 2009 | VO L . 5 2 | N O. 1 0


practice

solving complex problems, they are sel- ability of high-performance computing have become ubiquitous in modern
dom taught any software design process hardware, we have long sought better computers.
or the reasons to prefer one solution to options for bringing larger and faster GPU hardware design. Modern GPUs
another. Some go for years in this en- simulations to the scientific masses. have evolved to a high state of sophis-
vironment before being introduced to The last great advance in this regard was tication necessitated by the complex
revision-control systems, much less au- the evolution of commodity-based Li- interactive rendering algorithms used
tomated unit tests. nux clusters from cheap PCs on shelves by contemporary games and various
As software users, scientists are into the dominant platform today. The engineering and scientific visualization
similar to programmers in that they are next advance, practical acceleration, will software. GPUs are now fully program-
comfortable adapting examples to suit require a commodity technology with mable massively parallel computing de-
their needs and working from docu- strong commercial support, a sustain- vices that support standard integer and
mentation. The need to record and re- able performance advantage over sev- floating-point arithmetic operations.11
peat computations makes graphical in- eral generations, and a programming State-of-the-art GPU devices contain
terfaces usable primarily for interactive model that is accessible to the skilled more than 240 processing units and are
exploration, while batch-oriented input scientific programmer. We believe that capable of performing up to 1TFLOPs.
and output files become the primary ar- this next advance is to be found in 3D High-end devices contain multiple gi-
tifacts of the research process. graphics accelerators inspired by pub- gabytes of high-bandwidth on-board
One of the great innovations in sci- lic demand for visual realism in com- memory complemented by several
entific software has been the incorpo- puter games. small on-chip memory systems that can
ration of scripting capabilities, at first be used as program-managed caches,
rudimentary but eventually in the form GPU Computing further amplifying effective memory
of general-purpose languages such as Biomolecular modelers have always had bandwidth.
Tcl and Python. The inclusion of script- a need for sophisticated graphics to elu- GPUs are designed as throughput-
ing in NAMD and VMD has blurred the cidate the complexities of the large mo- oriented devices. Rather than optimiz-
line between user and developer, expos- lecular structures commonly studied ing the performance of a single thread
ing a safe and supportive programming in structural biology. In 1995, 3D visu- or a small number of threads of execu-
language that allows the typical scien- alization of such molecular structures tion, GPUs are designed to provide high
tist to automate complex tasks and even required desk-side workstations cost- aggregate performance for tens of thou-
develop new methods. Since no recom- ing tens of thousands of dollars. Gradu- sands of independent computations.
pilation is required, the user need not ally, the commodity graphics hardware This key design choice allows GPUs to
worry about breaking the tested, perfor- available for personal computers began spend the vast majority of chip die area
mance-critical routines implemented to incorporate fixed-function hardware (and thus transistors) on arithmetic
in C++. Much new functionality in VMD for accelerating 3D rendering. This units rather than on caches. Similarly,
has been developed by users in the form led to widespread development of 3D GPUs sacrifice the use of independent
of script-based plug-ins, and C-based games and funded a fast-paced cycle instruction decoders in favor of SIMD
plug-in interfaces have simplified the of continuous hardware evolution that (single-instruction multiple-data)
development of complex molecular has ultimately resulted in the GPUs that hardware designs wherein groups of
structure analysis tools and readers for
dozens of molecular file formats.
Scientists are quite capable of devel-
oping new scientific and computational
approaches to their problems, but it is
unreasonable to expect the biomedi-
cal community to extend their inter-
est and attention so far as to master
the ever-changing landscape of high-
performance computing. We seek to
provide users of NAMD and VMD with
the experience of practical supercomput-
ing, in which the skills learned with toy
systems running on a laptop remain of
use on both the departmental cluster
and national supercomputer, and the
complexities of the underlying parallel
decomposition are hidden. Rather than
a fearful and complex instrument, the
supercomputer now becomes just an-
other tool to be called upon as the user’s
Figure 1. The recently constructed Lincoln GPU cluster at the National Center for
work requires. Supercomputing Applications contains 1,536 CPU cores, 384 GPUs, 3TB of memory, and
Given the expense and limited avail- achieves an aggregate peak floating-point performance of 47.5TFLOPS.

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 37
practice

processing units share an instruction suited for GPU acceleration. third-party compiler vendors such as
decoder. This design choice maximiz- As a direct result of the large number RapidMind12 and the Portland Group to
es the number of arithmetic units per of processing units, high-bandwidth target GPUs more easily. With the major
mm2 of chip die area, at the cost of re- main memory, and fast on-chip mem- GPU vendors providing officially sup-
duced performance whenever branch ory systems, GPUs have the potential ported toolkits for GPU computing, the
divergence occurs among threads on to significantly outperform traditional most significant barrier to widespread
the same SIMD unit. CPU architectures significantly on high- use has been eliminated.
The lack of large caches on GPUs ly data-parallel problems that are well Although we focus on CUDA, many of
means that a different technique must matched to the architectural features of the concepts we describe have analogs
be used to hide the hundreds of clock the GPU. in OpenCL. A full overview of the CUDA
cycles of latency to off-chip GPU or host GPU programming. Until recently, the programming model is beyond the
memory. This is accomplished by multi- main barrier to using GPUs for scientif- scope of this article, but John Nickolls
plexing many threads of execution onto ic computing had been the availability et al. provide an excellent introduction
each physical processing unit, man- of general-purpose programming tools. in their article, “Scalable parallel pro-
aged by a hardware scheduler that can Early research efforts such as Brook2 and gramming with CUDA,” in the March/
exchange groups of active and inactive Sh13 demonstrated the feasibility of us- April 2008 issue of ACM Queue.15 CUDA
threads as queued memory operations ing GPUs for nongraphical calculations. code is written in C/C++ with exten-
are serviced. In this way, the memory In mid-2007 Nvidia released CUDA,16 sions to identify functions that are to be
operations of one thread are overlapped a new GPU programming toolkit that compiled for the host, the GPU device,
with the arithmetic operations of oth- addressed many of the shortcomings or both. Functions intended for execu-
ers. Recent GPUs can simultaneously of previous efforts and took full advan- tion on the device, known as kernels, are
schedule as many as 30,720 threads on tage of a new generation of compute- written in a dialect of C/C++ matched to
an entire GPU. Although it is not nec- capable GPUs. In late 2008 the Khronos the capabilities of the GPU hardware.
essary to saturate a GPU with the maxi- Group announced the standardization The key programming interfaces CUDA
mum number of independent threads, of OpenCL,14 a vendor-neutral GPU and provides for interacting with a device in-
this provides the best opportunity for la- accelerator programming interface. clude routines that do the following:
tency hiding. The requirement that the AMD, Nvidia, and many other vendors ! Enumerate available devices and
GPU be supplied with large quantities have announced plans to provide Open- their properties
of fine-grained data-parallel work is the CL implementations for their GPUs and ! Attach to and detach from a device
key factor that determines whether or CPUs. Some vendors also provide low- ! Allocate and deallocate device
not an application or algorithm is well level proprietary interfaces that allow memory
! Copy data between host and device
memory
! Launch kernels on the device
! Check for errors
When launched on the device, the
kernel function is instantiated thou-
sands of times in separate threads ac-
cording to a kernel configuration that
determines the dimensions and num-
ber of threads per block and blocks per
grid. The kernel configuration maps
the parallel calculations to the device
hardware and can be selected at run-
time for the specific combination of
input data and CUDA device capabili-
ties. During execution, a kernel uses
its thread and block indices to select
desired calculations and input and out-
put data. Kernels can contain all of the
usual control structures such as loops
and if/else branches, and they can read
and write data to shared device memory
or global memory as needed. Thread
synchronization primitives provide the
means to coordinate memory accesses
among threads in the same block, al-
Figure 2. GPU-accelerated electrostatics algorithms enable researchers to place ions lowing them to operate cooperatively
appropriately during early stages of model construction. Placement of ions in large structures
such as the ribosome shown here previously required the use of HPC clusters for calculation on shared data.
but can now be performed on a GPU-accelerated desktop computer in just a few minutes. The key challenges involved in de-

38 COMMUNICATIO NS O F THE ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


practice

veloping high-performance CUDA ker- parallel-programming paradigms.


nels revolve around efficient use of sev- GPU-accelerated clusters for HPC. Giv-
eral memory systems and exploiting all en the potential for significant accelera-
available data parallelism. Although the tion provided by GPUs, there has been a
GPU provides tremendous computa-
tional resources, this capability comes One of the most growing interest in incorporating GPUs
into large HPC clusters.3,6,8,19,21,24 As a re-
at the cost of limitations in the num-
ber of per-thread registers, the size of
compelling sult of this interest, Nvidia now makes
high-density rack-mounted GPU accel-
per-block shared memory, and the size and successful erators specifically designed for use in
of constant memory. With hundreds of
processing units, it is impractical for
applications for such clusters. By housing the GPUs in an
external case with its own independent
GPUs to provide a thread-local stack. GPU acceleration power supply, they can be attached to
Local variables that would normally be
placed on the stack are instead allocated
has been molecular blade or 1U rackmount servers that lack
the required power and cooling capac-
from the thread’s registers, so recursive dynamics ity for GPUs to be installed internally.
kernel functions are not supported.
Analyzing applications for GPU accel- simulation, which In addition to increasing performance,
GPU accelerated clusters also have the
eration potential. The first step in analyz-
ing an application to determine its suit-
is dominated by potential to provide better power effi-
ciency than traditional CPU clusters.
ability for any acceleration technology N-body atomic In a recent test on the AC GPU clus-
is to profile the CPU time consumed by
its constituent routines on representa-
force calculation. ter (http://iacat.uiuc.edu/resources/
cluster/) at the National Center for Su-
tive test cases. With profiling results in percomputing Applications (NCSA,
hand, one can determine to what extent http://www.ncsa.uiuc.edu/), a NAMD
Amdahl’s Law limits the benefit obtain- simulation of STMV (satellite tobacco
able by accelerating only a handful of mosaic virus) measured the increase in
functions in an application. Applica- performance provided by GPUs, as well
tions that focus their runtime into a few as the increase in performance per watt.
key algorithms or functions are usually In a small-scale test on a single node
the best candidates for acceleration. with four CPU cores and four GPUs (HP
As an example, if profiling shows that xw9400 workstation with a Tesla S1070
an application spends 10% of its runtime attached), the four Tesla GPUs provided
in its most time-consuming function, a factor of 7.1 speedup over four CPU
and the remaining runtime is scattered cores by themselves. The GPUs provid-
among several tens of unrelated func- ed a factor of 2.71 increase in the perfor-
tions of no more than 2% each, such an mance per watt relative to computing
application would be a difficult target only on CPU cores. The increases in per-
for an acceleration effort, since the best formance, space efficiency, power, and
performance increase achievable with cooling have led to the construction of
moderate effort would be a mere 10%. large GPU clusters at supercomputer
A much more attractive case would be centers such as NCSA and the Tokyo
an application that spends 90% of its ex- Institute of Technology. The NCSA Lin-
ecution time running a single algorithm coln cluster (http://www.ncsa.illinois.
implemented in one or two functions. edu/UserInfo/Resources/Hardware/In-
Once profiling analysis has identified tel64TeslaCluster/TechSummary/) con-
the subroutines that are worth acceler- taining 384 GPUs and 1,536 CPU cores
ating, one must evaluate whether they is shown in Figure 1.
can be reimplemented with data-paral-
lel algorithms. The scale of parallelism GPU Applications
required for peak execution efficiency Despite the relatively recent introduc-
on the GPU is usually on the order of tion of general-purpose GPU program-
100,000 independent computations. ming toolkits, a variety of biomolecular
The GPU provides extremely fine-grain modeling applications have begun to
parallelism with hardware support for take advantage of GPUs.
multiplexing and scheduling massive Molecular Dynamics. One of the most
numbers of threads onto the pool of pro- compelling and successful applications
cessing units. This makes it possible for for GPU acceleration has been mo-
CUDA to extract parallelism at a level of lecular dynamics simulation, which is
granularity that is orders of magnitude dominated by N-body atomic force cal-
finer than is usually practical with other culation. One of the early successes with

O C TO B E R 2 0 0 9 | VO L. 52 | NO. 1 0 | C OM M U N IC AT ION S OF T HE ACM 39


practice

the use of GPUs for molecular dynamics in our GPU implementation is respon-
was the Folding@Home project5,7 where sible for the forces on the atoms in a sin-
continuing efforts on development of gle patch due to the atoms in either the
highly optimized GPU algorithms have same or a neighboring patch. The kernel
demonstrated speedups of more than a
factor of 100 for a particular class of sim- We expect GPUs copies the atoms from the first patch in
the assigned pair to shared memory and
ulations (for example, protein folding)
of very small molecules (5,000 atoms
to maintain their keeps the atoms from the second patch
in registers. All threads iterate in uni-
and less). Folding@Home is a distrib- current factor- son over the atoms in shared memory,
uted computing application deployed
on thousands of computers worldwide.
of-10 advantage in accumulating forces for the atoms in
registers only. The accumulated forces
GPU acceleration has helped make it peak performance for each atom are then written to global
the most powerful distributing comput-
ing cluster in the world, with GPU-based
relative to CPUs, memory. Since the forces between a pair
of atoms are equal and opposite, the
clients providing the dominant compu- while their obtained number of force calculations could be
tational power (http://fah-web.stanford.
edu/cgi-bin/main.py?qtype=osstats). performance cut in half, but the extra coordination
required to sum forces on the atoms in
HOOMD (Highly Optimized Object-
oriented Molecular Dynamics), a re-
advantage for well- shared memory outweighs any savings.

suited problems
NAMD uses constant memory to
cently developed package specializing store a compressed lookup table of
in molecular dynamics simulations of
polymer systems, is unique in that it
continues to grow. bonded atom pairs for which the stan-
dard short-range interaction is not
was designed from the ground up for valid. This is efficient because the table
execution on GPUs.1 Though in its in- fits entirely in the constant cache and
fancy, HOOMD is being used for a vari- is referenced for only a small fraction
ety of coarse-grain particle simulations of pairs. The texture unit, a specialized
and achieves speedups of up to a factor feature of GPU hardware designed for
of 30 through the use of GPU-specific al- rapidly mapping images onto surfaces,
gorithms and approaches. is used to interpolate the short-range
NAMD18 is another early success in interaction from an array of values that
the use of GPUs for molecular dynam- fits entirely in the texture cache. The
ics.17,19,22 It is a highly scalable parallel dedicated hardware of the texture unit
program that targets all-atom simula- can return a separate interpolated value
tions of large biomolecular systems con- for every thread that requires it faster
taining hundreds of thousands to many than the potential function could be
millions of atoms. Because of the large evaluated analytically.
number of processor-hours consumed Building, visualizing, and analyzing
by NAMD users on supercomputers molecular models. Another area where
around the world, we investigated a va- GPUs show great promise is in acceler-
riety of acceleration options and have ating many of the most computationally
used CUDA to accelerate the calculation intensive tasks involved in preparing
of nonbonded forces using GPUs. CUDA models for simulation, visualizing them,
acceleration mixes well with task-based and analyzing simulation results (http://
parallelism, allowing NAMD to run on www.ks.uiuc.edu/Research/gpu/).
clusters with multiple GPUs per node. One of the critical tasks in the simula-
Using the CUDA streaming API for asyn- tion of viruses and other structures con-
chronous memory transfers and kernel taining nucleic acids is the placement
invocations to overlap GPU computa- of ions to reproduce natural biological
tion with communication and other conditions. The correct placement of
work done by the CPU yields speedups ions (see Figure 2) requires knowledge
of up to a factor of nine times faster than of the electrostatic field in the volume
CPU-only runs.19 of space occupied by the simulated sys-
At every iteration NAMD must calcu- tem. Ions are placed by evaluating the
late the short-range interaction forces electrostatic potential on a regularly
between all pairs of atoms within a cut- spaced lattice and inserting ions at the
off distance. By partitioning space into minima in the electrostatic field, updat-
patches slightly larger than the cutoff ing the field with the potential contribu-
distance, we can ensure that all of an at- tion of the newly added ion, and repeat-
om’s interactions are with atoms in the ing the insertion process as necessary.
same or neighboring cubes. Each block Of these steps, the initial electrostatic

40 C OM MUNICATIO NS O F THE ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


practice

field calculation dominates runtime maintained this performance lead de- simulation on graphics processing units. Journal of
Computational Chemistry 30, 6 (2009), 864–872.
and is therefore the part best suited for spite historically lagging CPUs by a 8. Göddeke, D., Strzodka, R., Mohd-Yusof, J., McCormick,
GPU acceleration. generation in fabrication technology, P., Buijssen, S.H.M., Grajewski, M., and Turek, S.
Exploring weak scalability for FEM calculations on a
A simple quadratic-time direct Cou- a handicap that may fade with growing GPU-enhanced cluster. Parallel Computing 33, 10–11
lomb summation algorithm computes demand. (2007), 685–699.
9. Hardy, D.J., Stone, J.E., and Schulten, K. Multilevel
the electrostatic field at each lattice The great benefits of GPU accelera- summation of electrostatic potentials using graphics
point by summing the potential contri- tion and other computer performance processing units. Parallel Computing 35 (2009),
164–177.
butions for all atoms. When implement- increases for biomedical science will 10. Humphrey, W., Dalke, A., and Schulten, K. VMD: Visual
molecular dynamics. Journal of Molecular Graphics 14
ed optimally, taking advantage of fast come in three areas. The first is do- (1996), 33–38.
reciprocal square-root instructions and ing the same calculations as today, but 11. Lindholm, E., Nickolls, J., Oberman, S., and Montrym,
J. Nvidia Tesla: A unified graphics and computing
making extensive use of near-register- faster and more conveniently, providing architecture. IEEE Micro 28, 2 (2008), 39–55.
speed on-chip memories, a GPU direct results over lunch rather than overnight 12. McCool, M. Data-parallel programming on the Cell
BE and the GPU using the RapidMind development
summation algorithm can outperform to allow hypotheses to be tested while platform. In Proceedings of GSPx Multicore
a CPU core by a factor of 44 or more.17,22 they are fresh in the mind. The second Applications Conference (Oct.–Nov. 2006).
13. McCool, M., Du Toit, S., Popa, T., Chan, B., and Moule, K.
By employing a so-called “short-range is in enabling new types of calculations Shader algebra. ACM Transactions on Graphics 23, 3
cutoff” distance beyond which contri- that are prohibitively slow or expensive (2004), 787–795.
14. Munshi, A. OpenCL specification version 1.0 (2008);
butions are ignored, the algorithm can today, such as evaluating properties http://www.khronos.org/registry/cl/.
achieve linear time complexity while still throughout an entire simulation rather 15. Nickolls, J., Buck, I., Garland, M., and Skadron, K.
Scalable parallel programming with CUDA. ACM Queue
outperforming a CPU core by a factor of than for a few static structures. The third 6, 2 (2008): 40–53.
26 or more.20 To take into account the and greatest is in greatly expanding the 16. Nvidia CUDA (Compute Unified Device Architecture)
Programming Guide. Nvidia, Santa Clara, CA 2007.
long-range electrostatic contributions user community for high-end biomedi- 17. Owens, J.D., Houston, M., Luebke, D., Green, S., Stone,
from distant atoms, the short-range cut- cal computation to include all experi- J.E., and Phillips, J.C. GPU computing. In Proceedings
of IEEE 96 (2008), 879–899.
off algorithm must be combined with a mental researchers around the world, 18. Phillips, J.C., Braun, R., Wang, W., Gumbart, J.,
Tajkhorshid, E., Villa, E., Chipot, C., Skeel, R.D., Kale, L.,
long-range contribution. A GPU imple- for there is much work to be done and and Schulten, K. Scalable molecular dynamics with
mentation of the linear-time multilevel we are just now beginning to uncover NAMD. Journal of Computational Chemistry 26 (2005),
1781–1802.
summation method, combining both the wonders of life at the atomic scale. 19. Phillips, J.C., Stone, J. E., and Schulten, K. Adapting a
the short-range and long-range contri- message-driven parallel application to GPU-accelerated
clusters. In Proceedings of the 2008 ACM/IEEE
butions, has achieved speedups in ex- Conference on Supercomputing. IEEE Press, 2008.
cess of a factor of 20 compared with a Related articles 20. Rodrigues, C.I., Hardy, D.J., Stone, J.E., Schulten, K., and
on queue.acm.org Hwu, W.W. GPU acceleration of cutoff pair potentials for
CPU core.9 molecular modeling applications. In Proceedings of the
GPU acceleration techniques have GPUs: A Closer Look 2008 ACM Conference on Computing Frontiers (2008),
Kayvon Fatahalian and Mike Houston 273–282.
proven successful for an increasingly 21. Showerman, M., Enos, J., Pant, A., Kindratenko,
http://queue.acm.org/detail.cfm?id=1365498
diverse range of other biomolecu- V., Steffen, C., Pennington, R., and Hwu, W. QP:
Scalable Parallel Programming with CUDA A heterogeneous multi-accelerator cluster. In
lar applications, including quantum Proceedings of the 10th LCI International Conference
John Nickolls, Ian Buck, Michael Garland, on High-performance Clustered Computing (Mar. 2009).
chemistry simulation (http://mtzweb. and Kevin Skadron 22. Stone, J.E., Phillips, J. C., Freddolino, P.L., Hardy, D.J.,
stanford.edu/research/gpu/) and visual- http://queue.acm.org/detail.cfm?id=1365500 Trabuco, L.G., and Schulten, K. Accelerating molecular
ization,23,25 calculation of solvent-acces- modeling applications with graphics processors.
Future Graphics Architectures Journal of Computational Chemistry 28 (2007),
sible surface area,4 and others (http:// William Mark 2618–2640.
www.hicomb.org/proceedings.html). It 23. Stone, J. E., Saam, J., Hardy, D. J., Vandivort, K. L., Hwu,
http://queue.acm.org/detail.cfm?id=1365501 W. W., and Schulten, K. High-performance computation
seems likely that GPUs and other many- and interactive display of molecular orbitals on
GPUs and multicore CPUs. In Proceedings of the 2nd
core processors will find even greater References Workshop on General-purpose Processing on Graphics
applicability in the future. 1. Anderson, J.A., Lorenz, C.D., and Travesset, A. Processing Units. ACM International Conference
General-purpose molecular dynamics simulations fully Proceeding Series 383 (2009), 9–18.
implemented on graphics processing units. Journal of 24. Takizawa, H. and Kobayashi, H. Hierarchical
Chemical Physics 227, 10 (2008), 5342–5359.
Looking Forward 2. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K.,
parallel processing of large-scale data clustering
on a PC cluster with GPU coprocessing. Journal of
Both CPU and GPU manufacturers now Houston, M., and Hanrahan, P. Brook for GPUs: Stream Supercomputing 36, 3 (2006), 219–234.
computing on graphics hardware. In Proceedings of
exploit fabrication technology improve- 2004 ACM SIGGRAPH. ACM, NY, 777–786
25. Ufimtsev, I.S. and Martinez, T.J. Quantum chemistry on
graphical processing units. Strategies for two-electron
ments by adding cores to their chips as 3. Davis, D., Lucas, R., Wagenbreth, G., Tran, J., and Moore, integral evaluation. Journal of Chemical Theory and
J. A GPU-enhanced cluster for accelerated FMS.
feature sizes shrink. This trend is antici- In Proceedings of the 2007 DoD High-performance
Computation 4, 2 (2008), 222–231.

pated in GPU programming systems, Computing Modernization Program Users Group


Conference. IEEE Computer Society, 305–309. James Phillips ([email protected]) is a senior research
for which many-core computing is the 4. Dynerman, D., Butzlaff, E., and Mitchell, J.C. CUSA programmer in the Theoretical and Computational
norm, whereas CPU programming is and CUDE: GPU-accelerated methods for estimating Biophysics Group at the Beckman Institute for Advanced
solvent accessible surface area and desolvation. Science and Technology at the University of Illinois at
still largely based on a model of se- Journal of Computational Biology 16, 4 (2009), Urbana-Champaign. For the last decade Phillips has been
rial execution with limited support for 523–537. the lead developer of NAMD, the highly scalable parallel
5. Elsen, E., Vishal, V., Houston, M., Pande, V., Hanrahan, P., molecular dynamics program for which he received a
tightly coupled on-die parallelism. We and Darve, E. N-body simulations on GPUs. Technical Gordon Bell Award in 2002.
therefore expect GPUs to maintain their Report, Stanford University (June 2007); http://arxiv.
org/abs/0706.3060. John Stone ([email protected]) is a senior research
current factor-of-10 advantage in peak 6. Fan, Z., Qiu, F., Kaufman, A., and Yoakum-Stover, S. programmer in the Theoretical and Computational
GPU cluster for high-performance computing. In Biophysics Group at the Beckman Institute for Advanced
performance relative to CPUs, while Proceedings of the 2004 ACM/IEEE Conference on Science and Technology at the University of Illinois at
their obtained performance advantage Supercomputing. IEEE Computer Society, 47. Urbana-Champaign. He is the lead developer of the VMD
7. Friedrichs, M.S., Eastman, P., Vaidyanathan, V., Houston, molecular visualization and analysis program.
for well-suited problems continues to M., Legrand, S., Beberg, A.L., Ensign, D.L., Bruns, C.M.,
grow. We further note that GPUs have and Pande, V.S. Accelerating molecular dynamic © 2009 ACM 0001-0782/09/1000 $10.00

O CTO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 41


practice
DOI:10.1145/ 1562764.1562781
acceleration in the rate at which images
Article development led by
queue.acm.org
are produced; changes in image de-
signs to cope with new scientific instru-
mentation and concepts; collaborative
The biosciences need an image format requirements for interoperability of im-
capable of high performance and long-term ages collected in different labs on dif-
ferent instruments; and research meta-
maintenance. Is HDF5 the answer? data dictionaries that must support
frequent and rapid extensions. These
BY MATTHEW T. DOUGHERTY, MICHAEL J. FOLK, EREZ ZADOK, problems are not unique to the biosci-
HERBERT J. BERNSTEIN, FRANCES C. BERNSTEIN, ences. Lack of image standardization is
KEVIN W. ELICEIRI, WERNER BENGER, CHRISTOPH BEST a source of delay, confusion, and errors
for many scientific disciplines.

Unifying
There is a need to bridge biological
and scientific disciplines with an im-
age framework capable of high com-
putational performance and interop-

Biological
erability. Suitable for archiving, such
a framework must be able to maintain
images far into the future. Some frame-
works represent partial solutions: a

Image
few, such as XML, are primarily suited
for interchanging metadata; others,
such as CIF (Crystallographic Informa-
tion Framework),2 are primarily suited

Formats
for the database structures needed for
crystallographic data mining; still oth-
ers, such as DICOM (Digital Imaging
and Communications in Medicine),3

with HDF5
are primarily suited for the domain of
clinical medical imaging.
What is needed is a common image
framework able to interoperate with
all of these disciplines, while provid-
ing high computational performance.
HDF (Hierarchical Data Format)6 is
such a framework, presenting a his-
toric opportunity to establish a coin
of the realm by coordinating the imag-
THE BIOLOGICAL SCIENCES need a generic image format ery of many biological communities.
Overcoming the digital confusion of
suitable for long-term storage and capable of incoherent bio-imaging formats will
handling very large images. Images convey profound result in better science and wider ac-
ideas in biology, bridging across disciplines. cessibility to knowledge.
Digital imagery began 50 years ago as an obscure Semantics: Formats,
technical phenomenon. Now it is an indispensable Frameworks, and Images
Digital imagery and computer tech-
computational tool. It has produced a variety of nology serve a number of diverse bio-
incompatible image file formats, most of which are logical communities with terminology
already obsolete. differences that can result in very dif-
ferent perspectives. Consider the word
Several factors are forcing the obsolescence: format. To the data-storage communi-
rapid increases in the number of pixels per image; ty the hard-drive format will play a ma-

42 C OMMUNICATIO NS O F TH E ACM | O C TOB ER 2009 | VO L . 5 2 | N O. 1 0


An x-ray diffraction image taken by Michael Soltis of LSAC on SSRL BL9-2 using an ADSC Q315 detector (SN901).

jor role in the computer performance ferent facets of the same specification. HDF5 translates across a variety of
of a community’s image format, and Because these terms are so ubiquitous computing architectures. Through
to some extent, they are inseparable. and varied due to perspective, we shall support from NASA (National Aero-
A format can describe a standard, a use them interchangeably, with the em- nautics and Space Administration),
framework, or a software tool; and for- phasis on the storage and management NSF (National Science Foundation),
mats can exist within other formats. of pixels throughout their lifetime, from DOE (Department of Energy), and oth-
Image is also a term with several acquisition through archiving. ers, HDF5 continues to support inter-
uses. It may refer to transient electri- national research. The HDF Group, a
cal signals in a CCD (charge-coupled Hierarchical Data Format Version 5 nonprofit spin-off from the University
device), a passive dataset on a storage HDF5 is a generic scientific data for- of Illinois, manages HDF5, reinforcing
device, a location in RAM, or a data mat with supporting software. Intro- the long-term business commitment
structure written in source code. An- duced in 1998, it is the successor to the to maintain the format for purposes of
other example is framework. An image 1988 version, HDF4. NCSA (National archiving and performance.
framework might implement an image Center for Supercomputing Applica- Because an HDF5 file can contain
standard, resulting in image files cre- tions) developed both formats for almost any collection of data entities
ated by a software-imaging tool. The high-performance management of in a single file, it has become the for-
framework, the standard, the files, and large heterogeneous scientific data. mat of choice for organizing hetero-
the tool, as in the case of HDF,6 may be Designed to move data efficiently be- geneous collections consisting of very
so interrelated that they represent dif- tween secondary storage and memory, large and complex datasets. HDF5 is

O CTO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 43


practice

used for some of the largest scientific bit floating-point number). An HDF5 HDF5 as a framework to create their
data collections, such as the NASA group is similar to a directory, or fold- independent next-generation image
Earth Observation System’s petabyte er, in a computer file system. An HDF5 file formats. In the case of the NeXus,11
repository of earth science data. In group contains links to groups or data- the format developed by the neutron
2008, netCDF (network Common Data sets, together with supporting meta- and synchrotron facilities, HDF5 has
Form)10 began using HDF5, bring- data. The organization of an HDF5 file been the operational infrastructure in
ing in the atmospheric and climate is a directed graph structure in which its design since 1998.
communities. HDF5 also supports groups and datasets are nodes, and Ongoing discussions by MEDSBIO
the neutron and X-ray communities links are edges. Although the term have led to the realization that common
for instrument data acquisition. Re- HDF implies a hierarchical structur- computational storage algorithms and
cently, MATLAB implemented HDF5 ing, its topology allows for other ar- formats for managing images would
as its primary storage format. Soon rangements such as meshes or rings. tremendously benefit the X-ray, neu-
HDF5 will formally be adopted by the HDF5 is a completely portable file tron, electron, and optical acquisition
International Organization for Stan- format with no limit on the number or communities. Significantly, the entire
dardization (ISO), as part of specifi- size of data objects in the collection. biological community would benefit
cation 10303 (STEP, Standard for the During I/O operations, HDF5 automat- from coherent imagery and better-in-
Exchange of Product model data). Also ically takes care of data-type differenc- tegrated data models. With four bio-
of note is the creation of BioHDF1 for es, such as byte ordering and data-type imaging communities concluding that
organizing rapidly growing genomics size. Its software library runs on Linux, HDF5 is essential to their future image
data volumes. Windows, Mac, and most other oper- strategy, this is a rare opportunity to
The HDF Group’s digital preserva- ating systems and architectures, from establish comprehensive agreements
tion efforts make HDF5 well suited laptops to massively parallel systems. on a common scientific image stan-
for archival tasks. Specifically their HDF5 implements a high-level API dard across biological disciplines.
involvement with NARA (National Ar- with C, C++, Fortran 90, Python, and
chives and Records Administration), Java interfaces. It includes many tools Concerns Identified
their familiarity with the ISO standard for manipulating and viewing HDF5 The following deficiencies impede the
Reference Model for an Open Archival data, and a wide variety of third-party immediate and long-term usefulness
Information System (OAIS),13 and the applications and tools are available. of digital images:
HDF5 implementation of the Meta- The design of the HDF5 software ! The increase in pixels caused by im-
data Encoding and Transmission Stan- provides a rich set of integrated perfor- proving digital acquisition resolutions,
dard (METS)8 developed by the Digital mance features that allow for access- faster acquisition speeds, and expanding
Library Federation and maintained by time and storage-space optimizations. user expectations for “more and faster”
the Library of Congress. For example, it supports efficient ex- is unmanageable. The solution requires
traction of subsets of data, multiscale technical analysis of the computation-
Technical Features of HDF5 representation of images, generic di- al infrastructure. The image designer
An HDF5 file is a data container, simi- mensionality of datasets, parallel I/O, must analyze the context of computer
lar to a file system. Within it, user tiling (2D), bricking (3D), chunking hardware, application software, and
communities or software applica- (nD), regional compression, and the the operating-system interactions.
tions define their organization of data flexible management of user metadata This is a moving target monitored
objects. The basic HDF5 data model that is interoperable with XML. HDF5 over a period of decades. For example,
is simple, yet extremely versatile in transparently manages byte ordering today’s biologists use computers hav-
terms of the scope of data that it can in its detection of hardware. Its soft- ing 2GB–16GB of RAM. What method
store. It contains two primary objects: ware extensibility allows users to in- should be used to access a four-dimen-
groups, which provide the organizing sert custom software “filters” between sional, 1TB image having 30 hyper-
structures, and datasets, which are secondary storage and memory; such spectral values per pixel? Virtually all
the basic storage structures. HDF5 filters allow for encryption, compres- of the current biological image formats
groups and datasets may also have at- sion, or image processing. The HDF5 organize pixels as 2D XY image planes.
tributes attached, a third type of data data model, file format, API, library, A visualization program may require
object consisting of small textual or and tools are open source and distrib- the entire set of pixels read into RAM
numeric metadata defined by user ap- uted without charge. or virtual memory. This, coupled with
plications. poor performance of the mass storage
An HDF5 dataset is a uniform mul- MEDSBIO relating to random disk seeks, paging,
tidimensional array of elements. The X-ray crystallographers formed MEDS- and memory swaps, effectively makes
elements might be common data types BIO (Consortium for Management of the image unusable. For a very large
(for example, integers, floating-point Experimental Data in Structural Bi- image, it is desirable to store it in mul-
numbers, text strings), n-dimensional ology)7 in 2005 to coordinate various tiple resolutions (multiscale) allowing
memory chunks, or user-defined com- research interests. Later the electron4 interactive access to regions of inter-
pound data structures consisting of and optical14 microscopy communi- est. Visualization software may inten-
floating-point vectors or an arbitrary ties began attending. During the past sively compute these intermediate
bit-length encoding (for example, 97- 10 years, each community considered data resolutions, later discarded upon

44 COM MUNICATIO NS O F TH E ACM | O CTOB ER 2009 | VO L . 5 2 | NO. 1 0


practice

exit from the software. subsequent datasets be corrupted. and modeling, attempt to implement
! The inflexibility of current biologi- That risk would be unacceptable for these formats, forcing the communi-
cal image file designs prevents them operational software used in data re- ties to confront design deficiencies.
from adapting to future modalities and positories and research. This function Basic image metadata definitions
dimensionality. Rapid advances in bio- and its certification testing are critical such as rank, dimension, and modality
logical instrumentation and computa- features of HDF software that are not must be explicitly defined so the down-
tional analysis are leading to complex readily available in any other format. stream communities can easily partic-
imagery involving novel physical and ipate. Different research communities
statistical pixel specifications. Common Objectives must be able to append new types of
! The inability to assemble different The objectives of these acquisition metadata to the image, enhancing the
communities’ imagery into an overarch- communities are identical, requiring imagery as it progresses through the
ing image model allows for ambiguity in performance, interoperability, and pipeline. Ongoing advances in the ac-
the analysis. The integration of various archiving. There is a real need for the quisition communities will continue
coordinate systems can be an impass- different bio-imaging communities to produce new and significant im-
able obstacle if not properly organized. to coordinate within the same HDF5 age modalities that feed this image
There is an increasing need to correlate data file by using identical high-per- pipeline. Enabling downstream us-
images of different modalities in order formance methods to manage pixels; ers easily to access pixels and append
to observe spatial continuity from mil- avoiding namespace collisions be- their community metadata supports
limeter to angstrom resolutions. tween the biological communities; interoperability, ultimately leading to
! The non-archival quality of images and adopting the same archival best fundamental breakthroughs in biol-
undermines their long-term value. The practices. All of these would benefit ogy. This is not to suggest that differ-
current designs usually do not provide downstream communities such as vi- ent communities’ metadata can be or
basic archival features recommended sualization developers and global re- should be uniformly defined as a sin-
by the Digital Library Federation, nor positories. gle biological metadata schema and
do they address issues of provenance. Performance. The design of an image ontology in order to achieve an effec-
Frequently, the documentation of a file format and the subsequent organi- tive image format.
community image format is incom- zation of stored pixels determine the Archiving. Scientific images have a
plete, outdated, or unavailable, thus performance of computation because general lack of archival design features.
eroding the ability to interpret the dig- of various hardware and software data- As the sophistication of bio-imagery
ital artifact properly. path bottlenecks. For example, many improves, the demand for the place-
specialized biological image formats ment of this imagery into long-term
Consensus use simple 2D pixel organizations, global repositories will be greater. This
It would be desirable to adopt an exist- frequently without the benefit of com- is being done by the Electron Micros-
ing scientific, medical, or computer pression. These 2D pixel organizations copy Databank4 in joint development
image format, and simply benefit from are ill suited for very large 3D images by the National Center for Macromo-
the consequences. All image formats such as electron tomograms or 5D op- lecular Imaging, the RCSB (Research
have their strengths and weaknesses. tical images. Those bio-imaging files Collaboratory for Structural Bioinfor-
They tend to fall into two categories: have sizes that are orders of magnitude matics) at Rutgers University, and the
generic and specialized formats. Ge- larger than the RAM of computers. European Bioinformatics Institute.
neric image formats usually have fixed Worse, widening gaps have formed be- Efforts such as the Open Microscopy
dimensionality or pixel design. For ex- tween CPU/memory speeds, persistent Environment14 are also developing
ample, MPEG29 is suitable for many storage speeds, and network speeds. bio-image informatics tools for lab-
applications as long as it is 2D spatial These gaps lead to significant delays based data sharing and data mining of
plus 1D temporal using red-green-blue in processing massive data sets. Any biological images that also are requir-
modality that is lossy compressed for file format for massive data has to ac- ing practical image formats for long-
the physiological response of the eye. count for the complex behavior of soft- term storage and retrieval. Because of
Alternatively, the specialized image ware layers, all the way from the appli- the evolving complexity of bio-imagery
formats suffer the difficulties of the cation, through middleware, down to and the need to subscribe to archival
image formats we are already using. operating-systems device drivers. A ge- best practices, an archive-ready image
For example, DICOM3 (medical imag- neric n-dimensional multimodal im- format must be self-describing. That
ing standard) and FITS5 (astronomical age format will require new instantia- is, there must be sufficient infrastruc-
imaging standard,) store their pixels tion and infrastructure to implement ture within the image file design to
as 2D slices, although DICOM does new types of data buffers and caches to properly document its content, con-
incorporate MPEG2 for video-based scale large datasets into much smaller text, and structure of the pixels and
imagery. RAM; much of this has been resolved related community metadata, thereby
The ability to tile (2D), brick (3D), or within HDF5. minimizing the reliance on external
chunk (nD) is required to access very Interoperability. Historically the ac- documentation for interpretation.
large images. Although this is concep- quisition communities have defined
tually simple, the software is not, and custom image formats. Downstream The Inertia of Legacy Software
must be tested carefully or risk that communities, such as visualization Implementing a new unified image

O CTO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 45


practice

format supporting legacy software same byte order as defined by the lega- follow the extensible design strategy
across the biological disciplines is a cy image format. The fourth possibility used in the organization of NFS (Net-
Gordian knot. Convincing software is to endow the VFS with archival and work File System) version 4 protocol
developers to make this a high priority performance analysis tools that could metadata.12
is a difficult proposition. Implemen- transparently provide those services to 6. In some circumstances it will
tation occuring across hundreds of legacy application software. be desirable to define adjuncts to the
legacy packages and flawlessly fielded common image model. An example is
in thousands of laboratories is not a Recommendations MPEG video, where the standardized
trivial task. Ideally, presenting images To achieve the goal of an exemplary compression is the overriding reason
simultaeously in their legacy formats image design having wide, long-term to store the data as a 1D byte stream
and in a new advanced format would support, we offer the following recom- rather than decompressing it into the
mitigate the technical, social, and lo- mendations to be considered through standard image model as a 3D YCbCr
gistical obstacles. However, this must a formal standards process: pixel dataset. Proprietary image for-
be accomplished without duplicating 1. Permit and encourage scientific mat is another type of adjunct requir-
the pixels in secondary storage. communities to continually to evolve ing 1D byte encapsulation rather than
One proposal is to mount an HDF5 their own image designs. They know translating it into the common image
file as a VFS (virtual file system) so that the demands of their disciplines best. model. In this scenario, images are
HDF5 groups become directories and Implementing community image for- merely flagged as such and routine ar-
HDF5 datasets become regular files. mats through HDF5 provides these chiving methods applied.
Such a VFS using FUSE (Filesystem-in- communities flexible routes to a com- 7. Provide a comprehensively tested
User-Space) would execute simultane- mon image model. software API in lockstep with the image
ously across the user-process space 2. Adopt the archival community’s model. Lack of a common API requires
and the operating system space. This recommendations on archive-ready each scientific group to develop and
hyperspace would manage all HDF- datasets. Engaging the digital preserva- test the software tools from scratch or
VFS file activity by interpreting, inter- tion community from the onset, rather borrow them from others, resulting in
cepting, and dynamically rearranging than as an afterthought, will produce not only increased cost for each group,
legacy image files. A single virtual file better long-term image designs. but also increased likelihood of errors
presented by the VFS could be com- 3. Establish a common image mod- and inconsistencies among imple-
posed of several concatenated HDF5 el. The specification must be concep- mentations.
datasets, such as a metadata header tually simple and should merely dis- 8. Implement HDF5 as a virtual file
dataset and a pixel dataset. Such a VFS tinguish the image’s pixels from the system. HDF-VFS could interpret in-
file could have multiple simultaneous various metadata. The storage of pix- coming legacy image file formats by
filenames and legacy formats depend- els should be in an appropriate dimen- storing them as pixel datasets and en-
ing on the virtual folder name that sional dataset. The encapsulation of capsulated metadata. HDF-VFS could
contains it, or the software application community metadata should be in 1D also present such a combination of
attempting to open it. byte datasets or attributes. HDF datasets as a single legacy-format
The design and function of an HDF- 4. The majority of the metadata is image file, byte-stream identical. Such
VFS has several possibilities. First, uniquely specific to the biological com- a file system could allow user legacy ap-
non-HDF5 application software could munity that designs it. The use of bina- plications to access and interact with
interact transparently with HDF5 files. ry or XML is an internal concern of the the images through standard file I/O
PDF files, spreadsheets, and MPEGs community creating the image design; calls, obviating the requirement and
would be written and read as routine however, universal image metadata burden of legacy software to include,
file-system byte streams. Second, this will overlap across disciplines, such compile, and link HDF5 API libraries
VFS, when combined with transparent as rank, dimensionality, and pixel mo- in order to access images. The duality
on-the-fly compression, would act as dality. Common image nomenclature of presenting an image as a file and
an operationally usable compressed should be defined to bridge metadata an HDF5 dataset offers a number of
tarball. Third, design the VFS with namespace conversions to legacy for- intriguing possibilities for managing
unique features such as interpreting mats. images and non-image datasets such
incoming files as image files. Commu- 5. Use RDF (Resource Description as spreadsheets or PDF files, or man-
nity-based legacy image format filters Framework)15 as the primary mecha- aging provenance without changes to
would rearrange legacy image files. For nism to manage the association of pix- legacy application software.
example, the pixels would be stored as el datasets and the community meta- 9. Make the image specification
HDF5 datasets in the appropriate di- data. A Subject-Predicate-Object-Time and software API freely accessible and
mensionality and modality, and the tuple stored as a dataset can benefit available without charge. Preferably,
related metadata would be stored as a from HDF5’s B-tree search features. such software should be available un-
separate HDF5 1D byte dataset. When Such an arrangement provides useful der an open source license that allows
legacy application software opens the time stamps for provenance and ge- a community of software developers to
legacy image file, the virtual file is dy- neric logging for administration and contribute to its development. Charg-
namically recombined and presented performance testing. The definition ing the individual biological imaging
by the VFS to the legacy software in the of RDF predicates and objects should communities and laboratories adds

46 COMM UNICATIO NS O F TH E ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


practice

financial complexity to the pursuit of gists will continue with incompatible Better Scripts, Better Games
scientific efforts that are frequently methods for solving similar problems, http://queue.acm.org/detail.cfm?id=1483106
underfunded. such as not having a common image Concurrency’s Shysters
10. Establish methods for verifica- model. http://blogs.sun.com/bmc/entry/
tion and performance testing. A critical The failure to establish a scalable concurrency_s_shysters
requirement is the ability to determine n-dimensional scientific image stan-
compliance. Not having compliance dard that is efficient, interoperable, References
testing significantly weakens the ar- and archival will result in a less-than- 1. BioHDF; http://www.geospiza.com/research/biohdf/.
2. Crystallographic Information Framework.
chival value by undermining the reli- optimal research environment and a International Union of Crystallography; http://www.
ability and integrity of the image data. less-certain future capability for im- iucr.org/resources/cif/.
3. DICOM (Digital Imaging and Communications in
Performance testing using prototypi- age repositories. The strategic danger Medicine); http://medical.nema.org.
4. EMDB (Electron Microscopy Data Bank); http://
cal test cases assists in the design pro- of not having a comprehensive scien- emdatabank.org/.
cess by flagging proposed community tific image storage framework is the 5. FITS (Flexible Image Transport System); http://fits.
gsfc.nasa.gov/.
image design that will have severe per- massive generation of unsustainable 6. HDF (Hierarchical Data Format); http://www.hdfgroup.
formance problems. Defining baseline bio-images. Subsequently, the long- org.
7. MEDSBIO (Consortium for Management of
test cases will quickly identify software term risks and costs of comfortable Experimental Data in Structural Biology); http://www.
problems in the API. inaction will likely be enormous and medsbio.org.
8. METS (Metadata Encoding and Transmission
11. Establish ongoing adminis- irreversible. Standard); http://www.loc.gov/standards/mets/.
trative support. Formal design pro- The challenge for the biosciences 9. MPEG (Moving Picture Experts Group); http://www.
chiariglione.org/mpeg/.
cesses can take considerable time to is to establish a world-class imaging 10. netCDF (network Common Data Form); http://www.
complete, but some needs—such as specification that will endow these unidata.ucar.edu/software/netcdf/.
11. NeXus (neutron, x-ray and muon science); http://www.
technical support, consultation, pub- indispensable and nonreproducible nexusformat.org.
lishing technical documentation, and observations with long-term mainte- 12. NFS (Network File System); http://www.ietf.org/rfc/
rfc3530.txt.
managing registration of community nance and high-performance compu- 13. OAIS (Open Archival Information System); http://
image designs—require immediate at- tational access. The issue is not wheth- nost.gsfc.nasa.gov/isoas/overview.html.
14. OME (Open Microscopy Environment); http://www.
tention. Establishing a mechanism for er the biosciences will adopt HDF5 as openmicroscopy.org/.
15. RDF (Resource Description Framework); http://www.
imaging communities to register their a useful imaging framework—that is w3.org/RDF/.
HDF5 root level groups as community already happening—but whether it is
specific data domains will provide an time to gather the many separate piec-
essential cornerstone for image de- es of the currently highly fragmented Matthew T. Dougherty ([email protected]) is at the
National Center for Macromolecular Imaging, specializing
sign and avoid namespace collisions patchwork of biological image for- in cryo-electron microscopy, visualization, and animation.
with other imaging communities. mats and place them under HDF5 as a
12. Examine how other formal stan- common framework. This is the time Michael J. Folk ([email protected]) is president of
The HDF Group.
dards have evolved. Employ the suc- to unify the imagery of biology, and we
cessful strategies and avoid the pitfalls. encourage readers to contact the au- Erez Zadok ([email protected]) is associate professor
Developing strategies and alliances thors with their views. at Stony Brook University, specializing in computer
storage systems performance and design.
with these standards groups will fur-
ther strengthen the design and adop- Acknowledgments Herbert J. Bernstein ([email protected]) is professor
tion of a scientific image standard. This work was funded by the Na- of computer science at Dowling College, active in the
development of IUCr standards.
13. Establishing the correct forum tional Center for Research Resourc-
is crucial and will require the guidance es (P41-RR-02250), National Insti- Frances C. Bernstein ([email protected])
of a professional standards organi- tute of General Medical Sciences is retired from Brookhaven National Laboratory after 24
years at the Protein Data Bank, active in macromolecular
zation—or organizations—that per- (5R01GM079429, Department of En- data representation and validation.
ceives the development of such an im- ergy (ER64212-1027708-0011962),
age standard as part of its mission to National Science Foundation (DBI- Kevin W. Eliceiri ([email protected]) is director
at the Laboratory for Optical and Computational
serve the public and its membership. 0610407, CCF-0621463), National Instrumentation, University of Wisconsin-Madison, active
Broad consensus and commitment by Institutes of Health (1R13RR023192- in the development of tools for bio-image informatics.

the scientific, governmental, business, 01A1, R03EB008516), The HDF Group


Werner Benger ([email protected]) is visualization
and professional communities is the R&D Fund, Center for Computation research scientist at Louisiana State University,
best and perhaps only way to accom- and Technology at Louisiana State specializing in astrophysics and computational fluid
dynamics.
plish this. University, Louisiana Information
Technology Initiative, and NSF/EPS- Christoph Best ([email protected]) is project leader at the
Summary CoR (EPS-0701491, CyberTools). European Bioinformatics Institute, specializing in electron
microscopy image informatics.
Out of necessity, bioscientists are inde-
pendently assessing and implement-
ing HDF5, but no overarching group is
responsible for establishing a compre- Related articles
on queue.acm.org
hensive bio-imaging format, and there
are few best practices to rely on. Thus, Catching disk latency in the act
there is a real possibility that biolo- http://queue.acm.org/detail.cfm?id=1483106 © 2009 ACM 0001-0782/09/1000 $10.00

O CTO B E R 2 0 0 9 | VO L. 52 | NO. 1 0 | C OM M U N IC AT ION S OF T HE ACM 47


D OI:10.1145/ 1562764 . 1 5 6 2 78 2

Article development led by


queue.acm.org

Stanford professor Pat Hanrahan sits


down with the noted hedge fund founder,
computational biochemist, and (above all)
computer scientist.

A Conversation
with
David E. Shaw
himself first and foremost
D AV I D SH AW C O N S ID E R S
a computer scientist. It’s a fact that’s sometimes
overshadowed by the activities of his two highly
successful, yet very different, ventures: the hedge
fund D. E. Shaw & Co., which he founded 20 years
ago, and the research lab D. E. Shaw Research, where
he now conducts hands-on research in the field of
computational biochemistry. The former makes
money through rigorous quantitative and qualitative
investment techniques, while the latter spends money
simulating complex biochemical pro- tions by several orders of magnitude.
cesses. But a key element to both or- Four 512-processor machines are now
ganizations’ success has been Shaw’s active and already helping scientists to
background in computer science. Serv- understand how proteins interact with
ing as interviewer, computer graphics each other and with other molecules at
researcher and Stanford professor an atomic level of detail. Shaw’s hope
Pat Hanrahana points out that one of is that these “molecular microscopes”
Shaw’s unique gifts is his ability to ef- will help unravel some biochemical
fectively apply computer science tech- mysteries that could lead to the de-
PHOTOGRA PH COURT ESY O F DAVID E. SH AW

niques to diverse problem domains. velopment of more effective drugs for


In this interview, Hanrahan and cancer and other diseases. If his track
Shaw discuss Shaw’s latest project at record is any indication, the world has
D. E. Shaw Research—Anton, a special- a lot to be hopeful for.
purpose supercomputer designed to PAT HANRAHAN: What led you to form
speed up molecular dynamics simula- D. E. Shaw Research?
DAVID SHAW: Before starting the lab,
a http://graphics.stanford.edu/~hanrahan/ I’d spent a number of years in the fi-

O C TO BE R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 49
practice

nancial world applying quantitative At a certain point, when I was ap- affect an individual’s susceptibility to
and computational techniques to the proaching my 50th birthday, I felt like various human diseases, and so forth.
process of investment management. it was a natural time to think about There are a lot of computer scientists
During the early years of D. E. Shaw & what I wanted to do over the coming working in the area that’s commonly
Co., the financial firm, I’d been per- years. Since my graduate work at Stan- referred to as bioinformatics, but not
sonally involved in research aimed at ford and my research at Columbia were nearly as many who work on prob-
understanding various financial mar- focused in part on parallel architec- lems involving the three-dimensional
kets and phenomena from a math- tures and algorithms, one of the things structures and structural changes that
ematical viewpoint. But as the years I spent some time thinking about was underlie the physical behavior of in-
went by and the company grew, I had whether there might be some way to dividual biological molecules. I think
to spend more time on general man- apply those sorts of technologies to there’s still a lot of juicy, low-hanging
agement, and I could feel myself get- one of the areas Rich had been teach- fruit in this area, and maybe even some
ting stupider with each passing year. ing me about. After a fair amount of important unifying principles that
I didn’t like that, so I started solving reading and talking to people, I found haven’t yet been discovered.
little theoretical problems at night just one application—the simulation of HANRAHAN: When you mention drug
for fun—things I could tackle on my molecular dynamics—where it seemed discovery, do you see certain applica-
own, since I no longer had a research like a massive increase in speed could, tions like that in the near-term or are
group like I did when I was on the fac- in principle, have a major impact on you mostly trying to do pure research at
ulty at Columbia. As time went by, I our understanding of biological pro- this point?
realized that I was enjoying that more cesses at the molecular level. SHAW: Although my long-term hope
and more, and that I missed doing re- It wasn’t immediately clear to me is that at least some of the things we
search on a full-time basis. whether it would actually be possi- discover might someday play a role in
I had a friend, Rich Friesner, who ble to get that sort of speed up, but it curing people, that’s not something
was a chemistry professor at Columbia, smelled like the sort of problem where I expect to happen overnight. Impor-
and he was working on problems like there might be some nonobvious way tant work is being done at all stages
protein folding and protein dynamics, to make it happen. The time seemed in the drug development pipeline, but
among other things. Rich is a compu- ripe from a technological viewpoint, our own focus is on basic scientific re-
tational chemist, so a lot of what he did and I just couldn’t resist the impulse to search with a relatively long time hori-
involved algorithms, but his training see if it could be done. At that point, I zon, but a large potential payoff. To put
was in chemistry rather than computer started working seriously on the prob- this in perspective, many of the medi-
science, so when we got together social- lem, and I found that I loved being cations we use today were discovered
ly, we often talked about some of the involved again in hands-on research. more or less by accident, or through
intersections between our fields. He’d That was eight years ago, and I still feel a brute-force process that’s not based
say, “You know, we have this problem: the same way. on a detailed understanding of what’s
the inner loop in this code does such- HANRAHAN: In terms of your goals at going on at the molecular level. In
and-such, and we can’t do the studies D. E. Shaw Research, are they particu- many areas, these approaches seem to
we want to do because it’s too slow. Do larly oriented toward computational be running out of steam, which is lead-
you have any ideas?” chemistry or is there a broader mis- ing researchers to focus more on tar-
Although I didn’t understand much sion? geting drugs toward specific proteins
at that point about the chemistry and SHAW: The problems I’m most in- and other biological macromolecules
biology involved, I’d sometimes take terested in have a biochemical or bio- based on an atomic-level understand-
the problem home, work on it a little physical focus. There are lots of other ing of the structure and behavior of
bit, and try to come up with a solution aspects of computational chemistry those targets.
that would speed up his code. In some that are interesting and important— The techniques and technologies
cases, the problem would turn out to nanostructures and materials science we’ve been working on are providing
be something that any computer scien- and things like that—but the applica- new tools for understanding the biol-
tist with a decent background in algo- tions that really drive me are biologi- ogy and chemistry of pharmaceutically
rithms would have been able to solve. cal, especially things that might lead relevant molecular systems. Although
After thinking about it for a little while, not only to fundamental insights into developing a new drug can take as long
you’d say, “Oh, that’s really just a spe- molecular biological processes, but as 15 years, our scientific progress is
cial case of this known problem, with also to tools that someone might use at occurring over a much shorter tim-
the following wrinkle.” some point to develop lifesaving drugs escale, and we’re already discovering
One time I managed to speed up the more effectively. things that we hope might someday be
whole computation by about a factor Our particular focus is on the struc- useful in the process of drug design.
of 100, which was very satisfying to me. ture and function of biological mol- But I also enjoy being involved in the
It didn’t require any brilliant insight; ecules at an atomic level of detail, and unraveling of biological mysteries,
it was just a matter of bringing a bit of not so much at the level of systems bi- some of which have puzzled research-
computer science into a research area ology, where you try to identify and un- ers for 40 or 50 years.
where there hadn’t yet been all that derstand networks of interacting pro- HANRAHAN: This machine you’ve
much of it. teins, figure out how genetic variations built, Anton, is now operational. Can

50 C OM MUNICATIO NS O F THE ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


practice

you tell us a little bit about that ma- HANRAHAN: When I read the Anton
chine and the key ideas behind it? paper in Communications,b it reminded
SHAW: Anton was designed to run me a lot of what I worked on, which
molecular dynamics (MD) simula- were graphics chips—just in the way
tions a couple orders of magnitude
faster than the world’s fastest super- The techniques you’re choreographing communica-
tion and keeping everything busy and
computers. MD simulations are con- and technologies all these really important tricks. Can

we’ve been working


ceptually pretty simple. In biological you summarize some of the key innova-
applications like the ones that inter- tions in Anton that make it so fast?
est us, the idea is to simulate the mo-
tions of large biomolecules, such as
on are providing SHAW: Each node of the Anton ma-
chine is based on a specialized ASIC
proteins or DNA, at an atomic level of new tools for that contains, among other things, 32
detail over a time period during which
biologically interesting things happen
understanding very long, arithmetically dense, non-
programmable pipelines designed
in real life. The trajectories followed the biology specifically to calculate certain forces
by these molecules are typically calcu-
lated by numerically integrating New- and chemistry of between pairs of interacting particles.
Those pipelines are the main source of
ton’s laws of motion, using a model pharmaceutically Anton’s speed. In addition, Anton uses
based on classical mechanics that ap-
proximates the effects of the various relevant molecular an algorithm that I developed in con-
junction with the machine’s top-level
types of forces that act on the atoms.
During each “time step” in the inte-
systems. architecture to reduce the amount of
data that would have to pass from one
gration process, the atoms move a very Anton ASIC to another in the course
short distance, after which the forces of exchanging information about the
are recomputed, the atoms are moved current positions of the atoms and the
again, and so on. forces that act on them. In essence, it’s
HANRAHAN: And you’re doing this on a technique for parallelizing the range-
a femtosecond (10^-15 seconds) time limited version of the classic N-body
scale, correct? problem of physics, but the key thing
SHAW: That’s right. Each of those for our purposes is that it’s highly ef-
time steps can only cover something on ficient within the range of parameter
the order of a couple femtoseconds of values we typically encounter in simu-
biological time, which means it takes a lating biological systems.
really long time to observe anything of HANRAHAN: Is this the midpoint algo-
biological interest. rithm or the NT algorithm?
HANRAHAN: If you were to do that sim- SHAW: The one I’m referring to is
ulation with a conventional computer, the NT algorithm, which is what runs
such as a normal core or dual or quad on Anton. The midpoint algorithm is
core, how much biological time could used in Desmond, an MD code that
you simulate in a day? was designed to run on ordinary com-
SHAW: Depending on the number modity clusters. They’re both exam-
of cores and other characteristics of ples of what I’ve referred to as “neutral
the processor chip, you’d expect per- territory” methods. The NT method
formance on the order of a few nano- has better asymptotic properties, but
seconds per day, as measured using a the constants tend to favor the mid-
standard molecular system called the point method on a typical cluster and
Joint AMBER-CHARMM Benchmark the NT method at a higher degree of
System, which represents a particular parallelism.
protein surrounded by water. HANRAHAN: When I first saw your NT
HANRAHAN: And what’s the compa- algorithm I was stunned by it, just be-
rable figure for Anton? cause people have been thinking about
SHAW: Our latest benchmark mea- this problem for so long. Force calcula-
surement was 16,400 nanoseconds per tions between particles are such an im-
day on a 512-node Anton configura- portant part of computational science,
tion, so Anton would run three or four and to have made the observation that
orders of magnitude faster, and rough-
ly two orders of magnitude faster than
b “Anton, A Special-Purpose Machine for Mo-
the fastest that can be achieved under lecular Dynamics Simulation” by D.E. Shaw et
practical conditions on supercomput- al. was published in the July 2008 issue of Com-
ers or massively parallel clusters. munications of the ACM, 91–97.

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 51
practice

the best way to compute the interac- ative codesign of new architectural
tions of two particles involves send- and algorithmic approaches could
ing them both somewhere else is just have increased, at least to some extent,
amazing. I understand your proof but the number of applications in which a
it still boggles me because it seems so
counterintuitive. Our latest sufficient speedup was achieved that
outweighed the very real economies of
SHAW: It is kind of weird, but if you
look back through the literature, you
benchmark scale associated with the use of gener-
al-purpose commodity hardware. In
can find various pieces of the puzzle measurement our case, I don’t think we could have
in a number of different contexts. Al-
though I wasn’t aware of this at the
was 16,400 reached our performance goals either
by implementing standard computa-
time, it turns out that the idea of “meet- nanoseconds tional chemistry algorithms on new
ing on neutral territory” can be found
in different forms in publications dat-
per day on a hardware or by using standard high-
performance architectures to run new
ing back as far as the early 1990s, al- 512-node Anton algorithms.
though these early approaches didn’t
offer an asymptotic advantage over configuration, so HANRAHAN: I think a lot of people in
computer science were very surprised
traditional spatial decomposition
methods. Later, Marc Snir indepen-
Anton would run by your Anton paper because they
might have thought that normal com-
dently came up with an algorithm that three or four orders puting paradigms are efficient and
achieved the same asymptotic proper-
ties as mine in a different way, and he
of magnitude faster, that there’s not a lot to be gained. You
read these papers where people get 5%
described his method in a very nice and roughly two improvements, and then you sort of
paper that appeared shortly before
my own. His paper included a proof
orders of magnitude blow them out of the water with this
100-fold improvement. Do you think
that this is the best you can do from faster than the this is just a freak thing or are there
an asymptotic viewpoint. Although
the constant factors were such that fastest that can chances that we could come up with
other revolutionary new ways of solv-
his method wouldn’t have performed be achieved under ing problems?
well in the sorts of applications we’re
interested in, it’s clear with the benefit practical conditions SHAW: I’ve been asked that before,
but I’m really not sure what the an-
of hindsight that the straightforward
addition of certain features from the
on supercomputers swer is. I’d be surprised if this turned
out to be the only computationally
NT method would have made his algo- or massively demanding application that could be
rithm work nearly as well as NT itself
for that class of applications.
parallel clusters. addressed using the general approach
we’ve taken to the design of a special-
But the important thing from my purpose machine. On the other hand,
perspective was not that the NT algo- there are a lot of scientific problems
rithm had certain asymptotic advan- for which our approach would clearly
tages, but that with the right kind of not be effective. For one thing, some
machine architecture it removed a key problems are subject to unavoidable
bottleneck that would otherwise have communication bottlenecks that
kept me from exploiting the enormous would dominate any speedup that
amount of application-specific arith- might be achieved using an applica-
metic horsepower I’d wanted to place tion-specific chip. And in some other
on each chip. applications, flexibility may be impor-
HANRAHAN: I think it’s a great exam- tant enough that a cluster-based solu-
ple of computer science thinking be- tion, or maybe one based on the use of
cause it’s a combination of a new algo- GPUs, would be a better choice.
rithm, which is a new way of organizing One of the things I found attractive
the computation, as well as new hard- about our application from the start
ware. Some people think all advances was that although it was nowhere close
in computing are due to software or to “embarrassingly parallel,” it had
hardware, but I think some of the most several characteristics that seemed
interesting ones are where those things to beg for the use of special-purpose
coevolve in some sense. hardware. First, the inner loop that
SHAW: I agree completely. The his- accounted for a substantial majority
tory of special-purpose machines has of the time required for a biologically
been marked by more failures than oriented molecular dynamics simula-
successes, but I suspect that the cre- tion was highly regular, and could be

52 C OMM UNICATIO NS O F TH E ACM | O CTOB ER 2009 | VO L . 5 2 | N O. 1 0


practice

Anton, a massively parallel machine designed to perform molecular dynamic simulations.

mapped onto silicon in an extremely a specific type of inter-chip communi- metaphor of a computational micro-
area- and power-efficient way. It also cation that was especially important in scope is that it emphasizes one of the
turned out that these inner-loop cal- our application. key things that Anton isn’t. Although
culations could be structured in such a Since my own interest is in the appli- we sometimes describe the machine
way that the same data was used a num- cation of molecular dynamics simula- as a “special-purpose supercomputer,”
ber of times, and with well-localized tions to biological problems, I haven’t its range of applicability is in practice
data transfers that minimized the need been forced to think very hard about so narrow that thinking of Anton as a
to send data across the chip, much less what aspects of the approach we’ve fol- computer is a bit like thinking of a mi-
to and from off-chip memory. lowed might be applicable to the code- croscope as a general-purpose labora-
There were some parts of our ap- sign of special-purpose machines and tory instrument. Like the optical mi-
plication where any function-specific algorithms for other applications. If I croscope, Anton is really no more than
hardware would have been grossly un- had to guess, I’d say that at least some a specialized tool for looking at a par-
derutilized or unable to take advantage aspects of our general approach might ticular class of objects that couldn’t be
of certain types of new algorithms or wind up being relevant to researchers seen before.
biophysical models that might later be who are looking for an insane increase HANRAHAN: So now that you have this
discovered. Fortunately, most of those in speed for certain other applications, microscope, what do you want to point
calculations aren’t executed all that but that hunch isn’t based on anything it at? I know you must be collaborating,
PHOTOGRA PH COURT ESY O F DAVID E. SH AW

frequently in a typical biomolecular very solid. and you have computational chemists
simulation. That made it feasible to HANRAHAN: In the Communications and biologists at D. E. Shaw Research.
incorporate a set of programmable on- paper, Anton was described as a com- Do you have some problems that you
chip processors that could be used for putational microscope. I really liked want to go after with it?
various calculations that fell outside that phrase and that the name Anton SHAW: There’s a wide range of spe-
the inner loop. We were also able to came from van Leeuwenhoek, who was cific biological phenomena we’d like
make use of problem-specific knowl- one of the first microscopists. to know more about, but at this point,
edge to provide hardware support for SHAW: Part of the reason I like the there’s a lot we can learn by simply

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | COM M U N IC AT ION S OF T HE ACM 53


practice

putting things under the microscope as part of standard practice, and then on these new problems?
and seeing what’s there. When Anton routinely deploy this approach to solv- SHAW: I’m not sure I deserve those
van Leeuwenhoek first started exam- ing biological problems. very kind words, but for what it’s worth,
ining pond water and various bodily SHAW: That’s a great analogy. Al- I tend to generate a plentiful supply of
fluids, there’s no way he could have though the evidence is still what I’d ideas, the vast majority of which turn
predicted that he’d see what we now characterize as preliminary, I think out to be bad ones. In some cases, they
know to be bacteria, protozoa, and there’s enough of it at this point to involve transplanting computational
blood and sperm cells, none of which predict that MD simulations will prob- techniques from one application to
had ever been seen before. Although ably be playing a much larger role another, and there’s usually a good
we have no illusions about our ma- within the field of structural biology a reason why the destination field isn’t
chine having an impact comparable to few years from now. It’s hard to tell in already using that technique. I also
the optical microscope, the fact is that advance whether there’s a Toy Story have a remarkable capacity to delude
nobody has ever seen a protein move around the corner, since the biggest myself into thinking that each idea has
over a period of time even remotely breakthroughs in biology often turn a higher probability of working than it
close to what we’re seeing now, so in out to be ones that would have been really does, which provides me with the
some ways, we’re just as clueless as difficult to even characterize before motivation I need to keep working on
van Leeuwenhoek was when he start- they happened. It may be that there are it. And, every once in a while, I stumble
ed looking. some deep principles out there that on an idea that actually works.
All that being said, there are some are waiting to be discovered in silico— HANRAHAN: It sounds like you have
biological systems and processes that ones that can tell us something funda- this “gene” for computing. You know
we’ve been interested in for a while, mental about some of the mechanisms algorithms, you know architecture,
and we’re beginning to learn more nature has evolved to do what it wants but yet you still are fascinated with ap-
about some of them now that we’re able to get done. It’s hard to plan for a dis- plying them to new problems. That’s
to run extremely long simulations. The covery like that; you just have to be in what’s often missing in our field. Peo-
one that’s probably most well known is the right territory with the tools and ple learn the techniques of the field but
the process of protein folding, which skills you think you might need in or- they don’t know how to apply them in a
is when a string of amino acids folds der to recognize it when you see it. new problem domain.
up into a three-dimensional protein. HANRAHAN: There’s no place that’s SHAW: I love learning about new
We’ve already started to learn some in- more fun to be when you have a new fields, but in some ways I feel like a
teresting things related to folding that microscope. van Leeuwenhoek must tourist whose citizenship is computer
we wouldn’t have known if it hadn’t have had a good time. I’ve read a bunch science. I think to myself, “I’m do-
been for Anton, and we’re hoping to about Robert Hooke, too, who was a ing computational finance, but I am a
learn more over time. We’ve also con- contemporary of van Leeuwenhoek. computer scientist. I’m doing compu-
ducted studies of a class of molecules He was part of the Royal Society. Every tational biology, but I am a computer
called kinases, which play a central week or two, they would get together scientist.” When we computer scien-
role in the development and treatment and make all these discoveries because tists start infiltrating a new discipline,
of cancer. And we’re looking at several they were looking at the world in a dif- what we bring to the table is often more
proteins that transfer either ions or sig- ferent way. than just a bag of tricks. What’s some-
nals through membranes in the cell. SHAW: I’ve always thought it would times referred to as “computational
We’re also using Anton to develop new be great to live during a period when a thinking” is leaving its mark on one
algorithms and methods, and to test lot of fundamental discoveries were be- field after another—and the night is
and improve the quality of the physical ing made, and a single scientist could still young.
models that are used in the simulation stay on top of most of the important
process itself. advances being made across the full
HANRAHAN: It seems like you’re al- breadth of a given discipline. It’s a bit Related articles
most at a tipping point. I worked at Pix- sad that we can’t do that anymore— on queue.acm.org
ar, and one of the singular events was victims of our success. A Conversation with
when computer graphics were used in HANRAHAN: But one thing that amaz- Kurt Akeley and Pat Hanrahan
Jurassic Park. Even bigger was Toy Story. es me about you is the number of fields http://queue.acm.org/detail.cfm?id=1365496
Once the graphics software reached a you’ve had an impact on. I’m trying to Beyond Beowulf Clusters
certain maturity, and once you showed figure out your secret sauce. You seem Philip Papadopoulos, Greg Bruno, Mason Katz
that it could be used to make a block- to be able to bring computation to bear http://queue.acm.org/detail.cfm?id=1242501
buster, then it wasn’t that long after- in problem areas that other people Databases of Discovery
ward that almost every movie included haven’t been as successful in. Obvi- James Ostell
computer-generated effects. How close ously you’ve had a huge impact on the http://queue.acm.org/detail.cfm?id=1059806
are we to that tipping point in structur- financial industry with the work you
al biology? If you’re able to solve some did on modeling stocks and portfolios.
important problems in structural biol- And now you’re doing biochemistry.
ogy, then people might begin consider- How do you approach these new areas?
ing molecular dynamics simulations How do you bring computers to bear © 2009 ACM 0001-0782/09/1000 $10.00

54 COMM UNICATIO NS O F TH E ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


 
               

acmqueue has now moved completely online!

acmqueue ')"+&(("-
'(")'"+ -"#+"")'(&-
,$&('"+ -,$"'( '##1&'
!#&#"("(")"%)()&'')'
planetqueue  #'-queue )(#&'+#
/)" #0!$#&("(#"("(&#!( (
&&-"$&#*#!!"(&-videos
#+" # audioroundtable discussions
$ )')"%)acmqueue case studies

acmqueue $&#*'&( $&'$(*


#")&&"("!&"("# #'-
&"(+#& '##)&" '!"$&
&*+#)&" ' (''(")' (#& #&#,$&('!'')&((acmqueue.'
%) (-#"("(*'$"(#(("  "'"&( %)'(#"''#(+&""&'
'#) (""#)(

Visit today!
http://queue.acm.org/
contributed articles
DOI:10.1145/ 1562764.1562783
technology advances to double per-
Writing programs that scale with increasing formance every 18 months. The im-
plicit hardware/software contract was
numbers of cores should be as easy as writing that increased transistor count and
programs for sequential computers. power dissipation were OK as long
as architects maintained the existing
BY KRSTE ASANOVIC, RASTISLAV BODIK, JAMES DEMMEL, sequential programming model. This
TONY KEAVENY, KURT KEUTZER, JOHN KUBIATOWICZ, contract led to innovations that were
NELSON MORGAN, DAVID PATTERSON, KOUSHIK SEN, inefficient in terms of transistors and
JOHN WAWRZYNEK, DAVID WESSEL, AND KATHERINE YELICK power (such as multiple instruction
issue, deep pipelines, out-of-order

A View of
execution, speculative execution,
and prefetching) but that increased
performance while preserving the se-
quential programming model.

the Parallel
The contract worked fine until we
hit the power limit a chip is able to
dissipate. Figure 1 reflects this abrupt
change, plotting the projected micro-

Computing
processor clock rates of the Interna-
tional Technology Roadmap for Semi-
conductors in 2005 and then again just
two years later.16 The 2005 prediction

Landscape
was that clock rates should have ex-
ceeded 10GHz in 2008, topping 15GHz
in 2010. Note that Intel products are
today far below even the conservative
2007 prediction.
After crashing into the power wall,
architects were forced to find a new par-
adigm to sustain ever-increasing perfor-
mance. The industry decided the only
viable option was to replace the single
power-inefficient processor with many
efficient processors on the same chip.
INDUSTRY NEEDS HELP from the research community The whole microprocessor industry
to succeed in its recent dramatic shift to parallel thus declared that its future was in par-
allel computing, with increasing num-
computing. Failure could jeopardize both the bers of processors, or cores, each tech-
IT industry and the portions of the economy nology generation every two years. This
style of chip was labeled a multicore mi-
that depend on rapidly improving information croprocessor. Hence, the leap to mul-
technology. Here, we review the issues and, as an ticore is not based on a breakthrough
example, describe an integrated approach we’re in programming or architecture and
is actually a retreat from the more dif-
developing at the Parallel Computing Laboratory, or ficult task of building power-efficient,
Par Lab, to tackle the parallel challenge. high-clock-rate, single-core chips.5
ILLUSTRATION BY LEONELLO CA LVET TI

Many startups have sold parallel


Over the past 60 years, the IT industry has improved computers over the years, but all failed,
the cost-performance of sequential computing by as programmers accustomed to con-
about 100 billion times overall.20 For most of the past tinuous improvement in sequential
performance saw little need to explore
20 years, architects have used the rapidly increasing parallelism. Convex, Encore, Floating
transistor speed and budget made possible by silicon Point Systems, Inmos, Kendall Square

56 COMM UNICATIO NS O F THE ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 57
contributed articles

Research, MasPar, nCUBE, Sequent, long span in between is software. We Metro, which has 188 five-stage RISC
Silicon Graphics, and Thinking Ma- use the bridge analogy throughout cores, plus four spares to help yield and
chines are just the best-known mem- this article. The aggressive goal of the dissipate just 35 watts.
bers of the Dead Parallel Computer So- parallel revolution is to make it as easy It may be reasonable to assume
ciety. Given this sad history, multicore to write programs that are as efficient, that manycore computers will be ho-
pessimism abounds. Quoting comput- portable, and correct (and that scale as mogeneous, like the Metro, but there
ing pioneer John Hennessy, President the number of cores per microproces- is an argument for heterogeneous
of Stanford University: sor increases biennially) as it has been manycores as well. For example, sup-
“…when we start talking about par- to write programs for sequential com- pose 10% of the time a program gets no
allelism and ease of use of truly parallel puters. Moreover, we can fail overall speedup on a 100-core computer. To
computers, we’re talking about a problem if we fail to deliver even one of these run this sequential piece twice as fast,
that’s as hard as any that computer sci- “parallel virtues.” For example, if par- assume a single fat core would need 10
ence has faced. …I would be panicked if I allel programming is unproductive, times as many resources as a thin core
were in industry.”19 this weakness will delay and reduce due to larger caches, a vector unit, and
Jeopardy for the IT industry means the number of programs that are able other features. Applying Amdahl’s Law,
opportunity for the research commu- to exploit new multicore architectures. here are the speedups (relative to one
nity. If researchers meet the paral- Hardware tower. The power wall thin core) of 100 thin cores and 90 thin
lel challenge, the future of IT is rosy. forces the change in the traditional cores for the parallel code plus one fat
If they don’t, it’s not. Hence, there programming model, but the question core for the sequential code:
are few restrictions on potential so- for parallel researchers is what kind of
lutions. Given an excuse to reinvent computing architecture should take Speedup100 = 1 / (0.1 + 0.9/100) = 9.2
the whole software/hardware stack, its place. There is a technology sweet times faster
this opportunity is also a once-in-a- spot around a pipelined processor of Speedup91 = 1 / (0.1/2 + 0.9/90) = 16.7
career chance to fix other weaknesses five-to-eight stages that is most effi- times faster
in computing that have accumulated cient in terms of performance per joule
over the decades like barnacles on the and silicon area.5 Using simple cores In this example of manycore proces-
hull of an old ship. means there is room for hundreds of sor speedup, a fat core needing 10
Here, we lay out one view of the op- them on the same chip. Moreover, hav- times as many resources would be
portunities, then, as an example, de- ing many such simple cores on a chip more effective than the 10 thin cores
scribe in more depth the approach of simplifies hardware design and verifi- it replaces.5,15
the Berkeley Parallel Computing Lab, cation, since each core is simple, and One notable challenge for the hard-
or Par Lab, updating two long techni- replication of cores is nearly trivial. ware tower is that it takes four to five
cal reports4,5 that include more detail. Just as it’s easy to add spares to mask years to design and build chips and port
Our goal is to recruit more parallel manufacturing defects, “manycore” software to evaluate them. Given this
revolutionaries. computers can also have higher yield. lengthy cycle, how could researchers in-
One example of a manycore comput- novate more quickly?
Parallel Bridge er is from the world of network proces- Software span. Software is the main
The bridge in Figure 2 represents an sors, which has seen a great deal of inno- problem in bridging the gap between
analogy connecting computer users vation recently due to the growth of the users and the parallel IT industry.
on the right to the IT industry on the networking market. The best-designed Hence, the long distance of the span in
left. The left tower is hardware, the network processor is arguably the Cisco Figure 2 reflects the daunting magni-
right tower is applications, and the Silicon Packet Processor, also known as tude of the software challenge.
One especially vexing challenge
Figure 1. Microprocessor clock rates of Intel products vs. projects from for the parallel software span is that
the International Roadmap for Semiconductors in 2005 and 2007.16
sequential programming accommo-
dates the wide range of skills of today’s
25
programmers. Our experience teach-
ing parallelism suggests that not every
20 programmer is able to understand the
nitty gritty of concurrent software and
Clock Rate (GHz)

15
2005 Roadmap
parallel hardware; difficult steps in-
clude locks, barriers, deadlocks, load
10
balancing, scheduling, and memory
consistency. How can researchers de-
2007 Roadmap
velop technology so all programmers
5 Intel single core
benefit from the parallel revolution?
Intel multicore A second challenge is that two criti-
0
2001 2003 2005 2007 2009 2011 2013
cal pieces of system software—com-
pilers and operating systems—have
grown large and unwieldy and hence

58 COM MUNICATIO NS O F TH E AC M | O C TO BER 2009 | VO L . 5 2 | NO. 1 0


contributed articles

Hardware

Software

Applications

IT Industry
Users

Figure 2. Bridge analogy connecting users to a parallel IT industry, inspired by the view of the Golden Gate Bridge from Berkeley, CA.

resistant to change. One estimate is is not defined by only average per- ings with 50,000 or more servers to run
that it takes a decade for a new compil- formance; advances could be in, say, SaaS, inspiring the new catchphrase
er optimization to become part of pro- worst-case response time, battery life, “cloud computing.”b They have also be-
duction compilers. How can research- reliability, or security. To save the IT gun renting thousands of machines by
ers innovate rapidly if compilers and industry, researchers must demon- the hour to enable smaller companies
operating systems evolve so glacially? strate greater end-user value from an to benefit from cloud computing. We
A final challenge is how to measure increasing number of cores. expect these trends to accelerate; and
improvement in parallel program- The mobile device (laptops and hand-
ming languages. The history of these Par Lab helds) is the client. In 2007, Hewlett-
languages largely reflects researchers As a concrete example of the parallel Packard, the largest maker of PCs,
deciding what they think would be landscape, we describe Berkeley’s Par shipped more laptops than desktops.
better and then building it for oth- Lab project,a exploring one of many Millions of cellphones are shipped each
ers to try. As humans write programs, potential approaches, though we won’t day with ever-increasing functionality, a
we wonder whether human psychol- know for years which of our ideas will trend we expect to accelerate as well.
ogy and human-subject experiments bear fruit. We hope it inspires more Surprisingly, these extremes in
shouldn’t be allowed to play a larger researchers to participate, increasing computing share many characteris-
role in this revolution.17 the chance of finding a solution before tics. Both concern power and energy—
Applications tower. The goal of re- it’s too late for the IT industry. the datacenter due to the cost of power
search into parallel computing should Given a five-year project, we project and cooling and the mobile client due
be to find compelling applications that the state of the field in five to 10 years, to battery life. Both concern cost—
thirst for more computing than is cur- anticipating that IT will be driven to the datacenter because server cost is
rently available and absorb biennially extremes in size due to the increasing replicated 50,000 times and mobile
increasing number of cores for the next popularity of software as a service, or clients because of a lower unit-price
decade or two. Success does not require SaaS: target. Finally, the software stacks are
improvement in the performance of The datacenter is the server. Amazon, becoming similar, with more layers for
ILLUSTRATION BY LEONELLO CA LVET TI

all legacy software. Rather, we need to Google, Microsoft, and other major IT mobile clients and increasing concern
create compelling applications that ef- vendors are racing to construct build- about protection and security.
fectively utilize the growing number of
cores while providing software environ-
a In March 2007, Intel and Microsoft invited 25 b See Ambrust, M. et al. Above the Clouds:
ments that ensure that legacy code still universities to propose five-year centers for A Berkeley View of Cloud Computing. Univer-
works with acceptable performance. parallel computing research; the Berkeley and sity of California, Berkeley, Technical Report
Note that the notion of “better” Illinois efforts were ranked first and second. EECS-2009-28.

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 59
contributed articles

Many datacenter applications have stroke patients. Advanced physiological vious parallel projects, emphasizing
ample parallelism across independent blood-flow modeling based on com- software architecture, autotuning, and
users, so the Par Lab focuses on paral- putational analysis of 3D medical im- separate support for productivity vs.
lelizing applications for clients. The ages of a patient’s cerebral vasculature performance programming.
multicore and manycore chips in the enables “virtual stress testing” to risk- Architecting parallel software with
datacenter stand to benefit from the stratify stroke victims intraoperatively. design patterns, not just parallel pro-
same tools and techniques developed Patients thus identified at low risk of gramming languages. Our situation is
for similar chips in mobile clients. complications can then be treated to similar to that found in other engineer-
Given this projection, we decided to mitigate the effects of the stroke. This ing disciplines where a new challenge
take a fresh approach: the Par Lab will technology will ultimately lower compli- emerges that requires a top-to-bottom
be driven top-down, applications first, cation rates in treating stroke victims, rethinking of the entire engineering
then software, and finally hardware. improve quality of life, reduce medical process; for example, in civil architec-
Par Lab application tower. An unfor- care expenditures, and save lives; and ture, Filippo Brunelleschi’s solution
tunate computer science tradition is we Parallel browser. The browser will in 1418 for how to construct the dome
build research prototypes, then wonder be the largest and most important ap- for the Cathedral of Florence required
why applications people don’t use them. plication on many mobile devices. We innovations in tools and building tech-
In the Par Lab, we instead selected ap- will first parallelize sequential browser niques, as well as rethinking the whole
plications up-front to drive research and bottlenecks. Rather than parallelizing process of developing an architecture.
provide concrete goals and metrics to JavaScript programs, we are pursuing All computer science faces a similar
evaluate progress. We selected each ap- an actor language with implicit paral- challenge; parallel programming is
plication based on five criteria: compel- lelism. Such a language may be acces- overdue for a fundamental rethinking
ling in terms of likely market or social sible to Web programmers while al- of the process of designing software.
impact, with short-term feasibility and lowing them to extract the parallelism Programmers have been trying to
longer-term potential; requiring signifi- in the browser’s JIT compiler, thereby craft parallel code for decades and
cant speedup or smaller, more efficient turning all Web-site developers un- learned a great deal about what works
platform to work as intended; covering knowingly into parallel programmers. and what doesn’t work. Automatic par-
the possible platforms and markets Application-domain experts are allelism doesn’t work. Compilers are
likely to dominate usage; enabling tech- first-class members of the Par Lab proj- great at low-level scheduling decisions
nology for other applications; and in- ect. Rather than try to answer design but can’t discover new algorithms to
volvement of a local committed expert questions abstractly, we ask our experts exploit concurrency. Programmers in
application partner to help design, use, what they prefer in each case. Project high-performance computing have
and evaluate our technology. success is judged by the user experience shown that explicit technologies (such
Here are the five initial applications with the collective applications on our as MPI and OpenMP) can be made to
we’re developing: hardware-software prototypes. If suc- work but too often require heroic ef-
Music/hearing. High-performance cessful, we imagine building on these fort untenable for most commercial
signal processing will permit: concert- five applications to create other appli- software vendors.
quality sound-delivery systems for cations that are even more compelling, To engineer high-quality parallel
home sound systems and conference as in the following two examples: software, we plan to rearchitect the
calls; composition and gesture-driven Name Whisperer. Imagine that your software through a “design pattern lan-
live-performance systems; and much mobile client peeking out of your shirt guage.” As explored in his 1977 book,
improved hearing aids; pocket is able to recognize the per- civil architect Christopher Alexander
Speech understanding. Dramatically son walking toward you to shake your wrote that “design patterns” describe
improved automatic speech recogni- hand. It would search a personal im- time-tested solutions to recurring prob-
tion in moderately noisy and rever- age database, then whisper in your lems within a well-defined context.3
berant environments would greatly ear, “This man is John Smith. He got An example is Alexander’s “family of
improve existing applications and en- an A– from you in CS152 in 1993”; and entrances” pattern, addressing how to
able new ones, like, say, a real-time Health Coach. As your mobile client simplify comprehension of multiple
meeting transcriber with rewind and is always with you, you could take pic- entrances for a first-time visitor to a
search. Depending on acoustic condi- tures and weigh your dishes (assum- site. He defined a “pattern language”
tions, current transcribers can gener- ing it has a built-in scale) before and as a collection of related and interlock-
ate many errors; after each meal. It would also record ing patterns, constructed such that the
Content-based image retrieval. Con- how much you exercise. Given calories patterns flow into each other as the de-
sumer-image databases are growing so consumed and burned and an image of signer solves a design problem.
dramatically they require automated your body, it could visualize what you’re Computer scientists are trained to
search instead of manual labeling. Low likely to look like in six months at this think in well-defined formalisms. Pat-
error rates require processing very high rate and what you’d look like if you ate tern languages encourage a less for-
dimensional feature spaces. Current less or exercised more. mal, more associative way of thinking
image classifiers are too slow to deliver Par Lab software span. Software about a problem. A pattern language
adequate response times; is the major effort of the project, and does not impose a rigid methodol-
Intraoperative risk assessment for we’re taking a different path from pre- ogy; rather, it fosters creative problem

60 COM MUNICATIO NS O F TH E ACM | O CTOB ER 2009 | VO L . 5 2 | N O. 1 0


contributed articles

solving by providing a common vo- circuits, and high-performance com-


cabulary to capture the problems en- puting. We then focused in depth on
countered during design and identify the patterns in the applications we de-
potential solutions from among fami- scribed earlier. Figure 3 shows the re-
lies of proven designs.
The observation that design patterns If researchers sults of our pattern mining.
Computational and structural pat-
and pattern languages might be useful meet the terns can be hierarchically composed

parallel challenge,
for software design is not new. An exam- to define an application’s high-level
ple is Gamma et al.’s 1994 book Design software architecture, but a complete
Patterns, which outlined patterns use-
ful for object-oriented programming.12
the future of IT pattern language for application de-
sign must at least span the full range,
In building our own pattern language, is rosy. If they from high-level architecture to detailed
we found Shaw’s and Garlan’s report,23
which described a variety of architec-
don’t, it’s not. software implementation and tuning.
Mattson et al’s 2004 book Patterns for
tural styles useful for organizing soft- Parallel Programming18 was the first
ware, to be very effective. That these such attempt to systematize parallel
architectural styles may also be viewed programming using a complete pattern
as design patterns was noted earlier language. We combine the structural
by Buschmann in his 1996 book Pat- and computational patterns mentioned
tern-Oriented Software Architecture.7 In earlier in our pattern language to liter-
particular, we adopted Pipe-and-Filter, ally sit on top of the algorithmic struc-
Agent-and-Repository, Process Control, tures and implementation structures
and Event-Based architectural styles as in the pattern language in Mattson’s
structural patterns within our pattern book. The resulting pattern language is
language. To this list, we add MapReduce still under development but is already
and Iterator as structural design patterns. employed by the Par Lab to develop the
These patterns define the structure software architectures and parallel im-
of a program but do not indicate what plementations of such diverse applica-
is actually computed. To address this tions as content-based image retrieval,
blind spot, another key part of our pat- large-vocabulary continuous speech
tern language is the set of “dwarfs” of recognition, and timing analysis for in-
the Berkeley View reports4,5 (see Fig- tegrated circuit design.
ure 3). Dwarfs are best understood as Patterns are conceptual tools that
computational patterns providing the help a programmer reason about a
computational interior of the structural software project and develop an ar-
patterns discussed earlier. By analogy, chitecture but are not themselves
the structural patterns describe a fac- implementation mechanisms for
tory’s physical structure and general producing code.
workflow. The computational patterns Split productivity and efficiency lay-
describe the factory’s machinery, flow ers, not just a single general-purpose
of resources, and work products. Struc- layer. A key Par Lab research objective
tural and computational patterns can is to enable programmers to easily
be combined to architect arbitrarily write programs that run as efficiently
complex parallel software systems. on manycore systems as on sequential
Convention holds that truly useful ones. Productivity, efficiency, and cor-
patterns are not invented but mined rectness are inextricably linked and
from successful software applications. must be addressed together. These ob-
To arrive at our list of useful compu- jectives cannot be accomplished with
tational patterns we began with those a single-point solution (such as a uni-
compiled by Phillip Collela of Law- versal language). In our approach, pro-
rence Berkeley National Laboratory of ductivity is addressed in a productivity
the “seven dwarfs of high-performance layer that uses a common composition
computing.” Then, in 2006 and 2007 and coordination language to glue to-
we worked with domain experts to gether the libraries and programming
broadly survey other application ar- frameworks produced by the efficien-
eas, including embedded systems, cy-layer programmer. Efficiency is prin-
general-purpose computing (SPEC cipally handled through an efficiency
benchmarks), databases, games, arti- layer that is targeted for use by expert
ficial intelligence/machine learning, parallel programmers.
computer-aided design of integrated The key to generating a successful

O C TO B E R 2 0 0 9 | VO L. 52 | NO. 1 0 | C OM M U N IC AT ION S OF T HE ACM 61


contributed articles

multicore software developer commu- to recurring problems. Basing frame- defining a thin portability layer that
nity is to maximally leverage the efforts works on pervasive design patterns will runs efficiently across single-socket
of parallel programming experts by en- help make parallel frameworks broad- platforms and includes features for
capsulating their software for use by the ly applicable. parallel job creation, synchronization,
programming masses. We use the term Productivity-layer programmers will memory allocation, and bulk-memory
“programming framework” to mean compose libraries and programming access. To provide a common model of
a software environment that supports frameworks into applications with the memory across machines with coher-
implementation of the solution pro- help of a composition and coordina- ent caches, local stores, and relatively
posed by the associated design pattern. tion language.13 The language will be slow off-chip memory, we are defining
The difference between a programming implicitly parallel; that is, its composi- an API based on the idea of logically
framework and a general programming tion will have serial semantics, mean- partitioned shared memory, inspired
model or language is that in a program- ing the composed programs will be by our experience with Unified Parallel
ming framework the customization is safe (such as race-free) and virtualized C,27 which partitions memory among
performed only at specified points that with respect to processor resources. It processors but not (currently) between
are harmonious with the style embod- will document and check interface re- on- and off-chip.
ied in the original design pattern. An strictions to avoid concurrency bugs We may implement this efficiency
example of a successful sequential resulting from incorrect composition, language either as a set of runtime
programming framework is the Ruby as in, say, instantiating a framework primitives or as a language extension of
on Rails framework, which is based with a stateful function when a state- C. It will be extensible with libraries to
on the Model-View-Controller pat- less one is required. Finally, it will experiment with various architectural
tern.26 Users have ample opportunity support definition of domain-specific features (such as transactions, dynamic
to customize the framework but only abstractions for constructing frame- multithreading, active messages, and
in harmony with the core Model-View- works for specific applications, offer- collective communication). The API will
Controller pattern. ing a programming experience similar be implemented on some existing mul-
Frameworks include libraries, code to MATLAB and SQL. ticore and manycore platforms and on
generators, and runtime systems that Parallel programs in the efficiency our own emulated manycore design.
assist programmers with implementa- layer are written very close to the ma- To engineer parallel software, pro-
tion by abstracting difficult portions chine, with the goal of allowing the best grammers must be able to start with
of the computation and incorporating possible algorithm to be written in the effective software architectures, and
them into the framework itself. Histor- primitives of the layer. Unfortunately, the software engineer would describe
ically successful parallel frameworks existing multicore systems do not of- the solution to a problem in terms of a
encode the collective experience of the fer a common low-level programming design pattern language. Based on this
programming community’s solutions model for parallel code. We are thus language, the Par Lab is creating a fam-

Figure 3. The color of a cell (for 12 computational patterns in seven general application areas and five Par Lab applications)
indicates the presence of that computational pattern in that application; red/high; orange/moderate; green/low; blue/rare.
Embed

Games
SPEC

CAD

HPC
ML
DB

Health Image Speech Music Browser


1. Finite State Mach.

2. Circuits

3. Graph Algorithms

4. Structured Grid

5. Dense Matrix

6. Sparse Matrix

7. Spectral (FFT)

8. Dynamic Prog

9. Particle Methods

10. Backtrack/B&B

11. Graphical Models

12. Unstructured Grid

62 COMMUNICATIO NS O F THE ACM | O CTOB ER 2009 | VO L . 5 2 | NO. 1 0


contributed articles

ily of frameworks to help turn a design machine. This surprising result is partly The synthesized mechanics could be
into working code. The general-pur- explained by the way the autotuner tire- barrier synchronization expressions or
pose programmer will work largely with lessly tries many unusual variants of a tricky loop bounds in stencil loops. Our
the frameworks and stay within what particular routine. Unlike libraries, au- sketching-based synthesis is to tradi-
we call the productivity layer. Specialist totuners also allow tuning to the partic- tional, deductive synthesis what model
programmers trained in the details of ular problem size. Autotuners also pre- checking is to theorem proving; rather
parallel programming technology will serve clarity and support portability by than interactively deriving a program,
work within the efficiency layer to im- reducing the temptation to mangle the our system searches a space of candi-
plement the frameworks and map them source code to improve performance date programs with constraint solving.
onto specific hardware platforms. This for a particular computer. Efficiency is achieved by reducing the
approach will help general-purpose Autotuning also helps with produc- problem to one solved with two com-
programmers create parallel software tion of parallel code. However, paral- municating SAT solvers. In future work,
without having to master the low-level lel architectures introduce many new we hope to synthesize parallel sparse
details of parallel programming. optimization parameters; so far, there matrix codes and data-parallel algo-
Generating code with search-based au- are few successful autotuners for paral- rithms for additional problems (such as
totuners, not compilers. Compilers that lel codes. For any given problem, there parsing).
automatically parallelize sequential may be several parallel algorithms, each Verification and testing, not one or the
code may have great commercial value with alternative parallel data layouts. other. Correctness is addressed differ-
as computers go from one to two to four The optimal choice may depend not ently at the two layers. The productiv-
cores, though as described earlier, his- only on the processor architecture but ity layer is free from concurrency prob-
tory suggests they will be unable to scale also on the parallelism of the computer lems because the parallelism models
from 32 to 64 to 128 cores. Compiling and memory bandwidth. Consequent- are restricted, and the restrictions are
will be even more difficult, as the switch ly, in a parallel setting, the search space enforced. The efficiency-layer code is
to multicore means microprocessors will be much larger than for traditional checked automatically for subtle con-
are becoming more diverse, since con- serial hardware. currency errors.
ventional wisdom is not yet established The table lists the results of auto- A key challenge in verification is
for multicore architectures. For exam- tuning on three multicores for three obtaining specifications for programs
ple, the table here shows the diversity kernels related to the dwarfs’ sparse to verify. Modular verification and au-
in designs of x86 and SPARC multicore matrix, stencil for PDEs, and structured tomated unit-test generation require
computers. In addition, as the num- grids9,30,31 mentioned earlier. This au- the specification of high-level serial se-
ber of cores increase, manufacturers totuned code is the fastest known for mantic constraints on the behavior of
will likely offer products with differing these kernels for all three computers. the individual modules (such as paral-
numbers of cores per chip to cover mul- Performance increased by factors of two lel frameworks and parallel libraries).
tiple price-performance points. They to four over standard code, much better To simplify specification, we use ex-
will also allow each core to vary its clock than you would expect from an optimiz- ecutable sequential programs with the
frequency to save power. Such diversity ing compiler. same behavior as a parallel component,
will make the goals of efficiency, scal- Efficiency-layer programmers will augmented with atomicity constraints
ing, and portability even more difficult be able to build autotuners for use by on a task,21 predicate abstractions of
for conventional compilers, at a time domain experts and other efficiency- the interface of a module,14 or multiple
when innovation is desperately needed. layer programmers to help deliver on ownership types.8
In recent years, autotuners have the goals of efficiency, portability, and Programmers often find it difficult to
become popular for producing high- scalability. specify such high-level contracts involv-
quality, portable scientific code for se- Synthesis with sketching. One chal- ing large modules; however, most find
rial microprocessors,10 optimizing a set lenge for autotuning is how to produce it convenient to specify local properties
of library kernels by generating many the high-performance implementa- of programs using assert statements
variants of a kernel and measuring each tions explored by the search. One ap- and type annotations. Local assertions
variant by running on the target plat- proach is to synthesize these complex and type annotations are often gener-
form. The search process effectively programs. In doing so, we rely on the ated from a program’s implicit correct-
tries many or all optimization switches; search for performance tuning, as well ness requirements (such as data race,
hence, searching may take hours to as for programmer productivity. To ad- deadlock freedom, and memory safety).
complete on the target platform. How- dress the main challenge of traditional The system propagates implications
ever, search is performed only once, synthesis—the need for experts to com- of these local assertions to the module
when the library is installed. The result- municate their insight with a formal boundaries through a combination of
ing code is often several times faster domain theory—we allow that insight static verification and directed automat-
than naive implementations. A single to be communicated directly by pro- ed unit testing. These implications cre-
autotuner can be used to generate high- grammers who write an incomplete ate serial contracts that specify how the
quality code for a variety of machines. program, or “sketch.” In it, they provide modules (such as frameworks) are used
In many cases, the autotuned code is an algorithmic skeleton, and the syn- correctly. When the contracts for the
faster than vendor libraries that were thesizer supplies the low-level mechan- parallel modules are in place, program-
specifically hand-tuned for the target ics by filling in the holes in the sketch. mers use static program verification to

O C TO B E R 2 0 0 9 | VO L. 52 | NO. 1 0 | C OM M U N IC AT ION S OF T HE ACM 63


contributed articles

Autotuned performance in GFLOPS/s on three kernels for dual-socket systems.

Intel e5345 Xeon AMD 2356 Opteron X4 Sun 5140 UltraSPARC T2


4 out-of-order cores, 4 out-of-order cores, 8 multithreaded cores,
MPU Type 2.3GH 2.3GHz 1.2GHz

Kernel
Optimization SpMV Stencil LBMHD SpMV Stencil LBMHD SpMV Stencil LBMHD
Standard 1.0 1.3 3.5 1.4 1.5 3.0 2.1 0.5 3.4

NUMA 1.0 — 3.5 2.4 2.6 3.7 3.5 0.5 3.8

Padding — 1.3 4.5 — 3.1 5.8 — 0.5 3.8

Vectorization — — 4.6 — — 7.7 — — 9.7

Unrolling — 1.7 4.6 — 3.6 8.0 — 0.5 9.7

Prefetching 1.1 1.7 4.6 2.9 3.8 8.1 3.6 0.5 10.5

Compression 1.5 — — 3.6 — — 4.1 — —

$/TLB block — 2.2 — — 4.9 — — 5.1 —

Collab Thread — — — — — — — 6.7 —

SIMD — 2.5 5.6 — 8.0 14.1 — — —

Final 1.5 2.5 5.6 3.6 8.0 14.1 4.1 6.7 10.5

check if the client code composed with age. The probabilistic models will give else parallelization is counterproduc-
the contracts is correct. a more realistic estimate of coverage of tive. The same argument applies to op-
Static program analysis in the pres- race and other concurrency errors in timistic parallelization. Work efficiency
ence of pointers and heap memory parallel programs. is a demanding requirement, since, for
falsely reports many errors that cannot Parallelism for energy efficiency. While some “inherently sequential” problems,
really occur. For restricted parallelism the earlier computer classes—desktops like finite-state machines, only work-
models with global synchronization, and laptops—reused the software of inefficient algorithms are known. In this
this analysis becomes more tractable, their own earlier ancestors, the energy context, we developed a nearly work-ef-
and a recently introduced technique efficiency for handheld operation may ficient algorithm for lexical analysis. We
called “directed automated testing,” need to come from data parallelism are also working on data-parallel algo-
or concolic unit testing, has shown in tasks that are currently executed se- rithms for Web-page layout and identi-
promise for improving software quality quentially, possibly from three sources: fying parallelism in future Web-browser
through automated test generation us- Efficiency. Completing a task on slow applications, attempting to implement
ing a combination of static and dynam- parallel cores will be more efficient than them with efficient message passing.
ic analyses.21 The Par Lab combines completing it in the same time sequen- Space-time partitioning for decon-
directed testing with model-checking tially on one fast core; structed operating systems. Space-time
algorithms to unit-test parallel frame- Energy amortization. Preferring data- partitioning is crucial for manycore cli-
works and libraries composed with se- parallel algorithms over other styles of ent operating systems. A spatial partition
rial contracts. Such techniques enable parallelism, as SIMD and vector com- (partition for short) is an isolated unit
programmers to quickly test executions puters amortize the energy expended containing a subset of physical machine
for data races and deadlocks directly, on instruction delivery; and resources (such as cores, cache parti-
since a combination of directed test Energy savings. Message-passing pro- tions, guaranteed fractions of memory
input generation and model checking grams may be able to save the energy or network bandwidth, and energy
hijacks the underlying scheduler and used by cache coherence. budget). Space-time partitioning virtu-
controls the synchronization primi- We apply these principles in our work alizes spatial partitions by time-multi-
tives. Our testing techniques will pro- on parallel Web browsers. In algorithm plexing whole partitions onto available
vide deterministic replay and debug- design, we observe that to save energy hardware but at a coarse-enough gran-
ging capabilities at low cost. We will with parallelization, parallel algorithms ularity to allow efficient programmer-
also develop randomized extensions of must be close to “work efficient,” that level scheduling in a partition.
our directed testing techniques to build is, they should perform no more total The presence of space-time parti-
a probabilistic model of path cover- work than a sequential algorithm, or tioning leads to restructuring systems

64 COMM UNICATIO NS O F TH E ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


contributed articles

services as a set of interacting distrib- ponents, providing quality-of-service


uted components. We propose a new guarantees. The resulting performance
“deconstructed OS” called Tessellation predictability improves parallel pro-
structured around space-time partition- gram performance, simplifies code au-
ing and two-level scheduling between
the operating system and application To save the totuning and dynamic load balancing,
supports real-time applications, and
runtimes. Tessellation implements IT industry, simplifies scheduling.

researchers must
scheduling and resource management Optional explicit control of the mem-
at the partition granularity. Applica- ory hierarchy. Caches were invented so
tions and OS services (such as file sys-
tems) run within their own partitions.
demonstrate hardware could manage a memory hi-
erarchy without troubling the program-
Partitions are lightweight and can be greater end-user mer. When it takes hundreds of clock
resized or suspended with similar over-
heads to a process-context swap.
value from cycles to go to memory, programmers
and compilers try to reverse-engineer
A key tenet of our approach is that an increasing the hardware controllers to make bet-
resources given to a partition are either
exclusive (such as cores or private cach- number of cores. ter use of the hierarchy. This backward
situation is especially apparent for
es) or guaranteed via a quality-of-service hardware prefetchers when program-
contract (such as a minimum fraction mers try to create a particular pattern
of network or memory bandwidth). that will invoke good prefetching. Our
During a scheduling quantum, the ap- approach aims to allow programmers
plication runtime within a partition to quickly turn a cache into an explicitly
is given unrestricted “bare metal” ac- managed local store and the prefetch
cess to its resources and may schedule engines into explicitly controlled Di-
tasks onto them in some way. Within rect Memory Access engines. To make it
a partition, our approach has much in easy for programmers to port software
common with the Exokernel.11 In the to our architecture, we also support a
common case, we expect many appli- traditional memory hierarchy. The low-
cation runtimes to be written as librar- overhead mechanism we use allows
ies (similar to libOS). Our Tessellation programs to be composed of methods
kernel is a thin layer responsible for that rely on local stores and methods
only the coarse-grain scheduling and that rely on memory hierarchies.
assignment of resources to partitions Accurate, complete counters of perfor-
and implementation of secure restrict- mance and energy. Sadly, performance
ed communications among partitions. counters on current single-core com-
The Tessellation kernel is much thin- puters often miss important measure-
ner than traditional kernels or even ments (such as prefetched data) or are
hypervisors. It avoids many of the per- unique to a computer and only under-
formance issues associated with tra- standable by the machine’s designers.
ditional microkernels by providing OS We will include performance enhance-
services through secure messaging to ments in the Par Lab architecture only
spatially co-resident service partitions, if they have counters to measure them
rather than context-switching to time- accurately and coherently. Since energy
multiplexed service processes. is as important as performance, we also
Par Lab hardware tower. Past parallel include energy counters so software can
projects were often driven by the hard- improve both. Moreover, these coun-
ware determining the application and ters must be integrated with the soft-
software environment. The Par Lab is ware stack to provide insightful mea-
driven top down from the applications, surements to the efficiency-layer and
so the question this time is what should productivity-layer programmers. Ide-
architects do to help with the goals of ally, this research will lead to a standard
productivity, efficiency, correctness, for performance counters so schedul-
portability, and scalability? ers and software development kits can
Here are four examples of this kind count on them on any multicore.
of help that illustrate our approach: Intuitive performance model. The
Supporting OS partitioning. Our hard- multicore diversity mentioned earlier
ware architecture enforces partition- exacerbates the already difficult jobs
ing of not only the cores and on-chip/ performed by programmers, compiler
off-chip memory but also the commu- writers, and architects. Hence, we de-
nication bandwidth among these com- veloped an easy-to-understand visual

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 65
contributed articles

model with built-in performance guide- Manycore synergy with cloud comput- Center is pursuing deterministic mod-
lines to identify bottlenecks in the dozen ing. SaaS applications in data centers els that allow programmers to reason
dwarfs in Figure 3.29 The Roofline mod- with millions of users are naturally par- with sequential semantics for testing
el plots computational and memory- allel and thus aligned with manycore, while naturally exposing a parallel per-
bandwidth limits, then determines the even if clients apps are not; formance model for WYSIWYG perfor-
best possible performance of a kernel Vitality of open source software. The mance. For reactive programs where
by examining the average number of op- OSS community is a meritocracy, so it’s parallelism is part of the problem, it is
erations per memory access. It also plots likely to embrace technical advances pursuing a shared-nothing approach
ceilings below the “roofline” to suggest rather than be limited by legacy code. that leverages actor-like models used in
the optimizations that might be useful Though OSS has existed for years, it is distributed systems. For application do-
for improving performance. One goal of more important commercially today mains that allow greater specialization,
the performances counters should be to than it was; it is developing a framework to gener-
provide everything needed to automati- Single-chip multiprocessors enable in- ate domain-specific environments that
cally create Roofiline models. novation. Having all processors on the either hide concurrency or expose only
A notable challenge from our ear- same chip enables inventions that were specialized forms of concurrency to the
lier description of the hardware tower impractical or uneconomical when end user while exploiting domain-spe-
is how to rapidly innovate at the hard- spread across many chips; and cific optimizations and performance
ware/software interface, when it can FPGA prototypes shorten the hard- measures. Initial applications and do-
take four to five years to build chips and ware/software cycle. Systems like RAMP mains include teleimmersion via “vir-
run programs needed to evaluate them. help researchers explore designs of tual teleportation” (multimedia), dy-
Given the capacity of field-programma- easy-to-program manycore architec- namic real-time virtual environments
ble gate arrays (FPGAs), researchers can tures and build prototypes more quickly (computer graphics), learning by read-
prototype full hardware and software than they ever could with conventional ing, and authoring assistance (natural
systems that run fast enough to inves- hardware prototypes. language processing).
tigate architectural innovations. This Given the importance of the chal- Stanford. The Pervasive Parallelism
flexibility means researchers can “tape lenges to our shared future in the IT Laboratory (http://ppl.stanford.edu/
out” every day, rather than over years. industry, pessimism is not a sufficient wiki/index.php/Pervasive_Parallelism_
We will leverage the Research Accelera- excuse to sit on the sidelines. The sin is Laboratory) at Stanford University takes
tor for Multiple Processors (RAMP) Proj- not lack of success but lack of effort. an application-driven approach toward
ect (http://ramp.eecs.berkeley.edu/) to parallel computing that extends from
build flexible prototypes fast enough to Related Projects programming models down to hard-
run full software stacks—including new Computer science hasn’t solved the ware architecture. The key technical
operating systems and our five com- parallel challenge though not because concepts are domain-specific languag-
pelling applications—to enable rapid it hasn’t tried. There could be a dozen es for increasing programmer produc-
architecture innovation using future conferences dedicated to parallelism, tivity and a common parallel runtime
prototype software, rather than past including Principles and Practice of Par- environment combining dynamic and
benchmarks.28 allel Programming, Parallel Algorithms static approaches for concurrency
and Architectures, Parallel and Distrib- and locality management. There are
Reasons for Optimism uted Processing, and Supercomputing. domain-specific languages for artifi-
Given the history of parallel comput- All traditionally focus on high-perfor- cial intelligence and robotics, business
ing, it’s easy to be pessimistic about our mance computing; the target hardware data analysis, and virtual worlds and
chances. The good news is that there is usually large-scale computers with gaming. The experimental platform
are plausible reasons researchers could thousands of microprocessors. Simi- is the Flexible Architecture Research
succeed this time: larly, there are many high-performance Machine, or FARM, system, combining
No killer microprocessor. Unlike in the computing research centers. Rather commercial processors with FPGAs in
past, no one is building the faster serial than review this material, here we high- the memory fabric.
microprocessor; programmers need- light four centers focused on multicore Georgia Tech. The Sony, Toshiba,
ing more performance have no option computers and their approaches to the IBM Center of Competence for the Cell
other than parallel hardware; parallel challenge in academia: Broadband Engine Processor (http://sti.
New measures of success. Rather than Illinois. The Universal Parallel Com- cc.gatech.edu/) at Georgia Tech focuses
the traditional goal of linear speedup puting Research Center (http://www. on a single multicore computer, as its
for all software as the number of pro- upcrc.illinois.edu/) at the University of name suggests. Researchers explore
cessors increases, success can reflect Illinois focuses on making it easy for do- versions of programs on Cell, includ-
improved responsiveness or MIPS/Joule main experts to take advantage of paral- ing image compression6 and financial
for a few new parallel killer apps; lelism, so the emphasis is more on pro- modeling.2 The Center also sponsors
All the wood behind one arrow. As ductivity in specific domains than on workshops and provides remote access
there is no alternative, the whole IT in- generality or performance.1 It relies on to Cell hardware.
dustry is committed, meaning many advancing compiler technology to find Rice University. The Habanero Multi-
more people and companies are work- opportunities for parallelism, whereas core Software Project (http://habanero.
ing on the problem; the Par Lab focuses on autotuning. The rice.edu/Habanero_Home.html) at

66 C OM MUNICATIO NS O F THE ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


contributed articles

Rice University is developing languages, performance model. We also plan to try 10. Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E.,
Petitet, A., Vuduc, R., Whaley, R., and Yelick, K. Self-
compilers, managed runtimes, concur- to scrape off the barnacles that have ac- adapting linear algebra algorithms and software.
rency libraries, and tools that support cumulated on the hardware/software Proceedings of the IEEE, Special Issue on Program
Generation, Optimization, and Adaptation 93, 2 (Feb.
portable parallel abstractions with high stack over the years. 2005), 293–312.
productivity and high performance for This parallel challenge offers the 11. Engler, D.R. Exokernel: An operating system
architecture for application-level resource
multicores; examples include parallel worldwide research community an op- management. In Proceedings of the 15th Symposium
language extensions25 and optimized portunity to help IT remain a growth on Operating Systems Principles (Cooper Mountain, CO,
Dec. 3–6, 1995), 251–266.
synchronization primitives.24 industry, sustain the parts of the world- 12. Gamma, E. et al. Design Patterns: Elements of
wide economy that depend on the con- Reusable Object-Oriented Software. Addison-Wesley
Professional, Reading, MA, 1994.
Conclusion tinuous improvement in IT cost-per- 13. Gelernter, D. and Carriero, N. Coordination languages
We’ve provided a general view of the and their significance. Commun. ACM 35, 2 (Feb. 1992),
formance, and take a once-in-a-career 97–107.
parallel landscape, suggesting that the chance to reinvent the whole software/ 14. Henzinger, T.A. et al. Permissive interfaces. In
Proceedings of the 10th European Software Engineering
goal of computer science should be hardware stack. Though there are rea- Conference (Lisbon, Portugal, Sept. 5–9). ACM Press,
making parallel computing productive, sons for optimism, the difficulty of the New York, 2005, 31–40.
15. Hill, M. and Marty, M. Amdahl’s Law in the multicore
efficient, correct, portable, and scal- challenge is reflected in the numerous era. IEEE Computer 41, 7 (2008), 33–38.
able. We highlighted the importance parallel failures of the past. 16. International Technology Roadmap for
Semiconductors. Executive Summary, 2005 and 2007;
of finding new compelling applica- Combining upside and downside, http://public.itrs.net/.
tions and the advantages of manycore this research challenge represents the 17. Kantowitz, B. and Sorkin, R. Human Factors:
Understanding People-System Relationships. John
and heterogeneous hardware. We also most significant of all IT challenges Wiley & Sons, Inc., New York, 1983.
described the research of the Berkeley over the past 50 years. We hope many 18. Mattson, T., Sanders, B., and Massingill, B. Patterns for
Parallel Programming. Addison-Wesley Professional,
Par Lab. While it will take years to learn more innovators will join this quest to Reading, MA, 2004.
which of our ideas work well, we share it build a parallel bridge. 19. O’Hanlon, C. A conversation with John Hennessy and
David Patterson. Queue 4, 10 (Dec. 2005/Jan. 2006),
here as a concrete example of a coordi- 14–22.
nated attack on the problem. Acknowledgments 20. Patterson, D. and Hennessy, J. Computer Organization
and Design: The Hardware/Software Interface, Fourth
Unlike the traditional approach of This research is sponsored by the Uni- Edition. Morgan Kaufmann Publishers, Boston, MA, Nov.
2008.
making hardware king, the Par Lab versal Parallel Computing Research 21. Sen, K. and Viswanathan, M. Model checking
is application-driven, working with Center, which is funded by Intel and multithreaded programs with asynchronous atomic
methods. In Proceedings of the 18th International
domain experts to create compelling Microsoft (Award # 20080469) and by Conference on Computer-Aided Verification (Seattle,
applications in music, image- and matching funds from U.C. Discovery WA, Aug. 17–20, 2006).
22. Sen, K. et al. CUTE: A concolic unit testing engine for
speech-recognition, health, and par- (Award #DIG07-10227). Additional C. In Proceedings of the Fifth Joint Meeting European
allel browsers. support comes from the Par Lab Affili- Software Engineering Conference (Lisbon, Portugal,
Sept. 5–9). ACM Press, New York, 2005, 263–272.
The software span connecting ap- ate companies: National Instruments, 23. Shaw, M. and Garlan, D. An Introduction to Software
plications to hardware relies more on NEC, Nokia, NVIDIA, and Samsung. Architecture. Technical Report CMU/SEI-94-TR-21,
ESC-TR-94-21. CMU Software Engineering Institute,
parallel software architectures than We wish to thank our colleagues in the Carnegie Mellon University, Pittsburgh, PA, 1994.
on parallel programming languages. Par Lab and the Lawrence Berkeley Na- 24. Shirako, J., Peixotto, D., Sarkar, V., and Scherer,
W. Phasers: A unified deadlock-free construct for
Instead of traditional optimizing com- tional Laboratory collaborations who collective and point-to-point synchronization. In
pilers, we depend on autotuners, us- shaped these ideas. Proceedings of the 22nd ACM International Conference
on Supercomputing (Island of Kos, Greece, June 7–12).
ing a combination of empirical search ACM Press, New York, 2008, 277–288.
and performance modeling to create 25. Shirako, J., Kasahara, H., and Sarkar, V. Language
References
extensions in support of compiler parallelization. In
highly optimized libraries tailored to 1. Adve, S. et al. Parallel Computing Research at Illinois:
Proceedings of the 20th Workshop on Languages and
The UPCRC Agenda. White Paper. University of Illinois,
Compilers for Parallel Computing (Urbana, IL, Oct.
specific machines. By splitting the soft- Urbana-Champaign, IL, Nov. 2008.
11–13). Springer-Verlag, Berlin, 2007, 78–94.
ware stack into a productivity layer and 2. Agarwal, V., Liu, L.-K., and Bader, D. Financial modeling
26. Thomas, D. et al. Agile Web Development with Rails,
on the Cell broadband engine. In Proceedings of 22nd
Second Edition. The Pragmatic Bookshelf, Raleigh, NC,
an efficiency layer and targeting them IEEE International Parallel and Distributed Processing
2008.
Symposium (Miami, FL, Apr. 14–18, 2008).
at domain experts and programming 3. Alexander, C. et al. A Pattern Language: Towns,
27. UPC Language Specifications, Version 1.2. Technical
Report LBNL-59208. Lawrence Berkeley National
experts respectively, we hope to bring Buildings, Construction. Oxford University Press, 1997.
Laboratory, Berkeley, CA, 2005.
4. Asanovic, K. et al. The Parallel Computing Laboratory
parallel computing to all programmers at U.C. Berkeley: A Research Agenda Based on the
28. Wawrzynek, J. et al. RAMP: Research Accelerator for
Multiple Processors. IEEE Micro 27, 2 (Mar. 2007),
while keeping domain experts produc- Berkeley View. UCB/EECS-2008-23, University of
46–57.
California, Berkeley, Mar. 21, 2008.
tive and allowing expert programmers 5. Asanovic, K. et al. The Landscape of Parallel Computing
29. Williams, S., Waterman, A., and Patterson, D. Roofline:
An insightful visual performance model for floating-
to achieve maximum efficiency. Our ap- Research: A View from Berkeley. UCB/EECS-2006-183,
point programs and multicore architectures. Commun.
University of California, Berkeley, Dec. 18, 2006.
proach to correctness relies on verifica- 6. Bader, D.A. and Patel, S. High-performance MPEG-2
ACM 52, 4 (Apr. 2009), 65–76.
30. Williams, S. et al. Lattice Boltzmann simulation
tion where possible, then uses the same software decoder on the Cell broadband engine. In
optimization on leading multicore platforms. In
Proceedings of the 22nd IEEE International Parallel
tools to reduce the amount of testing Proceedings of the 22nd IEEE International Parallel
and Distributed Processing Symposium (Miami, FL, Apr.
and Distributed Processing Symposium (Miami, FL, Apr.
where verification is not possible. 14–18, 2008).
14–18, 2008).
7. Buschmann, F. et al. Pattern-Oriented Software
The hardware tower of the Par Lab 31. Williams, S. et al. Optimization of sparse matrix-vector
Architecture: A System of Patterns. John Wiley & Sons,
multiplication on emerging multicore platforms. In
serves the software span and applica- Inc., New York, 1996.
Proceedings of the Supercomputing (SC07) Conference
8. Clarke, D.G. et al. Ownership types for flexible alias
(Reno, NV, Nov. 10–16). ACM Press, New York, 2007.
tion tower. Examples of such service protection. In Proceedings of the OOPSLA Conference
include support for OS partitioning, ex- (Vancouver, BC, Canada, 1998), 48–64.
9. Datta, K. et al. Stencil computation optimization and The authors are all affiliated with the Par Lab (http://parlab.
plicit control for the memory hierarchy, autotuning on state-of-the-art multicore architectures. eecs.berkeley.edu/) at the University of California, Berkeley.
In Proceedings of the ACM/IEEE Supercomputing (SC)
accurate measurement for performance 2008 Conference (Austin, TX, Nov. 15–21). IEEE Press,
and energy, and an intuitive, multicore Piscataway, NJ, 2008. © 2009 ACM 0001-0782/09/1000 $10.00

O C TO BE R 2 0 0 9 | VO L. 52 | N O. 1 0 | COM M U N IC AT ION S OF T HE ACM 67


contributed articles
D OI:10.1145/ 1562764.1562784
Here, we describe an automated fo-
The result is stable, focused, dynamic rum management (AFM) system we de-
signed to organize discussion threads
discussion threads that avoid redundant ideas in open forums, exploring it through a
and engage thousands of stakeholders. series of case studies and experiments
using feature requests mined from
BY JANE CLELAND-HUANG, HORATIU DUMITRU, open source forums.
CHUAN DUAN, AND CARLOS CASTRO-HERRERA Using Web-based forums to gather
requirements from stakeholders is com-

Automated
mon in open source forums and increas-
ingly common across a number of enter-
prise-level projects. They can grow quite
large, with many thousands of stake-

Support for
holders. Most use discussion threads to
help focus the conversation; however,
user-defined threads tend to result in re-
dundant ideas, while predefined cat-

Managing
egories might be overly rigid, possibly
leading to coarse-grain topics that fail
to facilitate focused discussions.
To better understand the problems

Feature
and challenges of capturing require-
ments in open forums, we surveyed
several popular open source projects,
evaluating how the structure of their fo-
rums and organization of their feature

Requests in
requests help stakeholders work collab-
oratively toward their goals. The survey
covered a number of tools and frame-
works: a customer relationship man-

Open Forums
agement system called SugarCRM11; a
UML modeling tool called Poseidon10;
an enterprise resource planning tool
called Open Bravo10; a groupware tool
called ZIMBRA10; a Web tool for admin-
istrating the MySQL server called PHP-
MyAdmin10; an open .NET architecture
for Linux called Mono10; and the large
Web-based immersive game world Sec-
grow larger and more complex
AS SOF TWARE PRO J E C T S ond Life.9 All exhibited a significantly
high percentage of discussion threads
and involve more stakeholders across geographical consisting of only one or two feature
and organizational boundaries, project managers requests. For example, as shown in Fig-
increasingly rely on open discussion forums to elicit ure 1, 59% of Poseidon threads, 70% of
SugarCRM threads, 48% of Open Bravo
requirements and otherwise communicate with other threads, and 42% of Zimbra threads in-
stakeholders. Unfortunately, open forums generally cluded only one or two requests. The
presence of so many small threads sug-
don’t support all aspects of the requirements- gests either a significant number of dis-
elicitation process; for example, in most forums, tinct discussion topics or that users had
stakeholders create their own discussion threads, created unnecessary new threads. An
initial analysis found the second case—
introducing significant redundancy of ideas and unnecessary new threads—held for
possibly causing them to miss important discussions. all forums we surveyed; for example in

68 COM MUNICATIO NS O F TH E ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


contributed articles

the SugarCRM forum, stakeholders’ re- Figure 1. Forums are characterized by numerous small threads,
quests related to email lists were found many with only one or two feature requests.
across 20 different clusters, 13 of which POSEIDON
included only one or two comments. Threads: 1,423 Stakeholders: 377
Feature requests: 5,764 Largest thread: 48
We designed an experiment to de- 1800
termine quantitatively if user feature
1600
requests and comments (placed in
new threads) should instead have been 1400

placed in larger preexisting threads. 1200

Number of threads
We conducted it using the data from 1000
SugarCRM, an open source customer
800
relationship management system
600
supporting campaign management,
email marketing, lead management, 400

marketing analysis, forecasting, quote 200


management, and case management. 0
We mined 1,000 feature requests con- 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 More
tributed by 523 different stakehold- Thread Size
ers over a two-year period (2006–2008)
ZIMBRA
distributed across 309 threads from
Threads: 1,334 Stakeholders: 506
the open discussion forum. Of the re- Feature requests: 6,194 Largest thread: 148
quests, 272 had been placed (by us- 400
Number of threads

ers) into discussion threads with no 300


more than two feature requests, 309 200
had been placed in groups with nine
100
or more requests, and the rest were
0
placed in intermediate-size groups.
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 More
We clustered all the feature requests
Thread Size
using a consensus Spherical K-Means
algorithm (described in the following Open Bravo ERP
sections), then estimated the degree to Threads: 858 Stakeholders: 337
Feature requests: 3,041 Largest thread: 31
which each feature request belonged 400
Number of threads

to a topic by computing the proximity


300
of the request to the centroid, or cen-
ter, of its assigned cluster. 200

We hypothesized that feature requests 100


representing unique topics would have 0
low proximity scores, while those fit- 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 More
ting well into a topic would have higher Thread Size
scores. Figure 2 shows the results for fea-
ture requests assigned to small groups SugarCRM
Threads: 309 Stakeholders: 523
with only one or two feature requests vs.
Feature requests: 1,000 Largest thread: 74
those assigned to larger groups of nine 200
Number of threads

or more requests. We performed a T-Test 150


that returned a p-value of 0.0005, show-
100
ing the distributions exhibited a sig-
50
nificant difference; nevertheless, there
was a high degree of overlap between 0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 More
the two distributions, suggesting that
many of the features assigned to indi- Thread Size

vidual or small threads could have been


placed with other feature requests in
larger threads. It was also reasonable to often responded directly to earlier en- Automated Approach
assume that once stakeholders found a tries in the thread, possibly accounting Many such problems can be alleviated
relevant thread and added a feature re- for some of the differences in proxim- through AFM tools that employ data
quest to it, their choice of words would ity scores between the two groupings. clustering techniques to create distinct,
have been influenced by words in the The results from the experiment and highly focused discussion threads.
existing thread, thereby increasing the our subjective observations generally However, clustering is hindered by sig-
similarity of feature requests placed in suggest that in many cases users incor- nificant background noise in feature
shared threads. We observed that users rectly decided to create new threads. requests (such as spelling errors, poor

O CTO B E R 2 0 0 9 | VO L. 52 | NO. 1 0 | C OM M U N IC AT ION S OF T HE ACM 69


contributed articles

Figure 2. The extent to which feature requests assigned to ture requests are preprocessed to re-
individual vs. larger threads fit into global topics. move common words (such as “this”
Large thread and “because”) not useful for identify-
Small thread ing underlying themes. The remaining
70
terms are then stemmed to their gram-
Number of feature requests

60
matical roots. Each feature request x
50
40
is represented as a vector of terms (tx,1,
30
tx,2,……, tx,w,) that is then normalized
20
through a technique called term fre-
10 quency, inverse document frequency
0 (tf-idf), where tf represents the original
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 term frequency of term t in the feature
Proximity score request, and idf represents the inverse
document frequency; idf is often com-
puted as log2(N/dft), where N represents
Figure 3. Two-stage spherical K-means. the number of feature requests in the
entire forum, and dft represents the
Algorithm: Two-stage spherical K-means clustering
number of feature requests containing
Input: unlabeled instances ⲭ = {xi }i = 1, number of clusters K, initial t. The similarity between each normal-
N

centroids I, convergence condition ized requests vector a=(a1, a2,……,aW)


and each centroid b=(b1, b2, …, bW) is
Output: crisp K-partition P = {ci }iK= 1 . then computed as
Σwi= 1 ai ˜ bi
Steps: sim (a, b) = ,
Σwi= 1 a2i˜Σwi= 1 b2i
1. Initialization: initialize centroids using I: {µi0}iK= 1 = I; t = 0.
2. Assign instances to centroids and update centroids until convergence where W represents the total number
2a. assign each instance x to the nearest cluster of terms in the entire set of feature re-
2b. update each centroid to the average of its assigned quests, and ai is computed as the num-
instances: ber of times term ti occurs in feature re-
Σ t+1 x quest a, weighted according to the idf
µit + 1 = x∊c1 value. Intuitively, this formula returns
|cit + 1|
higher similarity scores between two
2c. t = t + 1 feature requests if they share a signifi-
cant number of relatively rare terms.
3. Incremental optimization until convergence: We then determine an appropriate
3a. randomly select an instance x, move it to another cluster cluster granularity through a technique
that maximizes the objective function devised by F. Can and E.A. Ozkarahan2
K
¶sp = Σ Σ t x µ it .:
T in 1990 that predicts the ideal number
i = 1 x∊c1 of clusters by considering the degree
3b. update each centroid as in step 2b. each feature request differentiates it-
3c. t = t + 1 self from other such requests. The ide-
al number of clusters K is computed as

grammar, slang, long-winded com- We designed the AFM process to fx,i


W f 1 f2
K = Σ


Σ





*


 x,i


=


Σ








*


Σ


 x,i



ments, nonstandard abbreviations, re- meet these goals. Once an initial set of x∊D i = 1 |x| Ni x∊D |x| i = 1 Ni
dundancies, inconsistent use of terms, feature requests is collected, the proj-
and off-topic discussions). To be effec- ect manager places the forum under where fx,i is the frequency of term ti in
tive, an AFM must deliver high-quality the management of the AFM system, artifact x , |x| is the length of the arti-
clusters representing a focused theme and existing requests are analyzed to fact, and Ni is the total occurrence of
and be distinct from other clusters in identify themes and generate discus- term ti. This approach has been shown
order to minimize the redundancy of sion threads. Beyond this point, arriv- to be effective across multiple data
ideas across discussions. Clustering ing feature requests are classified into sets2,6 and can be calculated quickly so
algorithms must also execute quickly, existing threads, unless a new topic is the ideal number of clusters is recom-
so clustering occurs inconspicuously detected, in which case a new thread is puted frequently.
in the background. Finally, as open- created. Here, we describe these pro- Our approach uses a clustering algo-
discussion forums are characterized cesses and the underlying algorithms rithm called Spherical K-Means (SPK)5
by a steady influx of feature requests, we developed as the result of extensive that exhibits fast running times and re-
clusters must be relatively stable, so re- experimentation in clustering require- turns relatively high-quality results.6 As
quests are not constantly moved from ments techniques.6,7 described more formally in Figure 3, a
cluster to cluster. In preparation for clustering, fea- set of K centroids, or seeds, is initially

70 C OM MUNICATIO NS O F THE ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


contributed articles

selected by the algorithm, with the ob- feature requests. Re-clustering is per- included 4,205 feature requests we
jective that they are as dissimilar from formed through a modified SPKMeans mined from Second Life, an Internet-
each other as possible. The distance algorithm (we call Stable SPKMeans) based virtual world video game in
from each feature request to each cen- designed to minimize the movement which stakeholders are represented by
troid is computed, and each feature of feature requests between clusters interacting avatars; and the third de-
request is placed in the cluster associ- through reuse of the current set of cen- scribed the features of an online Am-
ated with its closest centroid. Once all troids as seeds for the new clustering. azon-like portal designed specifically
feature requests are assigned to clus- Cluster quality is also improved for students. In spring 2008 we asked
ters, the centroids are repositioned so through user feedback specifying 36 master’s-level students enrolled in
as to increase their average proximity to whether a pair of feature requests be- two different advanced software-engi-
all feature requests in the cluster. This long together. For example, users not neering classes at DePaul University to
is followed by a series of incremental happy with the quality of a cluster can consider the needs of a typical student,
optimizations in which an attempt is specify that a given feature request does textbook reseller, textbook author,
made to move a randomly selected fea- not fit the cluster. They might also pro- textbook publisher, instructor, portal
ture request to the cluster for which it vide additional tags to help place the administrator, architect, project man-
maximizes the overall cohesion gain. feature request in a more appropriate ager, and developer and create relevant
The process continues until no further cluster. These user constraints, along feature requests. The result was 366
optimization is possible. This simple with the tag information, are then in- feature requests.
post-processing step has been shown corporated into the SPK algorithm of To evaluate cluster quality, we con-
to significantly improve the quality of future clusterings. This reassignment structed an “ideal” clustering for the
clustering results.6 maximizes the quality of the individual SugarCRM data by reviewing and mod-
Clustering results are strongly in- clusters and optimizes conformance ifying the natural discussion threads
fluenced by initial centroid selection, to user constraints. Our prior work in created by SugarCRM users. Modifica-
meaning poor selection can lead to low- this area demonstrated significant im- tions included merging closely related
quality clusterings. However, this prob- provement in cluster quality when con- singleton threads, decomposing large
lem can be alleviated through a consen- straints are considered.13 megathreads into smaller more cohe-
sus technique for performing the initial sive ones, and reassigning misfits to
clustering, an approach that generates Evaluating AFM new clusters. The answer set enabled
multiple individual clusterings, then We conducted a series of experiments us to compare the quality of the gener-
uses a voting mechanism to create a fi- designed to evaluate the AFM’s abil- ated clusters using a metric called Nor-
nal result.8 Though it has a much longer ity to quickly deliver cohesive, distinct, malized Mutual Information (NMI) to
running time than SPKMeans, consen- and stable clusters. They utilized three measure the extent to which the knowl-
sus clustering has been shown to con- data sets: the first was the SugarCRM edge of one cluster reduces uncertainty
sistently deliver clusterings that are of data set discussed earlier; the second of other clusters.9 On a scale of 0 (no
higher-than-average quality compared
to the standalone SPK clusterings.6 In Figure 4. A partial cluster with security-related requirements generated after gathering
1,000 constraints from the Student data set.
the AFM system, consensus clustering
is used only for the initial clustering in
(1) The system must protect stored confidential information.
order to create the best possible set of
(2) The system must encrypt purchase/transaction information.
start-up threads.
Following the arrival of each new (3) A privacy policy must describe in detail to the users how their information
is stored and used.
feature request, the algorithm recom-
putes the ideal granularity to deter- (4) Transmission of personal information must be encrypted.
mine if a new cluster should be added. (5) Transmission of financial transactions must be encrypted.
To add a new cluster in a way that pre- (6) The system must use both encrypt and decrypt in some fields.
serves the stability of existing clusters (7) The system must allow users to view their previous transactions.
and minimizes clustering time, the (8) Databases must use the TripleDES encryption standard for database security.
AFM approach identifies the least- AES is still new and has had compatibility issues with certain types of databases,
cohesive cluster, then bisects it using including SQL Server express edition.
SPK, with K = 2. Feature requests from (9) The site must ensure that payment information is confidential and credit card
neighboring clusters are then reevalu- transactions are encrypted to prevent hackers from retrieving information.
ated to determine if they exhibit closer (10) Because the system will be used to buy books, we must focus on security and consider
proximity to one of the two new cen- transaction control in the architecture used to build it.
troids than they do to their own cur- (11) Correct use of cryptography techniques must be applied in the Amazon portal system
rently assigned centroids. If this closer to protect student information from outsiders and staff who might potentially acquire
proximity occurs they are reassigned to the information if left unprotected.
the relevant cluster. To ensure contin- (12) Sessions that handle payment transactions must be encrypted.
ued cluster quality, the entire data set
is re-clustered periodically following
the arrival of a fixed number of new

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 71
contributed articles

Figure 5. Stability of threads in the AFM clustering process. one clustering is possible for a given
data set, these metrics alone lack suf-
Sugar Second Life Student ficient insight to judge whether the
70 cluster quality is better or worse than
Percentage of feature requests

60 the user-defined threads. NMI metrics


50 could not be used to evaluate user-
40 defined threads due to the significant
30
disparity between the number of user-
defined threads and the number of
20
threads in the answer set.
10
We thus conducted a case study us-
0
0 1 2 3 4 5 6 7 8 9 10 ing the SugarCRM data set to compare
Number of moves
the treatment of four topics in the orig-
a. Number of moves per feature request from beginning to end of the inal SugarCRM user-managed forum
incremental clustering process; movements of < 0.1% are not shown. versus the AFM approach. We selected
the topics by examining an unstruc-
Sugar Second Life Student tured list of feature requests, including
16
distribution lists, appointment sched-
Percentage of needs that moved

14 uling, language support, and docu-


12 ment attachments. For each topic, we
10 first identified the set of related feature
8 requests, then compared the way they
6 had been allocated to both user-de-
4 fined and AFM-generated threads (see
2 Table 1). We did not solicit user feed-
0 back for these experiments.
1 3 5 7 9 11 13 13 17 19 21 23 25 27 29
We identified 14 feature requests
Iteration number
for the first topic, distribution lists.
b. Percentage of feature requests moved during
Human users placed them in a thread
the incremental clustering process.
called “email management,” including
Sugar Second Life Student
a total of 74 feature requests. The AFM
60 system similarly placed all requests in
the thread called “send email to sugar
Number of feature requests moved

50
contacts.” In this case, both approach-
40 es were relatively successful.
30 For the second topic, appointment
scheduling, we identified 32 feature
20
requests. Human stakeholders placed
10 them across six different threads, with
15 in a thread called “scheduling,” 12
0
1 6 11 16 21 26 31 36 41 46 in a thread called “how to improve cal-
Iteration number endering [sic] and activities in sugar
c. Number of feature requests moved during sales,” two in a thread called “calen-
the incremental clustering process. dar,” and one each in threads related
to mass updates, printing, and email
management. The AFM placed 18 fea-
similarity) to 1 (identical clusterings), tering scores of 0.58 increased to 0.7 ture requests in a discussion related to
the SugarCRM clusterings scored 0.57, following 1,000 pairwise constraints. daily appointment scheduling, anoth-
indicating some degree of similarity Figure 4 outlines a cohesive cluster of er 10 in a discussion related to meeting
between the generated cluster and the feature requests for the Student data scheduling, and the remaining four in
answer set. In experiments using the sets generated using 1,000 system- two clusters related to email, Web-en-
answer set to simulate the gathering wide constraints. Note that 1,000 con- abled calendars, and integration with
of nonconflicting user feedback, NMI straints represent only 1.5% of possible other tools. In this case—appointment
scores increased to 0.62 after 1,000 constraints for the Student Portal data scheduling—the AFM performed mar-
pairwise constraints were randomly set and only 0.2% for SugarCRM. ginally better than the user-defined
selected from the pairs of requests ex- These results demonstrate that threads, as the feature requests were
hibiting borderline (neither very strong many of the themes we identified in slightly less dispersed.
nor very weak) proximity scores6 We our two answer sets were also detected For the third topic, language support,
also created an answer set for the Stu- by the unsupervised clustering algo- we identified 11 feature requests. Hu-
dent Portal data set; initial NMI clus- rithms. However, because more than man stakeholders placed them across

72 C OMM UNICATIO NS O F TH E ACM | O CTOB ER 2009 | VO L . 5 2 | N O. 1 0


contributed articles

seven relatively small threads epitomiz- Student data sets, we found that the subsequent creation of new threads,
ing the kinds of problems we found in Stable SPKMeans clustering algorithm causing reclassification of some exist-
the open source forums we studied. The had no significant negative effect on ing feature requests. Figure 5b shows
AFM created a focused discussion fo- the quality of the final clusters. In fact, that the percentage of feature requests
rum on the topic of languages in which the NMI scores for the two data sets— moved across clusters gradually de-
it placed nine of the 11 requests. The SugarCRM and Student for which creases across subsequent iterations.
AFM approach excelled here. “target clusterings” were available— However, as shown in Figure 5c, the
The fourth and final topic, attach showed a slight improvement when we actual number of movements increas-
documents, represents a case in which used the stable algorithm. es in early iterations, then becomes
a fairly dominant concern was dis- To evaluate the stability of the modi- relatively stable. Though not discussed
persed by human users across 13 dif- fied SPKMeans algorithm, we tracked here, we also conducted extensive ex-
ferent threads, while the AFM placed the movement of feature requests be- periments that found the traditional
all related requests in a single highly tween clusters for re-clustering inter- unmodified SPKMeans algorithm re-
focused discussion thread. vals of 25 feature requests. Figure 5a sulted in approximately 1.6–2.6 times
It was evident that in each of the shows the number of moves per feature more volatility of feature requests than
four cases the AFM approach either request, reported in terms of percent- our modified version.
matched or improved on the results of age. Approximately 62%–65% of feature We analyzed the total execution time
the user-created threads. requests were never moved, 20%–30% of the Stable SPKMeans algorithm for
AFM was designed to minimize the of feature requests were moved once each of the three data sets using MAT-
movement of feature requests among or twice, and only about 5%–8% were LAB on a 2.33GHz machine; Table 2
clusters in order to preserve stability moved more frequently. A significant outlines the total time needed to per-
and quality. In a series of experiments amount of this movement is accounted form the clusterings at various incre-
against the SugarCRM, Second Life, and for by the arrival of new topics and the ments. We excluded initial consensus

Table 1. Four topics in human vs. automated thread management.

Topic User-defined threads # FRs Cluster Size AFM-defined threads # FRs Cluster Size
Distribution lists Email management 14 74 Send email to sugar contacts 14 62
Scheduling Scheduling 15 37 View daily appointments in calendar 18 58
appointments How to improve calendaring 12 42 Schedule meetings 10 39
and activities in sugar sales
Calendar 2 32 Calendar support features 3 27
Mass updates 1 14 Send email to sugar contacts 1 62
Printing 1 5
Email management 1 74
Language International salutations using 2 3 Contact languages 9 11
support one language pack
Customer salutation in email template 2 2 Users roles and defaults 1 32
Outlook category synchronization 2 7 Contact categories and fields 1 13
Project: Translation system 2 4
Fallback option in language files 1 1
Language Preference option 1 1
A simplified workflow for small business 1 7
Document Attaching documents to cases, projects 6 7 Attach and manage documents 25 32
attachments Attachments for bug tracker 1 1
Display attachment 3 3
Display size of attachments 1 1
and sugar documents
Note: AFM-generated threads are initially labeled according
Documents to a customer 2 3 to the most common terms and then renamed by stakeholders.
Email management 2 74 For example, the thread “Send email to sugar contacts”
File upload field 2 5 was first generated as “email, send, list, sugar, contact.”

Koral document management integration 1 1 The threads here have been renamed.
Link documents with project 2 2
Module builder 1 6 Spelling and grammar is maintained from the original
user forums.
New feature request: Attach file to 1 1
an opportunities
View all in all lists 2 3
WebDAV access to "documents" 1 3

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 73
contributed articles

clustering times that, with typical pa- data-mining techniques to help man- Acknowledgments
rameters of 50–100 requests, could take age discussion threads in open discus- This work was partially funded by Na-
up to two minutes. As depicted in this sion forums. Our ongoing work aims to tional Science Foundation grant CCR-
example, the SugarCRM data set took improve techniques for incorporating 0447594, including a Research Expe-
a total of 57.83 seconds to complete all user feedback into the clustering pro- riences for Undergraduates summer
29 clusterings at increments of 50 fea- cess so clusters that appear ad hoc to supplement to support the work of
ture requests. Individual clusterings are users or contain multiple themes can Horatiu Dumitru. We would also like to
notably fast; for example, the complete be improved. acknowledge Brenton Bade, Phik Shan
Second Life data set, consisting of 4,205 These findings are applicable across Foo, and Adam Czauderna for their
requests, was clustered in 75.18 seconds a range of applications, including those work developing the prototype.
using standard SPKMeans and in only designed to gather comments from a
1.02 seconds using our stable approach. product’s user base, support activities References
1. Basu, C., Hirsh, H., and Cohen, W. Recommendation
The Stable SPKMeans algorithm sig- (such as event planning), and capture as classification: Using social and content-based
nificantly improves the performance of requirements in large projects when information in recommendation. In Proceedings of
the 15th National Conference on Artificial Intelligence
the AFM, mainly because it takes less stakeholders are dispersed geographi- (Madison, WI, July 26–30). MIT Press, Cambridge, MA,
1998, 714–720.
time to converge on a solution when cally. Our ongoing work focuses on the 2. Can, F. and Ozkarahan, E.A. Concepts and effectiveness
quality seeds are passed forward from use of forums to support the gathering of the cover-coefficient-based clustering methodology
for text databases. ACM Transactions on Database
the previous clustering. Smaller incre- and prioritizing of requirements where Systems 15, 4 (Dec. 1990), 483–517.
ments require more clusterings, so the automated forum managers improve 3. Castro-Herrera, C., Duan, C., Cleland-Huang, J., and
Mobasher, B. A recommender system for requirements
overall clustering time increases as the the allocation of feature requests to elicitation in large-scale software projects. In
increment size decreases. However, threads and use recommender systems Proceedings of the 2009 ACM Symposium on Applied
Computing (Honolulu, HI, Mar. 9–12). ACM Press, New
our experiments found that increasing to help include stakeholders in relevant York, 2008, 1419–1426.
the increment size to 25 or even 50 fea- discussion groups.3 They also improve 4. Davis, A., Dieste, O., Hickey, A., Juristo, N., and Moreno,
A. Effectiveness of requirements elicitation techniques.
ture requests has negligible effect on the precision of forum search and en- In Proceedings of the 14th IEEE International
the quality and stability of the clusters. hance browsing capabilities by predict- Requirements Engineering Conference (Minneapolis, MN,
Sept.). IEEE Computer Society, 2006, 179–188.
ing and displaying stakeholders’ inter- 5. Dhillon, I.S. and Modha, D.S. Concept decompositions for
Conclusion est in a given discussion thread. large sparse text data using clustering. Machine Learning
42, 1–2 (Jan. 2001), 143–175.
We have identified some of the prob- From the user’s perspective, AFM fa- 6. Duan, C., Cleland-Huang, J., and Mobasher, B. A
lems experienced in organizing dis- cilitates the process of entering feature consensus-based approach to constrained clustering of
software requirements. In Proceedings of the 17th ACM
cussion threads in open forums. The requests. Enhanced search features International Conference on Information and Knowledge
Management (Napa, CA, Oct. 26–30). ACM Press, New
survey we conducted in summer 2008 help users decide where to place new York, 2008, 1073–1082.
of several open source forums sug- feature requests more accurately. Un- 7. Frakes, W.B. and Baeza-Yates, R. Information Retrieval:
Data Structures and Algorithms. Prentice-Hall,
gests that expecting users to manu- derlying data-mining functions then Englewood Cliffs, NJ, 1992.
ally create and manage threads may test the validity of the choice and (when 8. Fred, A.L. and Jain, A.K. Combining multiple clusterings
using evidence accumulation. IEEE Transactions on
not be the most effective approach. In placement is deemed incorrect) rec- Pattern Analysis and Machine Intelligence 27, 6 (June
contrast, we described an automated ommend moving the feature request 2005), 835–850.
9. Second Life virtual 3D world; http://secondlife.com,
technique involving our own AFM for to another existing discussion group or feature requests downloaded from the Second Life
creating stable, high-quality clusters sometimes to an entirely new thread. issue tracker https://jira.secondlife.com/secure/
Dashboard.jspa.
to anchor related discussion groups. All techniques described here are 10. SourceForge. Repository of Open Source Code and
Though no automated technique al- being implemented in the prototype Applications; feature requests for Open Bravo, ZIMBRA,
PHPMyAdmin, and Mono downloaded from SourceForge
ways delivers clusters that are cohesive AFM tool we are developing to test and forums http://sourceforge.net/.
and distinct from other clusters, our evaluate the AFM as an integral com- 11. Sugar CRM. Commercial open source customer
relationship management software; http://www.
reported experiments and case studies ponent of large-scale, distributed-re- sugarcrm.com/crm/; feature requests mined from
demonstrate the advantages of using quirements processes. feature requests at http://www.sugarcrm.com/forums/.
12. Wagstaff, K., Cardie, C., Rogers, S., and Schrödl, S.
Constrained K-means clustering with background
Table 2. Performance measured by total time spent clustering (in seconds). knowledge. In Proceedings of the 18th International
Conference on Machine Learning (June 28–July 1).
Morgan Kaufman Publishers, Inc., San Francisco, 2001,
Time to cluster entire set
577–584.
Data set Increment Size of feature requests one time
1 10 25 50 Jane Cleland-Huang ([email protected]) is an
Student (366 feature requests) associate professor in the School of Computing at DePaul
University, Chicago, IL.
Stable SPK Means 7.49 0.98 0.62 0.41 0.04
Horatiu Dumitru ([email protected]) is an
Standard SPK Means 101.82 10.78 4.74 2.92 0.73 undergraduate student studying computer science at the
Sugar (1,000 feature requests) University of Chicago, Chicago, IL.

Stable SPK Means 84.54 13.24 6.66 4.27 0.22 Chuan Duan ([email protected]) is a post-doctoral
researcher in the School of Computing at DePaul
Standard SPK Means 2,374.31 249.92 108.62 57.83 6.69
University, Chicago, IL.
Second Life (4,205 feature requests)
Carlos Castro-Herrera ([email protected]) is
Stable SPK Means 1,880.69 268.15 146.75 96.11 1.02 a Ph.D. student in the School of Computing at DePaul
Standard SPK Means 11,409.57 12,748.63 5,917.94 2,619.90 75.18 University, Chicago, IL.

© 2009 ACM 0001-0782/09/1000 $10.00

74 COMM UNICATIO NS O F THE ACM | O CTOB ER 2009 | VO L . 5 2 | NO. 1 0



   
    

 )'(*%(- *%*&(**-(- %% %%&%'(&0*$%*&( %


%*-&(" %% %( %) %%$*$* )%*&(*)-(- %% %
One-on-One Mentoring Programs ' ( )*+%*$$()- *$%*&()
(&$ %+)*(/&,(%$%* (+* &%%&*()*&()

1 &$$+% */$ #&+*((&#)&+()-&("%$%/&*(*&' )


1'%!+)*20 minutes a week %$"+ 2(% %)*+%*)# 
1"'(* %# ,#/&%# %&$$+% */&'(&)) &%#)%)*+%*)##&,(*-&(#

        


  
   

%*&(*))'&%)&() %#+ &+%* &% #& &+%* &% #%*%&#& )$/# %($+* #)
*# (&+' &+%* &% )&/)*$) -#**"( &$'%/ &('&(* &%%*# &+%* &%&"
(* %'/)*$)* &%# % &+%* &%,#)(&(*&(/
% * &%#&(*&( )
#+$((
*#( &+%* &%.)%)*(+$%*)% %(/+ &+%* &%
review articles
DOI:10.1145/ 1562764.1562785
performs well in the worst case: if one
This Gödel Prize-winning work traces the steps can prove that an algorithm performs
well in the worst case, then one can be
toward modeling real data. confident that it will work well in ev-
ery domain. However, there are many
BY DANIEL A. SPIELMAN AND SHANG-HUA TENG algorithms that work well in practice
that do not work well in the worst case.

Smoothed
Smoothed analysis provides a theo-
retical framework for explaining why
some of these algorithms do work well
in practice.
The performance of an algorithm is

Analysis:
usually measured by its running time,
expressed as a function of the input
size of the problem it solves. The per-
formance profiles of algorithms across
the landscape of input instances can

An Attempt to Explain differ greatly and can be quite irregu-


lar. Some algorithms run in time linear

the Behavior of
in the input size on all instances, some
take quadratic or higher order poly-
nomial time, while some may take an

Algorithms in Practice exponential amount of time on some


instances.
Traditionally, the complexity of an
algorithm is measured by its worst-case
performance. If a single input instance
triggers an exponential runtime, the
algorithm is called an exponential-
time algorithm. A polynomial-time
algorithm is one that takes polyno-
“My experiences also strongly confirmed mial time on all instances. While poly-
my previous opinion that the best theory is nomial-time algorithms are usually
viewed as being efficient, we clearly
inspired by practice and the best practice prefer those whose runtime is a poly-
is inspired by theory.” nomial of low degree, especially those
that run in nearly linear time.
Donald E. Knuth, “Theory and Practice,” It would be wonderful if every algo-
Theoretical Computer Science, 1991. rithm that ran quickly in practice was
a polynomial-time algorithm. As this
is not always the case, the worst-case
ALGORITHMS ARE HIGH-LEVEL descriptions of how
framework is often the source of dis-
computational tasks are performed. Engineers and crepancy between the theoretical eval-
experimentalists design and implement algorithms, uation of an algorithm and its practical
performance.
and generally consider them a success if they work in
ADA PT ED F ROM A PH OTOGRAPH BY ÉOLE WIND

It is commonly believed that practi-


practice. However, an algorithm that works well in one cal inputs are usually more favorable
than worst-case instances. For exam-
practical domain might perform poorly in another. ple, it is known that the special case
Theorists also design and analyze algorithms, with of the Knapsack problem in which
the goal of providing provable guarantees about their one must determine whether a set of
n numbers can be divided into two
performance. The traditional goal of theoretical groups of equal sum does not have a
computer science is to prove that an algorithm polynomial-time algorithm, unless NP

O C TOB E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 77


review articles

is equal to P. Shortly before he passed problems. We don’t understand is the set of all graphs with n vertices;
away, Tim Russert of NBC’s “Meet why! It is apparent that worst-case and in computational geometry, we
the Press,” commented that the 2008 analysis does not provide use- often have 7n  ℝn. In order to suc-
election could end in a tie between ful insights on the performance cinctly express the performance of an
the Democratic and the Republican of algorithms and heuristics algorithm A, for each 7n one defines
candidates. In other words, he solved and our models of computation a scalar TA(n) that summarizes the
a 51-item Knapsack problema by need to be further developed instance-based complexity measure,
hand within a reasonable amount and refined. Theoreticians are TA[], of A over 7n. One often further
of time, and most likely without investing increasingly in careful simplifies this expression by using
using the pseudo-polynomial-time experimental work leading to big-O or big-1 notation to express TA(n)
dynamic-programming algorithm for identification of important new asymptotically.
Knapsack! questions in algorithms area. Traditional Analyses. It is under-
In our field, the simplex algorithm Developing means for predicting standable that different approaches to
is the classic example of an algorithm the performance of algorithms summarizing the performance of an
that is known to perform well in prac- and heuristics on real data and algorithm over 7n can lead to very dif-
tice but has poor worst-case complex- on real computers is a grand ferent evaluations of that algorithm. In
ity. The simplex algorithm solves a challenge in algorithms. Theoretical Computer Science, the most
linear program, for example, of the commonly used measures are the
form, Needless to say, there are a multi- worst-case measure and the average-
tude of algorithms beyond simplex and case measures.
max cTx subject to Ax b b, (1) simulated annealing whose perfor- The worst-case measure is defined
mance in practice is not well explained as
where A is an m × n matrix, b is an by worst-case analysis. We hope that
m-place vector, and c is an n-place theoretical explanations will be found WCA(n) = max TA[x]
x  7n
vector. In the worst case, the simplex for the success in practice of many
algorithm takes exponential time.25 of these algorithms, and that these
Developing rigorous mathematical theories will catalyze better algorithm The average-case measures have
theories that explain the observed per- design. more parameters. In each average-case
formance of practical algorithms and measure, one first determines a distri-
heuristics has become an increasingly The Behavior of Algorithms bution of inputs and then measures the
important task in Theoretical Computer When A is an algorithm for solving expected performance of an algorithm
Science. However, modeling observed problem P we let TA[x] denote the assuming inputs are drawn from this
data and practical problem instances running time of algorithm A on an input distribution. Supposing + provides a
is a challenging task as insightfully instance x. If the input domain 7 has distribution over each 7n, the average-
pointed out in the 1999 “Challenges only one input instance x, then we can case measure according to + is
for Theory of Computing” Report use the instance-based measure TA1[x]
for an NSF-Sponsored Workshop on and TA2[x] to decide which of the two Ave+A (n) = E [TA[x]],
x + 7n
Research in Theoretical Computer algorithms A1 and A2 more efficiently
Science.b solves P. If 7 has two instances x and y,
then the instance-based measure of an where we use x  + 7n to indicate that x
While theoretical work on mod- algorithm A defines a two-dimensional is randomly chosen from 7n according
els of computation and meth- vector (TA[x], TA[y]). It could be the case to distribution +.
ods for analyzing algorithms has that TA1[x]  TA2[x] but TA1[y] ! TA2[y]. Critique of Traditional Analyses.
had enormous payoff, we are not Then, strictly speaking, these two algo- Low worst-case complexity is the gold
done. In many situations, sim- rithms are not comparable. Usually, standard for an algorithm. When low,
ple algorithms do well. Take for the input domain is much more com- the worst-case complexity provides an
example the Simplex algorithm plex, both in theory and in practice. absolute guarantee on the performance
for linear programming, or the The instance-based complexity mea- of an algorithm no matter which input
success of simulated annealing sure TA[] defines an |7| dimensional it is given. Algorithms with good worst-
of certain supposedly intractable vector when 7 is finite. In general, it case performance have been developed
can be viewed as a function from 7 to for a great number of problems.
ℝ1+ but it is unwieldy. To compare two However, there are many problems
a
In presidential elections in the United States,
algorithms, we require a more concise that need to be solved in practice for
each of the 50 states and the District of Colum- complexity measure. which we do not know algorithms with
bia is allocated a number of electors. All but the An input domain 7 is usually viewed good worst-case performance. Instead,
states of Maine and Nebraska use a winner-take- as the union of a family of subdomains scientists and engineers typically use
all system, with the candidate winning the major- {71, …, 7n, …}, where 7n represents all heuristic algorithms to solve these
ity votes in each state being awarded all of that
states electors. The winner of the election is the
instances in 7 of size n. For example, problems. Many of these algorithms
candidate who is awarded the most electors. in sorting, 7n is the set of all tuples of work well in practice, in spite of having
b
Available at http://sigact.acm.org/ n elements; in graph algorithms, 7n a poor, sometimes exponential, worst-

78 COM MUNICATIO NS O F THE ACM | OC TOB E R 2009 | VO L . 5 2 | NO. 1 0


review articles

case running time. Practitioners justify from the graph of a triangulation of nor completely arbitrary. At a high
the use of these heuristics by observing a point set in two dimensions, which level, each input is generated from a
that worst-case instances are usually will also have average degree approxi- two-stage model: In the first stage, an
not “typical” and rarely occur in prac- mately six. instance is generated and in the second
tice. The worst-case analysis can be too In fact, random objects such as ran- stage, the instance from the first stage
pessimistic. This theory-practice gap dom graphs and random matrices have is slightly perturbed. The perturbed
is not limited to heuristics with expo- special properties with exponentially instance is the input to the algorithm.
nential complexity. Many polynomial- high probability, and these special In smoothed analysis, we assume
time algorithms, such as interior-point properties might dominate the aver- that an input to an algorithm is subject
methods for linear programming and age-case analysis. Edelman14 writes of to a slight random perturbation. The
the conjugate gradient algorithm for random matrices: smoothed measure of an algorithm on
solving linear equations, are often an input instance is its expected per-
much faster than their worst-case What is a mistake is to psycholog- formance over the perturbations of
bounds would suggest. In addition, ically link a random matrix with that instance. We define the smoothed
heuristics are often used to speed up the intuitive notion of a “typical” complexity of an algorithm to be the
the practical performance of imple- matrix or the vague concept of maximum smoothed measure over
mentations that are based on algo- “any old matrix.” input instances.
rithms with polynomial worst-case In contrast, we argue that “ran- For concreteness, consider the case
complexity. These heuristics might dom matrices” are very special 7n = ℝn, which is a common input
in fact worsen the worst-case perfor- matrices. domain in computational geometry,
mance, or make the worst-case com- scientific computing, and optimiza-
plexity difficult to analyze. Smoothed Analysis: A Step Toward tion. For these continuous inputs and
Average-case analysis was intro- Modeling Real Data. Because of the applications, the family of Gaussian
duced to overcome this difficulty. In intrinsic difficulty in defining prac- distributions provides a natural model
average-case analysis, one measures tical distributions, we consider an of noise or perturbation.
the expected running time of an algo- alternative approach to modeling Recall that a univariate Gaussian
rithm on some distribution of inputs. real data. The basic idea is to identify distribution with mean 0 and standard
While one would ideally choose the typical properties of practical data, deviation s has density
distribution of inputs that occurs in define an input model that captures
practice, this is difficult as it is rare these properties, and then rigorously 1
} } e x /2s .
2 2

that one can determine or cleanly analyze the performance of algorithms •


@2
@ps
express these distributions, and the assuming their inputs have these
distributions can vary greatly between properties. The standard deviation measures
one application and another. Instead, Smoothed analysis is a step in this the magnitude of the perturbation.
average-case analyses have employed direction. It is motivated by the obser- A Gaussian random vector of variance
distributions with concise mathemati- vation that practical data is often sub- s 2 centered at the origin in 7n = ℝn is
cal descriptions, such as Gaussian ran- ject to some small degree of random a vector in which each entry is an inde-
dom vectors, uniform {0, 1} vectors, noise. For example, pendent Gaussian random variable of
and Erdös–Rényi random graphs. ! In industrial optimization and standard deviation s and mean 0. For
_
The drawback of using such dis- economic prediction, the input param- a vector x  ℝn, a s -Gaussian perturba-
_ _
tributions is that the inputs actually eters could be obtained by physical tion of x is a random vector x = x + g,
encountered in practice may bear very measurements, and measurements where g is a Gaussian random vector
little resemblance to the inputs that usually have some of low magnitude of variance s 2. The standard deviation
are likely to be generated by such dis- uncertainty. of the perturbation we apply should be
tributions. For example, one can see ! In the social sciences, data often related to the norm of the vector it per-
what a random image looks like by comes from surveys in which subjects turbs. For the purposes of this article,
disconnecting most TV sets from their provide integer scores in a small range we relate the two by restricting the
antennas, at which point they display (say between 1 and 5) and select their unperturbed inputs to lie in [−1, 1]n.
“static.” These random images do scores with some arbitrariness. Other reasonable approaches are taken
not resemble actual television shows. ! Even in applications where inputs elsewhere.
More abstractly, Erdös–Rényi random are discrete, there might be random- Definition 1. (Smoothed Com-
graph models are often used in aver- ness in the formation of inputs. For plexity). Suppose A is an algorithm
age-case analyses of graph algorithms. instance, the network structure of the with 7n = ℝn. Then, the smoothed com-
The Erdös–Rényi distribution G(n, p), Internet may very well be governed by plexity of A with s -Gaussian perturba-
produces a random graph by including some “blueprints” of the government tions is given by
every possible edge in the graph inde- and industrial giants, but it is still “per-
_
pendently with probability p. While turbed” by the involvements of smaller SmoothedVA (n) = _ max Eg [TA(x + g)],
x [1, 1]n
the average degree of a graph chosen Internet service providers.
from G(n, 6/(n – 1) ) is approximately In these examples, the inputs usu-
six such a graph will be very different ally are neither completely random

O C TO BE R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 79
review articles

where g is a Gaussian random vector of relaxation of polynomial smoothed Phase II is iterative: in the ith iteration,
variance s 2. complexity. the algorithm finds a neighboring ver-
tex vi of vi−1 with better objective value
In this definition, the “original” Definition 3. A has probably polyno- or terminates by returning vi−1 when
_
input x is perturbed to obtain the mial smoothed complexity if there exist no such neighboring vertex exists. The
_
input x + g, which is then fed to the constants n0, s0, c, and a, such that for simplex algorithms differ in their pivot
algorithm. For each original input, this all n r n0 and 0 b s b s0, rules, which determine which vertex
measures the expected running time vi to choose when there are multiple
of algorithm A on random perturba- _ choices. Several pivot rules have been
_ max n E g
[TA(x + g)A] b c . V 1 . n. (4)
tions of that input. The maximum out x [1, 1] proposed; however, almost all existing
front tells us to measure the smoothed pivot rules are known to have exponen-
analysis by the expectation under the They show that some algorithms tial worst-case complexity.25
worst possible original input. have probably polynomial smoothed Spielman and Teng36 considered
The smoothed complexity of an complexity, in spite of the fact that the smoothed complexity of the sim-
algorithm measures the performance their smoothed complexity according plex algorithm with the shadow-vertex
of the algorithm both in terms of to Definition 1 is unbounded. pivot rule, developed by Gass and
the input size n and in terms of the Saaty.18 They used Gaussian pertur-
magnitude s of the perturbation. Examples of Smoothed Analysis bations to model noise in the input
By varying s between zero and infin- In this section, we give a few examples data and proved that the smoothed
ity, one can use smoothed analysis to of smoothed analysis. We organize complexity of this algorithm is polyno-
interpolate between worst-case and them in five categories: mathemati- mial. Vershynin38 improved their result
average-case analysis. When s = 0, cal programming, machine learning, to obtain a smoothed complexity of
one recovers the ordinary worst-case numerical analysis, discrete math-
analysis. As s grows large, the random ematics, and combinatorial optimi- O (max (n5 log2 m, n9 log 4 n, n3 s 4)).
perturbation g dominates the original zation. For each example, we will give
_
x , and one obtains an average-case the definition of the problem, state See Blum and Dunagan, and Duna-
analysis. We are most interested in the worst-case complexity, explain gan et al.7, 13 for smoothed analyses of
the situation in which s is small rela- the perturbation model, and state the other linear programming algorithms
_ _
tive to ||x ||, in which case x + g may be smoothed complexity under the per- such as the interior-point algorithms
interpreted as a slight perturbation of turbation model. and the perceptron algorithm.
_
x . The dependence on the magnitude Mathematical Programming. The Quasi-Concave Minimization.
s is essential and much of the work in typical problem in mathematical pro- Another fundamental optimization
smoothed analysis demonstrates that gramming is the optimization of an problem is quasi-concave minimiza-
noise often makes a problem easier to objective function subject to a set of tion. Recall that a function f : ℝn m
solve. constraints. Because of its impor- ℝ is quasi-concave if all of its upper
tance to economics, management level sets Lg = {x| f (x) r g} are convex.
Definition 2. A has polynomial science, industry, and military plan- In quasi-concave minimization, one is
smoothed complexity if there exist posi- ning, many optimization algorithms asked to find the minimum of a quasi-
tive constants n0, s0, c, k1, and k2 such and heuristics have been developed, concave function subject to a set of lin-
that for all n r n0 and 0 b s b s0, implemented and applied to practi- ear constraints. Even when restricted
cal problems. Thus, this field provides to concave quadratic functions over
SmoothedVA (n) b c . V k2 . nk1, (2) a great collection of algorithms for the hypercube, concave minimization
smoothed analysis. is NP-hard.
From Markov’s inequality, we know Linear Programming. Linear pro- In applications such as stochastic
that if an algorithm A has smoothed gramming is the most fundamental and multiobjective optimization, one
complexity T(n, s), then optimization problem. A typical lin- often deals with data from low-dimen-
ear program is given in Equation 1. sional subspaces. In other words, one
_
_ max nPrg[TA(x + g)] b d 1T(n, S)] r 1 d . The most commonly used linear pro- needs to solve a quasi-concave mini-
x [1, 1]
(3) gramming algorithms are the simplex mization problem with a low-rank
algorithm12 and the interior-point quasi-concave function.23 Recall that a
Thus, if A has polynomial smoothed algorithms. function f : ℝn m ℝ has rank k if it can
_
complexity, then for any x , with prob- The simplex algorithm, first devel- be written in the form
ability at least (1 − d ), A can solve a oped by Dantzig in 1951,12 is a family of
_
random perturbation of x in time poly- iterative algorithms. Most of them are f (x)  g (a1T x, aT2 x,...,aTk x),
nomial in n, 1/s, and 1/d. However, the two-phase algorithms: Phase I deter-
probabilistic upper bound given in (3) mines whether a given linear program is for a function g : ℝk m ℝ and linearly
does not necessarily imply that the infeasible, unbounded in the objective independent vectors a1, a2, …, ak.
smoothed complexity of A is O(T(n,s) ). direction, or feasible with a bounded Kelner and Nikolova23 proved that,
Blum and Dunagan7 and subsequently solution, in which case, a vertex v0 of under some mild assumptions on the
Beier and Vöcking6 introduced a the feasible region is also computed. feasible convex region, if k is a constant

80 COM MUNICATIO NS O F TH E ACM | OC TOBE R 2009 | VO L . 5 2 | NO. 1 0


review articles

then the smoothed complexity of learning. The ordinary perceptron For example, Wilkinson ( JACM
quasi-concave minimization is polyno- algorithm solves a fundamental prob- 1961) demonstrated a family of linear
mial when f is perturbed by noise. Key lem: given a collection of points x1, …, systemsc of n variables and {0, −1, 1}
to their analysis is a smoothed bound xn  ℝd and labels b1, …, bn  {±1}n, find coefficients for which Gaussian elimi-
on the size of the k-dimensional shadow a hyperplane separating the positively nation with partial pivoting—the most
of the high-dimensional polytope that labeled examples from the negatively popular variant in practice—requires
defines the feasible convex region. labeled ones, or determine that no n-bits of precision.
Their result is a nontrivial extension such plane exists. Under a smoothed Precision Requirements of Gaussian
of the analysis of two-dimensional model in which the points x1, …, xn are Elimination. In practice one almost
shadows of Kelner and Spielman, and subject to a s -Gaussian perturbation, always obtains accurate answers using
Spielman and Teng.24, 36 Blum and Dunagan show that the per- much less precision. High-precision
Machine Learning. Machine learn- ceptron algorithm has probably poly- solvers are rarely used or needed. For
ing provides many natural problems nomial smoothed complexity, with example, Matlab uses 64 bits.
for smoothed analysis. The field has exponent a = 1. Their proof follows Building on the smoothed analy-
many heuristics that work in practice, from a demonstration that if the posi- sis of condition numbers (discussed
but not in the worst case, and the data tive points can be separated from the below), Sankar et al.33, 34 proved that it
defining most machine learning prob- negative points, then they can probably is sufficient to use O(log2(n/s) ) bits of
lems is inherently noisy. be separated by a large margin. That is, precision to run Gaussian elimination
K-means. One of the fundamental there probably exists a plane separat- with partial pivoting when the matri-
problems in machine learning is that ing the points for which no point is too ces of the linear systems are subject to
of k-means clustering: the partitioning close to that separating plane. s -Gaussian perturbations.
of a set of d-dimensional vectors Q = It is known that the perceptron The Condition Number. The
{q1, …, qn} into k clusters {Q1, …, Qk} so algorithm converges quickly in this smoothed analysis of the condition
that the intracluster variance case. Moreover, this margin is exactly number of a matrix is a key step toward
what is maximized by support vector understanding numerical precision
k machines. required in practice. For a square
V¤ ¤ \\ qj mQi \\2, PAC Learning. Probably approxi- matrix A, its condition number k (A)
i = 1 qj Qi
mately correct learning (PAC learn- is given by k (A) = ||A||2||A−1||2 where
ing) is a framework in machine ||A||2 = maxx ||Ax||2/||x||2. The condition
is minimized, where mQi  ¤q Q qj  learning introduced by Valiant in which number of A measures how much the
j i
\Qi \ is the centroid of Qi. a learner is provided with a polynomial solution to a system Ax = b changes as
One of the most widely used clus- number of labeled examples from a one makes slight changes to A and b:
tering algorithms is Lloyd’s algorithm given distribution, and must produce If one solves the linear system using
(IEEE Transaction on Information a classifier that is usually correct with fewer than log(k (A) ) bits of precision,
Theory, 1982). It first chooses an arbi- reasonably high probability. In stan- then one is likely to obtain a result far
trary set of k centers and then uses dard PAC learning, the distribution from a solution.
the Voronoi diagram of these centers from which the examples are drawn The quantity 1/||A−1||2 = minx||Ax||2/
to partition Q into k clusters. It then is fixed. Recently, Kalai and Teng21 ||x||2 is known as the smallest singular
repeats the following process until it applied smoothed analysis to this value of A. Sankar et al.34 proved the
stabilizes: use the centroids of the cur- problem by perturbing the input dis- following _ statement: For any _ square
rent clusters as the new centers, and tribution. They prove that polynomial- matrix A in ℝn × n satisfying ||A||2 b •@n ,
then repartition Q accordingly. sized decision trees are PAC-learnable and for any x > 1,
Two important questions about in polynomial time under perturbed
Lloyd’s algorithm are how many itera- product distributions. In constrast,
tions its takes to converge, and how under the uniform distribution even PrA [||A 1||2 r x] b  •
@n ,
}xs
close to optimal is the solution it super-constant size decision trees are
finds? Smoothed analysis has been not known to be PAC-learnable in poly- where A is a s -Gaussian perturbation
used to address the first question. nomial time. of A. Wschebor40 improved _this bound
Arthur and Vassilvitskii proved that Numerical Analysis. One of the foci to show that for s b 1 and ||A||2 b 1,
in the worst case, Lloyd’s algorithm of numerical analysis is the determina-
requires 27(•@n ) iterations to converge,3 tion of how much precision is required Pr[k (A) r x ] b O •
@n .
}xs
but that it has smoothed complexity by numerical methods. For example,
PHOTOGRA PH BY H ENNING M ÜHLINGH AU S

polynomial in nk and s −1.4 Manthey and consider the most fundamental prob- See Bürgisser et al. and Dunagan
Röglin28 recently reduced this bound to lem in computational science—that et al.10, 13 for smoothed analysis of the
polynomial in n•@k and s −1. of solving systems of linear equations. condition numbers of other problems.
Perceptrons, Margins, and Support Because of the round-off errors in Discrete Mathematics. For prob-
Vector Machines. Blum and Dunagan’s computation, it is crucial to know how lems in discrete mathematics, it is
analysis of the perceptron algorithm7 many bits of precision a linear solver
for linear programming implicitly con- should maintain so that its solution is c
See the second line of the Matlab code at
tains results of interest in machine meaningful. the end of the Discussion section for an example.

O C TO BE R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 81
review articles

more natural to use Boolean perturba- to other discrete problems. For numerical algorithms are known to run
_ _ _
tions: Let x = (x 1,…, x n)  {0, 1}n or {−1, example, Feige16 used the following faster when their inputs are well condi-
_
1} . The s -Boolean perturbation of x is
n
smoothed model for 3CNF formulas. tioned. Some smoothed analyses, such
a random string x = (x1,…, xn)  {0, 1} First, an adversary picks an arbitrary as that of Blum and Dunagan,7 exploit
_
n
or {−1, 1}n, where xi = x i with prob- formula with n variables and m clauses. this connection explicitly. Others rely
_
ability 1 − s and xi x x i with probability Then, the formula is perturbed at ran- on similar ideas.
s. That is, each bit is flipped indepen- dom by flipping the polarity of each For many problems, the condition
dently with probability s. occurrence of each variable indepen- number of an input is approximately
Believing that s -perturbations of dently with probability s. Feige gave a the inverse of the distance of that
Boolean matrices should behave like randomized polynomial-time refuta- input to the set of degenerate or ill-
Gaussian perturbations of real matri- tion algorithm for this problem. posed problems. As these degenerate
ces, Spielman and Teng35 made the fol- Combinatorial Optimization. Beier sets typically have measure zero, it is
lowing conjecture: and Vöcking6 and Röglin and Vöcking31 intuitively reasonable that a perturbed
considered the smoothed complexity instance should be unlikely to be too
Let A be an n by n matrix of inde- of integer linear programming. They close to a degenerate one.
pendently and uniformly chosen studied programs of the form Other Performance Measures.
r1 entries. Then Although we normally evaluate the
max cTx subject to Ax b b and x performance of an algorithm by its
PrA [||A 1||2 r x] b •
@n
}x A ,
n
n, running time, other performance
(5) parameters are often important. These
performance parameters include the
for some constant a. where A is an m × n real matrix, b ℝm, amount of space required, the number
and  ‹ ℤ. of bits of precision required to achieve
Statements close to this conjecture Recall that ZPP denotes the class of a given output accuracy, the number
and its generalizations were recently decision problems solvable by a ran- of cache misses, the error probability
proved by Vu and Tao39 and Rudelson domized algorithm that always returns of a decision algorithm, the number
and Vershynin.32 the correct answer, and whose expected of random bits needed in a random-
The s -Boolean perturbation of a running time (on every input) is poly- ized algorithm, the number of calls to
graph can be viewed as a smoothed nomial. Beier, Röglin, and Vöcking6, a particular subroutine, and the num-
extension of the classic Erdös– 31
proved the following statement: For ber of examples needed in a learning
Rényi random graph model. _ The any constant c, let 0 be a class of inte- algorithm. The quality of an approxi-
s -perturbation of a graph G, which we ger linear programs of form 5 with |D| mation algorithm could be its approxi-
denote by GG_(n, s ), is a distribution of = O(nc). Then, 0 has an algorithm of mation ratio; the quality of an online
random graphs in which every edge is probably polynomial smoothed com- algorithm could be its competitive
removed with probability s and every plexity if and only if 0u  ZPP, where ratio; and the parameter of a game
nonedge is included with probability 0u is the “unary” representation of 0. could be its price of anarchy or the rate
s. Clearly for p [0, 1], G(n, p) = Gc (n, Consequently, the 0/1-knapsack prob- of convergence of its best-response
p), i.e., the p-Boolean perturbation lem, the constrained shortest path dynamics. We anticipate future results
of the empty graph. One can define a problem, the constrained minimum on the smoothed analysis of these per-
smoothed extension of other random spanning tree problem, and the con- formance measures.
graph models. For example, for any m strained minimum weighted match- Precursors to Smoothed Complexity.
and _ G = (V, E), Bohman et al. define ing problem have probably polynomial Several previous probabilistic models
9

G(G, m) to be the distribution of the smoothed complexity. have also combined features of worst-
random graphs (V, E ‡ T) where T is a Smoothed analysis has been applied case and average-case analyses.
set of m edges chosen uniformly at ran- to several other optimization prob- Haimovich19 and Adler1 consid-
dom from the _ complement of E, i.e., lems such as local search and TSP,15 ered the following probabilistic
chosen from E = {(i, j) ŽE}. scheduling,5 motion planning,11 super- analysis: Given a linear program
A popular subject of study in the tra- string approximation,27 multiobjective $ = (A, b, c) of form (1), they defined
ditional Erdös–Rényi model is the phe- optimization,31 string alignment,2 and the expected complexity of $ to be
nomenon of phase transition: for many multidimensional packing.22 the expected complexity of the sim-
properties such as being connected or plex algorithm when the inequality
being Hamiltonian, there is a critical p Discussion sign of each constraint is uniformly
below which a graph is unlikely to have A Theme in Smoothed Analyses. One flipped. They proved that the expected
each property and above which it prob- idea is present in many of these complexity of the worst possible $ is
ably does have the property. Related smoothed analyses: perturbed inputs polynomial.
phase transitions have also been found are rarely ill-conditioned. Informally Blum and Spencer8 studied the
in the smoothed Erdös–Rényi models speaking, the condition number of design of polynomial-time algorithms
GG_ (n, s).17, 26 a problem measures how much its for the semi-random model, which
Smoothed analysis based on answer can be made to change by combines the features of the semi-
Boolean perturbations can be applied slight modifications of its input. Many random source with the random graph

82 COMM UNICATIO NS O F THE ACM | OC TOB E R 2009 | VO L . 5 2 | N O. 1 0


review articles

model that has a “planted solution.” For example, the worst-case com- A simpler way to strengthen
This model can be illustrated with the plexity and the smoothed complexity smoothed analysis is to restrict the
k-Coloring Problem: An adversary of the problem of computing a market family of perturbations considered.
plants a solution by partitioning the set equilibrium are essentially the same.20 For example, one could employ zero-
V of n vertices into k subsets V1,…,Vk. Let So far, no polynomial-time pricing preserving perturbations, which only
algorithm is known for general mar- apply to nonzero entries. Or, one could
F = {(u, v)| u and v are in different kets. On the other hand, pricing seems use relative perturbations, which per-
subsets} to be a practically solvable problem, as turb every real number individually by
Kamal Jain put it “If a Turing machine a small multiplicative factor.
be the set of potential inter-subset can’t compute then an economic sys- Algorithm Design Based on Per-
edges. A graph is then constructed by tem can’t compute either.” turbations and Smoothed Analysis.
the following semi-random process A key step to understanding the Finally, we hope insights gained from
that perturbs the decisions of the behaviors of algorithms in practice is smoothed analysis will lead to new ideas
adversary: In a sequential order, the the construction of analyzable models in algorithm design. On a theoretical
adversary decides whether to include that are able to capture some essential front, Kelner and Spielman24 exploited
each edge of F in the graph, and then aspects of practical input instances. ideas from the smoothed analysis
a random process reverses the deci- For practical inputs, there may often of the simplex method to design a
sion with probability s. Note that every be multiple parameters that govern (weakly) polynomial-time simplex
graph generated by this semi-random the process of their formation. method that functions by systemati-
process has the planted coloring: c(v) = One way to strengthen the cally perturbing its input program. On
i for all v  Vi, as both the adversary and smoothed analysis framework is to a more practical level, we suggest that
the random process preserve this solu- improve the model of the formation it might be possible to solve some
tion by only considering edges from F. of input instances. For example, if problems more efficiently by perturb-
As with the smoothed model, one the input instances to an algorithm A ing their inputs. For example, some
can work with the semi-random model come from the output of another algo- algorithms in computational geometry
by varying s between 0 and 1 to inter- rithm B, then algorithm B, together implement variable-precision arith-
polate between worst-case and aver- with a model of B’s input instances, metic to correctly handle exceptions
age-case complexity for k-coloring. provide a description of A’s inputs. that arise from geometric degener-
Algorithm Design and Analysis For example, in finite-element calcu- acy.29 However, degeneracies and near-
for Special Families of Inputs. lations, the inputs to the linear solver degeneracies occur with exceedingly
Probabilistic approaches are not the A are stiffness matrices that are pro- small probability under perturbations
only means of characterizing practical duced by a meshing algorithm B. The of inputs. To prevent perturbations
inputs. Much work has been spent on meshing algorithm B, which could be from changing answers, one could
designing and analyzing inputs that a randomized algorithm, generates employ quad-precision arithmetic,
satisfy certain deterministic but prac- a stiffness matrix from a geometric placing the perturbations into the
tical input conditions. We mention a domain 7 and a partial differential least-significant half of the digits.
few examples that excite us. equation F. So, the distribution of the Our smoothed analysis of Gaussian
In parallel scientific computing, stiffness matrices input to algorithm elimination suggests a more stable
one may often assume that the input A is determined by the distribution  solver for linear systems: When given
graph is a well-shaped finite element of the geometric domains 7 and the a linear system Ax = b, we first use the
mesh. In VLSI layout, one often only set F of partial differential equations, standard Gaussian elimination with
considers graphs that are planar or and the randomness in algorithm B. partial pivoting algorithm to solve

nearly planar. In geometric model- If, for example, 7 is the design of an Ax = b. Suppose x* is the solution com-
ing, one may assume that there is an advanced rocket from a set * of “blue- puted. If ||b – Ax*|| is small enough,
upper bound on the ratio among the prints” and F is from a set  of PDEs then we simply return x*. Otherwise,
distances between points. In Web describing physical parameters such we can determine a parameter e and
analysis, one may assume that the as pressure, speed, and temperature, generate a new linear system (A + e G)y
input graph satisfies some powerlaw and 7 is generated by a perturbation = b, where G is a Gaussian matrix with
degree distribution or some small- model ( of the blueprints, then one mean 0 and variance 1. Instead of solv-
world properties. When analyzing may further measure the performance ing Ax = b, we solve a perturbed linear
hash functions, one may assume that of A by the smoothed value: system (A + e G)y = b. It follows from
the data being hashed has some non- standard analysis that if e is sufficiently
PHOTOGRA PH BY H ENNING M ÜHLINGH AU S

negligible entropy.30 max E E [Q(A, X)] , smaller than k (A), then the solution to

Limits of Smoothed Analysis. The F, 7 * 7 k((7) X k B7, F the perturbed linear system is a good
goal of smoothed analysis is to explain approximation to the original one.
why some algorithms have much bet- One could use practical experience or

ter performance in practice than pre- where 7 k ( (7) indicates that 7 is binary search to set e.

dicted by the traditional worst-case obtained from a perturbation of 7 and The new algorithm has the property
analysis. However, for many problems, X k B(7, F) indicates that X is the out- that its success depends only on the
there may be better explanations. put of the randomized algorithm B. machine precision and the condition

O C TO BE R 2 0 0 9 | VO L. 52 | NO. 1 0 | C OM M U N IC AT ION S OF T HE ACM 83


review articles

number of A, while the original algo- volume 5125 of Lecture Notes in Computer Science algorithm? Inequalities – III. O. Shisha, ed.
(Springer, 2008) 357–369. Academic Press, 1972, 159–175.
rithm may fail due to large growth 3. Arthur, D. and Vassilvitskii, S. How slow is the 26. Krivelevich, M., Sudakov, B. and Tetali, P. On
factors. For example, the following is k-means method? In SOCG ’06, the 22nd Annual smoothed analysis in dense graphs and formulas.
ACM Symposium on Computational Geometry (2006) Random Struct. Algorithms 29 (2005), 180–193.
a fragment of Matlab code that first 144–153. 27. Ma, B. Why greed works for shortest common
solves a linear system whose matrix is 4. Arthur, D. and Vassilvitskii, S. Worst-case and superstring problem. In CPM ’08: Proceedings of the
smoothed analysis of the ICP algorithm, with an 19th Annual Symposium on Combinatorial Pattern
the 70 × 70 matrix Wilkinson designed application to the k-means method. In FOCS ’06, Matching (2008), 244–254.
to trip up partial pivoting, using the the 47th Annual IEEE Symposium on Foundations of 28. Manthey, B. and Röglin, H. Improved smoothed
Computer Science (2006) 153–164. analysis of the k-means method. In SODA ’09
Matlab linear solver. We then per- 5. Becchetti, L., Leonardi, S., Marchetti-Spaccamela, (2009).
A., Schfer, G., and Vredeveld, T. Average-case and 29. Mehlhorn, K. and Näher, S. The LEDA Platform of
turb the system, and apply the Matlab smoothed competitive analysis of the multilevel Combinatorial and Geometric Computing. Cambridge
solver again. feedback algorithm. Math. Oper. Res. 31, 1 (2006), University Press, New York, 1999.
85–108. 30. Mitzenmacher, M. and Vadhan, S. Why simple hash
6. Beier, R. and Vöcking, B. Typical properties of winners functions work: exploiting the entropy in a data
and losers in discrete optimization. In STOC ’04: stream. In SODA ’08: Proceedings of the Nineteenth
the 36th Annual ACM Symposium on Theory of Annual ACM-SIAM Symposium on Discrete
>> % Using the Matlab Solver Computing (2004), 343–352. Algorithms (2008), 746–755.
7. Blum, A. and Dunagan, J. Smoothed analysis of 31. Röglin, H. and Vöcking, B. Smoothed analysis of
>> n = 70; A = 2*eye(n)- the perceptron algorithm for linear programming. integer programming. Proceedings of the 11th
tril(ones(n) ); A(:,n)=1; In SODA ’02 (2002), 905–914. International Conference on Integer Programming
8. Blum, A. and Spencer, J. Coloring random and and Combinatorial Optimization. M. Junger and
>> b = randn(70,1); x = A\b; V. Kaibel, eds. Volume 3509 of Lecture Notes in
semi-random k-colorable graphs. J. Algorithms 19, 2
>> norm (A*x-b) (1995), 204–234. Computer Science, Springer, 2005, 276–290.
2.762797463910437e+004 9. Bohman, T., Frieze, A. and Martin, R. How many 32. Rudelson, M. and Vershynin, R. The littlewood-offord
random edges make a dense graph hamiltonian? problem and invertibility of random matrices. Adv.
>> % FAILED because of large Random Struct. Algorithms 22, 1 (2003), 33–42. Math. 218, (June 2008), 600–633.
growth factor 10. Bürgisser, P., Cucker, F., and Lotz, M. Smoothed 33. Sankar, A. Smoothed analysis of Gaussian
analysis of complex conic condition numbers. J. elimination. Ph.D. Thesis, MIT, 2004.
>> %Using the new solver 34. Sankar, A., Spielman, D.A., and Teng, S.-H. Smoothed
de Mathématiques Pures Appliqués 86 4 (2006),
>> Ap = A + randn(n)/10^9; 293–309. analysis of the condition numbers and growth factors
y = Ap\b; 11. Damerow, V., Meyer auf der Heide, F., Räcke, H., of matrices. SIAM J. Matrix Anal. Appl. 28, 2 (2006),
Scheideler, C., and Sohler, C. Smoothed motion 446–476.
>> norm(Ap*y-b) complexity. In Proceedings of the 11th Annual 35. Spielman, D.A. and Teng, S.-H. Smoothed analysis
6.343500222435404e-015 European Symposium on Algorithms (2003), 161–171. of algorithms. In Proceedings of the International
12. Dantzig, G.B. Maximization of linear function of Congress of Mathematicians (2002), 597–606.
>> norm(A*y-b) 36. Spielman, D.A. and Teng, S.-H. Smoothed analysis
variables subject to linear inequalities. Activity
4.434147778553908e-008 Analysis of Production and Allocation. of algorithms: Why the simplex algorithm usually
T.C. Koopmans, Ed. 1951, 339–347. takes polynomial time. J. ACM 51, 3 (20040,
13. Dunagan, J., Spielman, D.A., and Teng, S.-H. 385–463.
Smoothed analysis of condition numbers and 37. Teng, S.-H. Algorithm design and analysis with
complexity implications for linear programming. perburbations. In Fourth International Congress of
Mathematical Programming, Series A, 2009. To Chinese Mathematicans (2007).
Note that while the Matlab linear 38. Vershynin, R. Beyond Hirsch conjecture: Walks on
appear. Preliminary version available at http://arxiv.
solver fails to find a good solution to the org/abs/cs/0302011v2. random polytopes and smoothed complexity of the
14. Edelman, A. Eigenvalue roulette and random test simplex method. In Proceedings of the 47th Annual
linear system, our new perturbation- matrices. Linear Algebra for Large Scale and IEEE Symposium on Foundations of Computer
based algorithm finds a good solution. Real-Time Applications. Marc S. Moonen, Gene H. Science (2006), 133–142.
Golub, and Bart L. R. De Moor, eds. NATO ASI Series, 39. Vu, V.H. and Tao, T. The condition number of a
While there are standard algorithms 1992, 365–368. randomly perturbed matrix. In STOC ’07: the 39th
for solving linear equations that do not 15. Englert, M., Röglin, H., and Vöcking, B. Worst case Annual ACM Symposium on Theory of Computing
and probabilistic analysis of the 2-opt algorithm (2007), 248–255.
have the poor worst-case performance for the TSP: extended abstract. In SODA ’07: The 40. Wschebor, M. Smoothed analysis of k(a).
of partial pivoting, they are rarely used 18th Annual ACM-SIAM Symposium on Discrete J. Complexity 20, 1 (February 2004), 97–107.
Algorithms (2007), 1295–1304.
as they are less efficient. 16. Feige, U. Refuting smoothed 3CNF formulas. In The
For more examples of algorithm 48th Annual IEEE Symposium on Foundations of Daniel A. Spielman ([email protected]) is a
Computer Science (2007), 407–417. professor of Applied Mathematics and Computer Science
design inspired by smoothed analysis 17. Flaxman, A. and Frieze, A.M. The diameter of at Yale University, New Haven, CT.
and perturbation theory, see Teng.37 randomly perturbed digraphs and some applications.
In APPROX-RANDOM (2004), 345–356. Shang-Hua Teng ([email protected]) is a
18. Gass, S. and Saaty, T. The computational algorithm professor of Department of Computer Science, at Boston
Acknowledgments for the parametric objective function. Naval Res. University, and senior research scientist at Akamai
Logist. Quart. 2, (1955), 39–45. Technologies, Inc.
We would like to thank Alan Edelman 19. Haimovich, M. The simplex algorithm is very
for suggesting the name “Smoothed good!: On the expected number of pivot steps and
related properties of random linear programs.
Analysis” and thank Heiko Röglin and Technical report, Columbia University (April 1983).
Don Knuth for helpful comments on 20. Huang, L.-S. and Teng, S.-H. On the approximation
and smoothed complexity of Leontief market
this writing. equilibria. In Frontiers of Algorithms Workshop
Due to Communications space con- (2007), 96–107.
21. Kalai, A. and Teng, S.-H. Decision trees are PAC-
strainsts, we have had to restrict our learnable from most product distributions: a
bibliography to 40 references. We smoothed analysis. ArXiv e-prints (December 2008).
22. Karger, D. and Onak, K. Polynomial approximation
apologize to those whose work we have schemes for smoothed and random instances of
been forced not to reference. multidimensional packing problems. In SODA ’07:
the 18th Annual ACM-SIAM Symposium on Discrete
Algorithms (2007), 1207–1216.
23. Kelner, J.A. and Nikolova, E. On the hardness and
References smoothed complexity of quasi-concave minimization.
1. Adler, I. The expected number of pivots needed In The 48th Annual IEEE Symposium on Foundations
to solve parametric linear programs and the of Computer Science (2007), 472–482.
efficiency of the self-dual simplex method. 24. Kelner, J.A. and Spielman, D.A. A randomized
Technical report, University of California at Berkeley polynomial-time simplex algorithm for linear
(May 1983). programming. In The 38th Annual ACM Symposium
2. Andoni, A. and Krauthgamer, R. The smoothed on Theory of Computing (2006), 51–60.
complexity of edit distance. In Proceedings of ICALP, 25. Klee, V. and Minty, G.J. How good is the simplex © ACM 0001-0782/09/1000 $10.00.

84 COMM UNICATIO NS O F THE ACM | OC TOBE R 2009 | VO L . 5 2 | N O. 1 0


research highlights
P. 86 P. 87
Technical Distinct-Value Synopses
Perspective
Relational Query for Multiset Operations
Optimization— By Kevin Beyer, Rainer Gemulla, Peter J. Haas,
Berthold Reinwald, and Yannis Sismanis
Data Management
Meets Statistical
Estimation
By Surajit Chaudhuri

P. 96 P. 97
Technical Finding the Frequent
Perspective
Data Stream Items in Streams of Data
Processing— By Graham Cormode and Marios Hadjieleftheriou
When You Only
Get One Look
By Johannes Gehrke

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | COM M U N IC AT ION S OF T HE ACM 85


research highlights
D OI:10.1145/ 1562764.1 5 6 2 78 6

Technical Perspective
Relational Query Optimization—Data
Management Meets Statistical Estimation
By Surajit Chaudhuri

RELATIONAL SYSTEMS HAVE made it pos- them. It is also important to be able to timization. It revisits the difficult prob-
sible to query large collections of data maintain these statistics efficiently in lem of efficient estimation of the num-
in a declarative style through languages the face of data updates. These require- ber of distinct values in an attribute and
such as SQL. The queries are translated ments demand a judicious trade-off makes a number of contributions by
into expressions consisting of relation- between quality of estimation and the building upon past work that leverages
al operations but do not refer to any overheads of doing this estimation. randomized algorithms. It suggests an
implementation details. There is a key The early commercial relational sys- unbiased estimator for distinct values
component that is needed to support tems used estimation techniques us- that has a lower mean squared error
this declarative style of programming ing summary structures such as simple than previously proposed estimators
and that is the query optimizer. The op- histograms on attributes of the relation. based on a single scan of data. The au-
timizer takes the query expression as in- Each bucket of the histogram repre- thors propose a summary structure (syn-
put and determines how best to execute sented a range of values. The histogram opsis) for a relation such that the num-
that query. This amounts to a combina- captured the total number of rows and ber of distinct values in a query using
torial optimization on a complex search number of distinct values for each buck- multiset union, multiset intersection,
space: finding a low-cost execution plan et of the histogram. For larger query and multiset difference operations over
among all plans that are equivalent to expressions, histograms were derived a set of relations can be estimated from
the given query expression (consider- from histograms on its sub-expressions the synopses of base relations. Further-
ing possible ordering of operators, al- in an adhoc manner. Since the mid- more, if only one of the many partitions
ternative implementations of logical 1990s, the unique challenges of statis- of a relation is updated, the synopsis
operators, and different use of physical tical estimation in the context of query for just that partition must be rebuilt to
structures such as indexes). The success optimization attracted many research- derive the distinct value estimations for
that relational databases enjoy today in ers with backgrounds and interests in the entire relation.
supporting complex decision-support algorithms and statistics. We saw de- It has been 30 years since the frame-
queries would not have been a reality velopment of principled approaches to work of query optimization was defined
without innovation in query optimiza- these statistical estimation problems by System-R and relational query opti-
tion technology. that leveraged randomized algorithms mization has been a great success story
In trying to identify a good execu- such as probabilistic counting. Keeping commercially. Yet, statistical estimation
tion plan, the query optimizer must be with the long-standing tradition of close problems for query expressions remain
aware of statistical properties of data relation between database research and one area where significant advances
over which the query is defined because the database industry, some of these so- are needed to take the next big leap in
these statistical properties strongly in- lutions have been adopted in commer- the state of the art for query optimiza-
fluence the cost of executing the query. cial database products. tion. Recently, researchers are trying to
Examples of such statistical properties The following paper by Beyer et al. understand how additional knowledge
are total number of rows in the relation, showcases recent progress in statistical on statistical properties of data and
distribution of values of attributes of estimations in the context of query op- queries can best be gleaned from past
a relation, and the number of distinct executions to enhance the core statisti-
values of an attribute. Because the op- cal estimation abilities. Although I have
timizer needs to search among many The authors highlighted query optimization, such
alternative execution plans for the statistical estimation techniques also
given query and tries to pick one with showcase recent have potential applications in other ar-
low cost, it needs such statistical esti- progress in statistical eas such as data profiling and approxi-
mation not only for the input relations, mate query processing. I invite you to
but also for many sub-expressions that estimations in the read the following paper to sample a
it considers part of its combinatorial context of query subfield that lies at the intersection of
search. Indeed, statistical properties of database management systems, statis-
the sub-expressions guide the explora- optimization. tics, and algorithms.
tion of alternatives considered by the
query optimizer. Since access to a large Surajit Chaudhuri ([email protected]) is a principal
researcher and research area manager at Microsoft. He is
data set can be costly, it is not feasible to an ACM Fellow and the recipient of the 2004 ACM SIGMOD
determine statistical properties of these Contributions Award.

sub-expressions by executing each of © 2009 ACM 0001-0782/09/1000 $10.00

86 COM MUNICATIO NS O F TH E ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


DOI:10.1145/ 1562764. 1 5 6 2 78 7

Distinct-Value Synopses
for Multiset Operations
By Kevin Beyer, Rainer Gemulla, Peter J. Haas, Berthold Reinwald, and Yannis Sismanis

Abstract Issues related to combining and exploiting synopses in the


The task of estimating the number of distinct values (DVs) presence of union, intersection, and difference operations
in a large dataset arises in a wide variety of settings in com- on multiple datasets have been largely unexplored, as has
puter science and elsewhere. We provide DV estimation the problem of handling deletions of items from the data-
techniques for the case in which the dataset of interest is set. Such issues are the focus of this paper, which is about
split into partitions. We create for each partition a synop- DV estimation methods when the dataset of interest is split
sis that can be used to estimate the number of DVs in the into disjoint partitions, i.e., disjoint multisets.a The idea is
partition. By combining and extending a number of results to create a synopsis for each partition so that (i) the synopsis
in the literature, we obtain both suitable synopses and DV can be used to estimate the number of DVs in the partition
estimators. The synopses can be created in parallel, and and (ii) the synopses can be combined to create synopses
can be easily combined to yield synopses and DV estimates for “compound” partitions that are created from the base
for “compound” partitions that are created from the base partitions using multiset union, intersection, or difference
partitions via arbitrary multiset union, intersection, or dif- operations.
ference operations. Our synopses can also handle deletions This approach permits parallel processing, and hence
of individual partition elements. We prove that our DV scalability of the DV-estimation task to massive datasets,
estimators are unbiased, provide error bounds, and show as well as flexible processing of DV-estimation queries and
how to select synopsis sizes in order to achieve a desired graceful handling of fluctuating data-arrival rates. The par-
estimation accuracy. Experiments and theory indicate titioning approach can also be used to support automated
that our synopses and estimators lead to lower computa- data integration by discovering relationships between parti-
tional costs and more accurate DV estimates than previous tions. For example, suppose that the data is partitioned by
approaches. its source: Amazon customers versus YouTube download-
ers. Then DV estimates can be used to help discover subset-
inclusion and functional-dependency relationships, as well
1. INTRODUCTION as to approximate the Jaccard distance or other similarity
The task of determining the number of distinct values (DVs) metrics between the domains of two partitions; see Brown
in a large dataset arises in a wide variety of settings. One and Haas and Dasu et al.4, 6
classical application is population biology, where the goal Our goal is therefore to provide “partition-aware” syn-
is to determine the number of distinct species, based on opses for DV estimation, as well as corresponding DV
observations of many individual animals. In computer sci- estimators that exploit these synopses. We also strive to
ence, applications include network monitoring, document maintain the best possible accuracy in our DV estimates,
search, predicate-selectivity estimation for database query especially when the size of the synopsis is small: as dis-
optimization, storage-size estimation for physical database cussed in the sequel, the size of the synopsis for a com-
design, and discovery of metadata features such as keys and pound partition is limited by the size of the smallest input
duplicates. synopsis.
The number of DVs can be computed exactly by sorting We bring together a variety of ideas from the literature
the dataset and then executing a straightforward scan-and- to obtain a solution to our problem, resulting in best-
count pass over the data; alternatively, a hash table can of-breed DV estimation methods that can also handle
be constructed and used to compute the number of DVs. multiset operations and deletions. We first consider,
Neither of these approaches scales well to the massive data-
sets often encountered in practice, because of heavy time
and memory requirements. A great deal of research over the
a
Recall that a multiset, also called a bag, is an unordered collection
of values, where a given value may appear multiple times, for example,
past 25 years has therefore focused on scalable approximate {3,3,2,3,7,3,2}. Multiset union, intersection, and difference are defined
methods. These methods work either by drawing a random in a natural way: if nA(v) and nB(v) denote the multiplicities of value v
sample of the data items and statistically extrapolating the in multisets A and B, respectively, then the multiplicities of v in A ‰  B,

number of DVs, or by taking a single pass through the data Aˆ B, and A \ B are given respectively by nA(v)  nB(v), min(nA(v),nB(v) ), and

and using hashing techniques to compute an estimate using max(nA(v)  nB(v), 0).

a small, bounded amount of memory.


Almost all of this work has focused on producing a given
A previous version of this research paper was published
synopsis of the dataset, such as a random sample or bit vec-
in Proceedings of the 2007 ACM SIGMOD Conference.
tor, and then using the synopsis to obtain a DV estimate.

O C TO B E R 2 0 0 9 | VO L . 5 2 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 87
research highlights

in Section 2, a simple “KMV” (K Minimum hash Values) Specifically, if D  1 points are placed randomly and uni-
synopsis2 for a single partition—obtained by hashing the formly on the unit interval, then, by symmetry, the expected
DVs in the partition and recording the K smallest hash distance between any two neighboring points is 1/(D  1)
values—along with a “basic” DV estimator based on the z 1/D, so that the expected value of U(k), the kth smallest
synopsis; see Equation 1. In Section 3, we briefly review point, is E[U(k)] z 3kj =1 (1/D)  k/D. Thus D z k/E[U(k)]. The sim-
prior work and show that virtually all prior DV estimators plest estimator of E[U(k)] is simply U(k) itself—the so-called
can be viewed as versions of, or approximations to, the “method-of-moments” estimator—and yields the basic
basic estimator. In Section 4, we propose a new DV esti- estimator
mator—see Equation 2—that improves upon the basic
estimator. The new estimator also uses the KMV synopsis D̂ kBE  k/U(k) (1)
and is a deceptively simple modification of the basic esti-
mator. Under a probabilistic model of hashing, we show The above 1-to-1 mapping from the D DVs to a set of
that the new estimator is unbiased and has lower mean- D uniform random numbers can be constructed perfectly
squared error than the basic estimator. Moreover, when using O(D log D) memory, but this memory requirement
there are many DVs and the synopsis size is large, we show is clearly infeasible for very large datasets. Fortunately, a
that the new unbiased estimator has essentially the mini- hash function—which typically only requires an amount of
mal possible variance of any DV estimator. To help users memory logarithmic in D—often “looks like” a uniform ran-
assess the precision of specific DV estimates produced dom number generator. In particular, let # (A) = {v1, v2, …,
by the unbiased estimator, we provide probabilistic error vD} be the domain of multiset A, i.e., the set of DVs in A, and
bounds. We also show how to determine appropriate syn- let h be a hash function from # (A) to {0, 1, …, M}, where
opsis sizes for achieving a desired error level. M is a large positive integer. For many hash functions, the
In Section 5, we augment the KMV synopsis with coun- sequence h(v1), h(v2), …, h(vD) looks like the realization of
ters—in the spirit of Ganguly et al. and Shukla et al.10, 18—to a sequence of independent and identically distributed
obtain an “AKMV synopsis.” We then provide methods for (i.i.d.) samples from the discrete uniform distribution on
combining AKMV synopses such that the collection of these {0, 1, …, M}. Provided that M is sufficiently greater than D,
synopses is “closed” under multiset operations on the parent the sequence U1 = h(v1)/M, U2 = h(v2)/M, …, UD = h(vD)/M will
partitions. The AKMV synopsis can also handle deletions of approximate the realization of a sequence of i.i.d. samples
individual partition elements. We also show how to extend from the continuous uniform distribution on [0, 1]. This
our simple unbiased estimator to exploit the AKMV synopsis assertion requires that M be much larger than D to avoid col-
and provide unbiased estimates in the presence of multiset lisions, i.e., to ensure that, with high probability, h(vi) x h(vj)
operations, obtaining an unbiased estimate of Jaccard dis- for all i p j. A “birthday problem” argument 16, p. 45 shows that
tance in the process; see Equations 7 and 8. Section 6 con- collisions will be avoided when M = 7(D2). We assume hence-
cerns some recent complements to, and extensions of, our forth that, for all practical purposes, any hash function that
original results in Beyer et al.3 arises in our discussion avoids collisions. We use the term
“looks like” in an empirical sense, which suffices for appli-
2. A BASIC ESTIMATOR AND SYNOPSIS cations. Thus, in practice, the estimator D̂ kBE can be applied
The idea behind virtually all DV estimators can be viewed as with U(k) taken as the kth smallest hash value (normalized by
follows. Each of the D DVs in a dataset is mapped to a ran- a factor of 1/M). In general, E[1/X]  1/E[X] for a non-negative
dom location on the unit interval, and we look at the posi- random variable X,17, p. 351 and hence
tion U(k) of the kth point from the left, for some fixed value
of k; see Figure 1. The larger the value of D, i.e., the greater E[D̂ kBE]  E[k/U(k)]  k/E[U(k)] z D
the number of points on the unit interval, the smaller the
value of U(k). Thus D can plausibly be estimated by a decreas-
ing function of U(k).
Algorithm 1 (KMV Computation).

1: h: hash function from domain of dataset to {0, 1, …, M}


Figure 1. 50 random points on the unit interval (D = 50, k = 20). 2: L: list of k smallest hash values seen so far
3: maxVal(L): returns the largest value in L
fk,D 4:
5: for each item x in the dataset do
6: n = h(x)
7: if n e/ L then
8: if |L| < k then
9: insert n into L
E [U(k)]
10: else if n < maxVal(L) then
11: insert n into L
P{U(k) ≤ x} 12: remove largest element of L
U(k) 13: end if
14: end if
0[ ]1 15: end for

88 COMMUNICATIO NS O F TH E ACM | OC TOB E R 2 0 0 9 | VOL. 52 | NO. 1 0


D
i.e., the estimator D̂ kBE is biased upwards for each possible E[Cost] = 3 [(k / i)O(log k)  (1 (k / i))O(1)]
i=k+1
value of D, so that it overestimates D on average. Indeed, it D D
follows from the results in Section 4.1 that E[D̂ kBE]  d for k = 1.  3 (k / i)O(log k)]  O(D)  O(D)  O(k log k) 3 (l/i)
i=1 i=1
In Section 4, we provide an unbiased estimator that also has
lower mean-squared error than D̂ kBE.  O(D  k log k log D)
Note that, in a certain sense, the foregoing view of hash
functions—as algorithms that effectively place points on since 3iD 1(l/i)  O(log D). The overall expected cost is thus O(N
the unit interval according to a uniform distribution—  D  k log k  k log k log D) = O(N  k log k log D). U
represents a worst-case scenario with respect to the basic We show in Section 5 that adding counters to the KMV
estimator. To the extent that a hash function spreads synopsis has a negligible effect on the construction cost,
points evenly on [0, 1], i.e., without the clumping that is a and results in a desirable “closure” property that permits
byproduct of randomness, the estimator D̂ kBE will yield more efficient DV estimation under multiset operations.
accurate estimates. We have observed this phenomenon
experimentally.3 3. PRIOR WORK
The foregoing discussion of the basic estimator imme- We now give a unified view of prior synopses and DV esti-
diately implies a choice of synopsis for a partition A. Using mators, and discuss prior methods for handling compound
a hash function as above, hash all of the DVs in A and then partitions.
record the k smallest hash values. We call this synopsis a
KMV synopsis (for k minimum values). The KMV synopsis 3.1. Synopses for DV estimation
can be viewed as originating in Bar-Yossef et al.,2 but there In general, the literature on DV estimation does not discuss
is no discussion in Bar-Yossef et al.2 about implementing, synopses explicitly, and hence does not discuss issues related
constructing, or combining such synopses. to combining synopses in the presence of set operations on
As discussed previously, we need to have M = 7(D2) to the corresponding partitions. We can, however, infer poten-
avoid collisions. Thus each of the k hash values requires tial candidate synopses from the various algorithm descrip-
O(log M) = O(log D) bits of storage, and the required size of tions. The literature on DV estimation is enormous, so we
the KMV synopsis is O(k log D). content ourselves with giving highlights and pointers to
A KMV synopsis can be computed from a single scan of further references; for some helpful literature reviews, see
the data partition, using Algorithm 1. The algorithm uses a Beyer et al.,3 Gemulla,11 Gibbons13 and Haas and Stokes.14
sorted list of k hash values, which can be implemented using, Random Samples: Historically, the problem of DV estimation
e.g., a priority queue. The membership check in line 7 avoids was originally considered for the case in which the synopsis
unnecessary processing of duplicate values in the input data comprises a random sample of the data items. Applications
partition, and can be implemented using a temporary hash included estimation of species diversity (as discussed in the
table that is discarded after the synopsis has been built. introduction), determining the number of distinct Roman
Assuming that the scan order of the items in a partition coins based on the sample of coins that have survived, esti-
is independent of the items’ hash values, we obtain the fol- mating the size of Shakespeare’s vocabulary based on his
lowing result. extant writings, estimating the number of distinct individu-
als in a set of merged census lists, and so forth; see Haas and
Theorem 1. The expected cost to construct a KMV synopsis of Stokes14 and references therein. The key drawback is that DV
size k from a partition A comprising N data items having D dis- estimates computed from such a synopsis can be very inac-
tinct values is O(N  k log k log D). curate, especially when the dataset is dominated by a few
highly frequent values, or when there are many DVs, each
Proof. The hashing step and membership check in lines 6 having a low frequency (but not all unique). With high prob-
and 7 incur a cost of O(1) for each of the N items in A, for a ability, a sample of size k will have only one or two DVs in
total cost of O(N). To compute the expected cost of executing the former case—leading to severe underestimates—and k
the remaining steps of the algorithm, observe that the first k DVs in the latter case—leading to severe overestimates. For
DVs encountered are inserted into the priority queue (line 9), this reason, especially in the computer science literature,
and each such insertion has a cost of at most O(log k), for an the emphasis has been on algorithms that take a complete
overall cost of O(k log k). Each subsequent new DV encoun- pass through the dataset, but use limited memory. When
tered will incur an O(log k) cost if it is inserted (line 11), or the dataset is massive, our results are of particular interest,
an O(1) cost otherwise. (Note that a given DV will be inserted since we can parallelize the DV-estimate computation.
at most once, at the time it is first encountered, regardless Note that if we modify the KMV synopsis to record not the
of the number of times that it appears in A.) The ith new DV hash of a value, but the value itself, then we are in effect main-
encountered is inserted only if its normalized hash value Ui taining a uniform sample of the DVs in the dataset, thereby
is less than Mi, the largest normalized hash value currently avoiding the problems mentioned above. See Gemulla11 for
in the synopsis. Because points are placed uniformly, the a thorough discussion of “distinct-value sampling” and its
conditional probability of this event, given the value of Mi, is relationship to the DV-estimation problem.
P{Ui < Mi|Mi} = Mi. By the law of total expectation, P{Ui < Mi} Bit-Vector Synopses: The oldest class of synopses based on
= E[P{Ui < Mi | Mi}] = E[Mi] = k/i. Thus the expected cost for single-pass, limited-memory processing comprises vari-
handling the remaining D − k DVs is ous types of bit vectors. The “linear counting” technique 1, 21

O C TO B E R 2 0 0 9 | VO L . 5 2 | N O. 1 0 | COM M U N IC AT ION S OF T HE ACM 89


research highlights

hashes each DV to a position in a bit vector V of length M = counter, in order to permit DV estimation in the presence of
O(D), and uses the number of 1-bits to estimate the DV count. both insertions and deletions to the dataset. This modifica-
Its O(D) storage requirement is typically unacceptable for tion does not ameliorate the inclusion/exclusion problem,
modern datasets. however.
The “logarithmic counting” method of Flajolet and Sample-Counting Synopsis: Another type of synopsis arises
Martin 1, 9 uses a bit vector of length L = O(log D). The idea is to from the “sample counting” DV-estimation method—also
hash each of the DVs in A to the set {0, 1}L of binary strings of called “adaptive sampling”—credited to Wegman.1 Here
length L, and look for patterns of the form 0 j1 in the leftmost the synopsis for partition A comprises a subset of {h(v):
bits of the hash values. For a given value of j, the probability v  # (A)}, where h: # (A) c {0, 1, …, M} is a hash function
of such a pattern is 1/2 j1, so the expected observed num- as before. In more detail, the synopsis comprises a fixed-
ber of such patterns after D DVs have been hashed is D/2 j1. size buffer that holds binary strings of length L = log(M),
Assuming that the longest observed pattern, say with j* lead- together with a “reference” binary string s, also of length
ing 0’s, is expected to occur once, we set D/2 j*1 = 1, so that L. The idea is to hash the DVs in the partition, as in loga-
D = 2 j *1; see Figure 2, which has been adapted from Astrahan rithmic counting, and insert the hash values into a buffer
et al.1 The value of j * is determined approximately, by taking that can hold up to k > 0 hash values; the buffer tracks only
each hash value, transforming the value by zeroing out all the distinct hash values inserted into it. When the buffer
but the leftmost 1, and computing the logarithmic-counting fills up, it is purged by removing all hash values whose left-
synopsis as the bitwise-OR of the transformed values. Let r most bit is not equal to the leftmost bit of s; this operation
denote the position (counting from the left, starting at 0) of removes roughly half of the hash values in the buffer. From
the leftmost 0 bit in the synopsis. Then r is an upper bound this point on, a hash value is inserted into the buffer if and
for j*, and typically a lower bound for j *  1, leading to a only if the first bit matches the first bit of s. The next time
crude (under)estimate of 2r. For example, if r = 2, so that the the buffer fills up, a purge step (with subsequent filtering)
leftmost bits of the synopsis are 110 (as in Figure 2), we know is performed by requiring that the two leftmost bits of each
that the pattern 001 did not occur in any of the hash values, hash value in the buffer match the two leftmost bits of the
so that j * < 2. reference string. This process continues until all the values
The actual DV estimate is obtained by multiplying 2r by in the partition have been hashed. The final DV estimate is
a factor that corrects for the downward bias, as well as for roughly equal to K2r, where r is the total number of purges
hash collisions. In the complete algorithm, several inde- that have occurred and K is the final number of values in
pendent values of r are, in effect, averaged together (using a the buffer. For sample-counting algorithms with reference
technique called “stochastic averaging”) and then exponen- string equal to 00 . . . 0, the synopsis holds the K smallest
tiated. Subsequent work by Durand and Flajolet8 improves hash values encountered, where K lies roughly between k/2
on the storage requirement of the logarithmic counting and k.
algorithm by tracking and maintaining j * directly. The num- The Bellman Synopsis: In the context of the Bellman sys-
ber of bits needed to encode j * is O(log log D), and hence the tem, the authors in Dasu et al.6 propose a synopsis related
technique is called LogLog counting. to DV estimation. This synopsis comprises k entries
The main drawback of the above bit-vector data struc- and uses independent hash functions h1, h2,…,hk; the
tures, when used as synopses in our partitioned-data setting, ith synopsis entry is given by the ith minHash value mi =
is that union is the only supported set operation. One must, minv  # (A) hi(v). The synopsis for a partition is not actually
e.g., resort to the inclusion/exclusion formula to handle set used to directly compute the number of DVs in the partition,
intersections. As the number of set operations increases, but rather to compute the Jaccard distance between parti-
this approach becomes extremely cumbersome, expensive, tion domains; see Section 3.3. (The Jaccard distance between
and inaccurate. ordinary sets A and B is defined as J(A, B) = |A † B|/|A ‡ B|.
Several authors 10, 18 have proposed replacing each bit in the If J(A, B) = 1, then A = B; if J(A, B) = 0, then A and B are dis-
logarithmic-counting bit vector by an exact or approximate joint.) Indeed, this synopsis cannot be used directly for DV
estimation because the associated DV estimator is basically
Figure 2. Logarithmic counting. D̂ BE
1 , which has infinite expectation; see Section 2. When
constructing the synopsis, each scanned data item must be
Bit string hashed k times for comparison to the k current minHash
Hash with only
Color
values; for the KMV synopsis, each scanned item need only
value the leftmost “1”
be hashed once.

}
Red 111 100
Green 101 100 3.2. DV estimators
Red 111 100 Bit position: 012 The basic estimator D̂ kBE was proposed in Bar-Yosssef et al.,2
OR = 110
Blue 011 010 along with conservative error bounds based on Chebyshev’s
Pink 010 010 inequality. Interestingly, both the logarithmic and sample-
counting estimators can be viewed as approximations to the
j* = 1 r = Position of leftmost “0” = 2 basic estimator. For logarithmic counting—specifically the
2r = 4 Flajolet–Martin algorithm—consider the binary decimal
representation of the normalized hash values h(v)/M, where

90 COMM UNICATIO NS O F THE ACM | OC TOBE R 2 0 0 9 | VOL. 52 | NO. 1 0


M = 2L, e.g., a hash value h(v) = 00100110, after normalization, 4.1. Moments and error bounds
will have the binary decimal representation 0.00100110. Let U1, U2, …, UD be the normalized hash values of the D dis-
It can be seen that the smallest normalized hash value is tinct items in the dataset; for our analysis, we model these val-
approximately equal to 2−r, so that, modulo the correction ues as a sequence of i.i.d. random variables from the uniform
factor, the Flajolet–Martin estimator (without stochastic [0, 1] distribution—see the discussion in Section 2. As before,
averaging) is 1/2−r, which roughly corresponds to D̂ BE 1 . The denote by U(k) the kth smallest of U1, U2, …, UD, that is, U(k) is the
final F-M estimator uses stochastic averaging to average kth uniform order statistic. We can now apply results from the
independent values of r and hence compute an estimator Ê classical theory of order statistics to establish properties of the
of E[log2 D̂ BE]
1 , leading to a final estimate of D̂
= c2Ê, where the estimator D̂kUB  (k 1)/U(k). We focus on moments and error
constant c approximately unbiases the estimator. (Our new bounds; additional analysis of D̂kUB can be found in Beyer et al.3
estimators are exactly unbiased.) For sample counting, sup- Our analysis rests on the fact that the probability density
pose, without loss of generality, that the reference string is function (pdf) of U(k) is given by
00 . . . 0 and, as before, consider the normalized binary deci-
mal representation of the hash values. Thus the first purge fk,D(t) t k1(1  t)Dk/B(k, D  k  1) (3)
leaves behind normalized values of the form 0.0 . . . , the sec-
1
ond purge leaves behind values of the form 0.00 . . . , and so where B(a, b) = ³0 t a1 (1  t)b 1 dt denotes the standard beta func-
forth, the last (rth) purge leaving behind only normalized tion; see Figure 1. To verify (3), fix x  [0, 1] and observe that, for
hash values with r leading 0’s. Thus the number 2−r (which each 1 b i £ D, we have P{Ui b x} = x by definition of the uniform
has r − 1 leading 0’s) is roughly equal to the largest of the K distribution. The probability that exactly k of the D uniform
normalized hash values in the size-k buffer, so that the esti- random variables are less than or equal to x is a binomial prob-
mate K/2−r is roughly equal to D̂ kBE. ability: (Dk)xk(1  x) Dk. Thus the probability that U(k) b x is equal
to the probability that at least k random variables are less than
3.3. Estimates for compound partitions or equal to x, which is a sum of binomial probabilities:
To our knowledge, the only prior discussion of how to con- D x
struct DV-related estimates for compound partitions is P{X(k) dx} 3 (D)x j(1  x)D  j Ú0 fk,D (t) dt
j=k j
found in Dasu et al.6 DV estimation for the intersection of
partitions A and B is not computed directly. Instead, the The second equality can be established by integrating the
Jaccard distance r = J(# (A), # (B) ) is estimated first by an rightmost term by parts repeatedly and using the identity
estimator r̂, and then the number of values in the intersec-
tion of # (A) and # (B) is estimated as *(a)*(b) (a  1)!(b  1)!
B(a, b)   (4)
*(a  b) (a  b  1)!

D̂  (|# (A)| |# (B)|) d
r̂ 1 where *(x) ³0 t x1 et dt is the standard gamma function and
the rightmost equality is valid for integer a and b. The result
The quantities |# (A)| and |# (B)| are computed exactly, by in (3) now follows by differentiation.
means of GROUP BY relational queries; our proposed esti- With (3) in hand, we can now determine the moments of
mators avoid the need to compute or estimate these quanti- D̂ kUB, in particular, the mean and variance. Denote by ab the
ties. There is no discussion in Dasu et al.6 of how to handle falling power a(a − 1) . . . (a − b  1).
any set operations other than the intersection of two parti-
tions. If one uses the principle of inclusion/exclusion to han- Theorem 2. Suppose that r r 0 is an integer with r < k. Then
dle other set operations, the resulting estimation procedure
will not scale well as the number of operations increases. E[(D̂ kUB)r] (k  1)r Dr/(k  1)r (5)

4. AN IMPROVED DV ESTIMATOR In particular, E[D̂ kUB] D provided that k > 1, and Var[D̂ kUB]
As discussed previously, the basic estimator D̂ kBE is biased D(D  k  1)/(k  2) provided that k > 2.
upwards for the true number of DVs D, and so somehow
needs to be adjusted downward. We therefore consider the Proof. For k > r r 0, if follows from (3) that
estimator

D̂ kUB  (k 1)/U(k) (2)

and show that, both compared to the basic estimator and in


a certain absolute sense, D̂ kUB has superior statistical proper- and the first assertion of the theorem follows directly from
ties, including unbiasedness. The D̂ kUB estimator forms the (4). Setting r = 1 in (5) yields the next assertion, and the
basis for the extended DV estimators, discussed in Section final assertion follows from (5) and the relation Var[D̂ kUB] 
5, used to estimate the number of DVs in a compound parti- E[(D̂ kUB)2] E2[D̂ kUB]. U
tion. Henceforth, we assume without further comment that Recall that the mean squared error (MSE) of a statistical
D > k; if D £ k, then we can easily detect this situation and estimator X of an unknown quantity m is defined as MSE[X]
compute the exact value of D from the synopsis. = E[(X − m)2] = Var[X]  Bias2[X]. To compare the MSE of the

O C TO B E R 2 0 0 9 | VO L . 5 2 | N O. 1 0 | COM M U N IC AT ION S OF T HE ACM 91


research highlights

basic and unbiased estimators, note that, by (5), E[D̂ kBE]  kD/ 24 relational tables, with a total of 504 attributes and roughly
(k 1) and 2.6 million tuples. For several different hash functions, we
computed the average value of the relative error RE(D̂kUB) =
|D̂kUB D|/D over multiple datasets in the database. The hash
functions are described in detail in Beyer et al.3; for example,
the Advanced Encryption Standard (AES) hash function is a
well established cipher function that has been studied exten-
Thus D̂ kBE is biased high for D, as discussed earlier, and sively. The “baseline” curve corresponds to an idealized hash
has higher MSE than D̂ kUB. function as used in our analysis. As can be seen, the real-world
We can also use the result in (3) to obtain probabilistic accuracies are consistent with the idealized results, and rea-
(relative) error bounds for the estimator D̂ kUB. Specifically, set sonable accuracy can be obtained even for synopsis sizes of
x
Ix(a, b)  ³0 t a1 (1  t)b1 dt/B(a, b), so that P{U(k) b x}  Ix(k, D − k k < 100. In Beyer et al.,3 we found that the relative performance
 1). Then, for 0 < e < 1 and k r 1, we have of different hash functions was sensitive to the degree of regu-
larity in the data; the AES hash function is relatively robust to
such data properties, and is our recommended hash function.

(6) 4.2. Analysis with many DVs


When the number of DVs is known to be large, we can estab-
lish a minimum-variance property of the D̂ kUB estimator, and
also develop useful approximate error bounds that can be
For example, if D = 106 and the KMV synopsis size is k = used to select a synopsis size prior to data processing.
1024, then, with probability d = 0.95, the estimator D̂ kUB will be Minimum Variance Property: The classical statistical approach
within ±4% of the true value; this result is obtained by equat- to estimating unknown parameters based on a data sample is
ing the right side of (6) to d and solving for e numerically. In the method of maximum likelihood.7, Sec. 4.2 A basic result for
practice, of course, D will be unknown, but we can assess the maximum-likelihood estimators17, Sec. 4.2.2 asserts that an MLE
precision of D̂ kUB approximately by replacing D with D̂ kUB in the of an unknown parameter has the minimum variance over all
right side of (6) prior to solving for e. possible parameter estimates as the sample size becomes large.
Error bounds can also be derived for D̂ kBE. As discussed in We show that D̂kUB is asymptotically equivalent to the maximum-
Beyer et al.,3 D̂ kUB is noticeably superior to D̂ kBE when k is small; likelihood estimator (MLE) as D and k become large. Thus, for
for example, when k = 16 and d = 0.95, use of the unbiased D  k  1, the estimator D̂kUB has, to a good approximation, the
estimator yields close to a 20% reduction in e. As k increases, minimal possible variance for any estimator of D.
k − 1 ≈ k and both estimators perform similarly. To find the MLE, we cast our DV-estimation problem as a
The foregoing development assumes that the hash func- parameter estimation problem. Specifically, recall that U(k)
tion behaves as a 1-to-1 mapping from the D DVs in the dataset has the pdf fk,D given in (3). The MLE estimate of D is defined
to a set of D uniform random numbers. In practice, we must as the value D̂ that maximizes the likelihood L(D;U(k)) of the
use real-world hash functions that only approximate this ideal; observation U(k), defined as L(D;U(k)) = fk,D(U(k)). That is, roughly
see Section 2. Figure 3 shows the effect of using real hash func- speaking, D̂ maximizes the probability of seeing the value
tions on real data. The RDW database was obtained from the of U(k) that was actually observed. We find this maximizing
data warehouse of a large financial company, and consists of value by solving the equation L'(D;U(k)) = 0, where the prime
denotes differentiation with respect to D. We have L'(D;U(k))
= ln(1 − U(k)) − 9(D − k  1)  9(D  1), where 9 denotes the
Figure 3. Hashing Effect on the RDW Dataset.
digamma function. If x is sufficiently large, then 9(x) z ln
0.4 (x − 1)  g, where g z 0.5772 denotes Euler’s constant. Applying
AES this approximation, we obtain D̂ kMLE z k/U(k), so that the MLE
FLH estimator roughly resembles the basic estimator D̂ kBE pro-
GRM
0.3 Baseline vided that D  k  1. In fact, our experiments indicated that
D̂ kMLE and D̂ kBE are indistinguishable from a practical point
Average relative error

of view. It follows that D̂ kMLE z (k/k 1)D̂ kUB is asymptotically


equivalent to D̂ kUB as k m d.
0.2
Approximate Error Bounds: To obtain approximate proba-
bilistic error bounds when the number of DVs is large, we
simply let D m ∞ in (6), yielding
0.1

0
10 100 1000
Synopsis size (k)

92 COM MUNICATIO NS O F THE ACM | OC TOBE R 2 0 0 9 | VOL. 52 | NO. 1 0


An alternative derivation is given in Beyer et al.3 using a is a set of k non-negative counters. The quantity c(v) is the
powerful proof technique that exploits well known proper- multiplicity in A of the value v. The first two lines in Figure
ties of the exponential distribution. The approximate error 4 show the normalized hash values and corresponding
+ +
bounds have the advantageous property that, unlike the counter values in the AKMV synopses L A and L B of two base
exact bounds, they do not involve the unknown quantity D. partitions A and B, respectively. (Circles and squares repre-
Thus, given desired values of e and d, they can be used to help sent normalized hash values; inscribed numbers represent
determine target synopsis sizes prior to processing the data. counter values.)
When D is not large, the resulting recommended synopsis The size of the AKMV synopsis is O(k log D  k log N),
sizes are slightly larger than necessary, but not too wasteful. where N is the number of data items in A. It is easy to modify
It is also shown in Beyer et al.3 that E[RE(D̂ kUB)] z (P(k 2)/2) 1/2 Algorithm 1 to create and maintain counters via O(1) opera-
for large D, further clarifying the relationship between syn- tions. The modified synopsis retains the original construc-
opsis size and accuracy. tion complexity of O(N  k log k log D).
To define an AKMV synopsis for compound partitions,
5. FLEXIBLE AND SCALABLE ESTIMATION we first define an operator „ for combining KMV synopses.
The discussion up until now has focused on improving Consider two partitions A and B, along with their KMV syn-
the process of creating and using a synopsis to estimate the opses LA and LB of sizes kA and kB, respectively. Define LA „ LB
number of DVs in a single base partition, i.e., in a single to be the ordered list comprising the k smallest values in
dataset. As discussed in the introduction, however, the true LA v LB, where k = min(kA, kB) and we temporarily treat LA and
power of our techniques lies in the ability to split up a mas- LB as sets rather than ordered lists. (See line 3 of Figure 4;
sive dataset into partitions, create synopses for the parti- the yellow points correspond to values that occur in both
tions in parallel, and then compose the synopses to obtain LA and LB.) Observe that the „ operator is symmetric and
DV estimates for arbitrary combinations of partitions, e.g., if associative.
the combination of interest is in fact the union of all the par-
titions, then, as in the classical setting, we can estimate the Theorem 3. The ordered list L = LA „ LB is the size-k KMV syn-
number of DVs in the entire dataset, while parallelizing the  B, where k = min(k , k ).
opsis of A v A B
traditional sequential scan of the data. On the other hand,
estimates of the number of DVs in the intersection of par- Proof. We again temporarily treat LA and LB as sets. For a
titions can be used to estimate similarity measures such as multiset S with # (S) Œ#, write h(S) = {h(v): v # (S)}, and
Jaccard distance. denote by G the set of k smallest values in h(A v  B). Observe
We therefore shift our focus to DV estimation for a com- that G contains the k' smallest values in h(A) for some
pound partition that is created from a set of base partitions k' b k, and these k' values therefore are also contained in LA,
using the multiset operations of intersection, union, and i.e., G w h(A) Œ LA. Similarly, G w h(B) Œ LB, so that G Œ LA v LB.
difference. To handle compound partitions, we augment For any h i (LA v Lb)\G, we have that h > maxhaG ha by defini-
our KMV synopses with counters; we show that the resulting tion of G, because h Œ h(A v  B). Thus G in fact comprises
AKMV synopses are “closed” under multiset operations on the k smallest values in LA v LB, so that L = G. Now observe
the parent partitions. The closure property implies that if A that, by definition, G is precisely the size-k KMV synopsis
and B are compound partitions and E is obtained from A and of A v  B. U
B via a multiset operation, then we can compute an AKMV We next define the AKMV synopsis for a compound parti-
synopsis for E from the corresponding AKMV synopses for A tion E created from n r 2 base partitions A1, A2, …, An using
and B, and unbiasedly estimate the number of DVs in E from the multiset union, intersection, and set-difference opera-
this resulting synopsis. This procedure avoids the need to tors. Some examples are E = A1\ A2 and E = ( (A1v  A )w  (A3 v 
2
access the synopsis for each of the base partitions that were A4) ) \ A5. The AKMV synopsis for E is defined as L E = (LE, cE),
used to create A and B. The AKMV synopsis can also handle where LE = LA1 „ LA2 „    „ LAn is of size k  min(kA1, kA2,   , kAn),
deletions of individual items from the dataset. As discussed and, for v Œ vin 1# (Ai), the counter cE(v) is the multiplicity
below, the actual DV estimators that we use for compound
partitions are, in general, extensions of the simple D̂ kUB esti- Figure 4. Combining two AKMV synopses of size k = 6 (synopsis
mator developed in Section 4. elements are colored).

5.1. AKMV synopses LA+ 1 1 3 1 2 3


We assume throughout that all synopses are created using
the same hash function h: # c {0, 1, …, M}, where # LB+ 1 1 2 1 1 1
denotes the domain of the data values that appear in the par-
titions and M = 7(|# |2) as discussed in Section 2. We start by
LA+LB
focusing on insertion-only environments, and discuss dele-
tions in Section 5.3.
LA+ + B 1 0 0 2 0 0
We first define the AKMV synopsis of a base partition
A, where A has a KMV synopsis L = (h(v1), h(v2),…, h(vk) ) of 0 1
size k, with h(v1) < h(v2) < . . . < h(vk). The AKMV synopsis KA+ B = 2
of A is defined as L = (L, c), where c = (c(v1), c(v2), …, c(vk) )

O C TO BE R 2 0 0 9 | VO L . 5 2 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 93
research highlights

of v in E; observe that cE(v) = 0 if v is not present in E, so that Theorem 4. If k > 1 then E[D̂E] = DE. If k > 2, then
there may be one or more zero-valued counters; see line 4 of
Figure 4 for the case E = A w B. DE(kDv k2 Dv k DE)
With these definitions, the collection of AKMV syn- Var [D̂E] 
k(k 2)
opses over compound partitions is closed under mul-
tiset operations, and an AKMV synopsis can be built up Proof. The distribution of KE does not depend on the hash
incrementally from the synopses for the base partitions. values {h(v): v  # (Av )}. It follows that the random vari-
Specifically, suppose that we combine (base or compound) ables KE and U(k), and hence the variables p̂ and U(k), are
partitions A and B—having respective AKMV synopses statistically independent. By standard properties of the
L A = (LA, cA) and L B = (LB, cB)—to create E = A ◊ B, where hypergeometric distribution, E[KE] = kDE/Dv , so that E[D̂E] =
◊ {v  , \}. Then the AKMV synopsis for E is (LA „ LB,cE),
, w E[r̂ D̂v ] = E[r̂ ]E[D̂v ] = r Dv = DE. The second assertion fol-
where lows in a similar manner. Q
Thus D̂E is unbiased for DE. It also follows from the proof
cA(n) cB(n) if ◊  v that the estimator r̂ is unbiased for DE/Dv . In the special
cE(n)  min (cA(n), cB(n)) if ◊  w case where E = A w B for two ordinary sets A and B, the ratio
max (cA(n) cB(n), 0) if ◊  \  r corresponds to the Jaccard distance J(A, B), and we obtain
an unbiased estimator of this quantity. We can also obtain
As discussed in Beyer et al. and Gemulla,3, 11 the AKMV probability bounds that generalize the bounds in (6); see
synopsis can be sometimes be simplified. If, for example, Beyer et al.3 for details.
all partitions are ordinary sets (not multisets) and ordinary Figure 5 displays the accuracy of the AKMV estima-
set operations are used to create compound partitions, then tor when estimating the number of DVs in a compound
the AKMV counters can be replaced by a compact bit vector, partition. For this experiment, we computed a KMV synop-
yielding an O(k log D) synopsis size. sis of size k = 8192 for each dataset in the RDW database.
Then, for every possible pair of synopses, we used DˆE to
5.2. DV estimator for AKMV synopsis estimate the DV count for the union and intersection, and
We now show how to estimate the number of DVs for a also estimated the Jaccard distance between set domains
compound partition using the partition’s AKMV synop- using our new unbiased estimator rˆ defined in (7). We
sis; to this end, we need to generalize the unbiased DV also estimated these quantities using a best-of-breed esti-
estimator of Section 4. Consider a compound partition E mator: the SDLogLog estimator, which is a highly tuned
created from n r 2 base partitions A1, A2, …, An using mul- implementation of the LogLog estimator given in Durand
tiset operations, along with the AKMV synopsis L E = (LE, cE), and Flajolet.8 For this latter estimator, we estimated the
where LE = (h(v1), h(v2), …, h(vk) ). Denote by VL = {v1, v2, …, vk} number of DVs in the union directly, and then used the
the set of data values corresponding to the elements of LE. inclusion/exclusion rule to estimate the DV count for the
(Recall our assumption that there are no hash collisions.) intersection and then for the Jaccard distance. The figure
Set KE = |{v Œ VL: cE(v) > 0}|; for the example E = A w  B in displays, for each estimator, a relative-error histogram for
Figure 4, KE = 2. It follows from Theorem 3 that LE is a size- each of these three multiset operations. (The histogram
k KMV synopsis of the multiset Av = A1v A v  ...v A . The shows, for each possible RE value, the number of dataset
2 n
key observation is that, under our random hashing model, pairs for which the DV estimate yielded that value.) For the
VL can be viewed as a random sample of size k drawn uni-
formly and without replacement from # (Av ); denote by
Dv = |# (Av )| the number of DVs in Av . The quantity KE is a Figure 5. Accuracy Comparison for Union, Intersection, and Jaccard
random variable that represents the number of elements Distance on the RDW Dataset.
in VL that also belong to the set # (E). It follows that KE has
500
a hypergeometric distribution: setting DE = |# (E)|, we have
P{KE = j} = Dj E Dvk jDE / Dkv  Unbiased-KMV/Intersections
Unbiased-KMV/Unions
We now estimate DE using KE, as defined above, and U(k), the 400 Unbiased-KMV/Jaccard
SDLogLog/Intersections
largest hash value in LE. From Section 4.1 and Theorem 3, SDLogLog/Unions
we know that D̂v = (k − 1)/U(k) is an unbiased estimator of Dv ; SDLogLog/Jaccard
we would like to “correct” this estimator via multiplication 300
Frequency

by the ratio r = DE/Dv . We do not know r, but a reasonable


estimate is 200

r̂  KE/k (7)
100
the fraction of elements in the sample VL Œ # (Av ) that
belong to # (E). This leads to our proposed estimator
0
0 0.05 0.10 0.15
KE k  1 Relative error
D̂E (8)
k U(k)

94 COM MUNICATIO NS O F THE ACM | OC TOB E R 2 0 0 9 | VOL. 52 | NO. 1 0


majority of the datasets, the unbiased estimator based on 7. SUMMARY AND CONCLUSION
the KMV synopsis provides estimates that are almost ten We have revisited the classical problem of DV estimation,
times more accurate than the SDLogLog estimates, even but from a synopsis-oriented point of view. By combining
though both methods used exactly the same amount of and extending previous results on DV estimation, we have
available memory. obtained the AKMV synopsis, along with corresponding
unbiased DV estimators. Our synopses are relatively inex-
5.3. Deletions pensive to construct, yield superior accuracy, and can be
We now show how AKMV synopses can easily support dele- combined to easily handle compound partitions, permitting
tions of individual items. Consider a partition A that receives flexible and scalable DV estimation.
a stream of transactions of the form v or −v, corresponding
to the insertion or deletion, respectively, of value v. A naive
References
approach maintains two AKMV synopses: a synopsis L i for
1. Astrahan, M., Schkolnick, M., Whang, 12. Gemulla, R., Lehner, W. Sampling
the multiset Ai of inserted items and a synopsis L d for the K. Approximating the number time-based sliding windows in
multiset Ad of deleted items. Computing the AKMV synopsis of unique values of an attribute bounded space. In Proc. SIGMOD
without sorting. Inf. Sys. 12 (1987), (2008), 379–392.
of the multiset difference Ai \Ad yields the AKMV synopsis L A 11–15. 13. Gibbons, P. Distinct-values
2. Bar-Yossef, Z., Jayram, T.S., Kumar, estimation over data streams.
of the true multiset A. We do not actually need two synopses: R., Sivakumar, D., Trevisan, L. In M. Garofalakis, J. Gehrke, and
simply maintain the counters in a single AKMV synopsis L by Counting distinct elements in a data R. Rastogi, editors, Data Stream
stream. In Proc. RANDOM Management: Processing High-
incrementing the counter at each insertion and decrement- (2002), 1–10. Speed Data Streams. Springer, 2009.
ing at each deletion. If we retain synopsis entries having 3. Beyer, K.S., Haas, P.J., Reinwald, B., To appear.
Sismanis, Y., Gemulla, R. On synopses 14. Haas, P.J., Stokes, L. Estimating
counter values equal to 0, we produce precisely the synopsis for distinct-value estimation under the number of classes in a finite
L A described above. If too many counters become equal to 0, multiset operations. In Proc. ACM population. J. Amer. Statist. Assoc. 93
SIGMOD (2007), 199–210. (1998), 1475–1487.
the quality of synopsis-based DV estimates will deteriorate. 4. Brown, P.G., Haas, P.J. Techniques 15. Hadjieleftheriou, M., Yu, X.,
Whenever the number of deletions causes the error bounds for warehousing of sample data. Koudas, N., Srivastava, D. Hashed
In Proc. ICDE (2006). samples: selectivity estimators
to become unacceptable, the data can be scanned to com- 5. Cohen, E., Kaplan, H. Tighter for set similarity selection queries.
pute a fresh synopsis. estimation using bottom k sketches. Proc. VLDB Endow. 1, 1 (2008),
Proc. VLDB Endow. 1, 1 (2008), 201–212.
213–224. 16. Motwani, R., Raghavan, P. Randomized
6. RECENT WORK 6. Dasu, T., Johnson, T., Muthukrishnan, Algorithms. Cambridge University
S., Shkapenyuk, V. Mining Press (1995).
A number of developments have occurred both in paral- database structure; or, how to 17. Serfling, R.J. Approximation
lel with, and subsequent to, the work described in Beyer build a data quality browser. In Proc. Theorems of Mathematical Statistics.
ACM SIGMOD (2002), 240–251. Wiley, New York (1980).
et al.3 Duffield et al.7 devised a sampling scheme for 7. Duffield, N., Lund, C., Thorup, M. 18. Shukla, A., Deshpande, P., Naughton,
Priority sampling for estimation of J.F., Ramasamy, K. Storage estimation
weighted items, called “priority sampling,” for the pur- arbitrary subset sums. J. ACM 54, 6 for multidimensional aggregates in
pose of estimating “subset sums,” i.e., sums of weights (2007), 32. the presence of hierarchies. In Proc.
8. Durand, M., Flajolet, P. Loglog VLDB (1996), 522–531.
over subsets of items. Priority sampling can be viewed as counting of large cardinalities. In 19. Simitsis, A., Baid, A., Sismanis,
assigning a priority to an item i with weight wi by generat- Proc. ESA (2003), 605–617. Y., Reinwald, B. Multidimensional
9. Flajolet, P., Martin, G.N. Probabilistic content eXploration. Proc. VLDB
ing random number Ui uniformly on [0, 1/wi], and storing counting algorithms for data base Endow. 1, 1 (2008), 660–671.
the k smallest-priority items. In the special case of unit applications. J. Comp. Sys. Sci. 31 20. Szegedy, M. The DLT priority
(1985),182–209. sampling is essentially optimal.
weights, if we use a hash function instead of generating 10. Ganguly, S., Garofalakis, M., In Proc. STOC (2006), 150–158.
random numbers and if we eliminate duplicate priorities, Rastogi, R. Tracking set-expression 21. Whang, K., Vander-Zanden, B.T.,
cardinalities over continuous Taylor, H.M. A linear-time probabilistic
the synopsis reduces to the KMV synopsis and, when the update streams. VLDB J. 13 (2004), counting algorithm for database
subset in question corresponds to the entire population, 354–369. applications. ACM Trans. Database
11. Gemulla, R. Sampling Algorithms Sys. 15 (1990), 208–229.
the subset-sum estimator coincides with D̂kUB. If duplicate for Evolving Datasets. Ph.D. thesis, 22. Zimmer, C., Tryfonopoulos, C.,
priorities are counted rather than eliminated, the AKMV TU Dresden, Dept. of CS, 2008. Weikum, G. Exploiting correlated
http://nbn-resolving.de/urn:nbn: keywords to improve approximate
synopsis is obtained (but we then use different estima- de:bsz:14-ds-1224861856184- information filtering. In Proc. SIGIR
tors). An “almost minimum” variance property of the 11644. (2008), 323–330.

priority-sampling-based estimate has been established


in Szegedy20; this result carries over to the KMV synop- Kevin Beyer, Rainer Gemulla, Peter J.
Haas, Berthold Reinwald, and Yannis
sis, and strengthens our asymptotic minimum-variance Sismanis ({kbeyer,rgemull,phaas,reinwald,
result. In Cohen et al.,5 the priority-sampling approach syannis}@us.ibm.com),
IBM Almaden Research Center,
has recently been generalized to a variety of priority- San Jose, CA.
assignment schemes.
Since the publication of Beyer et al.,3 there have been
several applications of KMV-type synopses and corre-
sponding estimators to estimate the number of items
in a time-based sliding window,12 the selectivity of set-
similarity queries,15 the intersection sizes of posting lists
for purposes of OLAP-style querying of content-manage-
ment systems,19 and document-key statistics that are
useful for identifying promising publishers in a publish-
subscribe system.22 © 2009 ACM 0001-0782/09/1000 $10.00

O C TO B E R 2 0 0 9 | VO L . 5 2 | N O. 1 0 | COM M U N IC AT ION S OF T HE ACM 95


research highlights
DOI:10.1145/ 1562764.15 6 2 78 8

Technical Perspective
Data Stream Processing—
When You Only Get One Look
By Johannes Gehrke

T H E D ATA B A S E A N D systems communi- structures, and they do not require Within AT&T (the home institution
ties have made great progress in de- expensive sorting operations. They of the authors) a variety of streaming
veloping database systems that allow are online—an answer of the query algorithms are deployed today for net-
us to store and query huge amounts over the current prefix of the stream is work monitoring, based on real-time
of data. My first computer cost about available at any time. These so-called analysis of packet header data. For
$1,000; it was a Commodore 64 with data stream algorithms achieve these example, quantile aggregates are used
a 170KB floppy disk drive. Today (Sep- properties by trading exact query an- to track the distribution of round-trip
tember 2009), I can configure a 1.7TB swers against approximate answers, delays between different points in the
file server for the same price. Compa- but these approximations come with network over time. Similarly, heavy-
nies are responding to this explosion provable quality guarantees. hitter aggregates are used to find
in storage availability by building big- The following paper by Graham sources that send the most traffic to a
ger and bigger data warehouses, where Cormode and Marios Hadjieleftheriou given destination over time.
every digital trail we leave is stored for gives an overview of recent progress Although the paper surveys about
later analysis. Data arrives 24/7, and for the important primitive of finding 30 years of research, there is still
real-time “on-the-fly” analysis—where frequent items in a data stream. Infor- much progress in the area. Moreover,
answers are always available—is be- mally, an item is frequent in a prefix finding frequent items is just one sta-
coming mandatory. Here is where data of the stream if its relative frequency tistic. In practice, much more sophis-
stream processing comes to the rescue. exceeds a user-defined threshold. An- ticated queries such as frequent com-
In data stream processing sce- other formulation of the problem just binations of items, mining clusters,
narios, data arrives at high speeds looks for the most frequently occurring or other statistical models require
and must be analyzed in the order it items. The authors present an algo- data stream algorithms with quality
is received using a limited amount of rithmic framework that encompasses guarantees. For example, recent work
memory. The area has two main di- previous work and shows the results of from Yahoo! for content optimization
rections: a systems side and an algo- a thorough experimental comparison shows how to use time series models
rithmic side. On the systems side, re- of the different approaches. that are built online to predict the
searchers have developed data stream This paper is especially timely since click-through rate of an article based
processing systems that work like da- some of these algorithms are already on the stream of user clicks.
tabase systems turned upside down: in use (and those that are in use are not I think we have only scratched the
Long-running queries are registered necessarily the best, according to the surface both for applications and in
with the system and data is streamed authors). For example, inside Google’s novel algorithms, and I am looking
through the system. Startups now sell analysis infrastructure, in the map- forward to another 30 years of innova-
systems that analyze streaming data reduce framework, there exist several tion. I recommend this paper to learn
for solutions in areas such as fraud prepackaged “aggregators” that in one about the types of techniques that have
detection, algorithmic trading, and pass collect statistics over a huge data- been developed over the years and see
network monitoring. They often offer set. The “quantile” aggregate, which how ideas from algorithms, statistics,
at least an order of magnitude perfor- collects a value at each quantile of the and databases have come together in
mance improvement over traditional data, uses a previously developed algo- this problem.
database systems. rithm that is covered in the paper, and
On the algorithmic side, there has the “top” aggregate estimates the most Johannes Gehrke ([email protected]) is an
associate professor at Cornell University, Ithaca, NY.
been much research on novel one- popular values in a dataset, again
pass algorithms. These algorithms using an algorithm captured by the
have no need for secondary index framework in the paper. © 2009 ACM 0001-0782/09/1000 $10.00

96 COMM UNICATIO NS O F TH E ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


DOI:10.1145/ 1562764. 1 5 6 2 78 9

Finding the Frequent


Items in Streams of Data
By Graham Cormode and Marios Hadjieleftheriou

Abstract window” of only the most recent updates.) Some problems


The frequent items problem is to process a stream of items are simple in this model: for example, given a stream of
and find all those which occur more than a given fraction of transactions, finding the mean and standard deviation of
the time. It is one of the most heavily studied problems in the bill totals can be accomplished by retaining a few “suf-
mining data streams, dating back to the 1980s. Many other ficient statistics” (sum of all values, sum of squared val-
applications rely directly or indirectly on finding the frequent ues, etc.). Others can be shown to require a large amount
items, and implementations are in use in large-scale indus- of information to be stored, such as determining whether
trial systems. In this paper, we describe the most important a particular search query has already appeared anywhere
algorithms for this problem in a common framework. We within a large stream of queries. Determining which prob-
place the different solutions in their historical context, and lems can be solved effectively within this model remains
describe the connections between them, with the aim of an active research area.
clarifying some of the confusion that has surrounded their The frequent items problem (also known as the heavy hit-
properties. ters problem) is one of the most heavily studied questions in
To further illustrate the different properties of the algo- data streams. The problem is popular due to its simplicity
rithms, we provide baseline implementations. This allows to state, and its intuitive interest and value. It is important
us to give empirical evidence that there is considerable vari- both in itself, and as a subroutine within more advanced
ation in the performance of frequent items algorithms. The data stream computations. Informally, given a sequence of
best methods can be implemented to find frequent items items, the problem is simply to find those items which occur
with high accuracy using only tens of kilobytes of memory, most frequently. Typically, this is formalized as finding all
at rates of millions of items per second on cheap modern items whose frequency exceeds a specified fraction of the
hardware. total number of items. This is shown in Figure 1. Variations
arise when the items are given weights, and further when
these weights can also be negative.
1. INTRODUCTION This abstract problem captures a wide variety of settings.
Many data generation processes can be modeled as data The items can represent packets on the Internet, and the
streams. They produce huge numbers of pieces of data, each weights are the size of the packets. Then the frequent items
of which is simple in isolation, but which taken together represent the most popular destinations, or the heaviest band-
lead to a complex whole. For example, the sequence of que- width users (depending on how the items are extracted from
ries posed to an Internet search engine can be thought of the flow identifiers). This knowledge can help in optimizing
as a stream, as can the collection of transactions across all routing decisions, for in-network caching, and for planning
branches of a supermarket chain. In aggregate, this data can where to add new capacity. Or, the items can represent queries
arrive at enormous rates, easily in the realm of hundreds of
gigabytes per day or higher. While this data may be archived
Figure 1. A stream of items defines a frequency distribution over
and indexed within a data warehouse, it is also important to
items. In this example, with a threshold of f = 20% over the 19 items
process the data “as it happens,” to provide up to the minute grouped in bins, the problem is to find all items with frequency at
analysis and statistics on current trends. Methods to achieve least 3.8—in this case, the green and red items (middle two bins).
this must be quick to respond to each new piece of informa-
tion, and use resources which are very small when compared
to the total quantity of data.
These applications and others like them have led to
the formulation of the so-called “streaming model.” In
this abstraction, algorithms take only a single pass over
their input, and must accurately compute various func-
tions while using resources (space and time per item)
that are strictly sublinear in the size of the input—ideally,
polynomial in the logarithm of the input size. The output
must be produced at the end of the stream, or when que-
A previous version of this paper was published in
ried on the prefix of the stream that has been observed so
Proceedings of the International Conference on Very Large
far. (Other variations ask for the output to be maintained
Data Bases (Aug. 2008).
continuously in the presence of updates, or on a “sliding

O C TO BE R 2 0 0 9 | VO L . 5 2 | N O. 1 0 | COM M U N IC AT ION S OF T HE ACM 97


research highlights

made to an Internet search engine, and the frequent items are to provide formal guarantees on the quality of their output as a
now the (currently) popular terms. These are not simply hypo- function of an accuracy parameter e. We also provide baseline
thetical examples, but genuine cases where algorithms for implementations of many of these algorithms against which
this problem have been applied by large corporations: AT&T11 future algorithms can be compared, and on top of which algo-
and Google,23 respectively. Given the size of the data (which is rithms for different problems can be built. We perform experi-
being generated at high speed), it is important to find algo- mental evaluation of the algorithms over a variety of data sets
rithms which are capable of processing each new update very to indicate their performance in practice. From this, we are
quickly, without blocking. It also helps if the working space able to identify clear distinctions among the algorithms that
of the algorithm is very small, so that the analysis can hap- are not apparent from their theoretical analysis alone.
pen over many different groups in parallel, and because small
structures are likely to have better cache behavior and hence 2. DEFINITIONS
further help increase the throughput. We first provide formal definition of the stream and the fre-
Obtaining efficient and scalable solutions to the frequent quencies fi of the items within the stream as the number of
items problem is also important since many streaming times item i is seen in the stream.
applications need to find frequent items as a subroutine of
another, more complex computation. Most directly, min- Definition 1. Given a stream S of n items t1 … tn, the fre-
ing frequent itemsets inherently builds on finding frequent quency of an item i is fi = |{ j|tj = i}|. The exact f-frequent
items as a basic building block. Finding the entropy of a items comprise the set {i| fi > fn}.
stream requires learning the most frequent items in order
to directly compute their contribution to the entropy, and Example. The stream S = (a, b, a, c, c, a, b, d) has fa = 3, fb = 2,
remove their contribution before approximating the entropy fc = 2, fd = 1. For f = 0.2, the frequent items are a, b, and c.
of the residual stream.8 The HSS (Hierarchical Sampling A streaming algorithm which finds the exact f-frequent
from Sketches) technique uses hashing to derive multiple items must use a lot of space, even for large values of f, based
substreams, the frequent elements of which are extracted on the following information-theoretic argument. Given
to estimate the frequency moments of the stream.4 The fre- an algorithm that claims to solve this problem for f = 50%,
quent items problem is also related to the recently popular we could insert a set S of N items, where every item has fre-
area of Compressed Sensing. quency 1. Then, we could also insert N − 1 copies of item i.
Other work solves generalized versions of frequent items If i is now reported as a frequent item (occurring more than
problems by building on algorithms for the “vanilla” version 50% of the time) then i  S, else i Ž S. Consequently, since cor-
of the problem. Several techniques for finding the frequent rectly storing a set of size N requires 7(N) space, 7(N) space
items in a “sliding window” of recent updates (instead of is also required to solve the frequent items problem. That is,
all updates) operate by keeping track of the frequent items any algorithm which promises to solve the exact problem on a
in many sub-windows.2, 13 In the “heavy hitters distinct” stream of length n must (in the worst case) store an amount of
problem, with applications to detecting network scanning information that is proportional to the length of the stream,
attacks, the count of an item is the number of distinct pairs which is impractical for the large stream sizes we consider.
containing that item paired with a secondary item. It is typi- Because of this fundamental difficulty in solving the exact
cally solved extending a frequent items algorithm with dis- problem, an approximate version is defined based on a tol-
tinct counting algorithms.25 Frequent items have also been erance for error, which is parametrized by e.
applied to models of probabilistic streaming data,17 and
within faster “skipping” techniques.3 Definition 2. Given a stream S of n items, the e-approximate
Thus the problem is an important one to understand and frequent items problem is to return a set of items F so that for
study in order to produce efficient streaming implementa- all items i  F, fi > (f − e)n, and there is no i Ž F such that fi > fn.
tions. It remains an active area, with many new research
contributions produced every year on the core problem and Since the exact (e = 0) frequent items problem is hard
its variations. Due to the amount of work on this problem, in general, we use “frequent items” or “the frequent items
it is easy to miss out some important references or fail to problem” to refer to the e-approximate frequent items prob-
appreciate the properties of certain algorithms. There are lem. A closely related problem is to estimate the frequency
several cases where algorithms first published in the 1980s of items on demand.
have been “rediscovered” two decades later; existing work
is sometimes claimed to be incapable of a certain guaran- Definition 3. Given a stream S of n items defining frequen-
tee, which in truth it can provide with only minor modifica- cies fi as above, the frequency estimation problem is to pro-
tions; and experimental evaluations do not always compare cess a stream so that, given any i, an f^i is returned satisfying
against the most suitable methods. fi b f^i b fi + en.
In this paper, we present the main ideas in this area, by
describing some of the most significant algorithms for the core A solution to the frequency estimation problem allows the
problem of finding frequent items using common notation frequent items problem to be solved (slowly): one can estimate
and terminology. In doing so, we also present the historical the frequency of every possible item i, and report those i’s
development of these algorithms. Studying these algorithms whose frequency is estimated above (f − e)n. Exhaustively enu-
is instructive, as they are relatively simple, but can be shown merating all items can be very time consuming (or infeasible,

98 COMM UNICATIO NS O F THE ACM | OC TOB E R 2 0 0 9 | VO L . 5 2 | NO. 1 0


e.g., when the items can be arbitrary strings). However, all history7). In the December 1982, Journal of Algorithms, a
the algorithms we study here solve both the approximate fre- solution provided by Fischer and Salzburg was published.15
quent items problem and the frequency estimation problem Their proposed algorithm, although presented differently,
at the same time. Most solutions are deterministic, but we was essentially identical to Majority, and was accompa-
also discuss randomized solutions, which allow a small user- nied by an analysis of the number of comparisons needed to
specified probability of making a mistake. solve the problem. Majority can be stated as follows: store
the first item and a counter, initialized to 1. For each sub-
3. FREQUENT ITEMS ALGORITHMS sequent item, if it is the same as the currently stored item,
We discuss two main classes of algorithms for finding the increment the counter. If it differs, and the counter is zero,
frequent items. Counter-based algorithms track a subset of then store the new item and set the counter to 1; else, decre-
items from the input, and monitor counts associated with ment the counter. After processing all items, the algorithm
these items. We also discuss sketch algorithms, which are guarantees that if there is a majority vote, then it must be the
(randomized) linear projections of the input viewed as a vec- item stored by the algorithm. The correctness of this algo-
tor, and solve the frequency estimation problem. They there- rithm is based on a pairing argument: if every nonmajority
fore do not explicitly store items from the input. Furthermore, item is paired with a majority item, then there should still
sketch algorithms can support deletion of items (correspond- remain an excess of majority items. Although not posed as a
ing to updates with a negative weight, discussed in more streaming problem, the algorithm has a streaming flavor: it
detail below), in contrast with counter-based schemes, at the takes only one pass through the input (which can be ordered
cost of increased space usage and update time. arbitrarily) to find a majority item. To verify that the stored
These are by no means the only solutions possible for this item really is a majority, a second pass is needed to simply
problem. Other solutions are based on various notions of ran- count the true number of occurrences of the stored item.
domly sampling items from the input, and of summarizing Without this second pass, the algorithm has a partial guar-
the distribution of items in order to find quantiles, from which antee: if there is an exact majority item, it is found at the end
the frequent items can be discovered. These solution types of the first pass, but the algorithm is unable to determine
have attracted less interest for the frequent items problem, and whether this is the case. Note that as the hardness results for
are less effective based on our full experimental evaluations.10 Definition 1 show, no algorithm can correctly distinguish
the cases when an item is just above or just below the thresh-
3.1. Counter-based algorithms old in a single pass without using a large amount of space.
Counter-based algorithms decide for each new arrival The “Frequent” Algorithm: Twenty years later, the prob-
whether to store this item or not, and if so, what counts to lem of streaming algorithms was an active research area,
associate with it. A common feature of these algorithms is and a generalization of the Majority algorithm was shown
that when given a new item, they test whether it is one of k to solve the problem of finding all items in a sequence
being stored by the algorithm, and if so, increment its count. whose frequency exceeds a 1/k fraction of the total count.14, 18
The cost of supporting this “dictionary” operation depends Instead of keeping a single counter and item from the input,
on the model of computation assumed. A simple solution is the Frequent algorithm stores k − 1 (item, counter) pairs.
to use a hash table storing the current set of items, but this The natural generalization of the Majority algorithm is to
means that an otherwise deterministic solution becomes compare each new item against the stored items T, and
randomized in its time cost, since it takes expected O(1) increment the corresponding counter if it is among them.
operations to perform this step. Other models assume that Else, if there is some counter with a zero count, it is allocated
there is hardware support for these operations (such as to the new item, and the counter set to 1. If all k − 1 counters
Content Addressable Memory), or else that deterministic are allocated to distinct items, then all are decremented by 1.
“dynamic dictionary algorithms” are used. We sidestep this A grouping argument is used to argue that any item which
issue in this presentation by just counting the number of occurs more than n/k times must be stored by the algorithm
“dictionary” operations in the algorithms. when it terminates. Figure 2 illustrates some of the opera-
Majority Algorithm: The problem of frequent items dates tions on this data structure. Pseudocode to illustrate this
back at least to a problem proposed by Moore in 1980. It was algorithm is given in Algorithm 1, making use of set notation
published as a “problem” in the Journal of Algorithms in the to represent the dictionary operations on the set of stored
June 1981 issue, as items T: items are added and removed from this set using set
union and set subtraction, respectively, and we allow rang-
[J.Alg 2, P208–209] Suppose we have a list of n num- ing over the members of this set (any implementation will
bers, representing the “votes” of n processors on the have to choose how to support these operations). We also
result of some computation. We wish to decide if there assume that each item j stored in T has an associated coun-
is a majority vote and what the vote is. ter cj. For items not stored in T, then cj is implicitly defined as
0 and does not need to be explicitly stored.
Moore, with Boyer, also invented the Majority algorithm In fact, this generalization was first proposed by Misra and
in 1980, described in a technical report from early 1981.6 To Gries as “Algorithm 3”22 in 1982: the papers published in 2002
them, this was mostly of interest from the perspective of (which cite Fischer15 but not Misra22) were actually rediscover-
automatically proving the correctness of the solution (the ies of their algorithm. In deference to its initial discovery, this
details of this were published in 1991, along with a partial algorithm has been referred to as the “Misra–Gries” algorithm

O C TOB E R 2 0 0 9 | VO L . 5 2 | N O. 1 0 | C OM M U N IC AT ION S OF T HE ACM 99


research highlights

Figure 2. Counter-based data structure: the blue (top) item is already


algorithm stores tuples which comprise an item, a lower
stored, so its count is incremented when it is seen. The green bound on its count, and a “delta” ($) value which records the
(middle) item takes up an unused counter, then a second occurrence difference between the upper bound and the lower bound.
increments it. When processing the ith item in the stream, if information
is currently stored about the item then its lower bound is
increased by one; else, a new tuple for the item is created with
the lower bound set to one, and $ set to ªi/k¹. Periodically,
2 all tuples whose upper bound is less than ªi/k¹ are deleted.
+1 These are correct upper and lower bounds on the count of
each item, so at the end of the stream, all items whose count
exceeds n/k must be stored. As with Frequent, setting
6 k = 1/e ensures that the error in any approximate count is at
most en. A careful argument demonstrates that the worst-
case space used by this algorithm is , and for cer-
tain time-invariant input distributions it is .
1 Storing the delta values ensures that highly frequent items
+1 which first appear early on in the stream have very accurate
approximated counts. But this adds to the storage cost. A vari-
ant of this algorithm is presented by Manku in a presentation
+1 0 of the paper,20 which dispenses with explicitly storing the delta
values, and instead has all items sharing an implicit value of
$(i) = ªi/k¹. The modified algorithm stores (item, count) pairs.
For each item in the stream, if it is stored, then the count is
incremented; otherwise, it is initialized with a count of 1.
in more recent work on streaming algorithms. In the same Every time $(i) increases, all counts are decremented by 1,
paper, an “Algorithm 2” correctly solves the problem but has and all items with zero count are removed from the data struc-
only speculated worst-case space bounds. Some works have ture. The same proof suffices to show that the space bound
asserted that the Frequent algorithm does not solve the fre- is . This version of the algorithm is quite similar
quency estimation problem accurately, but this is erroneous. to Algorithm 2 presented in Misra22; but in Manku,20 a space
As observed by Bose et al.,5 executing this algorithm with k = bound is proven. The time cost is O(1) dictionary operations,
1/e ensures that the count associated with each item on termi- plus the periodic compress operations which require a linear
nation is at most en below the true value. scan of the stored items. This can be performed once every
The time cost of the algorithm is dominated by the O(1) updates, in which time the number of items
dictionary operations per update, and the cost of decre- stored has at most doubled, meaning that the amortized cost
menting counts. Misra and Gries use a balanced search tree, of compressing is O(1). We give pseudocode for this version of
and argue that the decrement cost is amortized O(1); Karp et the algorithm in Algorithm 2, where again T represents the set
al. propose a hash table to implement the dictionary18; and of currently monitored items, updated by set operations, and
Demaine et al. show how the cost of decrementing can be cj are corresponding counts.
made worst-case O(1) by representing the counts using off- SpaceSaving: All the deterministic algorithms presented so
sets and maintaining multiple linked lists.14 far have a similar flavor: a set of items and counters are kept,
LossyCounting: The LossyCounting algorithm was pro- and various simple rules are applied when a new item arrives
posed by Manku and Motwani in 2002,19 in addition to a (as illustrated in Figure 2). The SpaceSaving algorithm intro-
randomized sampling-based algorithm and techniques for duced in 2005 by Metwally et al. also fits this template.21 Here, k
extending from frequent items to frequent itemsets. The (item, count) pairs are stored, initialized by the first k distinct

Algorithm 1: FREQUENT(k) Algorithm 2: LOSSYCOUNTING(k) Algorithm 3: SPACESAVING(k)


n 0; n 0; 0; T Ø; n 0;
T Ø; foreach i do T Ø;
foreach i do n n1; foreach i do
n n1; if i T then ci ci1; n n1;
if i T then else if i T then ci ci1;
ci ci1; T T {i}; else if T < k then
else if T < k1 then ci 1 ; T T {i};
T T {i}; ci 1;
ci 1; if n/K
  then else
else forall j T do n/k
; j arg minjT c ;
j
cj cj1; forall j T do ci cj1;
if cj = 0 then T T \{ j};  if ci < then T T \ { j}; T T {i} \ { j};

100 COMM UNICATIO NS O F T H E ACM | OC TOB E R 2 0 0 9 | VO L . 5 2 | NO. 1 0


items and their exact counts. As usual, when the next item in fia   for all j x i. Then, the true answer to the inner prod-
the sequence corresponds to a monitored item, its count is uct should be exactly fi. The error guaranteed by the sketch
incremented; but when the next item does not match a mon- turns out to be with probability of at least 1 − d
itored item, the (item, count) pair with the smallest count for a sketch of size . The ostensibly dissimilar
has its item value replaced with the new item, and the count technique of “Random Subset Sums”16 (on close inspection)
incremented. So the space required is O(k) (respectively ), turns out to be isomorphic to this instance of the algorithm.
and a short proof demonstrates that the counts of all stored Maintaining the AMS data structure is slow, since it requires
items solve the frequency estimation problem with error updating the whole sketch for every new item in the stream.
n/k (respectively en). It also shares a useful property with The CountSketch algorithm of Charikar et al.9 dramatically
LossyCounting, that items which are stored by the algo- improves the speed by showing that the same underlying
rithm early in the stream and are not removed have very accu- technique works if each update only affects a small subset of
rate estimated counts. The algorithm appears in Algorithm 3. the sketch, instead of the entire summary. The sketch con-
The time cost is bounded by the dictionary operation of find- sists of a two-dimensional array C with d rows of w counters
ing if an item is stored, and of finding and maintaining the each. There are two hash functions for each row, hj which
item with minimum count. Simple heap implementations maps input items onto [w], and gj which maps input items
can track the smallest count item in O(log 1/e) time per onto {−1, +1}. Each input item i causes gj(i) to be added on to
update. When all updates are unitary (+1), a faster approach entry C[ j, hj (i)] in row j, for 1 b j b d. For any row j, the value gj(i )
is to borrow ideas from the Demaine et al. implementation of C[ j, hj (i)] is an unbiased estimator for fi. The estimate f^i is the
Frequent, and keep the items in groups with equal counts. median of these estimates over the d rows. Setting
By tracking a pointer to the group with smallest count, the and ensures that fi has error at most with
find minimum operation takes constant time, while incre- probability of at least 1 − d. This guarantee requires that the
menting counts take O(1) pointer operations (the “Stream- hash functions are chosen randomly from a family of “four-
Summary” data structure desribed by Metwally et al.21). wise independent” hash functions.24 The total space used is
, and the time per update is worst case.
3.2. Sketch algorithms Figure 3 shows a schematic of the data structure under the
Here, we use the term “sketch” to denote a data structure update procedure: the new item i gets mapped to a different
which can be thought of as a linear projection of the input. location in each row, where gj(i) is added on to the current
That is, if we imagine the stream as implicitly defining a vec- counter value in that location. Pseudocode for the core of the
tor whose ith entry is fi, the sketch is the product of this update algorithm is shown in Algorithm 4.
vector with a matrix. For the algorithm to use small space, CountMin Sketch: The CountMin sketch algorithm of
this matrix will be implicitly defined by a small number of Cormode and Muthukrishnan12 can be described in simi-
bits. The sketch algorithms described here use hash func- lar terms to CountSketch. An array of d × w counters is
tions to define a (very sparse) linear projection. Both views maintained, along with d hash functions hj. Each update is
(hashing or linear projection) can be helpful in explaining mapped onto d entries in the array, each of which is incre-
the methods, and it is usually possible to alternate between mented. Now f^i = min1b jbd C[ j, hj(i)]. The Markov inequality
the two without confusion. Because of their linearity, it fol- is used to show that the estimate for each j overestimates by
lows immediately that updates with negative values can less than n/w, and repeating d times reduces the probabil-
easily be accommodated by such sketching methods. This ity of error exponentially. So setting and
allows us to model the removal of items (to denote the con- ensures that f^i has error at most en with probability of at
clusion of a packet flow; or the return of a previously bought least 1 − d. Consequently, the space is and the
item, say) as an update with negative weight. time per update is . The data structure and update
The two sketch algorithms outlined below solve the fre- procedure is consequently much like that illustrated for the
quency estimation problem. They need additional data infor- CountSketch in Figure 3, with gj(i) always equal to 1. The
mation to solve the frequent items problem, so we also describe update algorithm is shown in Algorithm 5.
algorithms which augment the stored sketch to find frequent Finding Frequent Items Using a Hierarchy: Since sketches
items quickly. The algorithms are randomized, which means solve the case when item frequencies can decrease, more
that in addition to the accuracy parameter e, they also take a
failure probability d so that (over the random choices made
in choosing the hash functions) the probability of failure is at Figure 3. Sketch data structure: each new item is mapped to a set of
counters, which are incremented.
most d. Typically, d can be chosen to be very small (e.g., 10−6)
while keeping the space used by the sketch low.
+ctg1(i)
CountSketch: The first sketch in the sense that we use the
term was the AMS or Tug-of-war sketch due to Alon et al.1 h1(i)
This was used to estimate the second frequency moment,
it
F2 = 3i fi2. It was subsequently observed that the same data hd(i)
structure could be used to estimate the inner product of two
frequency distributions, i.e., 3i fi fia for two distributions given +ctgd(i)
(in a stream) by fi and fia. But this means that if fi is defined by
a stream, at query time we can find the product with fia  and

O C TO BE R 2 0 0 9 | VO L . 5 2 | N O. 1 0 | C OM M U N IC AT ION S OF T H E ACM 101


research highlights

Algorithm 4: COUNTSKETCH(w, d) 4. EXPERIMENTAL COMPARISON


C[1, 1]. . . C[d, w] 0;
for j 1 to d do 4.1. Setup
Initialize gj, hj; We compared these algorithms under a common imple-
foreach i do mentation framework to test as accurately as possible their
n n + 1; relative performance. All algorithms were implemented
for j 1 to d do using C++, and used common subroutines for similar tasks
C[ j, hj(i)]  C[ j, hj(i), j] gj (i); (e.g., hash tables) to increase comparability. We ran experi-
ments on a 4 Dual Core Intel(R) Xeon(R) 2.66 GHz with 16GB
of RAM running Windows 2003 Server. The code was com-
Algorithm 5: COUNTMin(w, d) piled using Microsoft’s Visual C++ 2005 compiler and g++
C[1, 1]. . . C[d, w] 0; 3.4.4 on cygwin. We did not observe significant differences
for j 1 to d do between the two compilers. We report here results obtained
Initialize gj ; using Visual C++ 2005. The code is available from http://
foreach i do www.research.att.com/^marioh/frequent–items/.
n n + 1; For every algorithm we tested a number of implementa-
for j 1 to d do tions, using different data structures to implement the basic
C[ j, hj(i)]  C[ j, hj(i)] 1; set operations. For some algorithms the most robust imple-
mentation choice was clear; for others we present results of
competing solutions. For counter-based algorithms we exam-
ine: Frequent using the Demaine et al. implementation tech-
complex algorithms are needed to find the frequent items. nique of linked lists (F), LossyCounting keeping separate
Here, we assume a “strict” case, where negative updates delta values for each item (LCD), LossyCounting without del-
are possible but no negative frequencies are allowed. In tas (LC), SpaceSaving using a heap (SSH), and SpaceSaving
this strict case, an approach based on divide-and-conquer using linked lists (SSL). We also examine sketch-based meth-
will work: additional sketches are used to determine which ods: hierarchical CountSketch (CS), hierarchical CountMin
ranges of items are frequent.12 If a range is frequent, then it sketch (CMH), and the CGT variant of CountMin.
can be split into b nonoverlapping subranges and the fre- We ran experiments using 10 million packets of HTTP traf-
quency of each subrange estimated from an appropriate fic, representing 24 hours of traffic from a backbone router in a
sketch, until a single item is returned. The choice of b trades major network. Experiments on other real and synthetic data-
off update time against query time: if all items i  {1 . . . U}, sets are shown in an extended version of this article.10 We varied
then [logb U ] sketches suffice, but each potential range is the frequency threshold f, from 0.0001 to 0.01. In our experi-
split into b > 1 subranges when answering queries. Thus, ments, we set the error guarantee e = f, since our results showed
updates take hashing operations, and O(1) that this was sufficient to give high accuracy in practice.
counter updates for each hash. Typically, moderate constant We compare the efficiency of the algorithms with respect to
values of b are used (between 2 and 256, say); choosing b to
be a power of two allows fast bit-shifts to be used in query š Update throughput, measured in number of updates
and update operations instead of slower divide and mod per millisecond.
operations. This results in CountMin sketch Hierarchical š Space consumed, measured in bytes.
and CountSketch Hierarchical algorithms. š Recall, measured in the total number of true heavy hit-
Finding Frequent Items Using Group Testing: An alter- ters reported over the number of true heavy hitters
nate approach is based on “combinatorial group testing” given by an exact algorithm.
(CGT), which randomly divides the input into buckets so š Precision, measured in total number of true heavy hit-
that we expect at most one frequent item in each group. ters reported over the total number of answers reported.
Within each bucket, the items are divided into subgroups Precision quantifies the number of false positives
so that the “weight” of each group indicates the identity of reported.
the frequent item. For example, separating the counts of the š Average relative error of the reported frequencies: We
items with odd identifiers and even identifiers will indicate measure separately the average relative error of the fre-
whether the heavy item is odd or even; repeating this for all quencies of the true heavy hitters, and the average rela-
bit positions reveals the full identity of the item. This can tive error of the frequencies of the false positive answers.
be seen as an extension of the CountMin sketch, since the Let the true frequency of an item be f and the measured
structure resembles the buckets of the sketch, with addi- frequency f~. The absolute relative error is defined
tional information on subgroups of each bucket (based on by . We average the absolute relative errors
the binary representation of items falling in the bucket); fur- over all measured frequencies.
ther, the analysis and properties are quite close to those of
a Hierarchical CountMin sketch. This increases the space For all of the above, we perform 20 runs per experiment (by
to when the binary representation takes log dividing the input data into 20 chunks and querying the algo-
U bits. Each update requires hashes as before, and rithms once at the end of each run). We report averages on all
updating O(log U) counters per hash. graphs, along with the 5th and 95th percentiles as error bars.

102 C O MM UNICATIO NS O F T H E ACM | OC TOB E R 2 0 0 9 | VO L . 5 2 | NO. 1 0


4.2. Counter-based algorithms data structures. Finally, notice that by keeping additional per-
Space and Time Costs: Figure 4(a) shows the update through- item information, LCD can sometimes distinguish between
put of the algorithms as a function of increasing frequency truly frequent and potentially frequent items better than LC.
threshold (f). The SpaceSaving and Frequent algorithms Figure 4(c) plots the average relative error in the frequency
are fastest, while the two variations of LossyCounting are estimation of the truly frequent items. The graph also plots
appreciably slower. On this data set, SSL and SSH are equally the 5th and 95th percentiles as error bars. The relative error
very fast, but on some other data sets SSL was significantly of F decreases with f, while the error of LossyCounting
faster than SSL, showing how data structure choices can affect increases with f. Note that F always returns an underesti-
performance. The range of frequency thresholds (f) consid- mate of the true count of any item; LC and LCD always return
ered did not affect update throughput (notice the log scale on overestimates based on a $ value, and so yield inflated esti-
the horizontal axis). The space used by these algorithms at the mates of the frequencies of infrequent items.
finest accuracy level was less than 1MB. SSL used 700KB for f = Conclusions: Overall, the SpaceSaving algorithm appears con-
0.0001, while the other algorithms all required approximately clusively better than other counter-based algorithms, across
400KB. Since the space cost varies with 1/f, for f = 0.01, the a wide range of data types and parameters. Of the two imple-
cost was 100 times less, i.e., a matter of kilobytes. This range mentations compared, SSH exhibits very good performance in
of sizes is small enough to fit within a mod ern second level practice. It yields very good estimates, typically achieving 100%
cache, so there is no obvious effect due to crossing memory recall and precision, consumes very small space, and is fairly
boundaries on the architectures tested on. A naive solution fast to update (faster than LC and LCD). Alternatively, SSL is
that maintains one counter per input item would consume the fastest algorithm with all the good characteristics of SSH,
many megabytes (and this grows linearly with the input size). but consumes twice as much space on average. If space is not a
This is at least 12 times larger than SSH for f = 0.0001 (which is critical issue, SSL is the implementation of choice.
the most robust algorithm in terms of space), and over a thou-
sand times larger than all algorithms for f = 0.01. Clearly, the 4.3. Sketch algorithms
space benefit of these algorithms, even for small frequency The advantage of sketches is that they support deletions, and
thresholds, is substantial in practice. hence are the only alternative in fully dynamic environments.
Precision, Recall, and Error: All algorithms tested guar- This comes at the cost of increased space consumption and
antee perfect recall (they will recover every item that is fre- slower update performance. We used a hierarchy with branch-
quent). Figure 4(b) plots the precision. We also show the 5th ing factor b = 16 for all algorithms, after running experiments
and 95th percentiles in the graphs as error bars. Precision with several values and choosing the best trade-off between
is the total number of true answers returned over the total speed, size, and precision. The sketch depth is set to d = 4
number of answers. Precision is an indication of the number throughout, and the width to w = 2/f, based on the analysis of
of false positives returned. Higher precision means smaller the CountMin sketch. This keeps the space usage of CS and
number of false positive answers. There is a clear distinc- CMH relatively similar, and CGT is larger by constant factors.
tion between different algorithms in this case. When using Space and Time Cost: Figure 5(a) shows the update
e = f, F results in a very large number of false positive answers, throughput of the algorithms. Update throughput is mostly
while LC and LCD result in approximately 50% false positives unaffected by variations in f, though CMH does seem to
for small f parameters, but their precision improves as skew- become slower for larger values of f. CS has the slowest
ness increases. Decreasing e relative to f would improve this update rate among all algorithms, due to the larger num-
at the cost of increasing the space used. However, SSL and ber of hashing operations needed. Still, the fastest sketch
SSH yield 100% accuracy in all cases (i.e., no false positives), algorithm is from 5 up to 10 times slower than the fastest
with about the same or better space usage. Note that these counter-based algorithm. Figure 5(b) plots the space con-
implement the same algorithm and so have the same output, sumed. The size of the sketches is fairly large compared to
only differing in the underlying implementation of certain counter-based algorithms: of the order of several megabytes

Figure 4. Performance of counter-based algorithms on real network data (a) speed, (b) precision, and (c) average relative error.
F LC LCD SSL SSH F LC LCD SSL SSH F LC LCD SSL SSH

30000 100 0.2


90 SSL, SSH
25000 80
0.15
20000 70
Precision (%)
Updates/ms

60
LC, LCD
ARE

15000 50 0.1
40
10000 30
0.05
5000 20
10
0 0 0
0.0001 0.001 0.01 0.0001 0.001 0.01 0.0001 0.001 0.01
f (log scale) f (log scale) f (log scale)

(a) HTTP: Speed vs. f. (b) HTTP: Precision vs. f. (c) HTTP: ARE vs. f of frequent items.

O C TO BE R 2 0 0 9 | VO L . 5 2 | N O. 1 0 | COM M U N IC AT ION S OF T H E ACM 103


research highlights

Figure 5. Performance of sketch algorithms on real network data (a) speed, (b) size, and (c) precision.
CS CMH CGT CS CMH CGT CS CMH CGT

4500 4e+07 100


4000 3.5e+07
95
3500 3e+07
90

Precision (%)
Updates/ms

3000 2.5e+07

Bytes
2500 2e+07 85
2000 1.5e+07
80
1500 1e+07
75
1000 5e+06
500 0 70
0.0001 0.001 0.01 0.0001 0.001 0.01 0.0001 0.001 0.01
f (log scale) f (log scale) f (log scale)

(a) HTTP: Speed vs. f. (b) HTTP: Size vs. f. (c) HTTP: Precision vs. f.

for small values of f. CMH is the most space efficient sketch š In the weighted input case, each update comes with an
and still consumes space three times as large as the least associated weight (such as a number of bytes, or num-
space efficient counter-based algorithm. ber of units sold). Here, sketching algorithms directly
Precision, Recall, and Error: The sketch algorithms all handle weighted updates because of their linearity. The
have near perfect recall, as is the case with the counter-based SpaceSaving algorithm also extends to the weighted
algorithms. Figure 5(c) shows that they also have good preci- case, but this is not known to be the case for the other
sion, with CS reporting the largest number of false positives. counter-based algorithms discussed.
Nevertheless, on some other datasets we tested (not shown š In the distributed data case, different parts of the input
here), the results were reversed. We also tested the average are seen by different parties (different routers in a net-
relative error in the frequency estimation of the truly fre- work, or different stores making sales). The problem is
quent items. For sufficiently skewed distributions all algo- then to find items which are frequent over the union of
rithms can estimate item frequencies very accurately, and the all the inputs. Again due to their linearity properties,
results from all sketches were similar since all hierarchical sketches can easily solve such problems. It is less clear
sketch algorithms essentially correspond to a single instance whether one can merge together multiple counter-
of a CountSketch or CountMin sketch of equal size. based summaries to obtain a summary with the same
Conclusions: There is no clear winner among the sketch accuracy and worst-case space bounds.
algorithms. CMH has small size and high update through- š Often, the item frequencies are known to follow some
put, but is only accurate for highly skewed distributions. statistical distribution, such as the Zipfian distribution.
CGT consumes a lot of space but it is the fastest sketch and With this assumption, it is sometimes possible to prove
is very accurate in all cases, with high precision and good fre- a smaller space requirement on the algorithm, as a func-
quency estimation accuracy. CS has low space consumption tion of the amount of “skewness” in the distribution.9, 21
and is very accurate in most cases, but has slow update rate š In some applications, it is important to find how many
and exhibits some random behavior. distinct observations there have been, leading to a distinct
heavy hitters problem. Now the input stream S is of the
5. CONCLUDING REMARKS form (i, j), and fi is defined as |{j|(i, j)  S}|. Multiple
We have attempted to present algorithms for finding frequent occurrences of (i, j) only count once towards fi. Techniques
items in streams, and give an experimental comparison for “distinct frequent items” rely on combining frequent
of their behavior to serve as a baseline for comparison. For items algorithms with “count distinct” algorithms.25
insert-only streams, the clear conclusion of our experiments š While processing a long stream, it may be desirable to
is that the SpaceSaving algorithm, a relative newcomer, weight more recent items more heavily than older ones.
has surprisingly clear benefits over others. We observed that Various models of time decay have been proposed to
implementation choices, such as whether to use a heap or achieve this. In a sliding window, only the most recent
lists of items grouped by frequencies, trade-off speed, and items are considered to define the frequent items.2
space. For sketches to find frequent items over streams More generally time decay can be formalized via a func-
including negative updates, there is not such a clear answer, tion which assigns a weight to each item in the stream
with different algorithms excelling at different aspects of the as a function of its (current) age, and the frequency of
problem. We do not consider this the end of the story, and an item is the sum of its decayed weights.
continue to experiment with other implementation choices.
Our source code and experimental test scripts are available Each of these problems has also led to considerable effort
from http://www.research.att.com/^marioh/frequent–items/ from the research community to propose and analyze algo-
so that others can use these as baseline implementations. rithms. This research is ongoing, cementing the position of
We conclude by outlining some of the many variations of the frequent items problem as one of the most enduring and
the problem intriguing in the realm of algorithms for data streams.

104 COM MUNICATIO NS O F TH E ACM | OC TOB E R 2 0 0 9 | VO L . 5 2 | N O. 1 0


References Frequency estimation of internet 20. Manku, G.S. Frequency counts over
packet streams with limited space. In data streams. http://www.cse.ust.hk/
1. Alon, N., Matias, Y., Szegedy, M. The Reasoning Series. Kluwer Academic European Symposium on Algorithms vldb2002/VLDB2002–proceedings/
space complexity of approximating Publishers, 1991, 105–117. (ESA) (2002). slides/S10P03slides.pdf (2002).
the frequency moments. In ACM 8. Chakrabarti, A., Cormode, G., 15. Fischer, M., Salzburg, S. Finding a 21. Metwally, A., Agrawal, D., Abbadi,
Symposium on Theory of Computing, McGregor, A. A near-optimal majority among n votes: Solution A.E. Efficient computation of frequent
(1996), 20–29. Journal version in J. algorithm for computing the to problem 81–5. J. Algorithms 3, 4 and top-k elements in data streams.
Comp. Syst. Sci. 58 (1999), 137–147. entropy of a stream. In ACM-SIAM (1982), 376–379. In International Conference on
2. Arasu, A., Manku, G.S. Approximate Symposium on Discrete Algorithms 16. Gilbert, A.C., Kotidis, Y., Database Theory (2005).
counts and quantiles over sliding (2007). Muthukrishnan, S., Strauss, M. How 22. Misra, J., Gries, D. Finding repeated
windows. In ACM Principles of 9. Charikar, M., Chen, K., Farach- to summarize the universe: Dynamic elements. Sci. Comput. Programming
Database Systems (2004). Colton, M. Finding frequent items maintenance of quantiles. In 2 (1982), 143–152.
3. Bhattacharrya, S., Madeira, A., in data streams. In Proceedings International Conference on 23. Pike, D., Dorward, S., Griesemer, R.,
Muthukrishnan, S., Ye, T. How to of the International Colloquium Very Large Data Bases (2002), Quinlan, S. Interpreting the data:
scalably skip past streams. In on Automata, Languages and 454–465. Parallel analysis with sawzall. Dyn.
Scalable Stream Processing Systems Programming (ICALP) (2002). 17. Jayram, T.S., McGregor, A., Grids Worldwide Comput. 13, 4
(SSPS) Workshop with ICDE 2007 10. Cormode, G., Hadjieleftheriou, M. Muthukrishnan, S., Vee, E. Estimating (2005), 277–298.
(2007). Finding frequent items in data statistical aggregates on probabilistic 24. Thorup, M., Zhang, Y. Tabulation-based
4. Bhuvanagiri, L., Ganguly, S., Kesh, D., streams. In International data streams. In ACM Principles of 4-universal hashing with applications
Saha, C. Simpler algorithm for Conference on Very Large Data Bases Database Systems (2007). to second moment estimation. In
estimating frequency moments (2008). 18. Karp, R., Papadimitriou, C., Shenker, S. ACM-SIAM Symposium on Discrete
of data streams. In ACM-SIAM 11. Cormode, G., Korn, F., Muthukrishnan, S., A simple algorithm for finding Algorithms (2004).
Symposium on Discrete Algorithms Johnson, T., Spatscheck, O. frequent elements in sets and bags. 25. Venkataraman, S., Song, D.X.,
(2006). Srivastava, D. Holistic UDAFs at ACM Trans. Database Syst. 28 (2003), Gibbons, P.B., Blum, A.
5. Bose, P., Kranakis, E., Morin, P., Tang, Y. streaming speeds. In ACM SIGMOD 51–55. New streaming algorithms for
Bounds for frequency estimation of International Conference on 19. Manku, G., Motwani, R. Approximate fast detection of superspreaders.
packet streams. In SIROCCO Management of Data (2004), 35–46. frequency counts over data streams. In Network and Distributed
(2003). 12. Cormode, G., Muthukrishnan, S. An In International Conference on System Security Symposium
6. Boyer, R.S., Moore, J.S. A fast improved data stream summary: The Very Large Data Bases (2002). NDSS (2005).
majority vote algorithm. Technical count-min sketch and its applications.
Report ICSCA-CMP-32, Institute J. Algorithms 55, 1 (2005), 58–75.
for Computer Science, University of 13. Datar, M., Gionis, A., Indyk, P.,
Texas (Feb. 1981). Motwani, R. Maintaining stream Graham Cormode and Marios Hadjieleftheriou
7. Boyer, R.S., Moore, J.S. MJRTY—a statistics over sliding windows. In ({graham,marioh}@research.att.com),
fast majority vote algorithm. In ACM-SIAM Symposium on Discrete AT&T Labs—Research, Florham Park, NJ.
Automated Reasoning: Essays in Algorithms (2002).
Honor of Woody Bledsoe, Automated 14. Demaine, E., López-Ortiz, A., Munro, J.I. © 2009 ACM 0001-0782/09/1000 $10.00

CACM lifetime mem half page ad:Layout 1 8/13/09 3:57 PM Page 1

Take Advantage of
ACM’s Lifetime Membership Plan!
 ACM Professional Members can enjoy the convenience of making a single payment for their
entire tenure as an ACM Member, and also be protected from future price increases by
taking advantage of ACM's Lifetime Membership option.
 ACM Lifetime Membership dues may be tax deductible under certain circumstances, so
becoming a Lifetime Member can have additional advantages if you act before the end of
2009. (Please consult with your tax advisor.)
 Lifetime Members receive a certificate of recognition suitable for framing, and enjoy all of
the benefits of ACM Professional Membership.

Learn more and apply at:


http://www.acm.org/life

O C TO BE R 2 0 0 9 | VO L . 5 2 | N O. 1 0 | COM M U N IC AT ION S OF T H E ACM 105


Group Term Life Insurance**

10- or 20-Year Group Term


Life Insurance*

Group Disability Income Insurance*

Group Accidental Death &


Dismemberment Insurance*

Group Catastrophic Major


Medical Insurance*

Group Dental Plan*

Long-Term Care Plan

Major Medical Insurance

Short-Term Medical Plan***

Who has time to think


about insurance?
Today, it’s likely you’re busier than ever. So, the last thing you probably have on your mind is
whether or not you are properly insured.

But in about the same time it takes to enjoy a cup of coffee, you can learn more about your
ACM-sponsored group insurance program — a special member benefit that can help provide
you financial security at economical group rates.

Take just a few minutes today to make sure you’re properly insured.
Call Marsh Affinity Group Services at 1-800-503-9230 or visit www.personal-plans.com/acm.

3132851 35648 (7/07) © Seabury & Smith, Inc. 2007


The plans are subject to the terms, conditions, exclusions and limitations of the group policy. For costs and complete details of coverage,
contact the plan administrator. Coverage may vary and may not be available in all states.
*Underwritten by The United States Life Insurance Company in the City of New York, a member company of American International Group, Inc.
**Underwritten by American General Assurance Company, a member company of American International Group, Inc.
***Coverage is available through Assurant Health and underwritten by Time Insurance Company.

AG5217
CAREERS

Open Positions at INRIA Candidates have to be highly competent in National Taiwan University
For Faculty and Research Scientists conducting research and education in the areas of Professor-Associate Professor-Assistant
human science, i.e., anthropometry, ergonomics, Professor
INRIA is a French public research institute in human factors, brain science and/or skill science.
Computer Science and Applied Mathematics. It 1. Mission: The Department of Computer Science and Infor-
is an outstanding and highly visible scientific or- A primary mission of this position is to promote in- mation Engineering, the Graduate Institute of
ganization, a major player in European research ternational research and education in the field of Networking and Multimedia, and the Graduate
and development programs. knowledge science based on the human science. Institute of Biomedical Electronics and Bioinfor-
INRIA has eight research centers in Paris, Bor- 2. Qualification: matics, all of National Taiwan University, have fac-
deaux, Grenoble, Lille, Nancy, Nice - Sophia Anti- Applicants have to hold a Ph.D. degree, and be ulty openings at all ranks beginning in February
polis, Rennes and Saclay. These centers host more qualified for highly scientific activities by par- 2010. Highly qualified candidates in all areas of
than 170 groups in partnership with universities ticipating in domestic and international research computer science/engineering and bioinformat-
and other research organizations. INRIA focuses initiatives. The search committee may ask candi- ics are invited to apply. A Ph.D. or its equivalent is
the activity of over 1200 researchers and faculty dates to present research activities together with required. Applicants are expected to conduct out-
members, 1200 PhD students and 500 post-docs a research plan in Japanese and/or English. standing research and be committed to teaching.
and engineers, on fundamental research at the 3. Contract: 5 years (extendable) Candidates should send a curriculum vitae, three
best international level, as well as on development 4. Selection: letters of reference, and supporting materials be-
and transfer activities in the following areas: The search committee shall evaluate the candi- fore October 15, 2009, to Prof Kun-Mao Chao, De-
! Modeling, simulation and optimization of dates’ expertise, research activities, and teaching partment of Computer Science and Information
complex dynamic systems skills. The evaluation result of the each applicant Engineering, National Taiwan University, No 1,
! Formal methods in programming secure and will not be released after the selection. The evalu- Sec 4, Roosevelt Rd., Taipei 106, Taiwan.
reliable computing systems ation procedure is strictly impartial, unbiased,
! Networks and ubiquitous information, and fair. An Equal Opportunity Employer, JAIST
computation and communication systems values diversity and strongly encourages applica- NEC Laboratories America, Inc.
! Vision and human-computer interaction tions from foreigners and women. Research Staff Members - Data Management
modalities, virtual worlds and robotics 5. Applicants must submit some applications:
! Computational Engineering, Computational Detailed can be found at: NEC Laboratories America, Inc., (www.nec-labs.
Sciences and Computational Medicine http://www.jaist.ac.jp/jimu/syomu/koubo/ com) is a vibrant industrial research center re-
! In 2010, INRIA will be opening several knowledge_media-e.htm nowned for technical excellence and high-impact
positions in its 8 research centers across France: 6. Deadline: Applications must be received no innovations that conducts research and develop-
! Junior and senior level positions, later than November 30, 2009. ment in support of global businesses by building
! Tenured and tenure-track positions, For more information, please contact: upon NEC’s 100-year history of innovation. Our re-
! Research and joint faculty positions with http://www.jaist.ac.jp/index-e.html search programs cover a wide range of technology
universities areas and maintain a balanced mix of fundamen-
tal and applied research as they focus on innova-
These positions cover all the above research KAIST tions which are ripe for technical breakthrough.
areas. Professor Our progressive environment provides exposure
INRIA centers provide outstanding scientific to industry-leading technologies and nurtures
environments and excellent working conditions. KAIST, the top Science & Engineering University close collaborations with leading research uni-
The institute welcomes applications from all na- in Korea, invites applications for tenure-track po- versities and institutions. Our collaborative at-
tionalities. It offers competitive salaries and so- sitions at all levels in Computer Science. We wel- mosphere, commitment to developing talent,
cial benefit programs. French schooling, medical come applications from outstanding candidates and extremely competitive benefits ensure that
coverage and social programs are highly regard- in all major fields of computer science and its in- we attract the sharpest minds in their respective
ed. Visa and working permits for the applicant terdisciplinary areas. fields. NEC Labs is headquartered in Princeton,
and the spouse will be provided. KAIST offers a competitive start-up research NJ and has a second location in Cupertino, CA.
Calendar and detailed application informa- fund and joint appointment with KAIST Insti- We are seeking researchers to join our Data
tion at: tutes, which will expand opportunities in inter- Management group in Cupertino, CA. The current
http://www.inria.fr/travailler/index.en.html disciplinary research and funding. KAIST also research focus of the group is to create cutting
More about our research groups: provides housing for five years. KAIST is commit- edge technologies for Data Management in the
http://www.inria.fr/recherche/equipes/index. ted to increasing the number of female and non- Cloud. Candidates must have a Ph.D. in Computer
en.html Korean faculty members. Science (or related fields) with solid data manage-
More about INRIA: www.inria.fr Required qualification is a Ph.D. or an equiva- ment background and strong publication record
lent degree in computer science or a closely re- in related areas, must be proactive with a “can-do”
lated field by the time of appointment. Strong attitude, and able to conduct research indepen-
Japan Advanced Institute of Science candidates who are expected to receive the Ph.D. dently. Experience in Cloud Computing, SaaS, Ser-
and Technology (JAIST) degrees within a year can be offered our appoint- vice Oriented Computing areas is a major plus.
Assistant Professor ment. Applicants must demonstrate strong re- The requirements for one position are:
search potential and commitment to teaching. ! Deep understanding of data management
Japan Advanced Institute of Science and Technolo- KAIST attracts nationwide top students pursuing systems and database internals
gy invites applicants for an Assistant Professor posi- B.S., M.S., and Ph.D. degrees. The teaching load is ! Strong hands-on system building and proto-
tion of the field of knowledge media in SCHOOL OF three hours per semester. typing skills
KNOWLEDGE SCIENCE. The appointee is expected For more information on available positions, ! Experience in distributed data management
to start her/his academic and educational activities please visit our website: ! Good knowledge of emerging data models and
in JAIST at an earlier time after 1st April, 2010. http://cs.kaist.ac.kr/service/employment.cs data processing techniques (e.g.,

O C TO B E R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T H E ACM 107


CAREERS

! Key/Value Stores, Column-Oriented Databases, http://www.cs.txstate.edu/recruitment/ for job Positions are available at all ranks, and we
MapReduce, etc.) duties, required and preferred qualifications, ap- have a large number of limited term positions
! Knowledge of middleware technologies plication procedures, and information about the currently available.
university and the department. For all positions we require a Ph.D. Degree or
For consideration, please forward resume and Texas State University-San Marcos is an equal op- Ph.D. candidacy, with the degree conferred prior
research statement to [email protected] and portunity educational institution and as such does to date of hire. Submit your application electroni-
reference “DM-R1” in the subject line. not discriminate on grounds of race, religion, sex, cally at:
For another position the researcher will create national origin, age, physical or mental disabilities, http://ttic.uchicago.edu/facapp/
new models to capture, analyze, and predict the or status as a disabled or Vietnam era veteran. Texas Toyota Technological Institute at Chicago is an
state of the data management systems deployed State is committed to increasing the number of wom- Equal Opportunity Employer
in cloud environment and combine the insights en and minorities in faculty and senior administra-
provided by those models with the database in- tive positions. Texas State University-San Marcos is a
ternals to deliver leading-edge data management member of the Texas State University System. University of Chicago
technologies for unparalleled efficiency gains. Professor, Associate Professor, Assistant
The requirements are: Professor, and Instructor
! Demonstrated knowledge of statistical and Toyota Technological Institute at
probabilistic models in large scale data and sys- Chicago (TTI-C) The Department of Computer Science at the Univer-
tem analysis Computer Science at TTI-Chicago sity of Chicago invites applications from exception-
! Strong experience in data mining and data ana- Faculty Positions at All Levels ally qualified candidates in all areas of Computer
lytics Science for faculty positions at the ranks of Profes-
! Good hands-on system building and prototyp- Toyota Technological Institute at Chicago (TTI-C) sor, Associate Professor, Assistant Professor, and
ing skills is a philanthropically endowed degree-granting Instructor. The University of Chicago has the high-
! Experience in data warehousing institute for computer science located on the Uni- est standards for scholarship and faculty quality,
versity of Chicago campus. The Institute is expect- and encourages collaboration across disciplines.
For consideration, please forward resume and ed to soon reach a steady-state of 12 traditional The Chicago metropolitan area provides a di-
research statement to [email protected] and faculty (tenure and tenure track), and 12 limited verse and exciting environment. The local econ-
reference “DM-R2” in the subject line. term faculty. Applications are being accepted in omy is vigorous, with international stature in
EOE/AA/MFDV all areas, but we are particularly interested in: banking, trade, commerce, manufacturing, and
! Theoretical computer science transportation, while the cultural scene includes
! Speech processing diverse cultures, vibrant theater, world-renowned
Texas State University-San Marcos ! Machine learning symphony, opera, jazz, and blues. The University
Department of Computer Science ! Computational linguistics is located in Hyde Park, a pleasant Chicago neigh-
Applications are invited for a tenure-track posi- ! Computer vision borhood on the Lake Michigan shore.
tion at the rank of Assistant, Associate or Profes- ! Scientific computing All applicants must apply through the Universi-
sor. Consult the department recruiting page at ! Programming languages ty’s Academic Jobs website, http://academiccareers.

:>(9;/469,*633,.,
PU]P[LZ HWWSPJH[PVUZ MVY H [LU\YL[YHJR
HWWVPU[TLU[ HZ (ZZPZ[HU[ VY (ZZVJPH[L
7YVMLZZVY VM ,UNPULLYPUN PU [OL HYLH
VM *VTW\[LY ,UNPULLYPUN [V ILNPU Graduate School of Computer
PU :LW[LTILY   ( KVJ[VYH[L PU *VTW\[LY VY ,SLJ[YPJHS and Information Sciences Dean
,UNPULLYPUNVYHYLSH[LKÄLSKPZYLX\PYLKHSVUN^P[OZ[YVUNPU[LYLZ[Z
Nova Southeastern University (NSU) invites applications for
PU\UKLYNYHK\H[L[LHJOPUNHUKPUKL]LSVWPUNHSHIVYH[VY`YLZLHYJO Dean of its Graduate School of Computer and Information
Sciences. The School offers a unique mix of innovative M.S.
WYVNYHTPU]VS]PUN\UKLYNYHK\H[LZ;LHJOPUNYLZWVUZPIPSP[PLZPUJS\KL and Ph.D. programs in computer science, information sys-
YVIV[PJZHUKKPNP[HSKLZPNUHUKLSLJ[P]LJV\YZLZPU[OLJHUKPKH[L»Z tems, information security, and educational technology.
As the chief academic and administrative officer of the Graduate
HYLH VM ZWLJPHSPaH[PVU L_HTWSLZ VM ^OPJO JV\SK PUJS\KL PTHNL School of Computer and Information Sciences (GSCIS), the
WYVJLZZPUN]PZPVU LTILKKLK Z`Z[LTZ NYHWOPJZ HUK V[OLY HYLHZ Dean will be responsible for leadership of the school's academic
and administrative affairs. The Dean will provide innovative
YLSH[LK [V JVTW\[LY OHYK^HYL  :\WLY]PZPVU VM Z[\KLU[ YLZLHYJO vision and leadership in order to maintain and advance the stature
of the GSCIS. The Dean will foster and enhance the multidisci-
HUKZLUPVYKLZPNUWYVQLJ[ZHZ^LSSHZZ[\KLU[HK]PZPUNPZYLX\PYLK plinary structure of the GSCIS that includes Computer Science,
Information Systems, and Information Technology in Education
:HIIH[PJHSSLH]L^P[OZ\WWVY[PZH]HPSHISLL]LY`MV\Y[O`LHY disciplines. The Dean will ensure that quality educational servic-
es are provided to students.
:^HY[OTVYL *VSSLNL PZ HU \UKLYNYHK\H[L SPILYHS HY[Z PUZ[P[\[PVU Qualifications include a doctoral degree in computer science,
^P[O  Z[\KLU[Z VU H Z\I\YIHU HYIVYL[\T JHTW\Z  TPSLZ information systems, or related field. Candidates should have
the ability to work with faculty in their continued pursuit of
ZV\[O^LZ[ VM 7OPSHKLSWOPH  ,PNO[ MHJ\S[` PU [OL +LWHY[TLU[ VM academic excellence and a shared vision towards preemi-
nence in research and scholarship. Candidates should have
,UNPULLYPUNVMMLYHYPNVYV\Z(),;HJJYLKP[LKWYVNYHTMVY[OL):PU experience related to graduate education, a demonstrated
,UNPULLYPUN[VHWWYV_PTH[LS`Z[\KLU[Z;OLKLWHY[TLU[OHZ record of developing and facilitating research, and senior
administrative experience. Candidates should have a sophis-
HU LUKV^LK LX\PWTLU[ I\KNL[ HUK [OLYL PZ Z\WWVY[ MVY MHJ\S[` ticated knowledge of the use of technology in the delivery of
education and distance learning and/or hybrid curricula.
Z[\KLU[ JVSSHIVYH[P]L YLZLHYJO  -VY WYVNYHT KL[HPSZ ZLL O[[W! Located on a beautiful 330-acre campus in Fort Lauderdale,
LUNPUZ^HY[OTVYLLK\ 0U[LYLZ[LK JHUKPKH[LZ ZOV\SK Z\ITP[ H Florida, NSU has more than 28,000 students and is the sixth
largest independent, not for-profit university in the United
J]IYPLMZ[H[LTLU[ZKLZJYPIPUN[LHJOPUNWOPSVZVWO`HUKYLZLHYJO States. NSU awards associate’’s, bachelor’’s, master’’s, educa-
tional specialist, doctoral, and first-professional degrees in
PU[LYLZ[ZHUK\UKLYNYHK\H[LHUKNYHK\H[L[YHUZJYPW[ZHSVUN^P[O more than 100 disciplines. It has a college of arts and sciences
[OYLL SL[[LYZ VM YLMLYLUJL [V!  *OHPY +LWHY[TLU[ VM ,UNPULLYPUN and schools of medicine, dentistry, pharmacy, allied health and
nursing, optometry, law, computer and information sciences,
:^HY[OTVYL*VSSLNL*VSSLNL(]LU\L:^HY[OTVYL7(  psychology, education, business, oceanography, and humani-
ties and social sciences.
 VY[VSTVS[LY'Z^HY[OTVYLLK\^P[O[OL^VYK¸JHUKPKH[L¹ Applications should be submitted online at
PU[OLZ\IQLJ[SPULI`+LJLTILY :^HY[OTVYL*VSSLNLPZ www.nsujobs.com for position #997648
Visit our websites: www.nova.edu & http://scis.nova.edu
HU LX\HS VWWVY[\UP[` LTWSV`LY" ^VTLU HUK TPUVYP[` JHUKPKH[LZ
Nova Southeastern University is an Equal Opportunity/
HYLZ[YVUNS`LUJV\YHNLK[VHWWS` Affirmative Action Employer.

108 COMM UNICATIO NS O F T H E ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


uchicago.edu/applicants/Central?quickFind=50533. University of Illinois at Chicago No e-mail applications will be accepted. To ensure
A cover letter, curriculum vitae including a list of Assistant Professor/Associate Professor full consideration, materials must be received by
publications, a statement describing past and cur- University of Illinois at Chicago, Department of December 31, 2009. However, we will continue
rent research accomplishments and outlining Mathematics, Statistics, and Computer Science considering candidates until all positions have
future research plans, a description of teaching ex- The Department has active research programs been filled. Minorities, persons with disabilities,
perience, and a list of references must be uploaded in a broad spectrum of centrally important areas and women are particularly encouraged to apply.
to be considered as an applicant. Candidates may of pure mathematics, computational and applied UIC is an AA/EOE.
also post a representative set of publications, to this mathematics, combinatorics, mathematical com-
website. The reference letters can be sent by mail or puter science and scientific computing, probabil-
e-mail to: ity and statistics, and mathematics education. See University of Oregon
Chair, Department of Computer Science http://www.math.uic.edu for more information. Department of Computer and Information
The University of Chicago Applications are invited for tenure track Assis- Science
1100 E. 58th Street, Ryerson Hall tant Professor or tenured Associate Professor posi- Faculty Position
Chicago, IL. 60637-1581 tions, effective August 16, 2010. Preference will be
given to applicants in statistics and related areas, The Computer and Information Science (CIS)
Or to: but outstanding applicants in all specialties will be Department at the University of Oregon seeks ap-
[email protected] considered. Final authorization of the position is plicants for one or more full-time, tenure-track
(attachments can be in pdf, postscript or Micro- subject to the availability of state funding. faculty positions beginning fall, 2010, at the rank
soft Word). Applicants must have a Ph.D. or equivalent of Assistant Professor. The University of Oregon is
Please note that at least three reference let- degree in mathematics, computer science, sta- an AAU research university located in Eugene, two
ters need to be mailed or e-mailed to the above tistics, mathematics education or related field, hours south of Portland, and within one hour’s
addresses and one of them must address the can- an outstanding research record, and evidence of drive of both the Pacific Ocean and the snow-
didate’s teaching ability. Applicants must have strong teaching ability. The salary is negotiable. capped Cascade Mountains.
completed all requirements for the PhD except Send vita and at least three (3) letters of rec- The CIS Department is part of the College of
the dissertation at time of application, and must ommendation, clearly indicating the position Arts and Sciences and is housed within the Lorry
have completed all requirements for the PhD at being applied for, to: Appointments Committee; Lokey Science Complex. The department offers
time of appointment. The PhD should be in Com- Dept. of Mathematics, Statistics, and Computer B.S., M.S. and Ph.D. degrees. More information
puter Science or a related field such as Mathemat- Science; University of Illinois at Chicago; 851 S. about the department, its programs and faculty
ics or Statistics. To ensure full consideration of Morgan (m/c 249); Box T; Chicago, IL 60607. Ap- can be found at http://www.cs.uoregon.edu, or
your application all materials [and letters] must plications through mathjobs.org are encouraged. by contacting the search committee at faculty.
be received by November 15. Screening will con- No e-mail applications will be accepted. To ensure [email protected].
tinue until all available positions are filled. The full consideration, materials must be received by We offer a stimulating, friendly environment
University of Chicago is an Affirmative Action/ November 16, 2009. However, we will continue for collaborative research both within the depart-
Equal Opportunity Employer. considering candidates until all positions have ment and with other departments on campus.
been filled. Minorities, persons with disabilities, Faculty in the department are affiliated with the
and women are particularly encouraged to apply. Cognitive and Decision Sciences Institute, the
University Of Detroit Mercy UIC is an AA/EOE. Computational Science Institute, and the Neuro-
Computer Science / Software Engineering Informatics Center.
Computer science is a rapidly evolving aca-
University of Detroit Mercy is seeking a faculty University of Illinois at Chicago demic discipline. The department accordingly
member for a tenure track position in Computer Research Assistant Professor seeks to hire faculty in established areas as well
Science or Software Engineering at the Assistant as emerging directions in computer science. Ap-
Professor level. A Ph.D. is required. Deadline Sep University of Illinois at Chicago, Department of plicants interested in interdisciplinary research
30, 2009. More information is available at http:// Mathematics, Statistics, and Computer Science are encouraged to apply. Applicants must have a
eng-sci.udmercy.edu/opportunities/index.htm The Department has active research programs Ph.D. in computer science or closely related field,
in a broad spectrum of centrally important areas a demonstrated record of excellence in research,
of pure mathematics, computational and applied and a strong commitment to teaching. A success-
University of Geneva – CS Dept. mathematics, combinatorics, mathematical com- ful candidate will be expected to conduct a vigor-
Associate/Assistant Professor puter science and scientific computing, probabil- ous research program and to teach at both the
ity and statistics, and mathematics education. See undergraduate and graduate levels.
The CS Dept. University of Geneva announces http://www.math.uic.edu for more information. Applications will be accepted electronically
the opening of one Associate/Assistant Profes- Applications are invited for the following po- through the department’s web site (only). Appli-
sor Position sition, effective August 16, 2010. cation information can be found at http://www.
The candidate must have a research and Research Assistant Professorship. This is a non- cs.uoregon.edu/Employment/. Review of applica-
teaching record in one or more of the following tenure track position, normally renewable annu- tions will begin January 4, 2010 and continue un-
areas: management and modelling of trust, secu- ally to a maximum of three years. This position car- til the position is filled. Please address any ques-
rity and privacy when handling multimedia data, ries a teaching responsibility of three courses per tions to [email protected].
knowledge management and information search, year, and the expectation that the incumbent play The University of Oregon is an equal opportu-
massively distributed information retrieval. a significant role in the research life of the Depart- nity/affirmative action institution committed to
Documents to apply: The cv and the list of ment. The salary for AY 2009-2010 for this position cultural diversity and is compliant with the Amer-
the publications, at the top of which the candi- is $54,500. Applicants must have a Ph.D. or equiva- icans with Disabilities Act. We are committed to
date indicates 5 or 6 publications which are the lent degree in mathematics, computer science, creating a more inclusive and diverse institution
most representative of his/her work (two copies of statistics, mathematics education or related field, and seek candidates with demonstrated potential
these publications must be included). A certified and evidence of outstanding research potential. to contribute positively to its diverse community.
copy of the highest diploma obtained by the can- Preference will be given to candidates in areas re-
didate, or a certificate of the institution that de- lated to number theory or dynamical systems.
livered it, must be joined to the application. The Send vita and at least three (3) letters of rec- University of Pennsylvania
candidate must provide 3 letters of reference. The ommendation, clearly indicating the position Faculty
application with the required documents must be being applied for, to: Appointments Committee;
sent before the 30th. of September 2009 by email Dept. of Mathematics, Statistics, and Computer The University of Pennsylvania invites applicants
as one file only to [email protected] to Science; University of Illinois at Chicago; 851 S. for tenure-track appointments in computer sci-
whom any enquiry regarding this position should Morgan (m/c 249); Box R; Chicago, IL 60607. Ap- ence to start July 1, 2010. Tenured appointments
be addressed. plications through mathjobs.org are encouraged. will also be considered.

O CTO B E R 2 0 0 9 | VO L. 52 | NO. 1 0 | C OM M U N IC AT ION S OF T H E ACM 109


CAREERS

The Department of Computer and Information University of Tartu Several computer science faculty are actively en-
Science seeks individuals with exceptional prom- Senior Research Fellow gaged in scientific computing, computational
ise for, or a proven record of, research achievement systems biology, biological modeling and bio-
who will excel in teaching undergraduate and grad- The successful candidate will lead an industry- informatics. Interdisciplinary research is highly
uate courses and take a position of international driven research project in the area of agile soft- valued and encouraged by the departments and
leadership in defining their field of study. While ex- ware development environments. The position University. The successful candidate will have a
ceptional candidates in all areas of core computer is for 4 years with a monthly salary of 2500-4000 strong research record in computational biophys-
science may apply, of particular interest this year euro. For details see: http://tinyurl.com/pnn5qn ics. The candidate should also have demonstrated
are candidates in who are working on the founda- ability to teach courses relating to topics in phys-
tions of Market and Social Systems Engineering ics, biophysics, or computer science. The success-
- the formalization, analysis, optimization, and re- The University of Washington Bothell ful candidate will be expected to teach in both
alization of systems that increasingly integrate en- Assistant Professor — departments at the undergraduate and graduate
gineering, computational, and economic systems Software Engineering levels. Excellence in research, teaching, and ob-
and methods. Candidates should have a vision and taining external funding will be expected. Appli-
interest in defining the research and educational The University of Washington Bothell Comput- cants should send a copy of their CV, statements
frontiers of this rapidly growing field. ing and Software Systems Program invites ap- regarding their research interests and teaching
The University of Pennsylvania is an Equal plications for a tenure track position in Software philosophy, and the names of three references to
Opportunity/Affirmative Action Employer. Engineering to begin fall 2010. Areas of research the Computational Biophysics Search Commit-
The Penn CIS Faculty is sensitive to “two-body and teaching interest include: Requirements tee, Box 7507, Wake Forest University, Winston-
problems” and would be pleased to assist with Engineering, Quality Assurance, Testing Meth- Salem, NC 27109-7507. Application materials can
opportunities in the Philadelphia region. odologies, Software Development Processes, be sent electronically in the form of a single PDF
For more detailed information regarding this Software Design Methodologies, Software Proj- document to [email protected]. Re-
position and application link please visit: ect Management, and Collaborative and Team view of applications will begin November 1, 2009
http://www.cis.upenn.edu/departmental/ Development. and will continue through January 15, 2010. More
facultyRecruiting.shtml The Bothell campus of the University of Wash- information is available at http://www.wfu.edu/
ington was founded in 1990 as an innovative, in- csphy/recruiting/. Wake Forest University is an
terdisciplinary campus within the University of equal opportunity/affirmative action employer
University of Pennsylvania Washington system – one of the premier institu-
Lecturer tions of higher education in the US. Faculty mem-
bers have full access to the resources of a major Williams College
The University of Pennsylvania invites appli- research university, with the culture and close Visiting Faculty Position
cants for the position of Lecturer in Computer relationships with students of a small liberal arts
Science to start July 1, 2010. Applicants should college. The Department of Computer Science at Williams
hold a graduate degree (preferably a Ph.D.) in For additional information, including ap- College invites applications for an anticipated
Computer Science or Computer Engineering, plication procedures, please see our website at one-semester, half-time visiting faculty position
and have a strong interest in teaching with prac- http://www.uwb.edu/CSS/. All University faculty in the spring of 2010.
tical application. Lecturer duties include under- engage in teaching, research, and service. The We are particularly interested in candidates
graduate and graduate level courses within the University of Washington, Bothell is an affirma- who can teach an undergraduate course in arti-
Master of Computer and Information Technol- tive action, equal opportunity employer. ficial intelligence or a related field. Candidates
ogy program,(www.cis.upenn.edu/grad/mcit/). Of should either hold or be pursuing a Ph.D. in
particular interest are applicants with expertise computer science or a closely related discipline.
and/or interest in teaching computer hardware US Air Force Academy This position might be particularly attractive
and architecture. The position is for one year and Visiting Professor of to candidates who are pursuing an advanced
is renewable annually up to three years. Success- Computer Science degree and seek the opportunity to incorporate
ful applicants will find Penn to be a stimulating additional classroom experience in their profes-
environment conducive to professional growth in The United States Air Force Academy Department sional preparation.
both teaching and research. of Computer Science is accepting applications for Applications in the form of a vita, a teaching
The University of Pennsylvania is an Equal a Visiting Professor position for the 2010-11 aca- statement, and three letters of reference, at least
Opportunity/Affirmative Action Employer. demic year. Interested candidates should contact one of which speaks to the candidate’s promise
The Penn CIS Faculty is sensitive to “two-body the Search Committee Chair, Dr. Barry Fagin, at as a teacher, may be sent to:
problems” and would be pleased to assist with [email protected] or 719-333-7377. Detailed Professor Thomas Murtagh, Chair
opportunities in the Philadelphia region. information is available at http://www.usafa.edu/ Department of Computer Science
For more detailed information regarding this df/dfcs/visiting/index.cfm. TCL, 47 Lab Campus Drive
position and application link please visit: Williams College
http://www.cis.upenn.edu/departmental/ Williamstown, MA 01267
facultyRecruiting.shtml Wake Forest University
Faculty Position in Electronic mail may be sent to cssearch@
Computational Biophysics cs.williams.edu. Applications should be submit-
University of San Francisco ted by October 15, 2009 and will be considered
Tenure-track Position Wake Forest University invites applications for until the position is filled.
a tenure track faculty position at the level of As- The Department of Computer Science con-
The Department of Computer Science at the Uni- sistant Professor with a joint appointment in the sists of eight faculty members supporting a thriv-
versity of San Francisco invites applications for a Departments of Computer Science and Physics ing undergraduate computer science major in
tenure-track position beginning in August, 2010. to begin in the fall semester of 2010. Applicants a congenial working environment with small
While special interest will be given to candidates should have completed a PhD in an appropriate classes, excellent students, and state-of-the-art
in bioinformatics, game engineering, systems, field by the time of appointment. Wake Forest facilities. Williams College is a highly selective,
and networking, qualified applicants from all University is a highly ranked, private university coeducational, liberal arts college of 2100 stu-
areas of Computer Science are encouraged to ap- with about 4500 undergraduates, 750 graduate dents located in the scenic Berkshires of Western
ply. For full consideration, applications should be students, and 1700 students in the professional Massachusetts. Beyond meeting fully its legal
submitted by December 1, 2009. schools of medicine, law, divinity and business. obligations for non-discrimination, Williams
More details, including how to apply, can be The Physics Department has a major concentra- College is committed to building a diverse and
found here: tion in biophysics with approximately one third inclusive community where members from all
http://www.cs.usfca.edu/job of the departmental faculty working in that field. backgrounds can live, learn, and thrive.

110 CO MM UNICATIO NS O F TH E ACM | O C TOBER 2009 | VO L . 5 2 | N O. 1 0


last byte

Milgram’s
[CON T I N U ED FRO M P. 112]

CALL FOR experiments were. Essentially, what


e’d done was create something that a “A lot of the way
NOMINATIONS computer scientist would recognize as we experience
EDITOR-IN-CHIEF a decentralized algorithm.
information online
ACM INTERNATIONAL
CONFERENCE
You’re referring to an experiment in now is in a
which Milgram asked a group of peo-
PROCEEDINGS SERIES ple in the Midwest to forward a letter different form.
ACM is looking for an Editor-in-Chief (EIC) to a friend of his near Boston. It’s coming to us
Right. He gave them the man’s
for its International Conference Proceedings
Series (ICPS). ACM has initiated ICPS to name and mailing address and some continuously,
publish proceedings of high quality
conferences, symposia, and workshops
basic personal details, but the rules in bite-size
were they had to mail it to someone
that are not sponsored by ACM or its SIGs.
they knew on a first-name basis. And pieces, through
ICPS has been in existence since 2002,
providing conference organizers a means of no one had a bird’s-eye view of the net- our social
work, but collectively people were able
electronically publishing proceedings that
ensures high visibility and wide distribution to pass these messages to the target networks—blogs
(ICPS proceedings are made available in the
ACM Digital Library).
very effectively. we read, things
As a result of the sharp growth of the ICPS, How does your own work fit into that we see on Twitter,
the ACM Publications Board has found
it necessary to initiate a more formal
picture? things our friends
What I did was use the ‘algorithm’
editorial management structure for the almost as a scientific probe of the struc- email us.”
series to continue to ensure ACM-level
ture of the network. I proposed a model
quality. The series will be managed similar
to other ACM publications and will have for social networks that allows people
an EIC who, along with the Editorial Board, to succeed at this kind of search, to land
will control the content of the series. The as close to the target as possible. There
EIC will be further assisted by an Advisory was a mathematical definition and link-
Board, which will include representation ing trees and theorems. But the key
from ACM SIGs. feature is that we should create links at You’ve also been working on a differ-
ACM Publications Board has set up a many different scales of resolution and ent kind of Web search.
nominating committee to assist the Board in equal proportions. In the classical model of Web search,
in selecting the ICPS EIC. The nominating some centralized thing like a search en-
committee members (in alphabetical
Meaning? gine gathers up lots of Web pages, and
order) are:
There are levels of resolution we can users come and issue queries and get
Michel Beaudouin-Lafon,
Univ. Paris-Sud use to think about the world. Geographi- answers. A lot of the way we experience
cally, there are people who are close information online now is in a different
Beng Chin Ooi,
National University of Singapore neighbors, people who are nearby, peo- form. It’s coming to us continuously, in
ple who are in the same region and coun- bite-size pieces, through our social net-
Bashar Nuseibeh,
The Open University, UK try, and so on. It’s important to have works—blogs we read, things we see on
Tamer Ozsu (Committee Chair),
links across all these scales. If we only Twitter, things our friends email us.
University of Waterloo link locally, we can’t get messages far
Dan R. Olsen,
away very quickly, but if we only create How should our search tools and algo-
Brigham Young University links at long ranges, then, for example, rithms evolve to fit this new model?
Moshe Vardi,
it would be very easy to get a message to One thing we should look at is a sort
Rice University the Boston area, but there would be no of temporal pattern, a pattern across
local structure to go the final few steps. time rather than across the network. So
The nominating committee would like to measures like chatter, bursts of activ-
receive nominations for the EIC position
Have you been able to apply those find- ity, and upward and downward trends
(self nominations are welcome). Please
send nominations, accompanied with
ings to more technical topics? are going to be important, because they
a short statement (up to one page) One area where they prove to be help us organize this kind of informa-
justifying why the nominee is suitable useful is in the design of peer-to-peer tion. You see this increasingly on sites
for this position, to Tamer Ozsu (tozsu@ networks. If you think about what like Google News and Twitter search,
cs.uwaterloo.ca) by October 15, 2009. a network is trying to do, it’s trying which are essentially trying to give a
to get information between pairs of time signature for certain stories.
computers that need to communi-
cate, and it’s trying to do this with- Leah Hoffmann is Brooklyn-based technology writer.
out any global knowledge of the net-
work. © 2009 ACM 0001-0782/09/1000 $10.00

O C TO BE R 2 0 0 9 | VO L. 52 | N O. 1 0 | C OM M U N IC AT ION S OF T H E ACM 111


last byte

DOI:10.1145/1562764.1562790 Leah Hoffmann

Q&A
The Networker
Jon Kleinberg talks about algorithms, information flow, and
the connections between Web search and social networks.

A
P RO F E S S O R O F computer
science at Cornell Uni-
versity, Jon Kleinberg
has been studying how
people maintain social
connections—both on- and offline—
for more than a decade. Kleinberg
has received numerous honors and
awards, most recently the 2008 ACM-
Infosys Foundation Award in Com-
puting Sciences.

How did you first come to think about


social networks?
It started with Web search, and ap-
preciating that in order to understand
Web search—which is about people
looking for information—we have to
understand networks, and in particu-
lar networks that are created by people
and reflect the social structure. You
might look at the network within an or-
ganization to try to find the most cen-
tral or important people. You can ask a
similar question about the links among
Web pages. And that becomes a way of
finding the information that’s been
most endorsed by the community.
ing mentioned over and over, certain about using these links and endorse-
This was the motivation behind the brands and laptops, and I might get ments to evaluate things in a network.
hubs and authorities algorithm, which some sense that here are the most And that led to other areas, like citation
you developed in the mid-1990s and popular ones. Those become the au- analysis and, more broadly, the field of
which uses the structure of the Inter- thorities, and the people who are best social networks.
net to try to find the most authoritative at mentioning them are the hubs. The
Web pages. best hubs reflect the consensus of the Some of the most famous research
In a sense, you can think about it community, while the best authorities in that field was done by Stanley Mil-
as an attempt to automate something are that consensus, and the two rein- gram, whose small world experiments
PHOTOGRA PH BY M ARC SM ITH

you could carry out manually. Suppose force each other. in the 1960s established that we’re all
I were trying to buy a new laptop, for linked by short paths—the proverbial
example. I’d find lots of people blog- Where did your work go from there? “six degrees of separation.”
ging, writing product reviews, and Once I had created the algorithm, I The thing that intrigued me was
so on. And I’d see certain things be- realized there’s something very basic how creative [C O NTINUED O N P. 111]

112 C O MM UNICATIO NS O F TH E ACM | O C TOB ER 2009 | VO L . 5 2 | NO. 1 0


Think Parallel.....
It’s not just what we make.
It’s what we make possible.
Advancing Technology Curriculum
Driving Software Evolution
Fostering Tomorrow’s Innovators

Learn more at: www.intel.com/thinkparallel

ACM  Ad.indd      1 4/17/2009      11:20:03  AM

You might also like