2020-May - Top Picks
2020-May - Top Picks
2020-May - Top Picks
Top Picks
www.computer.org/micro
IEEE
Computer
Society Has
You Covered!
WORLD-CLASS CONFERENCES — Stay
ahead of the curve by attending one of our
200+ globally recognized conferences.
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters, Three Park Ave., 17th Floor, New York,
NY 10016-5997; IEEE Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE Computer Society Publications Office,
10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. Postmaster: Send address changes and undelivered copies to IEEE, Member-
ship Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid at New York, NY, and at additional mailing offices. Canadian
GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Return undeliverable Canadian addresses
to 4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the
copy; and 3) does not imply IEEE endorsement of any third-party products or services. Author and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own webservers without permission, provided that the IEEE copyright notice and a full citation to the
original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate
review suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to
ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising,
or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual
Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or [email protected]. ©2020 by IEEE. All rights reserved. Abstracting
and library use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy
fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE
Micro does not necessarily constitute an endorsement by IEEE or the Computer Society. All submissions are subject to editing for style, clarity, and
space. IEEE prohibits discrimination, harassment, and bullying. For more information, visit ieee.org/web/aboutus/whatis/policies/p9-26.html.
May/June 2020
Volume 40 Number 3
Special Issue
Guest Editor’s Introduction
56 A smDB: Understanding
and Mitigating Front-End
Stalls in Warehouse-Scale
30 E nergy-Efficient Video
Processing for Virtual
Reality 91 M icroScope: Enabling
Microarchitectural
Yue Leng, Jian Huang, Chi-Chun Chen, Replay Attacks
Qiuyue Sun, and Yuhao Zhu Dimitrios Skarlatos, Mengjia Yan,
Bhargava Gopireddy, Read Sprabery,
Josep Torrellas, and Christopher W. Fletcher
37 T
owards General-Purpose
Acceleration: Finding
Structure in Irregularity
Vidushi Dadu, Jian Weng, Sihao Liu,
99 C reating Foundations for
Secure Microarchitec-
tures With Data-Oblivious ISA
and Tony Nowatzki
Extensions
Micro Economics
118 Pandemics and the Dismal
Technology Economy
Shane Greenstein
From the Editor-in-Chief
& THE NOVEL CORONAVIRUS has taken all of us to top architecture conferences during 2019 was
unchartered territories in our lives. However, as eligible to compete for the Top Picks honor. In
computer engineers and scientists, we can be total, 96 submissions were received, from which
proud of the integral contributions computers 12 articles were chosen to represent the cream
play in getting everybody connected as lock- of the crop of 2019.
downs and shelter-in-place isolations are taking Professor Hyesoon Kim of Georgia Tech
place around the world! The importance of chaired this year’s selection committee. Hyesoon
secure, low-power, and high-performance chips and 28 experts from academia and industry
and systems cannot worked hard to identify 12 Top Picks and 14
be overemphasized at The importance of Honorable Mention articles. An article recog-
this time. Microproc- secure, low-power, and nized as a Top Pick was invited to prepare a sub-
essors and microsys- high-performance mission for inclusion in this special issue. The
tems are increasingly chips and systems articles in this special issue are intended to be
relevant in everyday cannot be overempha- for a broader audience than the original confer-
lives as well as com- sized at this time. ence articles. These articles also focus more on
puting for medical Microprocessors and their potential impact. The honorable mentions
microsystems are
discoveries. are high-quality articles that unfortunately could
increasingly relevant in
While the pan- not be included in the special issue due to space
everyday lives as well
demic has taken us to as computing for constraints. They are listed in the Guest Editor’s
unfamiliar territories, medical discoveries. Introduction. Interested readers can locate them
IEEE Micro is present- in the original conference proceedings or the
ing to you the very IEEE/ACM Digital Library.
familiar IEEE Micro Top Picks issue. For more The purpose of the Top Picks issue has been
than a decade, IEEE Micro has had this tradition multifold. First and foremost, Top Picks was origi-
of evaluating articles from the previous year’s nally instituted to present the “best of the best” of
architecture conferences and selecting those the preceding year’s architecture research contri-
with the most novelty and potential for long- butions to a broader audience, including industry
term impact. IEEE Micro is upholding the tradi- and other fields. A second goal of Top Picks is to
tion this year as well. Any article published in recognize excellent research in the field and
bestow this honor on researchers who conducted
Digital Object Identifier 10.1109/MM.2020.2993184 the outstanding research that resulted in these
Date of current version 22 May 2020. articles. It is critically important for our field to
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
4
honor our budding researchers and help them to titled “Pandemics and the Dismal Technology
shape their careers. The Top Picks honor has been Economy.” This article discusses the economic
seen to be instrumental in achieving faculty posi- crisis brought by the pandemic. The author talks
tions, leading research positions in industry, and about the predictable features of such a crisis
prestigious research grants. Third, the writing style even when the pandemic itself is unprecedented,
intended for a broader audience makes it easy for however, recognizes that it is difficult to make pre-
beginner graduate students to understand the dictions on when the economy might turn around.
state-of-the-art of the field as they are pondering on He also discusses the increased appeal of stream-
topics to work for their doctorate degrees Above ing services and online communication tools.
all, I expect these articles to be enjoyable reads for To all the computer architects and chip
all the readers of IEEE Micro. designers: while COVID19 may have altered your
I take this opportunity to express my grati- daily routines, the demand for microchips and
tude to Hyesoon and the selection committee systems is only increasing. Medical research,
members, who spent countless hours during data analytics to pre-
the Christmas and New Year season to evaluate dict outbreaks, online
the submissions, and conducted a face-to-face It is a significant honor communication and
meeting and deliberated a whole day to per- to rise to the top in a conferencing tools,
form this important selection. Hyesoon and competition where each and security and
the committee conducted a multistep selection candidate article is privacy for various
process that tried to reduce the impact of the already recognized as applications, all are
discussion order of the articles. I wish to an excellent piece of pointing to the need
express my special thanks to Hyesoon and the work. I hope that these for increased resea-
works have significant
selection committee for the thoughtful process rch to design hard-
impact on future
and the hard work. ware and software
computer systems.
The Top Picks articles belong to four that efficiently meets
themes: first, cloud and accelerators; second, the demands of the
acceleration from understanding applications; emerging era.
third, quantum computing; and fourth, security I hope this special issue is thought provoking
and privacy. One may recognize that these are for our readers and helps shape the field for
highly relevant topics with or without the pan- many years to come. I also hope that researchers
demic. A comprehensive article written by Hye- in the field intensify their efforts to design better
soon Kim serves as an excellent introduction to chips and systems to help medical research and
the compendium. ordinary daily lives. Additionally, I encourage
I want to personally congratulate all the Top readers to submit to IEEE Micro. IEEE Micro is
Picks authors for their fantastic work. It is a sig- interested in submissions on any aspect of chip/
nificant honor to rise to the top in a competition system design or architecture.
where each candidate article is already recog- May the Top Picks articles bring some happy
nized as an excellent piece of work. I hope that reading to you amidst this coronavirus pandemic!
these works have significant impact on future
computer systems. Lizy Kurian John is a Cullen Trust for Higher Educa-
In addition to the Top Picks articles, this issue tion Endowed Professor in the Electrical and Computer
also features a Micro Economics column by Engineering Department, University of Texas at Austin.
Shane Greenstein of Harvard Business School, Contact her at [email protected].
May/June 2020
5
Guest Editor’s Introduction
& IT IS MY pleasure to introduce the “2019 Top Twenty eight selection committee members (see
Picks in Computer Architecture.” This annual the “Selection Committee” sidebar) read the
publication presents 12 articles selected from three-page documentations along with the origi-
major computer architecture conferences of the nal conference papers (single-blind review pro-
year. The 12 papers are recognized for their cess by nature). In keeping with the successful
importance, mainly the long-term impact and two-round ranking-based review process of the
influence on the industry and other researchers. past several years, the PC members first catego-
The selection committee members put enor- rized each article as either a top pick, an honor-
mous effort into picking the papers. We asked able mention, or not a top pick. They also
what the criteria should be for the top picks, and ranked the articles. After the first round of
then we tried to answer that question by looking
reviews, all PC members participated in online
for significant improvement over previous work,
discussions to decide which articles should
establishing a new area.
move to the second round. In the first round,
As in prior years, only 12 articles could be
all the articles were assigned at least four
selected to appear in this special issue. The
reviewers, and in the second round, the articles
selection committee chose 14 additional high-
had at least four additional reviewers.
quality articles to be recognized as honorable
This year, as we expanded our research areas
mentions. I strongly encourage you to read
into special accelerators that rely on emerging
these articles (see the “Honorable Mentions”
technologies, we found it particularly challeng-
sidebar).
ing to ensure all reviewers understood the
underlying technologies. Because the selection
REVIEW PROCESS process is concerned more with the impact of
This year’s review process built on previous the work rather than evaluating its technical
years’ selection processes. Authors submitted a accuracy, technical expertise is less critical than
three-page document that contained a two-page for main conference reviews. Nonetheless, when
summary of the article and one page of support- several papers cover similar topics, it is also
ing arguments for long-term impact and influ- important to identify those worthy of nomina-
ence on other researchers and industry. tion based on technical merits. To overcome the
limitations on available expertise, we increased
Digital Object Identifier 10.1109/MM.2020.2992834 the number of reviewers for such emerging tech-
Date of current version 22 May 2020. nology based papers.
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
6
& SELECTION COMMITTEE Josep Torrellas, University of Illinois
Urbana-Champaign (UIUC)
Arka Basu, Indian Institute of Science (IISc) Lisa Wu Wills, Duke University
Babak Falsafi, Ecole Polytechnique Fe de
rale de Mike O’Connor, NVIDIA/UT-Austin
Lausanne (EPFL) Mohit Tiwari, UT Austin
Boris Grot, University of Edinburgh Onur Mutlu, ETH Zurich/CMU
Christopher Fletcher, University of Illinois Parthasarathy Ranganathan, Google
Urbana-Champaign (UIUC) Rajeev Balasubramonian, University of Utah
nez, Texas A&M University
Daniel Jime Ravi Iyer, Intel
Dmitry Ponomarev, SUNY Binghamton Reetuparna Das, University of Michigan
Edward Suh, Cornell University Thomas Wenisch, University of Michigan/Google
Gennady Pekhimenko, University of Toronto Tor Aamodt, The University of British Columbia
Jangwoo Kim, Seoul National University Tushar Krishna, Georgia Institute of Technology
Jayasena Nuwan, AMD Ulya Karpuzcu, University of Minnesota
Jishen Zhao, University of California San Diego Vijay Janapa Reddi, Harvard University
John Kim, KAIST Yunji Chen, Institute of Computing Technology
Jose Joao, Arm Research
Chinese Academy of Sciences (ICT–CAS)
May/June 2020
7
Guest Editor’s Introduction
IEEE Micro
8
analysis tool to allow identification of critical for Program Trace Privacy” by Dangwal et al.
code segments and then proposing solutions to presents a compression method to generate
improve the performance of WSC is presented in traces that can limit the information leak, and
“AsmDB: Understanding and Mitigating Front- memory trace writing for cache simulation is
End Stalls in Warehouse-Scale Computers” by shown as an example.
Nagendra et al.
May/June 2020
9
Theme Article: Top Picks
& CLOUD COMPUTING NOW powers applications applications are interactive, latency critical
from every domain of human endeavor, which services that must meet strict performance
require ever improving performance, respon- (throughput and tail latency), and availability
siveness, and scalability.2,5,6,8 Many of these constraints, while also handling frequent soft-
ware updates.4–7; 12 The past five years have
Digital Object Identifier 10.1109/MM.2020.2985960 seen a significant shift in the way cloud services
Date of publication 22 April 2020; date of current version 22 are designed, from large monolithic implemen-
May 2020. tations, where the entire functionality of a
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
10
Second, microservices enable programming
language and framework heterogeneity, with
each tier developed in the most suitable lan-
guage, only requiring a common API for micro-
services to communicate with each other;
typically over remote procedure calls (RPC) or a
RESTful API. In contrast, monoliths limit the lan-
guages used for development, and make fre-
quent updates cumbersome and error-prone.
Figure 1. Differences in the deployment of
Finally, microservices separate failure doma-
monoliths and microservices.
ins across application tiers, allowing cleaner error
isolation, and simplifying correctness and perfor-
service is implemented in a single binary, to mance debugging, unlike in monoliths, where
large graphs of single-concerned and loosely- resolving bugs often involves troubleshooting the
coupled microservices.1,10 This shift is becom- entire service. This also makes them applicable to
ing increasingly pervasive, with large cloud pro- Internet-of-Things (IoT) applications that often
viders, such as Amazon, Twitter, Netflix, Apple, host mission-critical computation.
and EBay having already adopted the microser- Despite their advantages, microservices rep-
vices application model, and Netflix reporting resent a significant departure from the way cloud
more than 200 unique microservices in their services are traditionally designed, and have
ecosystem, as of the end of 2016.1 broad implications in both hardware and soft-
The increasing popularity of microservices is ware, changing a lot of assumptions current ware-
justified by several reasons. First, they promote house-scale systems are designed with. For
composable software design, simplifying and example, since dependent microservices are typi-
accelerating development, with each microser- cally placed on different physical machines, they
vice being responsible for a small subset of the put a lot more pressure on high bandwidth and
application’s functionality. The richer the func- low latency networking than traditional applica-
tionality of cloud services becomes, the more tions. Furthermore, the dependencies between
the modular design of microservices helps man- microservices introduce backpressure effects
age system complexity. They similarly facilitate between dependent tiers, leading to cascading
deploying, scaling, and updating individual micro- QoS violations that propagate and amplify
services independently, avoiding long develop- through the system, making performance debug-
ment cycles, and improving elasticity. For ging expensive in both resources and time.11
applications that are updated on a daily basis, Given the increasing prevalence of microser-
modifying, recompiling, and testing a large mono- vices in both cloud and IoT settings, it is impera-
lith is both cumbersome and prone to bugs. tive to study both their opportunities and
Figure 1 shows the deployment differences challenges. Unfortunately most academic work
between a traditional monolithic service, and an on cloud systems is limited to the available
application built with microservices. While the open-source applications; monolithic designs in
entire monolith is scaled out on multiple servers, their majority. This not only prevents a wealth
microservices allow individual components of of interesting research questions from being
the end-to-end application to be elastically explored, but can also lead to misdirected
scaled, with microservices of complementary research efforts whose results do not translate
resources bin-packed on the same physical to the way real cloud services are implemented.
server. Even though modularity in cloud services
was already part of the service-oriented architec-
ture (SOA) design approach, the fine granularity DeathstarBench SUITE
of microservices, and their independent deploy- Our article,10 presented at ASPLOS’19,
ment create hardware and software challenges addresses the lack of representative and open-
different from those in traditional SOA workloads. source benchmarks built with microservices, and
May/June 2020
11
Top Picks
Figure 2. Graph of microservices in Social Network. Figure 3. Graph of microservices in Media Service.
quantifies the opportunities and challenges of this unidirectional follow relationships. Figure 2
new application model across the system stack. shows the architecture of the end-to-end service.
Benchmark Suite Design: We have designed, Users (client) send requests over http, which
implemented, and open-sourced a set of end- first reach a load balancer, implemented with
to-end applications built with interactive micro- nginx. Once a specific webserver is selected,
services, representative of popular production also in nginx, the latter uses a php-fpm module
online services using this application model. Spe- to talk to the microservices responsible for com-
cifically, the benchmark suite includes a social posing and displaying posts, as well as microser-
network, a media service, an ecommerce shop, a vices for advertisements and search engines. All
hotel reservation site, a secure banking system, messages downstream of php-fpm are Apache
and a coordination control platform for UAV Thrift RPCs. Users can create posts embedded
swarms. Across all applications, we adhere to the with text, media, links, and tags to other users.
design principles of representativeness, modular- Their posts are then broadcasted to all their fol-
ity, extensibility, software heterogeneity, and end- lowers. Users can also read, favorite, and repost
to-end operation. posts, as well as reply publicly, or send a direct
Each service includes tens of microservices in message to another user. The application also
different languages and programming models, includes machine learning plugins, such as user
including node.js, Python, C/Cþþ, Java, Java- recommender engines, a search service using
script, Scala, and Go, and leverages open-source Xapian, and microservices to record and display
applications, such as NGINX, memcached, Mon- user statistics, e.g., number of followers, and to
goDB, Cylon, and Xapian. To create the end-to-end allow users to follow, unfollow, or block other
services, we built custom RPC and RESTful APIs accounts. The service’s backend uses memc-
using popular open-source frameworks like ached for caching, and MongoDB for persistent
Apache Thrift, and gRPC. Finally, to track how storage for posts, profiles, media, and recom-
user requests progress through microservices, we mendations. The service is broadly deployed at
have developed a lightweight and transparent to our institution, currently servicing several hun-
the user distributed tracing system, similar to Dap- dred users. We also use this deployment to
per and Zipkin that tracks requests at RPC granu- quantify the tail at scale effects of microservices.
larity, associates RPCs belonging to the same end- Media Service: The application implements an
to-end request, and records traces in a centralized end-to-end service for browsing movie informa-
database. We study both traffic generated by real tion, as well as reviewing, rating, renting, and
users of the services, and synthetic loads gener- streaming movies. Figure 3 shows the architec-
ated by open-loop workload generators. ture of the end-to-end service. As with the social
network, a client request hits the load balancer,
Applications in DeathStarBench which distributes requests among multiple nginx
Social Network: The end-to-end service imple- webservers. Users can search and browse infor-
ments a broadcast-style social network with mation about movies, including their plot,
IEEE Micro
12
photos, videos, cast, and review information, as account, search information about the bank, or
well as insert new reviews in the system for a contact a representative. Once logged in, a user
specific movie by logging into their account. can process a payment, pay their credit card bill,
Users can also select to rent a movie, which browse information about loans or request one,
involves a payment authentication module to and obtain information about wealth manage-
verify that the user has enough funds, and a ment options. Most microservices are written in
video streaming module using nginx-hls, a pro- Java and Javascript. The back-end databases use
duction nginx module for HTTP live streaming. memcached and MongoDB instances.
The actual movie files are stored in NFS, to avoid IoT Swarm Coordination: Finally, we explore an
the latency and complexity of accessing chunked environment where applications run both on the
records from nonrelational databases, while cloud and on edge devices. The service coordi-
movie reviews are kept in memcached and Mon- nates the routing of a swarm of programmable
goDB instances. Movie information is main- drones, which perform image recognition and
tained in a sharded and replicated MySQL obstacle avoidance. We have designed two ver-
database. The application also includes movie sion of this service. In the first, the majority of the
and advertisement recommenders, as well as a computation happens on the drones, including
couple auxiliary services for maintenance and the motion planning, image recognition, and
service discovery, which are not shown in the obstacle avoidance, with the cloud only con-
figure. structing the initial route per-drone, and holding
E-Commerce Site: The service implements an persistent copies of sensor data. This architec-
e-commerce site for clothing. The design draws ture avoids the high network latency between
inspiration, and uses several components of the cloud and edge, however, it is limited by the on-
open-source Sockshop application. The applica- board resources. In the second version, the cloud
tion front-end in this case is a node.js service. is responsible for most of the computation. It per-
Clients can use the service to browse the inven- forms motion control, image recognition, and
tory using catalogue, a Go microservice that obstacle avoidance for all drones, using the
mines the back-end memcached and MongoDB ardrone-autonomy, and Cylon libraries, in
instances holding information about products. OpenCV and Javascript, respectively. The edge
Users can also place orders (Go) by adding items devices are only responsible for collecting sensor
to their cart (Java). After they log in (Go) to their data and transmitting them to the cloud, as well
account, they can select shipping options (Java), as recording some diagnostics using a local node.
process their payment (Go), and obtain an js logging service. In this case, almost every
invoice (Java) for their order. Finally, the service action suffers the cloud-edge network latency,
includes a recommender engine for suggested although services benefit from the additional
products, and microservices for creating an item cloud resources. We use 24 programmable Parrot
wishlist (Java), and displaying current discounts. AR2.0 drones, together with a backend cluster of
Hotel Reservation: The service implements a 20 two-socket, 40-core servers.
hotel reservation site, where users can browse
information about hotels and complete reserva-
tions. The service is primarily written in Go, with Adoption
the backend tiers implemented using memc- DeathStarBench is open-source software
ached and MongoDB. Users can filter hotels under a GPL license. The project is currently in
according to ratings, price, location, and avail- use by several tens of research groups both in
ability. They also receive recommendations on academia and industry. In addition to the open-
hotels they may be interested in. source project, we have also deployed the social
Banking System: The service implements a network as an internal social network at Cornell
secure banking system that processes payments, University, currently used by over 500 students,
loan requests, and credit card transactions. and have used execution traces for several
Users interface with a node.js front-end, similar
to the one in E-commerce to login to their https://github.com/delimitrou/DeathStarBench.
May/June 2020
13
Top Picks
IEEE Micro
14
Figure 5. (a) Overview of the FPGA configuration for RPC Figure 7. Cascading QoS violations in Social
acceleration, and (b) the performance benefits of acceleration in Network compared to per-microservice CPU
terms of network and end-to-end tail latency. utilization.
from acceleration on network processing latency performance issue, but can on occasion make it
alone, and on the end-to-end latency of each of the worse, by admitting more traffic into the system.
services. Network processing latency improves by The more complex the dependence graph
1068x over native TCP, whereas end-to-end tail between microservices, the more pronounced
latency improves by 43% and up to 2:2x. For inter- such issues become. Figure 6 shows the microser-
active, latency-critical services, where even a vices dependence graphs for three major cloud
small improvement in tail latency is significant, service providers, and for one of our applications
network acceleration provides a major boost in (Social Network). The perimeter of the circle (or
performance. sphere surface) shows the different microservi-
ces, and edges show dependencies between
Cluster Management them. Such dependencies are difficult for develop-
A major challenge with microservices has to ers or users to describe, and furthermore, they
do with cluster management. Even though the change frequently, as old microservices are
cluster manager can elastically scale out individ- swapped out and replaced by newer services.
ual microservices on-demand instead of the entire Figure 7 shows the impact of cascading QoS
monolith, dependencies between microservices violations in the Social Network service. Darker
introduce backpressure effects and cascading colors show tail latency closer to nominal opera-
QoS violations that propagate through the sys- tion for a given microservice in Figure 7(a), and
tem, hurting quality of service (QoS). Backpres- low utilization in Figure 7(b). Brighter colors sig-
sure can additionally trick the cluster manager nify high per-microservice tail latency and high
into penalizing or upsizing a highly utilized micro- CPU utilization. Microservices are ordered based
service, even though its saturation is the result of on the service architecture, from the back-end
backpressure from another, potentially not-satu- services at the top, to the front-end at the bot-
rated service. Not only does this not solve the tom. Figure 7(a) shows that once the back-end
service at the top experiences high tail latency,
the hotspot propagates to its upstream services,
and all the way to the front-end. Utilization in
this case can be misleading. Even though the sat-
urated back-end services have high utilization in
Figure 7(b), microservices in the middle of the
figure also have even higher utilization, without
this translating to QoS violations.
Conversely, there are microservices with
relatively low utilization and degraded perfor-
mance, for example, due to waiting on a blocking/
synchronous request from another, saturated
Figure 6. Microservices graphs for three production tier. This highlights the need for cluster manag-
clouds, and our Social Network. ers that account for the impact dependencies
May/June 2020
15
Top Picks
IEEE Micro
16
As with the hardware and cluster management
implications above, these results again emphasize
the need for hardware and software techniques
that improve performance predictability at scale
without hurting latency and resource efficiency.
May/June 2020
17
Top Picks
showed that programmable acceleration can 6. C. Delimitrou and C. Kozyrakis, “Quasar: Resource-
greatly reduce one of the primary overheads of efficient and QoSAware cluster management,” in
multitier services; network processing. Proc. 19th Int. Conf. Archit. Support Program.
As microservices continue to evolve, it is Lang. Oper. Syst., Salt Lake City, UT, USA, 2014,
essential for datacenter hardware, operating pp. 127–144.
and networking systems, cluster managers, and 7. C. Delimitrou and C. Kozyrakis, “HCloud: Resource-
programming frameworks to also evolve with efficient provisioning in shared cloud systems,” in
them, to ensure that their prevalence does not Proc. 21st Int. Conf. Archit. Support Program. Lang.
come at a performance and/or efficiency loss. Oper. Syst., Apr. 2016, pp. 473–488.
Both DeathStarBench and the resulting study of 8. C. Delimitrou and C. Kozyrakis, “Bolt: I know what you
the system implications of microservices are a did last summer... In the cloud,” in Proc. 22nd Int. Conf.
call to action for the research community to fur- Archit. Support Program. Lang. Oper. Syst., Apr. 2017,
ther explore the opportunities and challenges of pp. 599–613.
this emerging application model. 9. D. Firestone et al., “Azure accelerated networking:
Smartnics in the public cloud,” in Proc. 15th USENIX
ACKNOWLEDGMENTS Symp. Netw. Syst. Design Implementation, 2018,
We sincerely thank C. Kozyrakis, D. Sanchez, pp. 51–66.
D. Lo, as well as the academic and industrial users 10. Y. Gan et al., “An open-source benchmark suite for
of the benchmark suite, and the anonymous microservices and their hardware-software
reviewers for their feedback on earlier versions of implications for cloud and edge systems,” in Proc.
this article. This work was supported in part by 24th Int. Conf. Archit. Support Program. Lang. Oper.
an NSF CAREER award, in part by NSF grant CNS- Syst., Apr. 2019, pp. 3–18.
1422088, in part by a Google Faculty Research 11. Y. Gan et al., “Seer: Leveraging big data to
Award, in part by a Alfred P. Sloan Foundation Fel- navigate the complexity of performance debugging
lowship, in part by a Facebook Faculty Research in cloud microservices,” in Proc. 24th Int. Conf.
Award, in part by a John and Norma Balen Sesqui- Archit. Support Program. Lang. Oper. Syst., Apr.
centennial Faculty Fellowship, and in part by gen- 2019, pp. 19–33.
erous donations from Google Compute Engine, 12. D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and
Windows Azure, and Amazon EC2. C. Kozyrakis, “Heracles: Improving resource
efficiency at scale,” in Proc. 42nd Annu. Int. Symp.
Comput. Archit., 2015, pp. 450–462.
& REFERENCES
1. “The evolution of microservices,” 2016. [Online].
Available: https://www.slideshare.net/adriancockcroft/ Yu Gan is currently working toward the Ph.D. degree
evolution-of-microservices-craft-conference
with the School of Electrical and Computer Engineer-
ing, Cornell University, where he works on cloud
2. L. Barroso, U. Hoelzle, and P. Ranganathan, The
computing and root cause analysis for interactive
Datacenter as a Computer: An Introduction to the
microservices. He is a student member of IEEE and
Design of Warehouse-Scale Machines. Morgan & ACM. Contact him at [email protected].
Claypool: San Rafael, CA, USA, 2018.
3. S. Chen, S. Galon, C. Delimitrou, S. Manne, and Yanqi Zhang is currently working toward the Ph.D.
J. F. Martinez, “Workload characterization of degree with the School of Electrical and Computer
interactive cloud services on big and small server Engineering, Cornell University, where he works on
platforms,” in Proc. Int. Symp. Workload cloud systems and resource management for inter-
Characterization, Oct. 2017, pp. 125–134. active microservices. He is a student member of
4. J. Dean and L. A. Barroso, “The tail at scale,” IEEE and ACM. Contact him at [email protected].
Commun. ACM, vol. 56 no. 2, pp. 74–80, 2013.
5. C. Delimitrou and C. Kozyrakis, “Paragon: QoS-aware Dailun Cheng is currently working toward the
scheduling for heterogeneous datacenters,” in Proc. M.Eng. degree with the School of Electrical and
18th Int. Conf. Archit. Support Program. Lang. Oper. Computer Engineering, Cornell University. Contact
Syst., Houston, TX, USA, 2013, pp. 77–88. him at [email protected].
IEEE Micro
18
Ankitha Shetty is currently working toward Chris Colen is currently working toward the M.Eng.
the M.Eng. degree with the School of Computer degree with the School of Computer Science, Cornell
Science, Cornell University. Contact him at University. Contact him at [email protected].
[email protected].
Fukang Wen is currently working toward the M.Eng.
Priyal Rathi is currently working toward the M.Eng. degree with the School of Computer Science, Cornell
degree with the School of Computer Science, Cornell University. Contact him at [email protected].
University. Contact him at [email protected].
Catherine Leung is currently working toward the
Nayan Katarki is currently working toward the M.Eng. degree with the School of Computer Science,
M.Eng. degree with the School of Electrical and Cornell University. Contact him at [email protected].
Computer Engineering, Cornell University. Contact
him at [email protected]. Siyuan Wang is currently working toward the M.Eng.
degree with the School of Computer Science, Cornell
Ariana Bruno is currently working toward the University. Contact him at [email protected].
M.Eng. degree with the School of Electrical and
Computer Engineering, Cornell University. Contact Leon Zaruvinsky is currently working toward the
him at [email protected]. M.Eng. degree with the School of Computer Science,
Cornell University. Contact him at [email protected].
Justin Hu is currently working toward the M.Eng.
degree with the School of Computer Science, Cornell Mateo Espinosa is currently working toward the
University. Contact him at [email protected]. M.Eng. degree with the School of Computer Science,
Cornell University. Contact him at [email protected].
Brian Ritchken is currently working toward the
M.Eng. degree with the School of Electrical and Rick Lin is currently working toward the M.Eng.
Computer Engineering, Cornell University. Contact degree with the School of Electrical and Computer
him at [email protected]. Engineering, Cornell University. Contact him at
[email protected].
Brendon Jackson is currently working toward the
M.Eng. degree with the School of Electrical and Zhongling Liu is currently working toward the
Computer Engineering, Cornell University. Contact M.Eng. degree with the School of Electrical and
him at [email protected]. Computer Engineering, Cornell University. Contact
him at [email protected].
Kelvin Hu is currently working toward the M.Eng.
degree with the School of Computer Science, Cornell Jake Padilla is currently working toward the
University. Contact him at [email protected]. M.Eng. degree with the School of Computer Science,
Cornell University. Contact him at [email protected].
Meghna Pancholi is currently working toward the
B.S. degree with the School of Computer Science, Christina Delimitrou is currently an Assistant
Cornell University. Contact him at [email protected]. Professor with the School of Electrical and Computer
Engineering, Cornell University, where she works on
Yuan He is currently working toward the M.Eng. computer architecture and distributed systems.
degree with the School of Electrical and Computer Her research interests include resource-efficient data-
Engineering, Cornell University. Contact him at centers, scheduling and resource management with
[email protected]. quality-of-service guarantees, emerging cloud and IoT
application models, and cloud security. Delimitrou
Brett Clancy is currently working toward the M.Eng. received the Ph.D. degree in electrical engineering
degree with the School of Computer Science, Cornell from Stanford University. She is a member of IEEE and
University. Contact him at [email protected]. ACM. Contact her at [email protected].
May/June 2020
19
Theme Article: Top Picks
MAESTRO: A Data-Centric
Approach to Understand
Reuse, Performance, and
Hardware Cost of DNN
Mappings
Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer and Angshuman Parashar
Vivek Sarkar, and Tushar Krishna NVIDIA Corp
Georgia Tech
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
20
Figure 1. High-level overview of mapping a high-dimensional DNN layer (CONV2D in this figure) to an
accelerator with 2-D PE array. Note that tile scheduling also needs to be done within spatial partitioning; we omit
it for simplicity. (a) An Overview of Mapping CONV2D to an Accelerator. (b) High-level Tool flow of MAESTRO.
& DEEP NEURAL NETWORK (DNN) inference accel- mapping) is challenging because it requires
erators achieve high performance by exploiting deep understanding of complex interaction of
parallelism over hundreds of processing ele- hardware components, mapping, and DNN
ments (PEs) and high energy efficiency by maxi- layers. In particular, data reuse in scratchpad
mizing data reuse within PEs and on-chip memory hierarchy in DNN accelerators is one of
scratchpads.1–4 The efficiency (performance and the key behaviors, which is critical for energy
energy efficiency) of a DNN accelerator depends efficiency, thus the prime optimization target of
on three factors depicted in Figure 1: 1) the DNN accelerators. Data reuse pattern is dictated
workload (DNN layers), 2) the amount and type by dataflow,1 which are data/computation tile
of available hardware resources (hardware), and scheduling and spatial partitioning strategies
3) the mapping strategy of a DNN layer on the without actual tile size as described in Figure 1
target hardware (mapping). That is, we can pre- (a). To systematically and analytically model the
dict the efficiency (latency, energy, buffer data reuse for DNN accelerators’ efficiency esti-
requirement, etc.) of an accelerator when we mation, we need a precise and thorough descrip-
have full parameters for those three factors, tion of mapping and a framework to analyze data
which can guide the DNN accelerator design for reuse of a mapping on target hardware and the
better efficiency. One critical requirement on the DNN layer.
efficiency estimation is that it needs to be fast Therefore, we propose a data-centric repre-
since the design space (e.g., 480 million valid sentation of mapping that enables precise
designs in our hardware DSE even if we fix the descriptions of all the possible mappings in a
target mapping and layer) is huge, and we need concise and compiler-friendly manner. Leverag-
to query the efficiency of candidate designs in ing the compiler-friendly format, we develop
the search space when we search for an optimal MAESTRO, a comprehensive cost-benefit analy-
design. How do we implement such a fast effi- sis framework based on systematic data reuse
ciency estimation framework that thoroughly analysis. As shown in Figure 1(b), MAESTRO
considers all the parameters of the three receives the three factors—DNN layer, hard-
factors that determine the efficiency of DNN ware, and mapping—as inputs and generates
accelerators? more than 20 estimated statistics including
Such demands led to the development of an latency, energy, the number of buffer accesses,
analytical cost model instead of cycle-accurate buffer size requirement, etc. We validated the
simulators. Analytically, modeling the complex performance statistics of MAESTRO against
high-dimensional DNN accelerator design space cycle-accurate RTL simulation results5 and
over the three factors (DNN layer, hardware, and reported performance in a previous work6 with
May/June 2020
21
Top Picks
Table 1. The taxonomy of data reuse in DNN accelerators and shows, input tile 3 is mapped on all the PEs,
implementation choices for each. We highlight implementation used which implies the spatial reuse opportunities.
in the example reuse patterns with red texts. Dataflow implies data reuse opportunities,
and we can categorize data reuse in DNN acceler-
ators into four types (data reuse taxonomy),
which we summarize in Table 1. Each data reuse
type requires proper hardware support to
exploit the data reuse opportunity as actual data
reuse. We discuss those four reuse types
grouped in communication type as follows:
Spatial/Temporal Multicast. When the spa-
tial/temporal reuse opportunities are in input
tensors (i.e., filter and input activation), the
reused data can be multicasted to multiple PEs
(spatial reuse) or over time (temporal reuse).
The examples in Table 1 show such a pattern
based on fanout NoC (spatial multicast), which
delivers data to multiple PEs at the same time,
and buffer (temporal multicast).
In the spatial multicast example, tiles 1 and 2
are delivered to PE1 and PE2 at the same time
the accuracy of 96.1% on average. MAESTRO pro- leveraging the multicast capability of fanout
vides fast cost-benefit estimation based on an hardware. Alternatively, store-and-forward style
analytical model, which took 493 ms to analyze implementation such as systolic arrays is avail-
the entire Resent50 layers7 on a 256PE NVDLA- able with tradeoff of hardware cost and latency.
style2 accelerator on a laptop with i9-9980H CPU In the temporal multicast example, the same
with 16 GB of memory. MAESTRO supports data tile appears over time in the same PE (PE1).
arbitrary layer sizes and a variety of layer opera- That is, we send the data to the future for reuse
tions from state-of-the-art DNN models, which in the future (i.e., store the data in a buffer and
includes CONV1D, CONV2D, fully connected (FC) read it in the future). Therefore, temporal multi-
layer, depthwise separable convolution, up-scale cast, which is reading the same stored data over
convolution, etc. time, requires a buffer, as shown in Table 1.
Spatial/Temporal Reduction. When the spa-
DATA REUSE IN DNN tial reuse opportunities are in the output activa-
ACCELERATORS tion tensor, the reuse pattern in hardware is
Data reuse is the key behavior in DNN acceler- spatial reduction, which accumulates partial out-
ator that improves both latency and energy via puts (or, partial sums) for an output across multi-
reducing the number of remote buffer accesses ple PEs. The example in Table 1 shows an
(i.e., global buffer),1; 8 which is determined by example reuse pattern based on store-and-for-
dataflow. Data reuse opportunities exist when ward hardware. We observe that the output tiles
the dataflow assigns the same set of data tiles 1 and 2 are moving to the next PE over time,
over consecutive time on the same PE (i.e., reuse which illustrates pipelined accumulation to the
in time) or across multiple PEs but not over con- right direction assuming that PEs are receiving
secutive time (i.e., reuse in space). We define new operands from above (i.e., a row of a systolic
those opportunities as temporal and spatial array). Alternatively, fanin hardware such as
reuse opportunities. For example, in the example reduction tree can support the spatial reduction.
dataflow in Figure 1, output tiles (orange tiles) In contrast, the temporal reuse opportunities
remain the same in time 0 and 1, which implies imply that we compute partial sums over time
the temporal reuse opportunities. Within time 1, and accumulate them within the same location.
as the spatial partitioning example in Figure 1 This type of reuse requires a buffer since
IEEE Micro
22
Figure 2. Example CONV1D operation and mapping of the example on an accelerator. We represent the
mapping in both computation and data space, where each point corresponds to a partial sum and a data,
respectively. We use 1-based indices in this example.
intermediate results need to be stored and read We show an example of mapping on three-PE
again in the future, which effectively indicates accelerator in computation and data space
multiple read-modify-write to a buffer. The exam- in Figure 2(b). In this example mapping, we map
ple in Table 1 shows such a reuse pattern, where three partial sum computation to each PE, and
the output tile 1 appears at the same PE over each PE collaboratively compute partial outputs
time. (accumulated partial sums) on the same set of
To identify the reuse opportunities in arbi- outputs. When the PE array finishes computa-
trary mappings, we need a precise representa- tion in a tile (time=0 in the example), the PE
tion of mapping and systematically infer data array receives the next computation tile
reuse from the description. For those two (time=1 in the example). The next computation
goals, we present a data-centric representation tile is in the direction of loop index x0 . We project
of mapping, which is concise and compiler the same mapping on the data space as shown
friendly. in Figure 2, using the array subscripts in the
loop nest of CONV1D operation in Figure 2(a).
That is, partial sum at (x,0 s) requires weight at s,
DESCRIBING MAPPINGS input at x0 +s, and output at x,0 as shown in the
We use a CONV1D operation described loop body of in Figure 2(a). In the example, we
in Figure 2 as an example operation to introduce observe that the data space explicitly shows
our mapping description. As described data reuse behavior; mapped filter values do not
in Figure 2, CONV1D operation can be under- move over time, which implies that the example
stood as a sliding window operation of a filter mapping is based on a weight-stationary style
vector on a input vector, where individual multi- dataflow. This implies that inferring data reuse
plication results within a filter window are accu- can be significantly simplified when we describe
mulated to generated one output value in the mapping in the data space, which can facilitate a
output vector. When we project the loop indices fast analysis framework of DNN accelerator’s
in the loop nest in Figure 2(a), we obtain compu- efficiency.
tation space in Figure 2(b) where loop indices Motivated by the observation, we introduce
are on each axis, and partial sums are projected data-centric mapping directives that directly
in the plane. We also construct data space of describe the mapping in data space.
each vector as shown in Figure 2(b), where the
corresponding data index is on the axis. Note Data-centric mapping directives
that the data index is not the same as the loop We introduce three data-centric mapping
index (e.g., the input data index x is computed directives in Figure 3(a). Temporal and spatial
using loop indices x0 +s). Therefore, we denote map directives describe data mapping that
data indices using underlined index in this exam- changes in time and space (PEs), respectively.
ple. Note that output and filter indices x0 and s That is, temporal map corresponds to a normal
are identical to the loop indices x0 and s in this for loop in loop nest while spatial map corre-
simple example operation. sponds to a parallel for loop. Those two mapping
May/June 2020
23
Top Picks
Figure 3. Introductory example of data-centric directives. (a) Syntax of data-centric directives. (b) Semantics
of two mapping directives based on an example description process on the example CONV1D mapping
in Figure 2. (c) Capability of data-centric mapping directives that can describe a variety of mapping styles.
directives take three parameters: Mapping size, output vector in Figure 3(b), we observe that
offset, and dimension. The mapping size speci- the starting index of mapping changes over
fies the number of data points (in tensors, map- time 3, which implies that the temporal offset is
ping size in the target dimension since a 3. For filter vector, we observe that the starting
mapping constructs a high-dimensional volume) index of mapping for each PE changes by 1,
mapped on each PE. The offset describes how which implies that the spatial offset is 1. Note
the mapping is updated over time on temporal that spatial map can also involve temporal
map and space on spatial map. Cluster directive aspect as the mapping on the filter vector
specifies the hierarchical organization of PEs, in Figure 3(b); after processing all the computa-
which enables us to explore multiple parallel tion that involves the first data tile on filter, the
dimensions in a mapping. data tile will move on to the next position. This
To understand the syntax and semantics of happens when the number of PEs is not suffi-
data-centric mapping directives, in Figure 3, we cient to cover entire spatially mapped dimen-
provide an example process to determine a cor- sion (also known as spatial folding), and an
responding data mapping description of the implicit temporal offset of (spatial offset)
example mapping in Figure 2(b). We omit the (number of PEs) is applied. Finally, we write the
input tensor because input tensor data mapping dimension on which we describe the data map-
can be easily inferred from the mapping of out- ping, then we obtain the data-centric mapping
put and filter. We first determine if the mapping description of each data mapping, as shown in
is in time or space by checking the mapped the resulting data mapping description column
data are the same or different (i.e., paralleliza- in Figure 3(b). To specify the entire example
tion) across PEs. Next, we check the number of mapping, we need to specify the order of
data points mapped on each PE to determine changes in data tile between output and filter
the mapping size, which are three and one for vectors. Since filter is updated in a slower man-
output and filter, respectively, in the example. ner, we place the data mapping description of
To determine the offset parameter, we check filter above, and write that of output below, like
the temporal and spatial offset on temporal and we specify the update order in loop nest (outer-
spatial map, respectively. For example, for most loop index changes slower).
IEEE Micro
24
Capability of Mapping Directives running VGG16 and AlexNet, respectively. The
Using the data-centric directives, we can latency estimated by MAESTRO are within 3.9%
describe a variety of mappings if it maps con- absolute error of the cycle-accurate RTL simula-
secutive data points in a regular manner (i.e., tion and reported processing delay6 on average.
affine loop subscripts when described in a
loop nest representation). Figure 3(c) shows
the capability of the data-centric directive by CASE STUDIES
showing the changes in the resulting mapping With MAESTRO, we perform deeper case
when we update the base representation we studies about the costs-benefit tradeoff of vari-
obtained in Figure 3(b). When we change the ous mappings when applied to different DNN
directive order, we describe a different order operations. We evaluate five distinct mapping
of data tile update in dimensions. This effec- styles listed in Figure 4(a) in the “Case Study I:
tively changes the stationary vector from The Impact of Mapping Choices” section and the
weight to output, which changes the temporal preference of each mapping to different DNN
data reuse opportunities. When we change the operators. For energy estimation, we multiply
spatial dimension, then we exploit the parallel- activity counts with base energy values from
ism in a different dimension, as the third exam- Cacti13 simulation (28 nm, 2 kB L1 scratchpad,
ple in Figure 3(b) and (c) shows. Finally, if we and 1 MB shared L2 buffer). We also present dis-
change the mapping size (we accordingly tinct design space of an early layer (wide and
update the offset to keep the description shallow) and a late layer (narrow and deep) to
legal), we change the amount of mapped filter show the dramatically different hardware prefer-
and output, as shown in Figure 3(c) and (d). ence of different DNN layers and mapping in the
Based on the fact that data reuse is explicit “Case Study II: Hardware Design-Parameters and
in data dimension and the capability of data- Implementation Analysis” section.
centric directives, we implement an analytical
cost-benefit analysis framework for DNN accel- Case Study I: The Impact of Mapping Choices
erators, MAESTRO. We discuss a high-level Figure 4(b) shows the DNN-operator granu-
overview of MAESTRO next and discuss larity estimation of latency and energy of each
insights from the case studies we performed mapping across five state-of-the-art DNN models
based on MAESTRO next. listed in the “Case Studies” section. Note that
this should be considered a comparison of map-
ping—not of actual designs, which can contain
ANALYTICAL COST MODEL several low-level implementation differences,
Based on the data-centric directives we dis- e.g., custom implementations of logic/memory
cussed, we built a cost-benefit analysis frame- blocks, process technology, etc. We observe
work that considers all of the three factors— that KC-P style mapping provides overall low
DNN layers, hardware, and mapping—with pre- latency and energy. However, the energy effi-
cise modeling of data reuse. MAESTRO consists ciency in VGG16 is worse than YR-P (Eyeriss1
of five preliminary engines: Tensor, cluster, style) mapping, and the latency is worse than
reuse, performance analysis, and cost analysis. YX-P (Shidiannao14 style) mapping in UNet. This
In the article, we focus on the high-level idea is based on the different preference toward map-
without details such as edge case handling, mul- ping of each DNN operator. YX-P provides short
tiple layers, and multiple level hierarchy, etc. We latency to segmentation networks like UNet,
present implementation details in our web page which has wide activation (e.g., 572 572 in the
and open-source repository. We validated input layer) and recovers the original activation
MAESTRO’s performance model against RTL sim- dimension at the end via up-scale convolution
ulation and reported processing delay of two (e.g., transposed convolutions). Such a prefer-
accelerators—MAERI5 and Eyeriss6 when ence to the YX-P style is mainly based on its par-
allelization strategy: It exploits parallelism over
https://maestro.ece.gatech.edu/ both of row and column dimensions in
May/June 2020
25
Top Picks
Figure 4. Summary of case studies. (a) List of mappings used in case study I. (b) Results of the case study I.
Top and bottom rows present latency and energy, respectively. We apply 256 PEs and 32 GBps NoC
bandwidth. We use five different DNN models; Resnet50,7 VGG16,9 ResNeXt50,10 MobileNetV2,11 and
UNet.12 The right-most column presents the average results across models for each DNN operator type and
the adaptive mapping case. We compare the number of input channels and the input activation height to
identify early and late layers (If C > Y, late layer. Else, early layer). (c) Design space of KC-P and YR-P-based
accelerators. We highlight the design space of an early and a late layer to show their significantly different
hardware preference. We apply area/power constraints based on Eyeriss6 to the DSE. The color of each data
point indicates the number of PEs. We mark the throughput- and energy-optimized designs using stars and
crosses. (d) The impact of multicast capability, bandwidth, and buffer size. Design points are selected from
the upper-most design space in (c). The name of design points refer to the differences from the throughput-
optimal reference point. Dark rows represent the efficiency of the selected design point.
activation. The energy efficiency of YR-P map- almost similar (difference < 11%), so the KC-P
ping in VGG16 is based on its high reuse factor mapping provides similar energy efficiency as
(the number of local accesses per fetch) in early YR-P in these cases. This can also be observed in
layers. The YR-P mapping has 5.8 and 15.17 the late layer (blue) bars in Figure 4(b) bottom-
higher activation and filter reuse factors, respec- row plots.
tively, in early layers. However, in late layers, The diverse preference to mappings of differ-
the reuse factors of YR-P and KC-P mapping are ent DNN operators motivates us to employ
IEEE Micro
26
optimal mapping for each DNN operator type. can be observed in the area-throughput plot
We refer such an approach as adaptive mapping in Figure 4(c). YR-P mapping requires low NoC
and present the benefits in the right-most col- bandwidth so it does not show the same behav-
umn of Figure 4(b), the average case analysis ior as KC-P mapping. However, with more strin-
across entire models in the DNN operator granu- gent area and power constraints, YR-P mapping
larity. By employing the adaptive approach, we will show the same behavior.
could observe a potential 37% latency and 10% During DSE runs, MAESTRO reports buffer
energy reduction. Such an optimization opportu- requirements for each mapping and the DSE tool
nity can be exploited by flexible accelerators like places the exact amount buffers MAESTRO
Flexflow15 and MAERI5 or via heterogeneous reported. Contrary to intuition, larger buffer
accelerators that employ multiple subaccelera- sizes do not always provide high throughput, as
tors with various mapping styles in a single DNN shown in buffer-throughput plots in Figure 4
accelerator chip. (plots in the second column). The optimal points
regarding the throughput per buffer size are in
Case Study II: Hardware Design-Parameters the top-left region of the buffer-throughput plots.
and Implementation Analysis The existence of such points indicates that the
Using MAESTRO, we implement a hardware tiling strategy of the mapping (mapping sizes
design space exploration (DSE) tool that in our directive representation) significantly
searches four hardware parameters (the number affects the efficiency of buffer use. We observe
of PEs, L1 buffer size, L2 buffer size, and that the throughput-optimized designs have a
NoC bandwidth) optimized for either energy effi- moderate number of PEs and buffer sizes, imply-
ciency, throughput, or energy-delay-product ing that hardware resources need to be distrib-
(EDP) within given hardware area and power uted not only to PEs but also to NoC and buffers
constraints. The DSE tool receives the same set for high PE utilization. Likewise, we observe that
of inputs as MAESTRO with hardware area/ the buffer amount does not directly increase
power constraints and the area/power of build- throughput and energy efficiency. These results
ing blocks synthesized with the target technol- imply that all the components are intertwined,
ogy. For the cost of building blocks, we and they need to be well-balanced to obtain a
implement float/fixed point multiplier and adder, highly efficient accelerator.
bus, bus arbiter, and global/local scratchpad in We also observe the impact of hardware sup-
RTL and synthesis them using 28-nm technology. port for each data reuse type, discussed
For bus and arbiter cost, we fit the costs into a in Table 1. Figure 4(d) shows such design points
linear and quadratic model using regression found in the design space of KC-P mapping on
because bus cost increases linearly and arbiter VGG16-conv2 layer presented in the first row of
cost increases quadratically (e.g., matrix Figure 4(c). The reference design point is the
arbiter). throughput-optimized design represented as a
Using the DSE tool, we explore the design star in the first row of Figure 4(c). When band-
space of KC-P and YR-P mapping accelerators. width gets smaller, the throughput significantly
We set the area and power constraint as 16 mm2 drops, but energy remains similar. However, the
and 450 mW, which is the reported chip area lack of spatial multicast or reduction support
and power of Eyeriss.6 We plot the entire design resulted in approximately 47% energy increase,
space we explored in Figure 4(c). Whether an as the third and fourth design points shows.
accelerator can achieve peak throughput
depends on not only the number of PEs but also
NoC bandwidth. In particular, although an accel- CONCLUSION
erator has sufficient number of PEs to exploit Fast modeling of cost-benefit space of DNN
the maximum degree of parallelism a mapping accelerators is critical for automated optimiza-
allows, if the NoC does not provide sufficient tion tools since the design space is huge and
bandwidth, the accelerator suffers a communica- high dimensional based on hundreds of DNN
tion bottleneck in the NoC. Such design points model, hardware, and mapping parameters. In
May/June 2020
27
Top Picks
this article, we presented a methodology to 2. “Nvdla deep learning accelerator,” 2017. [Online].
enable fast cost-benefit estimation of a DNN Available: http://nvdla.org.
accelerator on a given DNN model and mapping, 3. A. Parashar et al., “Scnn: An accelerator for
which consists of a compiler-friendly data-cen- compressed-sparse convolutional neural networks,”
tric representation of mappings and an analyti- in Proc. Int. Symp. Comput. Archit., 2017,
cal cost-benefit estimation framework that pp. 27–40.
exploits the explicit data reuse in data space in 4. N. P. Jouppi et al., “In-datacenter performance
data-centric repre- analysis of a tensor processing unit,” in Proc. IEEE
sentations. To ana- Using MAESTRO, we
Int. Symp. Comput. Archit., 2017, pp. 1–12.
lytically estimate the show that no single 5. H. Kwon, A. Samajdar, and T. Krishna, “Maeri:
costs and benefits, mapping and no single Enabling flexible dataflow mapping over DNN
we demystify data hardware is ideal for all accelerators via reconfigurable interconnects,” in
reuse in hardware the DNN layers, which Proc. Int. Conf. Archit. Support Program. Lang. Oper.
and required hard- implies the complexity Syst., 2018, pp. 461–475.
ware support and of the DNN accelerator 6. Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze,
apply the observa- design space. Using “Eyeriss: An energy-efficient reconfigurable
tion into the ana- hardware design accelerator for deep convolutional neural networks,”
lytical cost-benefit space exploration IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–
framework we
estimation frame- 138, Jan. 2017.
implemented using
work, MAESTRO. 7. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
MAESTRO, we also
Using MAESTRO, show that hardware
learning for image recognition,” in Proc. IEEE Conf.
we show that no sin- features can Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
gle mapping and no significantly impact the 8. A. Parashar et al., “Timeloop: A systematic approach
single hardware is throughput to DNN accelerator evaluation,” in Proc. IEEE Int.
ideal for all the and energy. Symp. Perform. Anal. Syst. Softw., Mar. 2019,
DNN layers, which pp. 304–315.
implies the complex- 9. K. Simonyan and A. Zisserman, “Very deep
ity of the DNN accelerator design space. Using convolutional networks for large-scale image
hardware design space exploration framework recognition,” in Proc. Int. Conf. Learn.
we implemented using MAESTRO, we also show Representations, 2015. [Online]. Available: https://iclr.
that hardware features can significantly impact cc/archive/www/doku.php%3Fid=iclr2015:accepted-
the throughput and energy. Those cases show main.html
that the capability of MAESTRO for various anal- r, Z. Tu, and K. He,
10. S. Xie, R. Girshick, P. Dolla
ysis problems on DNN accelerator design space. “Aggregated residual transformations for deep
In addition to the case studies we performed, neural networks,” in Proc. IEEE Conf. Comput. Vis.
MAESTRO also facilitates many other optimiza- Pattern Recognit., 2017, pp. 1492–1500.
tion (e.g., neural architecture search specialized 11. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.
for a target accelerator, mapping search for a tar- Chen, “MobileNetV2: Inverted Residuals and Linear
get accelerator, etc.) frameworks based on its Bottlenecks,” in Proc. IEEE Conf. Comput. Vis. Pattern
speed and accuracy, which will lead to broad Recognit., 2018, pp. 4510–4520.
impact on various areas (DNN model design, 12. O. Ronneberger, P. Fischer, and T. Brox, “U-net:
compiler, architecture, etc.) in the DNN accelera- Convolutional networks for biomedical image
tor domain. segmentation,” in Proc. Int. Conf. Med. Image Comput.
Comput.-Assisted Intervention, 2015, pp. 234–241.
13. N. Muralimanohar, R. Balasubramonian, and N. P.
& REFERENCES Jouppi, “Cacti 6.0: A tool to model large caches,” HP
1. Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial Laboratories, vol. 27, p. 28, 2009.
architecture for energy-efficient dataflow for 14. Z. Du et al., “Shidiannao: Shifting vision processing
convolutional neural networks,” in Proc. Int. Symp. closer to the sensor,” in Proc. Int. Symp. Comput.
Comput. Archit., 2016, pp. 367–379. Archit, 2015, pp. 92–104.
IEEE Micro
28
15. W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, Tushar Krishna is an Assistant Professor in the
School of Electrical and Computer Engineering,
“Flexflow: A flexible dataflow accelerator architecture
Georgia Institute of Technology, where he also holds
for convolutional neural networks,” in Proc. Int. Symp.
the ON Semiconductor Junior Professorship. His
High Perform. Comput. Archit., 2017, pp. 553–564.
research interests include computer architecture,
on-chip interconnection networks, and deep learning
Hyoukjun Kwon is currently working toward the accelerators. Krishna received the Ph.D. degree in
Ph.D. degree in the College of Computing, Georgia electrical engineering and computer science from
Institute of Technology. His research interest includes Massachusetts Institute of Technology. He received
communication-centric and flexible accelerator design the NSF CRII Award in 2018. He is a member of IEEE
and modeling mappings on spatial accelerators. Kwon and ACM. Contact him at [email protected].
received B.S. degrees in environmental materials sci-
ence and in computer science and engineering from Michael Pellauer is a Senior Research Scientist at
Seoul National University. He is a student member of NVIDIA. His research interests are building domain
IEEE. Contact him at [email protected]. specific accelerators, with a special emphasis on
deep learning and sparse tensor algebra. Pellauer
Prasanth Chatarasi is a senior Ph.D. student received the Ph.D. degree from Massachusetts Insti-
advised by Prof. Vivek Sarkar and Dr. Jun Shirako in tute of Technology, the Masters degree from Chalm-
the School of Computer Science, Georgia Institute of ers University of Technology, and the Bachelor’s
Technology. His research focuses on advancing degree from Brown University. Contact him at
compiler optimizations for high-performance appli- [email protected].
cations on general-purpose and domain-specific
parallel architectures. In the past, he focused on Angshuman Parashar is a Senior Research
enhancing traditional compilation techniques for Scientist at NVIDIA. His research interests are in
both sequential and explicitly parallel programs building, evaluating, and programming spatial and
for performance optimizations and debugging data-parallel architectures, with a present focus
on general-purpose architectures. Contact him at on automated mapping of machine learning
[email protected]. algorithms onto architectures based on explicit
decoupled data orchestration. Parashar received
Vivek Sarkar is a Professor and the Stephen Flem- the Ph.D. degree in computer science and engi-
ing Chair for Telecommunications in the College of neering from the Pennsylvania State University
Computing at Georgia Institute of Technology, where (2007), and the B.Tech. degree in computer
he conducts research in multiple aspects of software science and engineering from the Indian Institute
for parallel computing. He is a Fellow of ACM and of Technology, Delhi, India (2002). Contact him at
IEEE. Contact him at [email protected]. [email protected].
May/June 2020
29
Theme Article: Top Picks
Energy-Efficient Video
Processing for Virtual
Reality
Yue Leng and Jian Huang Chi-Chun Chen, Qiuyue Sun, and Yuhao Zhu
University of Illinois at Urbana–Champaign University of Rochester
Abstract—Virtual reality (VR) has huge potential to enable radically new applications,
behind which spherical panoramic video processing is one of the backbone techniques.
However, current VR systems reuse the techniques designed for processing conventional
planar videos, resulting in significant energy inefficiencies. Our characterizations show
that operations that are unique to processing 360 VR content constitute 40% of the
total processing energy consumption. We present EVR, an end-to-end system for
energy-efficient VR video processing. EVR recognizes that the major contributor to the VR
tax is the projective transformation (PT) operations. EVR mitigates the overhead of PT
through two key techniques: semantic-aware streaming on the server and hardware-
accelerated rendering on the client device. Real system measurements show that EVR
reduces the energy of VR rendering by up to 58%, which translates to up to 42% energy
saving for VR devices.
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
30
conventional planar videos and exceeds the that is specialized for PT. We implement an
thermal design point (TDP) of typical mobile EVR prototype on an Amazon AWS server
devices.4 The device power requirement instance and an NVIDA Jetson TX2 board
will only grow as users demand combined with a Xilinx Zynq-7000
higher frame-rate and resolu- A major challenge in FPGA. Real system measurements
tion, presenting a practical VR video processing show that EVR reduces the energy
challenge to the energy- and today is the excessive of VR rendering by up to 58%,
thermal-constrained mobile VR power consumption of which translates to up to 42%
devices. VR devices. Our energy saving for VR devices.
The excessive device power measurements show
is mainly attributed to the fun- that rendering 720p VR
videos in 30 frames
damental mismatch between ENERGY
per second (FPS)
today’s VR system design phi- CHARACTERIZATIONS
consistently consumes
losophy and the nature of VR A VR system involves two distinct
about 5 W of power,
videos. Today’s VR video sys- which is twice as much
stages: capture and rendering. VR
tems are designed to reuse power than rendering videos are captured by special cam-
well-established techniques conventional planar eras, which generate 360 images
designed for conventional pla- videos and exceeds that are best presented in the spheri-
nar videos.1 This strategy the thermal design cal format. The spherical images are
accelerates the deployment of point (TDP) of typical then projected to planar frames
VR videos, but causes signifi- mobile devices. through one of the spherical-to-pla-
cant energy overhead. More nar projections, such as the equirec-
specifically, VR videos are streamed and proc- tangular projection. The planar video
essed as conventional planar videos. As a result, is either directly live-streamed to client devices
once on-device, each VR frame goes through a for rendering (e.g., broadcasting a sports event),
sequence of spherical–planar projective trans- or published to a content provider, such as
formations (PT) that correctly render a user’s YouTube or Facebook, and then streamed to
current viewing area on the display. The PT client devices upon requests. Alternatively, the
operations are pure overhead uniquely associ- streamed videos can also be persisted in the
ated with processing VR videos—operations local storage on a client device for future play-
that we dub “VR tax.” Our characterizations back. This article focuses on client-side VR con-
show that “VR tax” is responsible for about 40% tent rendering, i.e., after a VR video is captured,
of the processing energy consumption, a lucra- because rendering directly impacts VR devices’
tive target for optimizations. energy efficiency.
We present EVR, an end-to-end system for Rendering VR videos consumes excessive
energy-efficient VR video processing. EVR rec- power on the VR device, which is particularly
ognizes that the major contributor to the VR problematic as VR devices are energy and ther-
tax is the PT operations. EVR mitigates the mal constrained. This section characterizes the
overhead of PT through two key techniques: energy consumption of VR devices. Although
semantic-aware streaming (SAS) on the server there are many prior studies that focused on
and hardware-accelerated rendering (HAR) on energy measurement of mobile devices such as
the client device. EVR uses SAS to reduce smartphones and smartwatches, this is the first
the chances of executing PT on VR devices by such study that specifically focuses on VR devi-
prerendering 360 frames in the cloud. ces. We show that the energy profiles between
Different from conventional prerendering tech- VR devices and traditional mobile devices are
niques, SAS exploits the key semantic informa- different.
tion inherent in VR content that is previously We conduct studies on a recently published VR
ignored. Complementary to SAS, HAR miti- video dataset, which consists of head movement
gates the energy overhead of on-device ren- traces from 59 real users viewing different 360
dering through a new hardware accelerator VR videos on YouTube.3 We replay the traces
May/June 2020
31
Top Picks
IEEE Micro
32
user’s current focus. One could also imagine
other design alternatives. For instance, the client
could send the current (desired) FOV to the
cloud service, which returns another FOV video
if there happens to be one that matches the
desired FOV. We leave it as future work to
explore the full design space of the dynamic
component.
Figure 2. Cumulative distribution of tracking
durations.
Hardware-Accelerated Rendering
tracking the movement of the same object, and We propose a new hardware accelerator, PT
show the results in Figure 2 as a cumulative engine (PTE), that performs efficient PTs. We
distribution plot. On average, users spend design the PTE as an SoC IP block that replaces
about 47% of time tracking an object for at the GPU and collaborates with other IPs such as
least 5 s. the Video Codec and Display Processor for VR
The near 100% frame coverage in many vid- video rendering. Figure 3 shows how PTE fits
eos as the number of identified objects increases into a complete VR hardware architecture.
indicates that the server can effectively predict The PTE takes in frames that are decoded from
user viewing area solely based on the visual the video codec, and produces FOV frames to
objects without sophisticated client-side mecha- the frame buffer for display. If a frame is already
nisms such as using machine learning models to prepared by the cloud server as a projected FOV
predict users’ head movement.5,10 This observa- frame, the PTE sends it to the frame buffer
tion frees the resource-constrained VR from per- directly; otherwise the input frame goes through
forming additional work and simplifies the client the PTE’s datapath to generate the FOV frame.
design. The GPU can remain idle during VR video play-
SAS has two major components: First, a static back to save power.
and offline analysis component that extracts The bulk of the PTE is a set of PT units (PTU)
objects from the VR video upon ingestion that exploits the pixel-level parallelism. The
and generates a set of FOV videos that could be pixel memory (P-MEM) holds the pixel data for
directly visualized once on a VR device; second, the incoming input frame, and the sample mem-
a dynamic and runtime serving component that ory (S-MEM) holds the pixel data for the FOV
streams FOV videos on demand to the VR device. frame that is to be sent to the frame buffer. The
We augment the new FOV video with metadata PTE uses DMA to transfer the input and FOV
that corresponds to the head orientation for frame data. The PTE also provides a set of
each frame. Once the FOV video together with
its associated metadata is on the client side and
before a FOV frame is sent to the display, the VR
client compares the desired viewing area indi-
cated by the head motion sensor with the meta-
data associated with the frame. If the two match,
the client directly visualizes the frame on the dis-
play, bypassing the PT operations. Otherwise,
the client system requests the original video seg-
ment from the cloud, essentially falling back to
the normal VR rendering mode.
Note that in our current design, we make the
simplification that the client will always fall back
to the regular VR processing flow upon FOV-miss
and restart streaming FOV videos based on Figure 3. Overview of the augmented hardware architecture.
May/June 2020
33
Top Picks
EVR Implementation
Building on top of the two optimizing primi- Baseline We compare against a baseline that is
tives, SAS and HAR, we design EVR. EVR includes implemented on the TX2 board and that does
a cloud component and a client component. The not use SAS and HAR. The baseline is able to
cloud component extracts object semantics deliver a real-time (30 FPS basis) user experi-
from VR videos upon ingestion, and prerenders ence. Our goal is to show that EVR can effec-
a set of miniature videos that contain only the tively reduce the energy consumption with little
user viewing areas and that could be directly loss of user experience.
rendered as planar videos by leveraging the
powerful computing resources on the cloud. The Benchmark To faithfully represent real VR
client component retrieves the miniature video user behaviors, we use a recently published
with object semantics, and leverages the special- VR video data set,3 which consists of head
ized accelerator for energy-efficient on-device movement traces from 59 real users viewing
rendering if the original full video is required. different 360 VR videos on YouTube. The vid-
For VR applications whose content comes from eos have a 4K (3840 2160) resolution, which
panoramic videos available on the VR devices, is regarded as providing an immersive VR
the HAR can accelerate the video rendering with experience. The data set is collected using the
lower energy overhead. We implement EVR in a Razer Open Source Virtual Reality HDK2 HMD
prototype system, where the cloud service is with an FOV of 110 110 , and records users’
hosted on an AWS instance while the client is real-time head movement traces. We replay
deployed on a customize platform that combines the traces to emulate readings from the
the NVIDIA TX2 and Xilinx Zynq-7000 develop- IMU sensor and thereby mimic realistic VR
ment boards, which can represent a typical VR viewing behaviors. This trace-driven methodol-
client device. ogy ensures the reproducibility of our results.
IEEE Micro
34
Results the energy efficiency of VR applications with
Energy Reductions On average, S and H cloud/client codesign.
achieve 22% and 38% compute energy savings,
respectively. SþH combines SAS and HAR and Energy Characterization of VR Devices
delivers an average 41%, and up to 58%, energy Although there are many prior studies that
saving. The compute energy savings across appli- focused on energy measurement of mobile
cations are directly proportional to the PT oper- devices, such as smartphones and smart-
ation’s contributions to the processing energy, watches, this is the first such study that specif-
as shown in Figure 1(b). For instance, Paris ically focuses on VR devices. We show that the
and Elephant have lower energy savings because energy profiles of VR devices are significantly
their PT operations contribute different from that of traditional
less to the total compute energy mobile devices. Our results sug-
consumptions. Although there are
many prior studies that gest that we must rethink the
The trend is similar for the conventional system-level power/
focused on energy
total device energy savings. SþH energy optimizations in the con-
measurement of mobile
achieves on average 29% and up text of VR processing.
devices, such as
to 42% energy reduction. The smartphones and
energy reduction increases smartwatches, this is Implication on Hardware IP Block
the VR viewing time, and also the first such study that for VR
reduces the heat dissipation specifically focuses on This article provides a case
and, thus, provides a better view- VR devices. in point for future mobile SoCs to
ing experience. integrate VR-specific and VR-opti-
mized IP blocks, and our principal idea of
User Experience Impact We
bypassing the GPU will be critical to those
also quantify user experience both quantitatively
designs. We design the PTE as a standalone IP
and qualitatively. Quantitatively, we evaluate the
block in order to enable modularity and ease dis-
percentage of FPS degradation introduced by EVR
tribution. Alternatively, the PTE logic could be
compared to the baseline. We show that the FPS
tightly integrated into either the video codec or
drop rate averaged across 59 users is only about
display processor. Indeed, many new designs of
1%. Lee et al. reported that a 5% FPS drop is
the display processor have started integrating
unlikely to affect user perception.8 We assessed
functionalities that used to be executed in GPUs,
qualitative user experience and confirmed that the
such as color space conversion. Such a tight
FPS drop is visually indistinguishable and that EVR
integration would let the display processor
delivers smooth user experiences. Although the
directly perform PT operations before scanning
goal of EVR is not to save bandwidth, EVR does
out the frame to the display, and thus reduces
reduce the network bandwidth requirement
the memory traffic induced by writing the FOV
through SAS, which transmits only the pixels that
frames from the PTE to the frame buffer.
fall within user’s sight.
May/June 2020
35
Top Picks
Available: https://community.arm.com/graphics/b/ in Proc. 5th Workshop All Things Cellular: Oper., Appl.
blog/posts/white-paper-360-degreevideo-rendering Challenges, 2016, pp. 1–6.
2. X. Chen, N. Ding, A. Jindal, C. Hu, M. Gupta, and
R. Vannithamby, “Smartphone energy drain in the wild:
Analysis and implications,” ACM SIGMETRICS Yue Leng is currently a Software Engineer with
Performance Eval. Rev., vol. 43, no. 1, pp. 151–164, 2015. Airbnb, San Francisco, CA, USA. Leng received the
3. X. Corbillon, F. Simone, and G. Simon, “360-degree M.S. degree in computer engineering from the Univer-
video head movement dataset,” in Proc. 8th ACM sity of Illinois at Urbana–Champaign in 2019. Contact
Multimedia Syst. Conf., 2017, pp. 199–204. her at [email protected].
4. M. Halpern, Y. Zhu, and V. Reddi, “Mobile CPU’s rise to
Jian Huang is currently an Assistant Professor with
power: Quantifying the impact of generational mobile
the Electrical and Computer Engineering Depart-
CPU design trends on performance, energy, and user
ment, University of Illinois at Urbana-Champaign.
satisfaction,” in Proc. Int. Symp. High-Performance
Huang received the Ph.D. degree from Georgia
Comput. Archit., 2016, pp. 64–76.
Institute of Technology in 2017. Contact him at
5. B. Haynes, A. Minyaylov, M. Balazinska, L. Ceze, and [email protected].
A. Cheung, “VisualCloud demonstration: A DBMS for
virtual reality,” in Proc. ACM Int. Conf. Manage. Data, Chi-Chun Chen is currently a Compiler Engineer
2017, pp. 1615–1618. with Cray, Inc., Seattle, WA, USA. Chen received the
6. J. Hegarty et al., “Darkroom: Compiling high-level M.S. degree in computer science from the University
image processing code into hardware pipelines,” in of Rochester in 2019. Contact him at cchen120@ur.
Proc. SIGGRAPH, 2014. rochester.edu.
7. J. Huang, A. Badam, R. Chandra, and E. Nightingale,
“WearDrive: Fast and energy-efficient storage for Qiuyue Sun is currently a senior undergraduate
wearables,” in Proc. USENIX Annu. Tech. Conf., 2015, student with the Computer Science Department,
pp. 613–625. University of Rochester. Contact her at qsun15@u.
rochester.edu.
8. K. Lee et al., “Outatime: Using speculation to enable
low-latency continuous interaction for mobile cloud
gaming,” in Proc. 13th Annu. Int. Conf. Mobile Syst., Yuhao Zhu is currently an Assistant Professor with
the Computer Science Department, University of
Appl., Services, 2015, pp. 151–165.
Rochester. Zhu received the Ph.D. degree from
9. J. Li, A. Badam, R. Chandra, S. Swanson,
The University of Texas at Austin in 2017. He is the
B. Worthington, and Q. Zhang, “On the energy
corresponding author of this article. Contact him at
overhead of mobile storage systems,” in Proc. File
[email protected].
Storage Technol., 2014, pp. 105–118.
IEEE Micro
36
Theme Article: Top Picks
Towards General-Purpose
Acceleration: Finding
Structure in Irregularity
Vidushi Dadu, Jian Weng, Sihao Liu, and
Tony Nowatzki
University of California Los Angeles
& THE SLOWING IMPROVEMENTS of technology accelerators have been proposed for “irregular”
scaling are raising the demand for specialized domains like graph processing,3,9 compressed
hardware accelerators, especially for increas- neural networks,4,6,10 databases12 and genomics.
ingly difficult problems. While general-purpose Compared to such architectures, GPUs lose in
data-processing hardware, like GPUs or other performance and/or energy efficiency by order-
vector architectures, are effective on regular of-magnitude. On the other hand, domain-agnos-
algorithms, those with irregularity in their con- tic architectures are widely applicable, which is
trol flow or memory access patterns suffer in valuable for economies of scale and robustness
performance. As evidence, many domain-specific to algorithm change. An important question
then is whether it is possible to build a program-
Digital Object Identifier 10.1109/MM.2020.2986199 mable accelerator that is equally as capable as
Date of publication 16 April 2020; date of current version 22 GPUs and vector processors, but better suited
May 2020. to irregular algorithms.
May/June 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
37
Top Picks
IEEE Micro
38
and wide memory access into power-of-two finer and at each step outputs the smaller item. Even
grain resources while maintaining data-depen- though the data structures, data types, and pur-
dence semantics. pose are very different, their relationship to data
Evaluation and contribution: We study machine dependence is the same: they both have stream
learning (ML) as our primary domain, and graph access, but the relative ordering of stream con-
processing and databases to demonstrate gener- sumption is data dependent (they reuse data
ality. SPU achieves between 1.8–7 speedup on from some stream multiple times).
artificial intelligence (AI)/ML applications, and Stream-join definition: A program region that
SPU’s ability to retain performance on dense is regular except that the reuse of stream data
algorithms led to 4.5 speedup. On graph and the production of outputs may depend on
and database applications, SPU achieves similar the data.
performance to domain-specific accelerators with Problem with CPUs/GPUs and motivation:
modest performance and power overheads. Because of their data-dependent nature,
Our primary contributions in this work are the stream-joins introduce branch mispredictions
identification of the two common exploitable for CPUs. For GPGPUs, vectorization becomes
data-dependence forms, and an ISA and hardware difficult due to control divergence of single-
mechanisms to support them. More broadly, we instruction–multiple threads (SIMT) lanes; also,
believe that taking a domain-agnostic approach the memory pattern can diverge between lanes,
can lead to novel insights and foster knowledge causing bank conflicts.
transfer across domains. To visualize the problem for CPUs, see
Figure 2, which shows both the traditional data-
flow and proposed stream-join dataflow repre-
EXPLOITABLE DATA-DEPENDENCE sentation for the examples above. Here, black
FORMS arrows represent data dependence, and green
We observe that two restricted forms of data
arrows indicate control.
dependence are sufficient to cover many algo-
Figure 2(a) shows that the inner product
rithms: stream join and AF-Indirect. In this section,
dataflow can be mapped to a dataflow-based
we first define these forms and give intuition on
processor like an out-of-order core, but only at
their performance challenges for existing archi-
low throughput. To explain, note that there is a
tectures, and then overview our proposal.
loop-carried dependence through the control-
Preliminary Term—“Streams”: Both of the
dependent increment and memory access. This
dependence forms rely on the concept of stream
prevents perfect pipelining, and the throughput
abstractions, so we briefly explain. Streams are
is limited to one instance of this computation
simply an ordered sequence of values. Relevant
every n cycles, where n is the total latency of
to this work are memory streams, which are
these instructions.
sequences of loads or stores with a well-defined
Insight: Our insight is that from the perspec-
pattern.7,12 Streams are similar to vector
tive of the memory, the control dependence is
accesses, but have no fixed length.
mostly unnecessary, as most loads at the line-
granularity will be performed anyways. There-
Stream Join fore, to break the dependence, we need to sep-
An interesting class of algorithms iterates arate the loads from computation (this is what
over each input (each stream) in order, but the memory streams do), then expose a pipelined
total order of operations (and perhaps whether mechanism for controlling the order of data
an output is produced) is data dependent. Two consumption. In the sparse vector example,
relevant kernels are shown in Figure 2. Sparse we would like to reuse the larger of the two
vector multiplication (a) iterates over two index values for consideration (data-dependent
sparse lists (in CSR format) where indices are reuse). If the comparison instruction can treat
stored in sorted order, and performs the multi- its inputs like a queue, and specify the reuse
plication if there is a match. The core of the behavior (i.e., pop the smaller element), this
merge kernel (b) iterates over two sorted lists, can be accomplished in a pipelined fashion.
May/June 2020
39
Top Picks
IEEE Micro
40
Figure 3. SPU microarchitecture. (a) Sparse Processing Unit (SPU). (b) SPU Core. (c) Scratchpad Controller. (d) DGRA
Processing Element.
AF-Indirect Definition: A program region that Insight: Our insight is that the dependence
is regular (including no implicit dependencies) check is not required between subsequent
except that the address of one memory stream requests (e.g., corresponding to different static
may depend on another, and a stream can loads) if alias freedom is known. Further, the
encode a read–modify–write operation. atomic operations required in these algorithms
Problem for CPUs/GPUs and our motivation: are often low latency integer arithmetic logic
On CPUs, indirect memory is possible with scat- unit (ALU) operations. Therefore, for depen-
ter/gather, however the throughput is limited dence check, the maximum number of possible
given the limited ports to read/write vector- conflicting addresses is usually low (i.e., 1 less
length number of cache lines simultaneously. than the atomic update latency). Hence, we
Also, not leveraging alias-freedom means a reli- could compare with absolute addresses instead
ance on expensive load-store queues. of relying on a serializing lock bit mechanism.
Although GPUs can use their banked scratch- Our AF-indirect proposal: We find that the
pads for faster indirect access, the following two desired behavior can be accomplished by:
reasons limit the indirect throughput. 1. No reor- 1) exposing alias-freedom in the hardware–
dering of requests across subsequent vector software interface to enable interleaving across
warp accesses.11 Doing so in a GPU would vectors: and 2) storing absolute address of pend-
require dependence checking of in-flight ing atomic updates (maximum 2) to enable
accesses, as they cannot guarantee alias free- pipelining of nonconflicting addresses. Figure 2
dom. 2. Atomic updates to the same scratchpad shows how SPU is able to reorder requests, and
bank are not pipelined even though they access also able to pipeline atomic update requests with
different memory locations. The reason is that initiation with no bubbles. The stall is intro-
the lock bits for atomic operations are shared duced in the presence of “real” dependencies, for
among multiple addressable locations.1 The example, see cycle-4 in AF-Indirect reordering in
coarse-granularity locking is required to reduce Figure 2. This is limited to a maximum two-cycle
the locking overhead. bubble.
To visualize the inefficiency of a typical GPU
scratchpad, see Figure 2, which shows how
scratchpad vector requests (corresponding to SPARSE PROCESSING UNIT
indirect read and atomic update, respectively) are In this section, we first overview the primary
served on a GPU. For simplicity, we assume a warp aspects of the design, and then provide the
size of 8. As GPUs do not reorder requests across details of stream-join-enabled systolic-CGRA and
warps, the update request vector is issued after the banked memory exposed to knowledge of
the completion of all read requests. For updates in AF-Indirect.
GPU, we assume one lock-bit per scratchpad bank. Figure 3(a) shows the proposed SPU archi-
Here, the three-cycle nonpipelineable operation tecture. SPU cores are integrated into a mesh
further worsens the overhead of bank conflicts. network-on-chip (NoC). Each core is composed
May/June 2020
41
Top Picks
of the specialized memory and compute ALUs. Correspondingly, CLT and registers are
fabric: decomposable granularity reconfigura- also composable.
ble architecture (DGRA), together with a con- To route the data from PEs, the network of
trol core for coordination among streams. the DGRA is decomposable into multiple parallel
Communication/synchronization: SPU pro- finer-grain subnetworks (minimum 8 b). For flexi-
vides two specialized mechanisms for communi- ble routing, we add the ability for incoming val-
cation. First, we include the multicast capability ues to shift one subnetwork per switch hop.
in the network. Data can be broadcast to a sub- Alias-freedom-exposed banked memory:
set of cores, using the relative offset in the Because our workloads often require a mix of
scratchpad. As a specialization for loading main linear and indirect arrays simultaneously, for
memory, cores issue their load requests to a cen- example, streaming read of indices (direct) and
tralized memory stream engine, and data can be associated values (indirect), we begin our design
multicast from there to relevant cores. For syn- with two logical scratchpad memories, one
chronizing on data-readiness, SPU uses a data- highly banked and one linear. In this design,
flow-tracker-like mechanism to wait on a count both exist within the same address space.
of remote-scratchpad writes. Hence, memory streams may access locations in
a remote core’s scratchpad using the similar
interface for linear and indirect streams.
SPU Core The role of the scratchpad controller [see
The basic operation of each core [see Figure 3 Figure 3(c)] is to generate requests for reads/
(b)] is that the control core will first configure writes to the linear scratchpad, and reads/
the DGRA for a particular dataflow computation, writes/updates to the indirect scratchpad. A
and then send stream commands to the scratch- control unit assigns the scratchpad streams,
pad controller to read data or write to the DGRA, and their state is maintained in either linear or
which itself has an input and output port inter- indirect stream address generation logic. The
face to buffer data. controller should then select between any con-
Stream-join compute fabric: DGRA: We aug- current streams for address generation and
ment a systolic CGRA to support stream-join send it to the associated scratchpad to maxi-
control and dataflow computation with arbitrary mize expected bandwidth. The linear address
data types. Figure 3(d) shows the microarchitec- generator’s operation is simple—create wide
ture of a DGRA processing element (PE) (green scratchpad requests using the linear access
color represents control). pattern.
To implement control interpretation, we The indirect address generator creates a vec-
add a control lookup table (CLT) to each func- tor of requests by combining each element of
tional unit (FU), which determines a mapping the stream of addresses (coming from the
between the control inputs and possible con- compute fabric, explained in the “Exploitable
trol operations. This mapping is configured Data-Dependence Forms” section) with each ele-
along with the dataflow computation graph. ment in the parent stream (i.e., b[i] in a[f(b
During dataflow operation, CLT consumes one [i])]). This vector of requests is sent to an arbi-
of the dataflow inputs to produce control sig- trated crossbar for distribution to banks, and a
nals for the ALU (discard), associated registers set of queues buffer requests for each static ran-
(reset), and FIFOs connected to ALU inputs dom access memory (SRAM) bank until they can
(reuse). be serviced.
In the DGRA, we enable each coarse-grained Since there are no conflicts among indirect
resource to be able to be decomposed to read/write requests, the requests are serviced
powers-of-two fine-grain resources. For compu- from the top of the bank queue as soon as the
tation, the decomposable PE can split each scratchpad data bus becomes available. For
coarse-grained input into multiple finer-grained atomic update requests, the requests can be ser-
inputs [16-b inputs in Figure 3(d)], which are viced when both scratchpad read and write
used to feed two separate lower granularity buses are available, and the updated address
IEEE Micro
42
does not conflict with the pending updates
issued from the same bank. As the ordering of
the data returned from read requests is critical
for dataflow operations, we employ an indirect
read reorder buffer (IROB) that maintains incom-
plete requests in a circular buffer (see Figure 2).
IROB entries are deallocated in-order when a
request’s data is sent to the compute unit.
Control ISA: We leverage an open-source Figure 4. Overall performance.
stream-dataflow ISA7 for the control core’s
implementation of streams, and add support for
indirect reads/writes/updates, stream-join data- EVALUATION
flow model, and typed dataflow graph. The ISA Our evaluation broadly addresses the question
contains stream instructions for the data trans- of whether restricted data-dependence forms
fer, including reading/writing to main memory exposed to an ISA (and exploited in hardware) can
and scratchpad. help achieve general-purpose acceleration.
Comparison to general-purpose accelerators:
Figure 4 shows how SPU fairs against CPU and
METHODOLOGY GPU for workloads across ML, graph processing,
SPU: We implemented SPU’s DGRA in Chisel,
and databases.
and implemented with an industry 28-nm tech-
The workloads with a stream-join pattern—
nology. We built an SPU simulator in gem5, using
kernel support vector machines (KSVM), TPCH
a RISCV ISA for the control core.
sort heavy queries (SH), gradient boosting deci-
Architecture comparison points: Table 1 shows
sion trees (GBDT)—achieve speedup up to 10
the characteristics of the architectures we com-
speedup over CPU due to avoiding the through-
pare against, including their on-chip memory
put-limiting cyclic dependence loop and lower
sizes, FU composition, and memory bandwidth.
computational density. The GPU also suffers
We also address whether an inorder processor
from hardware underutilization as control leads
is sufficient by comparing against “SPU-inorder,”
to masking in vector lanes.
where the DGRA is replaced by an array of eight
On workloads with AF-Indirect—fully con-
inorder cores (total of 512 cores). For reference,
nected layer (FC), convolution layer (CONV),
we also compared against a dual-socket Intel
arithmetic circuits (AC), Graph, TPCH not sort-
Skylake CPU, with 24 cores.
heavy queries (N-SH)—both GPU and SPU use a
Workload implementations: We implement
histogram-based approach. However, SPU’s
SPU kernels (both dense/sparse) for each work-
aggressive reordering of indirect updates in the
load, and use a combination of libraries and
compute-enabled scratchpad far outperforms
hand-written code to compare against CPU/GPU
the limited ordering in GPU.
versions.
Finally, the ability to support both stream-
join and AF-Indirect enables the use of new com-
Table 1. Characteristics of evaluated architectures. pression techniques like run-length encoding
efficiently. These techniques effectively reduce
Characteristics GPU SPU-inorder SPU
the required memory bandwidth, thus improv-
Processor GP104 In-order SPU-core
ing performance.
Cache+Scratch 4064 kB 2560 kB 2560 kB Even though SPU-inorder can relieve
Cores 1792 512 64 SPU cores some of the vectorization overheads suffered
by GPU, it is insufficient due to lower peak
FP32 Unit 3584 2048 2432
throughput.
FP64 Unit 112 512 160 Domain accelerator comparison: Accelerators
Max Bw 243 GB/s 256 GB/s 256 GB/s for FC,4 CONV,10 and Graphs3 all employ com-
pute-enabled banked memory to achieve high
May/June 2020
43
Top Picks
indirect throughput. SPU is able to remain within Table 2. Analysis of related works.
57% of its performance. The difference is due to
other specializations, e.g., higher radix NoC in Exploitable dependence
Specialized architectures
forms
Graph application-specified integrated circuit
(ASIC) and higher buffer access bandwidth in TPUv1—Dense ML
IEEE Micro
44
bit-level operations. These optimizations can formulations or definitions of restricted data-
apply to SPU. More importantly, by finding struc- dependence forms, which could lead to new
ture and commonality in the dependence forms opportunities for specialization. For example, a
across domains, it becomes clear coarser grain form of data depen-
how to apply these optimizations We think it is important dence than we have explored is
to other superficially different not to forget that data-dependent parallelism (aka
problems, like database join or systems and dynamic parallelism). At the other
decision tree training. application experts are end of the spectrum could be
Impact on general-purpose pro- constantly innovating data-dependent data types, where
cessors: This work focused on new algorithms, and at a fine grain, the data-type size
reconfigurable dataflow-like pro- are now doing so with is chosen to meet the precision
cessors for implementing depen- deep knowledge of the requirements. One could imagine
underlying hardware.
dence-form specialization. While it exposing these forms as first-class
Our results support the
was convenient, other architec- primitives in the hardware/soft-
notion that a rigid
tures can equally benefit from architecture can limit
ware interface, and each could be
such specialization. certain algorithmic plausibly useful in many domains.
approaches from being Effect on algorithms: Finally,
Indirection in GPUs: A conceiv- viable. we think it is important not to
able extension to a GPU ISA forget that systems and applica-
could enable the annotation of tion experts are constantly inno-
a program region as being alias-free indirect vating new algorithms, and are now doing so
(informed by programmer or compiler). This with deep knowledge of the underlying hard-
would allow GPU scratchpads to eliminate ware. Our results support the notion that a
memory dependence checking and enable rigid architecture can limit certain algorithmic
aggressive reordering, leading to reduced approaches from being viable. Therefore, we
impact of bank conflicts and higher through- believe that incorporating support for struc-
put. NVIDIA’s tensor core is precedence that tured irregularity into existing and new pro-
such specialization is feasible. grammable architectures can lead to
Stream-join SIMD: Stream-join control could innovations in novel algorithms and data
be supported in a CPU, for example, through structures.
extensions to SIMD operations. An approach
could be to add specialized instructions, ACKNOWLEDGMENTS
which allow treating registers as FIFOs, and We would like to thank G. Van den Broeck
the branch instructions may control the and A. Choi for their insights and help with arith-
order of data consumption (using simple metic circuits workloads. We would also like to
finite-state machine at FIFOs). thank D. Ott and P. Subrahmanyam for their
Hybrid FPGAs: Recent FPGAs (Xilinx Alveo) thoughtful conversations on the nature of irregu-
include neural network accelerator units, larity and data dependence. This work was
demonstrating the need for specialization supported in part by the National Science Foun-
of even reconfigurable hardware. Increasing dation under Grant CCF-1751400 and Grant CCF-
these units’ flexibility to be similar to 1937599 and in part by the gift funding from
SPU could simultaneously provide many VMware.
of the same efficiency benefits as an
ASIC while also retaining the fundamental
value proposition of FPGAs: broad work- & REFERENCES
load efficiency while retaining fine-grain 1. J. Gomez-Luna, J. M. Gonzalez-Linares, J. I. Benavides
reprogrammability. Benitez, and N. Guil Mata, “Performance modeling
of atomic additions on GPU scratchpad memory,”
Other exploitable data-dependence forms: It IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 11,
is possible that there may be alternate pp. 2273–2282, Nov. 2013.
May/June 2020
45
Top Picks
2. A. Gondimalla, N. Chesnut, M. Thottethodi, and Vidushi Dadu is currently working toward the Ph.D.
degree with the Department of Computer Science,
T. N. Vijaykumar, “SparTen: A sparse tensor accelerator
University of California Los Angeles. Her current
for convolutional neural networks,” in Proc. 52nd Annu.
research focuses on hardware–software codesign to
IEEE/ACM Int. Symp. Microarchit., 2019, pp. 151–165.
enable general-purpose acceleration. Dadu received
3. T. J. Ham, L. Wu, N. Sundaram, N. Satish, and the B.Tech. degree in electronics and communica-
M. Martonosi, “Graphicionado: A high-performance tion engineering from the Indian Institute of Technol-
and energy-efficient accelerator for graph analytics,” ogy Roorkee. She is a student member of IEEE.
in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchit., Contact her at [email protected].
Oct. 2016, pp. 1–13.
4. S. Han et al., “EIE: Efficient inference engine on
compressed deep neural network,” in Proc. 43rd
Annu. Int. Symp. Comput. Archit., 2016, pp. 243–254. Jian Weng is currently working toward the Ph.D.
5. K. Hegde et al., “ExTensor: An accelerator for sparse degree with the Department of Computer Science,
tensor algebra,” in Proc. 52nd Annu. IEEE/ACM Int.
University of California Los Angeles. His research
interests include analyzing and designing reconfig-
Symp. Microarchit., 2019, pp. 319–333.
urable spatial architectures along with the associ-
6. A. K. Mishra, E. Nurvitadhi, G. Venkatesh, J. Pearce,
ated compilation techniques. Weng received the
and D. Marr, “Fine-grained accelerators for sparse
B.Eng. degree in computer science from Shanghai
machine learning workloads,” in Proc. 22nd Asia South Jiao Tong University. He is a member of the Asso-
Pacific Design Autom. Conf., 2017, pp. 635–640. ciation of Computing Machinery. Contact him at
7. T. Nowatzki, V. Gangadhar, N. Ardalani, and [email protected].
K. Sankaralingam, “Stream-dataflow acceleration,” in
Proc. 44th Annu. Int. Symp. Comput. Archit., 2017,
pp. 416–429.
8. T. Nowatzki, V. Gangadhar, K. Sankaralingam, and Sihao Liu is currently working toward the Ph.D.
G. Wright, “Pushing the limits of accelerator efficiency degree with the Department of Computer Science,
while retaining programmability,” in Proc. IEEE Int. Symp. University of California Los Angeles. His research
High Perform. Comput. Archit., Mar. 2016, pp. 27–39.
interests include spatial architecture prototyping and
design space exploration. Liu received the B.Eng.
9. S. Pal et al., “OuterSPACE: An outer product based
degree in electrical engineering from Xi’an Jiaotong
sparse matrix multiplication accelerator,” in Proc. IEEE
University. He is a student member of IEEE. Contact
Int. Symp. High Perform. Comput. Archit., Feb. 2018,
him at [email protected].
pp. 724–736.
10. A. Parashar et al., “SCNN: An accelerator for compressed-
sparse convolutional neural networks,” in Proc. 44th Annu.
Int. Symp. Comput. Archit., 2017, pp. 27–40. Tony Nowatzki is currently an Assistant Professor
11. NVIDIA Whitepaper, “Cuda C best practices guide,” with the Department of Computer Science, University
May 2019. [Online]. Available: https://docs.nvidia. of California Los Angeles. His research interests
com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf include architecture and compiler codesign and novel
12. L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross, hardware/software interfaces. Nowatzki received the
“Q100: The architecture and design of a database Ph.D. degree in computer science from the University
processing unit,” in Proc. 19th Int. Conf. Archit. Support
of Wisconsin-Madison. He is a member of IEEE.
Contact him at [email protected].
Program. Lang. Oper. Syst., 2014, pp. 255–268.
IEEE Micro
46
Theme Article: Top Picks
Varifocal Storage:
Dynamic Multiresolution
Data Storage
Yu-Ching Hu Te I
University of California, Riverside Google
Murtuza Lokhandwala Hung-Wei Tseng
North Carolina State University University of California, Riverside
& FOLLOWING THE HINTS of Amdahl’s law, com- Modern computer systems and applications
puter architecture/system designers always try intensively rely on heterogeneous hardware accel-
to “make the common case fast” and focus on erators and algorithms inspired by approximate
optimizing the most time-consuming compo- computing to significantly shrink the execution
nent. However, it is so easy for us to forget that time and improve energy efficiency in compute
the common case changes all the time and opti- kernels, but leave other architectural components
mizing the most common case can also intro- remaining the same as traditional, exact comput-
duce new overhead. ing. As a result, we have now reached a point that
the side effects of approximate computing on
Digital Object Identifier 10.1109/MM.2020.2985955 accelerators, moving data and adjusting data reso-
Date of publication 10 April 2020; date of current version 22 lutions (e.g., data precision levels, summarized
May 2020. results, intermediate results, and sampled
May/June 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
47
Top Picks
contexts), have overtaken compute kernels as a to a CPU and a set of DRAM modules where the
new bottleneck in many applications. computer hosts the operating system and pro-
This research presents varifocal storage (VS), vides a synchronization point for runtime data
a new architecture that coordinates application storage, the computer may incorporate other
demands, hardware accelerators, and intelligent computing units, including general-purpose
data storage devices to efficiently support vari- computing on graphics processing units, digital
ous input resolutions of system components, signal processors or tensor processing units
but still maintain the flexibility and (TPUs), to accelerate the exe-
quality without additional costs. VS cution of compute kernels of
This research presents
revisits the task allocation of workloads. These accelerators
varifocal storage (VS),
approximate computing on general- usually accept data in different
a new architecture that
purpose computers to place tasks coordinates application precisions (e.g., 32 bits in regu-
such as raw-data retrieval, data-reso- demands, hardware lar GPU cores, 16 bits in Tensor
lution adjustment, and quality con- accelerators, and intel- Cores, 8-bits in TPUs) from the
trol in the most appropriate place ligent data storage host processor architecture
within a system in a full-stack sys- devices to efficiently (e.g., 64 bit). The computer
tem design perspective. Instead of support various input also stores input/output data
faithfully shipping the raw data, the resolutions of system persistently using solid state
storage device in VS can work components, but still drives (SSDs) or storage over
directly with the running application maintain the flexibility the network through a network
and quality without interface card (NIC).
to generate and deliver data sets in
additional costs.
the desired resolution and quality These heterogeneous hard-
before going through the narrower ware components exchange data
system interconnect. In this way, VS minimizes through the system interconnect (i.e., PCIe) where
the bandwidth demand from the data source the root complex is nowadays located on the CPU.
and decreases the most latency-critical data- Due to the limited total links available from the
transfer overhead. root complex, the system usually allocates a rela-
We evaluate VS by running a wide range of tively larger amount of links to high-throughput
applications on our prototype SSD. The ideal, accelerators (e.g., 16 PCIe lanes for GPUs). The
programmer-directed VS achieves 1.52 speedup remaining components (e.g., SSDs, NICs) can only
on average over conventional approximate com- use relatively smaller amounts of PCIe links or
puting, while the automatic VS still achieves even have to share the links with other peripherals
1.46 speedup without programmers’ hints. through a PCIe switch.
As the computer may 1) use a data set for
different purposes (e.g., an application can
DEMAND OF PRESENTING DATA SETS request an image in resolutions as high as
IN DIFFERENT RESOLUTIONS 7680 4320 pixels to display or edit, but
Figure 1 illustrates the architecture of a inferencing in a machine learning application
modern heterogeneous computer. In addition only requires 1=64 of that resolution) or 2)
compute data on accelerators with different
precisions, the computer usually stores per-
sistent data with high-resolution content and
needs to generate the input data in the
desired resolution dynamically. Figure 2
explains the resulting data-processing pipeline
of running approximate-computing applica-
tions or using these hardware accelerators in
modern heterogeneous computers.
The computer first needs to issue I/O com-
Figure 1. Architecture of a modern heterogeneous computer. mands for the storage device to access raw data
IEEE Micro
48
optimized I/O library, the overhead of receiving/
preparing data sets exceeds the kernel execution
as the most critical stage in a majority of these
applications. We expect the gap in Figure 3 to grow
with the relatively fast evolution of hardware accel-
erators, but slowly improved I/O and storage
Figure 2. Data-processing pipeline of approximate technologies.
applications using the conventional execution model.
VS SYSTEM ARCHITECTURE
Figure 4 shows VS in a heterogeneous com-
from its internal data arrays and then transfer the
puter system. VS revisits the storage-system
raw data through the underlying system intercon-
stack to allow the device to dynamically produce
nect while simultaneously serving other data-
data with different resolutions on demand. The
access requests. Once the host computer receives
VS core layer resides inside the storage device
a chunk of data, the CPU can start producing data
to change data resolutions presented to applica-
sets in lower resolutions. The compute kernel can
tions. The VS layer interacts with an existing sys-
then perform computations using the resolution-
tem I/O interfaces and provides an extended
adjusted data sets. If the kernel can leverage a
interface for resolution adjustments. The VS
hardware accelerator, the system must addition-
layer also works together with the SSD manage-
ally exchange among different components
ment layer (i.e., the flash translation layer in
through the interconnects before the accelerator
flash-based, SSDs) to locate the requested data.
can compute on the prepared data.
The host system needs an extended kernel
With these highly optimized approximate-com-
driver and API functions for the applications to
puting-based acceleration techniques but rela-
send requests, exchange data, and receive feed-
tively limited bandwidth for data exchange, the
back from the VS core layer. The host applica-
latency of retrieving and preparing data for approx-
tion interacts with the API and sends commands
imate-compute kernels becomes the most critical
specifying operators that VS should apply to the
stage in the data-processing pipeline. Figure 3 com-
raw data.
pares the latency of receiving raw data chunks
The VS core layer supports a set of operators
from a high-end NVM-Express (NVMe) storage
that are especially effective for applications that
device against the execution time of performing
contain high data-level parallelism, but are able
approximate/mixed-precision compute kernels on
to tolerate inaccuracies in data sets. The VS core
the same data chunks using an NVIDIA Tesla T4
GPU for a set of applications. Using a highly
May/June 2020
49
Top Picks
IEEE Micro
50
Cost. The VS core layer can leverage exist- any quality control mechanism is enabled, and
ing SSD controllers and minimize extra hardware 2) the parameters that allow the underlying stor-
costs for the following reasons. 1) Empirical age device to adjust data as well as control varia-
studies,9,10 as well as our measurements in the bles that quality control mechanisms use to
unmodified prototype SSD, reveal that the SSD assure the quality of the adjusted data.
controller cores are mostly idle due to the rela- Figure 6 shows the KMeans code with VS func-
tively long latency of accessing NVM devices tion calls inserted. The modified KMeans code ini-
and the overprovisioning of processing power. tiates VS calling vs_setup to set the desired
2) The critical path of the data-access pipeline is operator, resolution, and the data format. VS
determined by either the access time of flash starts adjusting data only if the application calls
chips or the latency of the DMA stage, leaving the vs_read function. This function resembles
slacks that can be taken up by VS to apply opera- the existing Linux read function except that
tors without the need for additional accelera- 1) the resulting data size may be different from
tors. 3) SSD controllers enjoy the benefits of the requested data size, since operators will trim
exclusive resources within the storage device data sizes in most cases, and 2) the function will
and can perform data adjustment more effi- provide feedback regarding the resolution that
ciently than the host CPU. VS selects. If the program calls a regular read
The rest of this section will briefly describe function to read data, VS will act as a conven-
the current programming model, operators, tional data storage but not change the data
quality control mechanisms, and architectural resolution.
support of VS. If VS successfully adjusts the data, the appli-
cation can use a compute kernel that supports
Programming Model lower resolution input (e.g., cluster_approxi-
To prepare an application to take advantage mate) to further reduce the total execution time
of the VS model, the programmer uses the VS of the program. Depending on the approximate
library to specify data resolutions and retrieve compute kernels that the application uses, the
adjusted data for the application. These library programmer can choose different VS operators
functions help the application to set up 1) the for data adjustments when calling the vs_setup
operators required to read data and whether function. In addition to the programmer’s choice
of resolutions, the programmer can optionally
enable VS’ quality control mechanisms, Autofo-
cus and iFilter. Autofocus can automatically
decide the resolution using a set of control varia-
bles for a chosen operator. The decisions that
Autofocus make are usually more conservative
than those of a programmer, but Autofocus can
nonetheless help applications adapt to data
sets. If a given application can apply multiple
versions of approximate kernels for different VS
operators, the programmer can use the iFilter
mechanism to let VS choose both the most
appropriate operators and resolutions for each
data set.
VS Operators
VS provides a set of operators to adjust data
resolutions and expose these operators through
the NVMe interface as well as the system API. VS
Figure 6. KMeans code sample with inserted VS operators are selected under the following crite-
function calls. ria: 1) The computation overhead must match
May/June 2020
51
Top Picks
the processing power inside the storage device. adjusted data, avoiding the cases of sending data
Therefore, VS can minimize the impact on access that fail to pass the quality control knobs to the
latency and power consumption and avoid extra host before the computation occurs. In contrast,
hardware costs. 2) A wide range of applications all previous research projects censor computa-
must be able to apply the operator, thereby tion results and always require at least part of
allowing for more efficient use of valuable device raw data to present in the host main memory as
resources (VS identifies the most useful opera- well as being computed using exact computing.
tors from previous efforts11,12). 3) The operator Autofocus allows the programmer to simply
must allow VS to take advantage of mismatches specify the desired VS-operator, letting VS
between external and internal bandwidths and decide the most appropriate resolution that
downsize the outgoing data. These operators every checked piece of the adjusted data suc-
can flexibly support various resolutions and cessfully passes through the pre-defined, opera-
accommodate exact computing. tor-dependent threshold values. If low-
The current VS framework supports the fol- resolution data failed on the quality control
lowing categories of operators for diverse data knobs, VS will reject low-resolution data and
types. gradually use higher resolutions until the quality
Data Packing: The data-packing operator fits the demand and apply this resolution for the
trims the data set size by using fewer bytes to same data set later. Utilizing another important
express each item and by condensing the layout observation from previous research that a small
in memory. Since the data-packing operator subset of input data is representative of the rest
translates raw data into a less-precise data type, of the input data in approximate-computing
it can potentially decrease accuracy (e.g., dou- applications that tolerate inaccuracies,3 Autofo-
ble!float!half or int64!int32!short!char). cus selects the resolution using only a small por-
Quantization: The quantization operator tion of the raw input data from a requested data
rescales the raw values into a smaller value set and then monitors the quality of the adjusted
space as well as preserves the relative order of input data.
values. The quantization operator applicable to iFilter can work without programmer input
the application requires a large value space. and is more effective than Autofocus for appli-
Reduction/Tiling: The reduction operator cations having compute kernels that are com-
applies a function (e.g., average) to a group of patible with multiple VS operators. The iFilter
input values and yields a single output value. After algorithm is similar to the Autofocus algorithm
applying a reduction operator, VS sends only the in which it selects the most appropriate resolu-
resulting value of each group to reduce the amount tion for each compatible operator, except that
of data passing through the system interconnect. iFilter will keep track of the resolution and the
Sampling: The sampling operator chooses a resulting data size for each operator. After
subset of items from the raw data and sends the selecting an operator that passes all quality
selected items to the host computer. Operators control variables and generates the smallest
in this category can perform uniform/random data size among all passing operators, iFilter
data selection or report only the most represen- will enter the monitoring phase as in
tative data. The sampling operator can poten- Autofocus.
tially achieve the same effect as that of loop
perforation but without any code modification. Building a VS-Compliant Storage Device
Building a VS-compliant storage device
First-Level Quality Control in the Storage means tackling challenges associated with 1)
Devices providing a hardware/software interface that
If the input quality is way too far from the ori- allows applications to describe the resolutions
gin, the approximate computing can hardly gen- and quality of the target data, and 2) minimizing
erate meaningful results.3 Therefore, we the computational overhead/cost of adjusting
designed two quality control mechanisms, Auto- data resolutions. VS overcomes the former chal-
focus and iFilter, both control the quality of lenge by extending the NVMe interface; this
IEEE Micro
52
requires the fewest modifications to the system exact computing is 7% slower than the conven-
stack and applications. VS addresses the latter tional approximate computing.
challenge by exploiting the idle cycles available Several groups of our workloads shared the
in modern SSD controllers. same data sets, but applied different operators
and resolutions to accommodate each individual
demand. Without an architecture like VS, the
storage system must store multiple versions of a
RESULTS
shared data set or provide raw data to the host
In this article, we built a VS-compliant SSD by
for preprocessing, hurting either space effi-
extending a commercialized, datacenter-class
ciency or performance.
SSD. We attached the VS-compliant SSD to a
We also compare our mechanisms with other
high-end heterogeneous machine with a GPU.
alternatives. Autofocus outperforms a state-of-the-
The host operating system contains the
art quality control mechanism by up to 2.86
extended NVMe driver to support additional VS
because VS does not require the storage device to
NVMe commands. We measured the perfor-
deliver raw data to the host. This article also mea-
mance of the resulting system with several work-
sured that VS outperforms the best compression
loads that span a wide range of applications.
algorithm for each data set by 4.40
Figure 7 shows the relative end-
As an optimization since VS incurs zero overhead in
to-end latency of running a com-
usually comes with its decoding data on the host side.
plete workload using workloads
with the conventional approximate own overhead, we really
computing approach with GPU- need to holistically revisit
accelerated kernels as the baseline.
the interactions among CONCLUSION
different system/archi- As an optimization usually
Since VS efficiently prepares input
tectural components to comes with its own overhead, we
data sets in storage devices for
negate the introduced really need to holistically revisit
approximate computing kernels side effects to scale up the interactions among different
running on the GPU, the manual performance with new
system/architectural components
programmer-directed VS leads to a technologies.
to negate the introduced side
speedup of 1.52 for these applica-
effects to scale up performance
tions. VS also achieved an average
with new technologies. This article,
energy savings of 32% for applica-
in particular, demonstrates the case of modern
tions compared to the conventional approximate
heterogeneous computers running both exact
computing.
and approximate computing applications. By
Using Autofocus to dynamically select the
making the demand of applications visible to the
desired data resolutions, these applications
storage system and also making the computing
achieve an average speedup of 1.43. Without
capability available for preparing data in different
any programmer intervention, iFilter can
resolutions, this article fundamentally alleviates
improve performance by 1.46. In contrast, the
the side effects from the conventional, single-
point-optimization design—the data adjustment
and movement overhead.
The resulting architecture not only hides
the latency of data adjustment within the NVM
data access pipeline of the storage device, but
more importantly, it exploits the fact that
approximate computing only needs lower reso-
lution inputs to further reduce the data volume
flowing through the system interconnect. With-
out a full-stack, holistic design like this work,
we can never take full advantage of approxi-
Figure 7. Speedup of the end-to-end latency. mate computing.
May/June 2020
53
Top Picks
This article leverages the success of near- 2. D. S. Khudia, B. Zamirai, M. Samadi, and S. Mahlke,
data/in-storage processing to implement the pro- “Rumba: An online quality management system for
posed idea. However, in addition to “offloading approximate computing,” in Proc. ACM/IEEE 42nd
computation” that prior work focusing mainly Annu. Int. Symp. Comput. Archit., Jun. 2015,
on, this article reveals the potential of “offering pp. 554–566.
new features” (e.g., the quality control mecha- 3. M. A. Laurenzano, P. Hill, M. Samadi, S. Mahlke, J.
nisms in VS) to streamline the rest of computa- Mars, and L. Tang, “Input responsiveness: Using
tion. As the budget of building storage devices is canary inputs to dynamically steer approximation,” in
usually very limited, the processor near-data/in- Proc. 37th ACM SIGPLAN Conf. Program. Lang.
storage cannot compete with host CPUs and Design Implementation, 2016, pp. 161–176.
hardware accelerators. We hope this work could 4. H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A.
inspire researchers in inventing more “add-on” Agarwal, and M. Rinard, “Dynamic knobs for
features that fit the capabilities in these devices responsive power-aware computing,” in Proc. 16th Int.
to improve the application performance. Conf. Archit. Support Program. Lang. Operating Syst.,
We also expect the outcome of this article 2011, pp. 199–212.
inspires researchers to further discover those 5. A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam,
issues introduced by local optimizations and con- L. Ceze, and D. Grossman, “Enerj: Approximate data
sider the presence of heterogeneous computing types for safe and general low-power computation,” in
resources, or intelligent data storage and I/O Proc. 32nd ACM SIGPLAN Conf. Program. Lang.
devices to achieve glocal optimizations as we Design Implementation, 2011, pp. 164–174.
demonstrated in this article. More research on 6. X. Sui, A. Lenharth, D. S. Fussell, and K. Pingali,
hardware/software interfaces that do not hide “Proactive control of approximate programs,” in Proc.
power but maintain good tradeoffs on program- 21st Int. Conf. Archit. Support Program. Lang. Oper.
mability, simplicity, flexibility, and efficiency is Syst., 2016, pp. 607–621.
necessary for emerging computer architectures. 7. A. Sampson, J. Nelson, K. Strauss, and L. Ceze,
“Approximate storage in solid-state memories,” in
Proc. 46th Annu. IEEE/ACM Int. Symp. Microarchit.,
ACKNOWLEDGMENTS
We would like to thank the AI infrastructure 2013, pp. 25–36.
group and academic relations group from Face- 8. S. Ganapathy, A. Teman, R. Giterman, A. Burg, and
development of our prototype SSD. This work was 10. G. Koo et al., “Summarizer: Trading bandwidth
sponsored by the two National Science Founda- with computing near storage,” in Proc. 50th
tion Awards 1940046 and 1940048. This work was Annu. IEEE/ACM Int. Symp. Microarchit., 2017,
North Carolina State University and University of 11. M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S.
Mahlke, “Sage: Self-tuning approximation for graphics
California, Riverside.
engines,” in Proc. 46th Annu. IEEE/ACM Int. Symp.
Microarchit., 2013, pp. 13–24.
& REFERENCES 12. M. Samadi, D. A. Jamshidi, J. Lee, and S. Mahlke,
1. Y. Li et al., “A network-centric hardware/algorithm co- “Paraprox: Pattern-based approximation for data
design to accelerate distributed training of deep parallel applications,” in Proc. 19th Int. Conf.
neural networks,” in Proc. 51th Annu. IEEE/ACM Int. Archit. Support Program. Lang. Oper. Syst., 2014,
Symp. Microarchit., 2018, pp. 175–188. pp. 35–50.
IEEE Micro
54
Yu-Ching Hu is currently working toward the Ph.D. Te I is currently a Software Engineer at Google, work-
degree with the Department of Computer Science and ing in the Google Translate Team. Te I received the
Engineering, University of California, Riverside. His M.S. degree in computer science from North Carolina
research interests focus on improving the performance State University. Contact him at [email protected].
of database and machine learning applications
through optimizing their interactions with heteroge- Hung-Wei Tseng is an Assistant Professor with the
neous computing units and storage systems. He is a Department of Electrical and Computer Engineering,
member of IEEE. Contact him at [email protected]. University of California, Riverside. His research inter-
ests include heterogeneous computer architectures
Murtuza Lokhandwala is currently working as a
and nonvolatile memory based storage systems as
Design Verification Engineer. His interests include
well as their programming languages, runtime sys-
system design and architecture for processors, digi-
tems, compilers, and applications. Tseng received
tal design, and verification. Lokhandwala received
the master’s degree in computer engineering from the Ph.D. degree in computer science from the Uni-
North Carolina State University. Contact him at versity of California, San Diego. Contact him at
[email protected]. [email protected].
May/June 2020
55
Theme Article: Top Picks
AsmDB: Understanding
and Mitigating Front-End
Stalls in Warehouse-Scale
Computers
Nayana Prasad Nagendra Christos Kozyrakis
Princeton University Stanford University
Grant Ayers Trivikram Krishnamurthy
Google Nvidia
David I. August Heiner Litz
Princeton University University of California, Santa Cruz
Hyoun Kyu Cho and Svilen Kanev Tipp Moseley and
Google Parthasarathy Ranganathan
Google
Abstract—It is well known that the datacenters hosting today’s cloud services waste
a significant number of cycles on front-end stalls. However, prior work has provided little
insights about the source of these front-end stalls and how to address them. This work
analyzes the cause of instruction cache misses at a fleet-wide scale and proposes a new
compiler-driven software code prefetching strategy to reduce instruction caches misses
by 90%.
& DUE TO THE continued growth of cloud-based the world. This massive growth necessitates
digital services, warehouse-scale computers improving the cost and efficiency of WSCs
(WSC) are now serving billions of devices across through microarchitectural and system software
based optimizations.
Digital Object Identifier 10.1109/MM.2020.2986212 WSC workloads are characterized by deep
Date of publication 16 April 2020; date of current version 22
software stacks in which individual requests can
traverse many layers of data retrieval, data
May 2020.
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
56
processing, communication, logging, and moni- code fragmentation and the perils of micro-optimi-
toring. As a result, the instruction working set zation; and iii) a novel software-based code pre-
sizes of WSC workloads today are often 100 fetch algorithm for reducing i-cache misses at
larger than server-class L1 instruction caches (i- fleet-wide scales.
cache)1 and are currently expanding at rates of
over 20% per year.2 As cache sizes have not
improved significantly over the last many years, AsmDB: A WSC ASSEMBLY
WSC workloads are becoming increasingly front- DATABASE
end bound. Thus, processors are no longer able To enable the necessary horizontal analysis
to sustain a high instruction fetch rate, manifest- and optimization across the server fleet, we built
ing itself in large unrealized performance gains a continuously updated assembly database
due to front-end stalls, which are dominated by (AsmDB) to collect instruction- and basic-block-
increased i-cache misses. While prior work has level information for most observed CPU cycles
identified the growing impor- across the thousands of real produc-
tance of this problem, to date, To enable the tion services executing across the
there has been little analysis of necessary horizontal Google fleet. AsmDB aggregates
the sources of these misses analysis and instruction and control-flow data col-
and of available opportunities optimization across the lected from hundreds of thousands of
to address them. server fleet, we built a machines each day and grows by mul-
We corroborate this chal- continuously updated tiple TiB each week. We have been
lenge for our WSCs on Google assembly database continuously populating AsmDB over
web search leaf servers, in (AsmDB) to collect
several years with the goal of provid-
which 13.8% of the total per- instruction- and
ing easy-to-query assembly-level infor-
formance potential is wasted basic-block-level
information for most
mation for nearly every unique
due to “front-end latency,” instruction executed in our WSCs. We
observed CPU cycles
principally caused by i-cache demonstrate several cases where
across the thousands
misses. We also measured L1 AsmDB proves invaluable for front-
of real production
i-cache miss rates of 11 misses
services executing end optimization, including spotting
per kilo-instruction, and a across the opportunities for manual optimiza-
hot steady-state instruction Google fleet. tions, finding areas for improvement
working set of approximately
in existing compiler passes, as well as
4 MiB. This is significantly
for serving as a data source for a
larger than the sizes of the L1 and L2 caches on
novel compiler-driven technique to improve i-
today’s server CPUs, but small and hot enough
cache hit rates.
to easily fit and remain in the shared L3 cache
1 AsmDB is an always-on, massive-scale fleet-
(typically 10 s of MiB).
wide performance monitoring system. It uses
To understand and improve the i-cache
hardware support to collect bursty execution
behavior of WSC applications, we focus on tools
traces, performs fleet-wide temporal and spatial
and techniques for “broad” acceleration* of thou-
sampling, and leverages sophisticated offline post-
sands of WSC workloads. At the scale of a typical
processing to construct full-program dynamic
WSC server fleet, performance improvements
control-flow graphs. Collecting and processing
of a few percentage points (and even sub-1%
improvements) lead to millions of dollars in profiling data from hundreds of thousands of
cost and energy savings, as long as they are machines is a daunting task by itself. However, we
widely applicable across workloads. To that end, have carefully designed the system architecture
our work provides three primary contributions: such that it can capture and process profiling data
i) A methodology for analyzing instruction profiles in a cost-efficient way while still processing tera-
at a fleet-wide scale; ii) detailed insights about bytes of data each week.
A fleet-wide assembly database, such as
* AsmDB, provides a scalable solution to search
“Deep” acceleration would involve focusing on a handful of workloads and
trying to recover most of the 15% performance opportunity. for performance antipatterns and opens up new
May/June 2020
57
Top Picks
IEEE Micro
58
Figure 3. Fraction of hot code within a function among the 100 hottest fleet-wide functions. From the left-hand
side to right-hand side, “hot code” defined as covering 90%, 99%, and 99.9% of execution.
into the cache, in addition to the necessary hot to perform when optimizing typical WSC flat
instructions leading to hot/cold fragmentation execution profiles. Hence, this suggests that
and thus suboptimal utilization of the limited combining inlining with more aggressive hot/
cache resources. cold code splitting can achieve better i-cache uti-
We more formally define fragmentation to be lization, freeing up the scarce capacity.
the fraction of code (in bytes) that is necessary On a finer granularity, we find that the indi-
to cover the last 10%, 1%, or 0.1% of executions vidual cache lines are also often fragmented
of a function. Because functions are sequentially and waste cache capacity, especially for small
laid out in memory, these cold bytes are very functions. Unlike cold cache lines within a
likely to be brought into the cache by next-line function, cold bytes in a cache line are always
prefetchers. Intuitively, this definition measures brought in along with the hot ones, introduc-
the fraction of i-cache capacity potentially ing an even more significant performance
wasted by loading cold cache lines. issue. This suggests that there exist opportuni-
We find that intrafunction fragmentation is ties to improve the basic-block layout, at link
especially prevalent. Even after compiling with or postlink time, when compiler profile infor-
feedback-directed optimization, 50% of the mation is precise enough to reason about spe-
codes in all functions are cold, frequently cific cache lines.
interleaved with hot code sections, and thus We provide a concrete example of optimizing
practically never executed despite being likely code bloat and fragmentation by focusing on
to be in the cache. This is true even among the memcmp, one of the hottest functions contribut-
hottest and most well-optimized functions in ing to cache misses. memcmp clearly stands
our server fleet. out of the correlation between call frequency
Using AsmDB data, we calculate the measure and function size in Figure 2. It is both extremely
of fragmentation for the top 100 functions by frequent, and at almost 6 KiB of code, 10 larger
execution count in our server fleet. Figure 3 plots than memcpy, which is conceptually of similar
it against the containing function size. If we con- complexity. Examining its layout and execution
sider code covering the last 1% of execution as patterns (see Figure 4) suggests that it does
“cold,” 66 functions out of the 100 are comprised suffer from a high amount of fragmentation, as
of more than 50% cold code. Even with a stricter we observed fleet wide in the previous section.
definition of cold (<0.1%), 46 functions have While covering 90% of executed instructions in
more than 50% cold code. Perhaps not surpris- memcmp only requires two cache lines, getting
ingly, there is a loose correlation with function up to 99% coverage requiring 41 lines or 2.6 KiB of
size—larger (more complex) functions tend to cache capacity. Not only is more than 50% of the
have a larger fraction of cold code. code cold, it is also interspersed with hot regions,
We attribute the intrafunction fragmentation increasing the likelihood to be brought in by next-
to the deep inlining that the compiler needs line prefetchers. Such code bloat is costly—
May/June 2020
59
Top Picks
IEEE Micro
60
cases, fewer than 2.5% of additional dynamic
instructions are added for code prefetches.
LONG-TERM IMPLICATIONS
With increased technological growth, WSCs
now serve billions of devices and applications
across the planet. Due to their success, we expect
an ever-greater reliance on WSCs in the near
future, providing faster, more reliable, and more
secure services to society. These increasing dem-
ands necessitate achieving higher performance for
Figure 6. Miss coverage and performance WSCs in order to be cost- and energy-efficient for
improvement for the best-performing configuration WSC companies and their customers while simul-
for each workload. taneously reducing the environmental impact on
our world.
At its core, our prefetch injection strategy In combination with the slowdown of Moore’s
leverages the observation that the injection site law, improving the efficiency of existing hardware
of a prefetch instruction can be freely moved in WSCs becomes even more critical. We analyzed
within the window of opportunity to minimize a web search binary, showing that 68% of the CPU
fan-in and fan-out. We call this approach dynamic performance potential is lost due to pipeline stalls,
window injection. At a high level, our prefetch of which 13.8% are due to the front-end not being
procedure first constructs the execution history able to deliver instructions fast enough.
for each miss and then traverses the control This article addresses the front-end bottle-
flow graph in the reverse direction until it neck on following fronts.
reaches the end of the instruction window, cal-
culated based on the application-level IPC. Next, First, we have built a tool that is capable of
prefetch injection sites are searched for each collecting data from live datacenter applica-
miss among each of its execution paths, which tions at the granularity of instructions and at
have minimal fan-in and fan-out. Prefetch the scale of a WSC. We have described the
instructions are then automatically inserted in architecture design decisions in detail,
the selected injection sites for the correspond- enabling other WSC operators to reproduce
ing misses as part of the final linking steps. our system.
We prototype the effects of our proposed Second, this article is the first work that
software prefetching technique on memory shows detailed characterization studies of
traces from several WSC workloads. We evaluate the processor front-end at the scale of a WSC
on a modified version of the zsim simulator,5 by describing previously unreleased perfor-
using the system parameters modeled against an mance characteristics of WSC workloads.
Intel Haswell datacenter-scale server processor. Third, we have proposed and evaluated a
We focus primarily on three WSC applications— novel software-based code prefetch strategy
a web search leaf node, an ads matching service, to automatically and effectively reduce i-
and a knowledge graph back-end. For each work- cache misses across large WSC workloads.
load, we collect traces during a representative
single-machine load test, which sends realistic This work provides a powerful methodology to
loads to the server under test. perform further at-scale research to obtain a
Figure 6 shows that our prefetching tech- detailed understanding of the microarchitectural
nique is able to eliminate 91%–96% of all i-cache characteristics and the interplay between current
misses, with a performance improvement pro- software and hardware. In addition, its reproduc-
portional to the front-end boundedness of the ibility enables other WSC companies to perform
application and the gap left from NLP. In all similar research. Overall, such research would
May/June 2020
61
Top Picks
enable hardware vendors to work closely with designing domain-specific accelerators becomes
software developers to better design future feasible and cost-efficient. However, while this
processors. approach has proven successful for domains
Our front-end characterization studies benefit such as deep learning, most of the fleet cycles are
the compiler and architecture communities both still executed on general-purpose processors as
in academic and industrial set- many applications are too complex
tings. Our results on micro-optimi- We developed AsmDB, and rapidly changing to render
zations, fragmentation, and code- a database for custom-designed hardware feasi-
bloat can help in fine-tuning com- instruction and basic- ble. Nevertheless, as this article
piler passes, optimizing inlining block information showed, the performance charac-
strategies, and basic block lay- across thousands of teristics of WSC applications are
outs. Similarly, our studies pro- WSC production fundamentally different from tradi-
vide valuable information to binaries, to characterize tional applications such as the
i-cache miss-working SPEC benchmark suite. WSC pro-
architecture researchers exposing
sets and miss-causing cessors may differ with capabilities
existing software loop holes that
instructions. such as our proposed instruction
can be addressed with next-gener-
ation hardware designs. prefetching mechanism, which may
Our work on software code prefetching be of little use to SPEC applications, but which
proves as a strong case study for hardware ven- delivers significant performance gains for data-
dors to provide support for a software code pre- center applications.
fetch instruction and to implement such an In summary, the evidence is strong that this
instruction in the instruction set architecture article will promote the research and develop-
(ISA). With this, compiler writers and software ment of new compiler techniques, new proces-
developers can leverage code prefetching and sor designs, and new ways of collecting and
analyzing behaviors at the warehouse scale.
its resulting performance improvements in an
automatic and scalable way.
More broadly, this article provides two CONCLUSION
insights, which we believe will have a significant This work focused on understanding and
and long-lasting impact on future research in the improving i-cache behavior, which is a critical per-
performance optimization and computer architec- formance constraint for WSC applications.
ture domain. The first insight teaches the impor- We developed AsmDB, a database for instruction
tance of enabling fleet-wide performance and basic-block information across thousands of
optimizations, which we also refer to as the WSC production binaries, to characterize i-cache
Amdahl’s law of WSC performance. Traditionally, miss-working sets and miss-causing instructions.
performance optimizations have been focused on We used these insights to motivate fine-grain lay-
individual applications. In this approach, applica- out optimizations to split hot and cold codes and
tions are profiled to determine the most compute- better utilize limited i-cache capacity. We also pro-
intensive regions, resulting in the largest perfor- posed a new feedback-driven optimization that
mance gains when optimized. However, this inserts software instructions for code prefetching
approach no longer applies to WSCs as datacen- based on the control-flow information and miss
ters run thousands of different applications profiles in AsmDB. This prefetching optimization
simultaneously. As a result, compute-intensive can cover up to 96% of i-cache misses without sig-
application-specific kernels are no longer worth nificant changes to the processor and while requir-
optimizing. Instead, performance engineers need ing only very simple front-end fetch mechanisms.
to focus on code that is shared among many appli-
cations in the fleet, representing the largest aggre-
gated percentage of compute cycles. ACKNOWLEDGMENTS
The second insight teaches the importance of This work was supported by the NSF Award
designing domain-specific general-purpose pro- CCF-1823559. Nayana Prasad Nagendra and
cessors. WSCs have grown to a size at which Grant Ayers contributed equally to this work.
IEEE Micro
62
& REFERENCES Svilen Kanev is currently a Software Engineer at
Google, working on translating datacenter perfor-
1. G. Ayers, J. H. Ahn, C. Kozyrakis, and mance analysis insights into performance and TCO
P. Ranganathan, “Memory hierarchy for web search,” gains. He is broadly interested in anything that strad-
in Proc. IEEE Int. Symp. High Perform. Comput. dles the hardware-software interface. Kanev received
Archit., 2018, pp. 643–656. the Ph.D. degree in computer science from Harvard
2. S. Kanev et al., “Profiling a warehouse-scale University. Contact him at [email protected]
computer,” in Proc. Int. Symp. Comput. Archit., 2015,
pp. 158–169. Christos Kozyrakis is currently a Professor of elec-
3. R. Kumar, B. Grot, and V. Nagarajan, “Blasting through trical engineering and computer science with Stanford
the frontend bottleneck with shotgun,” in Proc. Archit. University. His research interests include hardware
Support Program. Lang. Oper. Syst., 2018, pp. 30–42. architectures and system software for cloud computing
and emerging workloads. Kozyrakis received the Ph.D.
4. G. Ren, E. Tune, T. Moseley, Y. Shi, S. Rus, and
degree in computer science from the University of
R. Hundt, “Google-wide profiling: A continuous
California Berkeley. He is a Fellow of IEEE and ACM.
profiling infrastructure for data centers,” IEEE Micro,
Contact him at [email protected].
vol. 30, no. 4, pp. 65–79, Jul./Aug. 2010.
5. D. Sanchez and C. Kozyrakis, “ZSim: Fast and accurate
Trivikram Krishnamurthy is currently a Senior Engi-
microarchitectural simulation of thousand-core systems,” neering Manager at Nvidia. Before joining Nvidia, he was
in Proc. Int. Symp. Comput. Archit., 2013, pp. 475–486. a Software Engineer at Google. Krishnamurthy received
the M.S. degree in electrical and computer engineering
from the University of California Santa Barbara. Contact
Nayana Prasad Nagendra is currently working him at [email protected].
toward the Ph.D. degree with the Department
of Computer Science, Princeton University. Her
Heiner Litz is currently an Assistant Professor in the
research interests include performance analysis and
Computer Science and Engineering Department, Uni-
microarchitectural design with a focus on data
versity of California, Santa Cruz (UCSC) and the Asso-
centers. This work was done while she was an intern
ciate Director of the Center for Research in Storage
at Google. She is a student member of IEEE and
Systems. His main research interests include com-
ACM. Contact her at [email protected].
puter architecture, operating systems, and storage
with a focus on data centers. Before joining UCSC, he
Grant Ayers is currently a Software Engineer at
was a Researcher at Google. Litz received the Ph.D.
Google. His research interests include computer
degree from Mannheim University. He is a member of
architecture, security, and accelerators. He joined
IEEE and ACM. Contact him at [email protected].
Google after receiving the Ph.D. degree in computer
science from Stanford University. This work was
done while he was an intern at Google. Contact him Tipp Moseley is currently a Principal Software Engi-
at [email protected]. neer at Google, where he works on datacenter-scale
performance analysis. His research interests include
David I. August is currently a Professor with the compilers, operating systems, performance analysis,
Department of Computer Science, Princeton Univer- runtime systems, fault tolerance, and optimized lock-
sity, where he directs the Liberty Research Group. His free data structures. Moseley received the Ph.D.
research interests include compilers and computer degree in computer science from the University of
architectures. August received the Ph.D. degree in Colorado at Boulder. Contact him at [email protected].
electrical and computer engineering from the Univer-
sity of Illinois at Urbana–Champaign. Contact him at Parthasarathy Ranganathan is currently a
[email protected]. Distinguished Engineer at Google, where he is design-
ing their next-generation systems. His research inter-
Hyoun Kyu Cho is currently a Software Engineer at ests include systems architecture and management,
Google. His research interests include compiler optimi- power management, and energy efficiency for servers
zation, parallel computing, and performance analysis. and datacenters. Ranganathan received the Ph.D.
Cho received the Ph.D. degree in computer science degree in computer engineering from Rice University.
and engineering from the University of Michigan at Ann He is a Fellow of IEEE and ACM. Contact him at
Arbor. Contact him at [email protected] [email protected].
May/June 2020
63
Theme Article: Top Picks
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
64
intermediate scale quantum (NISQ) computa- The net result of our work is to extend the fron-
tion. The NISQ regime considers near-term tier of what quantum computers can compute. In
machines with just tens to hundreds of quantum particular, the frontier is defined by the zone in
bits (qubits) and moderate errors. which every machine qubit is a data qubit, for
Given the severe constraints on example, a 100-qubit algorithm
quantum resources, it is critical to Given the severe running on a 100-qubit machine.
fully optimize the compilation of a constraints on quantum This is indicated by the yellow
quantum algorithm in order to have resources, it is critical region in Figure 1. In this frontier
successful computation. Prior archi- to fully optimize the zone, we do not have room for
tectural research has explored tech- compilation of a nondata workspace qubits known
niques such as mapping, scheduling, quantum algorithm as ancilla. The lack of ancilla in
and parallelism to extend the amount in order to have the frontier zone is a costly con-
of useful computation possible. In successful straint that generally leads to inef-
this article, we consider another computation. ficient circuits. For this reason,
technique: quantum trits (qutrits). typical circuits instead operate
While quantum computation is typically below the frontier zone, with many machine
expressed as a two-level binary abstraction of qubits used as ancilla. This article demonstrates
qubits, the underlying physics of quantum that ancilla can be substituted with qutrits,
systems are not intrinsically binary. Whereas enabling us to extend the ancilla-free frontier zone
classical computers operate in binary states of quantum computation.
at the physical level (e.g., clipping above
and below a threshold voltage), quantum com-
puters have natural access to an infinite BACKGROUND
spectrum of discrete energy levels. In fact, A qubit is the fundamental unit of quantum
hardware must actively suppress higher level computation. Compared to their classical coun-
states in order to achieve the two-level qubit terparts which take values of either 0 and 1,
approximation. Hence, using three-level qut- qubits may exist in a superposition of the two
rits is simply a choice of including an addi- states. We designate these two basis states as j0i
tional discrete energy level, albeit at the cost and j1i and can represent any qubit as
of more opportunities for error. jci ¼ a j0i þ b j1i with kak2 þ kbk2 ¼ 1. kak2 and
Prior work on qutrits (or more generally, kbk2 correspond to the probabilities of measur-
d-level qudits) identified only constant factor ing j0i and j1i, respectively.
gains from extending beyond qubits. In Quantum states can be acted on by quan-
general, this prior work1 has emphasized the tum gates, which preserve valid probability
information compression advantages of qut- distributions that sum to 1 and guarantee
rits. For example, N qubits can be expressed reversibility. For example, the X gate trans-
as N=log2 ð3Þ qutrits, which leads to forms a state jci ¼ a j0i þ b j1i to X jci ¼
log2 ð3Þ 1:6 constant factor improvements in b j0i þ a j1i . The X gate is also an example of a
runtimes. classical reversible operation, equivalent to the
Our approach utilizes qutrits in a novel fash- NOT operation. In quantum computation, we have
ion, essentially using the third state as tempo- a single irreversible operation called measurement
rary storage, but at the cost of higher per- that transforms a quantum state into one of the
operation error rates. Under this treatment, the two basis states with a given probability based on
runtime (i.e., circuit depth or critical path) is a and b.
asymptotically faster, and the reliability of com- In order to interact different qubits, two-qubit
putations is also improved. Moreover, our operations are used. The CNOT gate appears both
approach only applies qutrit operations in an in classical reversible computation and in quan-
intermediary stage: The input and output are tum computation. It has a control qubit and a tar-
still qubits, which is important for initialization get qubit. When the control qubit is in the j1i state,
and measurement on real devices.2; 3 the CNOT performs a NOT operation on the target.
May/June 2020
65
Top Picks
IEEE Micro
66
Table 1. Asymptotic comparison of N-controlled gate decompositions. The total gate count for all circuits scales
linearly (except for Barenco et al.,6 which scales quadratically). Our construction uses qutrits to achieve logarithmic
depth without ancilla. We benchmark our circuit construction against Gidney,4 which is the asymptotically best
ancilla-free qubit circuit.
Ancilla 0 0 N 0 0 0
past research toward building practical quantum As in our approach, circuit constructions
computers has focused on qubits. from Wang and Perkowski,7 and Lanyon et al.8
This article introduces qutrit-based circuits, have attempted to improve the ancilla-free Gen-
which are asymptotically better than equivalent eralized Toffoli gate by using qudits. Wang and
qubit-only circuits. Unlike prior work, we dem- Perkowski7 achieves a linear circuit depth but
onstrate a compelling advantage in both run- by operating each control as a qutrit. The
time and reliability, thus justifying the use of Lanyon et al.8 construction, which has been
qutrits. demonstrated experimentally, achieves linear
circuit depths by operating the target as a
Generalized Toffoli Gate d ¼ N-level qudit.
The Toffoli gate itself is a simple extension of Our circuit construction, presented in the
the CNOT gate, but has two controls instead of “Generalized Toffoli Gate” section, has similar
one control. In a Toffoli gate, the NOT is applied if structure to the He design, which can be rep-
and only if both controls are j1i. Similarly, a Gener- resented as a binary tree of gates. However,
alized Toffoli gate has N controls and flips the tar- instead of storing temporary results with a lin-
get qubit if and only if all N control qubits are j1i. ear number of ancilla qubits, our circuit tem-
The Generalized Toffoli gate is an important primi- porarily stores information directly in the
tive used across a wide range of quantum algo- qutrit j2i state of the controls. Thus, no
rithms, and it has been the focus of extensive past ancilla are needed.
optimization work. Table 1 compares past circuit In our simulations, we benchmark our circuit
constructions for the Generalized Toffoli gate to construction against the Gidney construction4
our construction, which is presented in full in because it is the asymptotically best qubit cir-
“Generalized Toffoli Gate” section. cuit in the ancilla-free frontier zone. We label
Among prior work, Gidney,4 He et al.,5 and these two benchmarks as QUTRIT and QUBIT.
Barenco et al.6 designs are all qubit-only. The
three circuits have varying tradeoffs. While Gid-
ney and Barenco operate at the ancilla-free fron- CIRCUIT CONSTRUCTION
tier, they have large circuit depths: Linear with a In order for quantum circuits to be executable
large constant for Gidney and quadratic for Bare- on hardware, they are typically decomposed into
nco. While the He circuit achieves logarithmic single- and two- qudit gates. Performing efficient
depth, it requires an ancilla for each data qubit, low depth and low gate count decompositions is
effectively halving the effective potential of any important in both the NISQ regime and beyond.
given quantum hardware and operating far
below the frontier. Nonetheless, in practice, Key Intuition
most circuit implementations use these linear- We develop the intuition for how qutrits can
ancilla constructions due to their small depths be useful by considering the example of construct-
and gate counts. ing an AND gate. In the framework of quantum
May/June 2020
67
Top Picks
IEEE Micro
68
algorithms. Here, we note two important applica-
tions of our circuit decomposition.
Arithmetic Circuits
The Generalized Toffoli is a key subcircuit
in many arithmetic circuits such as constant
addition, modular multiplication, and modular
exponentiation. The circuit for computing a
square root is also improved by a more effi-
cient Generalized Toffoli gate. As shown by
Gokhale,10 the circuit for the initial approxi-
pffiffiffi
mation to 1= x involves a sequence of stan-
dard Toffoli gates terminated by a large
OðnÞ-width Generalized Toffoli gate. Our cir-
cuit construction is directly applicable to this
terminal gate.
May/June 2020
69
Top Picks
IEEE Micro
70
Figure 6. Circuit simulation results for all possible pairs of circuit constructions and noise models. Each bar
represents 1000+ trials, so the error bars are all 2s < 0:1%. Our QUTRIT construction significantly
outperforms the QUBIT construction.
outperform the ancilla-free QUBIT benchmark envision other advantages to higher radix quan-
(blue bars) in fidelity (success probability) by tum computing. For example, the information-
more than 10 000. compression advantage of qudits may be particu-
For the SC, SC+T1, and SC+GATES noise mod- larly well suited to the NISQ hardware, where
els, our qutrit constructions achieve between device connectivity—and therefore diameter—is
57–83% mean fidelity, whereas the ancilla-free a bottleneck. Compressing a qubit computation
qubit constructions all have almost 0% fidelity. via qudits would allow us to reduce the graph
Only the lowest error model, SC+T1+GATES diameter.
achieves modest fidelity of 26% for the QUBIT The results presented in this article are appli-
circuit, but in this regime, the qutrit circuit is cable to quantum computing in the near term on
close to 100% fidelity. machines that are expected within the next five
The trapped ion noise models achieve similar years. The net result of this article is to extend the
results—the DRESSED_QUTRIT and frontier of what is computable by
the BARE_QUTRIT achieve approxi- quantum hardware, and hence to
mately 95% fidelity via the QUTRIT cir- Clever use of qutrits accelerate the timeline for practi-
cuit, whereas the TI_QUBIT noise offers a path to more cal quantum computing. Emphat-
model has only 45% fidelity. Between sophisticated quantum ically, our results are driven by
computation today,
the dressed and bare qutrits, the the use of qutrits for asymptoti-
without needing to wait
dressed qutrit exhibits higher fidelity cally faster ancilla-free circuits.
for better hardware. We
than the bare qutrit, as expected. Moreover, we also improve lin-
are optimistic that
Moreover, the dressed qutrit is resil- continued hardware– earity constants by two orders of
ient to leakage errors, so the simula- software codesign may magnitudes. Finally, as verified
tion results should be viewed as a further extend the frontier by our circuit simulator coupled
lower bound on its advantage over the of quantum computers. with realistic noise models, our
qubit and bare qutrit. circuits are more reliable than
Our qutrit-assisted Generalized qubit-only equivalents. In sum,
Toffoli gate has already attracted interest from clever use of qutrits offers a path to more sophisti-
both device physics and algorithms communities. cated quantum computation today, without need-
To this end, major quantum software packages ing to wait for better hardware. We are optimistic
like Cirq are now compatible with qutrit (and that continued hardware–software codesign may
qudit) simulations. We have also been working further extend the frontier of quantum computers.
with hardware groups to experimentally imple-
ment the ideas presented here. One promising
direction is to use OpenPulse, an open standard ACKNOWLEDGMENTS
for pulse-level quantum control, to experimentally We would like to thank Michel Devoret and
demonstrate a generalized Toffoli gate. We also Steven Girvin for suggesting to investigate
May/June 2020
71
Top Picks
qutrits. We also acknowledge David Schuster for 10. P. Gokhale, “Implementation of square root function
helpful discussion on superconducting qutrits. using quantum circuits,” Undergraduate Awards,
This work was supported in part by EPiQC, an 2014.
NSF Expedition in Computing, under Grant CCF- 11. D. K. Park, F. Petruccione, and J.-K. K. Rhee, “Circuit-
1730449/1832377; in part by STAQ under Grant based quantum random access memory for classical
NSF Phy-1818914; and in part by DOE Grants DE- data,” Sci. Rep., vol. 9, no. 1, 2019, Art. no. 3949.
SC0020289 and DE-SC0020331. The work of Pra-
nav Gokhale was supported by the Department
Pranav Gokhale is currently working toward the
of Defense through the National Defense Science
Ph.D. degree with the University of Chicago. His
and Engineering Graduate Fellowship Program.
research focuses on breaking the abstraction
barrier between quantum hardware and software.
He is the founder of Super.tech. Contact him at
& REFERENCES [email protected].
1. A. Pavlidis and E. Floratos, “Arithmetic circuits for Jonathan M. Baker is currently working toward
multilevel qudits based on quantum Fourier the Ph.D. degree with the University of Chicago. His
transform,” 2017, arXiv:1707.08834. research is primarily focused on vertical integration
2. J. Randall et al., “Efficient preparation and detection of of the quantum computing hardware–software stack.
microwave dressed-state qubits and qutrits with Contact him at [email protected].
trapped ions,” Phys. Rev. A, vol. 91, 2015,
Casey Duckering is currently working toward the
Art. no. 012322.
Ph.D. degree with the University of Chicago, aiming
3. J. Randall, A. M. Lawrence, S. C. Webster, S. Weidt, to efficiently bring together quantum algorithms
N. V. Vitanov, and W. K. Hensinger, “Generation of with their physical implementation on quantum com-
high-fidelity quantum control methods for multilevel puters. Contact him at [email protected].
systems,” Phys. Rev. A, vol. 98, 10 2018,
Art. no. 043414. Frederic T. Chong is the Seymour Goodman Pro-
4. C. Gidney, “Constructing large controlled nots,” 2015. fessor with the Department of Computer Science, Uni-
versity of Chicago. He is also Lead Principal
5. Y. He, M.-X. Luo, E. Zhang, H.-K. Wang, and
Investigator for the EPiQC Project (Enabling Practi-
X.-F. Wang, “Decompositions of n-qubit toffoli gates
cal-scale Quantum Computing), an NSF Expedition in
with linear circuit complexity,” Int. J. Theor. Phys.,
Computing. Contact him at [email protected].
vol. 56, pp. 2350–2361, Jul. 2017.
6. A. Barenco et al., “Elementary gates for quantum Natalie C. Brown is currently working toward the
computation,” Phys. Rev. A, vol. 52, pp. 3457–3467, Ph.D. degree with Georgia Institute of Technology.
Nov. 1995. Her research focuses on leakage error correction
7. Y. Wang and M. Perkowski, “Improved complexity of and mitigation in topological surface codes. Contact
quantum oracles for ternary grover algorithm for graph her at [email protected].
coloring,” in Proc. 41st IEEE Int. Symp. Multiple-Valued
Kenneth R. Brown is an Associate Professor of
Logic, May 2011, pp. 294–301.
electrical and computer engineering with Duke Uni-
8. B. P. Lanyon et al., “Simplifying quantum logic using
versity and the Director of the NSF Software Enabled
higher-dimensional Hilbert spaces,” Nature Phys., Architectures for Quantum co-design (STAQ) project
vol. 5, pp. 134–140,, 2009. developing applications, software, and hardware
9. Y.-M. Di and H.-R. Wei, “Elementary gates for ternary for ion trap quantum computers. Contact him at
quantum logic circuit,” 2011, arXiv:1105.5485. [email protected].
IEEE Micro
72
Theme Article: Top Picks
Architecting Noisy
Intermediate-Scale
Quantum Computers:
A Real-System Study
Prakash Murali Ali Javadi Abhari
Princeton University IBM T. J. Watson Research Center
Norbert M. Linke Nhung Hong Nguyen
Joint Quantum Institute, University of Maryland University of Maryland
Margaret Martonosi Cinthia Huerta Alderete
Princeton University University of Maryland and Instituto Nacional de
Astrofısica, Optica nica
y Electro
& QUANTUM COMPUTING (QC) is a fundamentally information. While the basic principles of QC have
new model of computation, which exploits been known since the 1980s, recent hardware
quantum mechanical phenomena to perform progress has ushered in the era of noisy intermedi-
computation. QC systems use qubits (quantum ate-scale quantum (NISQ) devices. These systems
bits) to represent information and gates represent an important milestone toward large
(quantum instructions) to manipulate quantum scale QC, and are expected to scale to 500–1000
qubits in coming years. In spite of being too error-
prone and resource-constrained for well-known
Digital Object Identifier 10.1109/MM.2020.2985683 applications like Shor’s factoring, NISQ systems
Date of publication 6 April 2020; date of current version 22 are capable of very powerful computations. Nota-
May 2020. bly, Google recently demonstrated a classically
May/June 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
73
Top Picks
intractable computation on an NISQ system with applications on seven systems from three lead-
54 qubits.1 ing vendors—IBM, Rigetti, and University of
Being early-stage, NISQ devices are highly Maryland. The systems studied represent differ-
diverse in terms of hardware and ent points in the design space,
architecture. Leading QC vendors with two leading qubit tech-
While the basic
including IBM, Rigetti, Google, principles of QC have nologies (superconducting and
IonQ, and others have adopted been known since the trapped ion qubits), different
very different approaches for build- 1980s, recent hardware connectivity topologies, pro-
ing hardware qubits. To support progress has ushered gramming interfaces, and noise
their qubit choices, vendors have in the era of noisy behavior. The diversity of sys-
also chosen different instruction intermediate-scale tems studied is important for
sets and hardware communication quantum (NISQ) understanding which aspects of
topologies. Further, QC systems devices. These systems QC design hold across different
also have variance in hardware represent an important design choices and which are
milestone toward large
noise, owing to fundamental more implementation specific.
scale QC, and are
challenges in qubit control and Our work represents the most
expected to scale to
manufacturing. While this diversity comprehensive cross-platform,
500–1000 qubits in
itself poses a challenge for efficient coming years. real-system measurements of QC
and portable application execu- prototypes ever performed.
tion, there is also a huge gap On the other hand, this design
between the QC hardware that is buildable now, space diversity also poses serious challenges for
and the resource requirements of compelling accurate comparative studies. In particular, our
real-world applications. Many interesting comparisons hinge on developing a toolflow and
applications demand large systems with several evaluation approach common to all platforms,
thousand quantum bits and high-precision oper- and yet not penalizing any particular platform
ations, but current hardware has less than while pursuing toolflow generality. Our toolflow,
100 qubits and error-prone operations. To fully TriQ, is the first top-to-bottom multivendor QC
attain practical and powerful QC, computer compiler toolflow. TriQ optimizes high-level lan-
architecture techniques and software toolchains guage programs for QC hardware by leveraging
must be employed to narrow the algorithm-to- deep but parameterized knowledge of the target
devices resource gap across a wide range of device characteristics, including the gate set,
algorithms and devices. connectivity, and noise profile. Importantly, TriQ
To this end, our article2 offers one of the avoids inefficiencies in vendor toolflows, offering
deepest explorations of cross-platform charac- up to two orders of magnitude higher reliability
teristics in QC systems, presenting a full-stack, compared to IBM’s Qiskit3 and Rigetti’s Quil4
benchmark-driven, hardware–software analysis. compiler which are the default toolchains for the
Viewing QC through the lens of computer archi- respective hardware. TriQ, therefore, allows us to
tecture, we evaluate important hardware design perform architectural analysis across diverse QC
decisions (qubit types, system size, connectivity, systems using high-level application performance
noise), the hardware–software interface (gate set measurements and is also a common compiler
choices), and software optimizations to tackle toolflow.
fundamental design questions: What instructions Our experiments with TriQ reveal several
should QC systems expose to software? Should architectural insights for QC systems. We quan-
instructions be unified in a device-independent tify the importance of gate set, ISA and connectiv-
ISA across different qubit types? How do hard- ity choices and offers design recommendations.
ware connectivity and noise characteristics We also evaluate the effects of hardware noise
impact benchmark performance? Can hardware on applications and the importance of software
limitations be overcome with a compiler? optimizations to mitigate such noise. Our results
To answer these questions, we use real- have also attracted significant academic and
system measurements to evaluate a suite of QC industry attention with vendors including IBM
IEEE Micro
74
and Rigetti incorporat-
ing our optimizations in
their compiler toolflows.
In coming years, hard-
ware and architectural
insights from our study
are likely to influence
QC.
Figure 1. Hardware qubit technology, native gate set, and software-visible gate set in
the systems used in our study. Each qubit technology lends itself to a set of native gates.
BACKGROUND
ON QC For programming, vendors expose these gates in a software-visible interface or construct
A qubit is the funda- composite gates with multiple native gates.
mental building block
a QC system. Unlike a
classical bit which is restricted to be either in here and refer the reader to our original paper
the state 0 or 1 at any instant, a qubit can exist for more details.2
in a superposition state where it is a probabilistic Figure 1 shows the different hardware qubit
combination of the two basis states. This prop- technologies used in IBM, Rigetti, and UMD sys-
erty allows an n-bit QC system to represent 2n tems. IBM and Rigetti use superconducting
basis states simultaneously, unlike classical qubits, while UMD uses trapped ion qubits. On
registers which can be in exactly one of the 2n one hand, these choices are similar to how clas-
values at any given time. To manipulate informa- sical computers can be realized using vacuum
tion, QC gates are implemented to operate on tubes, relay circuits or CMOS transistors. On the
one or more qubits, using some physical interac- other hand, qubit technologies are very different
tion such as a microwave or laser pulse. Similar and do not lend themselves to abstraction simi-
to universal gates in classical systems, QC com- lar to the ON–OFF switch abstraction in classical
putations can be expressed using a small univer- technologies. For example, on IBM’s supercon-
sal set of single (1Q) and two-qubit (2Q) gates. ducting qubits, the two-qubit interactions are
In particular, 2Q gates create entanglement achieved using the cross-resonance effect,
which is a key property exploited by algorithms. where one qubit is driven at the resonant fre-
To obtain classical output from the system, quency of another qubit using a coupled hard-
qubits are measured or readout, collapsing the ware resonator. In contrast, in UMD’s trapped
superposition state to either 0 or 1. ion qubits, two-qubit interactions are achieved
using collective motional modes of an ion chain,
QC ARCHITECTURE CHOICES mediated through laser pulses.
AND TRADEOFFS Owing to these fundamental differences, ven-
NISQ systems have very diverse hardware dors implement different native gates or microop-
and architecture. While classical metrics such as erations that are feasible on their platform.
performance (time) and area are important to Figure 1 shows these native 1Q and 2Q gates.
evaluate these options, a key figure of merit in Even among superconducting qubits, the native
the current NISQ regime is the likelihood of cor- interactions may be different. For example,
rect execution of applications. Owing to the Rigetti uses the controlled Z operation as the fun-
noise, a single execution of an application may damental 2Q operation instead of the cross-reso-
be corrupted by noise. Hence, programs are nance gate in IBM. Using these native gates,
typically run multiple times and the success rate vendors choose a software-visible programming
is measured as the fraction of trials which yields interface which includes either native gates
the correct answer. Toward understanding how themselves or composite gates which use multi-
system design affects success rate and perfor- ple native gates. These choices for software-
mance, we briefly discuss the key design choices visible gates also differ widely across vendors.
May/June 2020
75
Top Picks
IEEE Micro
76
optimization using these inputs
is a distinguishing feature of TriQ
that allows it to obtain high
success rates across platforms.
As output, TriQ generates opti-
mized code in the vendor-
specified assembly code.
To compile the IR, the first
step is to map program qubits
onto distinct hardware qubits.
For example, program qubits can
be assigned to hardware qubits Figure 4. Overview of the TriQ toolflow. Inputs are high-level Scaffold programs
according to the order they are and their inputs, as well as device-specific QC system properties. Output is
used in the program. This policy optimized code in one of three vendor-specific executable formats.
can result in high communication
overhead and poor success rate
when qubits participating in 2Q gates are not Second, TriQ schedules gates in the program
mapped close together. If program qubits are in a topologically-sorted order using the IR. This
mapped onto unreliable hardware qubits, it allows maximum operations to be executed in
can further worsen the success rate. Therefore, parallel, reducing the errors due to qubit deco-
TriQ uses a noise-adaptive mapping strategy herence. For devices which do not support full
which optimizes both communication and reli- connectivity, TriQ automatically inserts the nec-
ability. TriQ chooses a set of qubits that match essary communication operations to bring
well with the communication requirements of qubits into adjacent positions before executing
the application and simultaneously, it ensures 2Q gate. To improve success rates, TriQ incorpo-
that this set of qubits has low error rates for rates noise-awareness in this step by selecting
the instruction mix of the application. TriQ the lowest error rate paths for moving qubits,
implements this policy using a satisfiability rather than any shortest distance path.
modulo theory (SMT) optimization, solved Third, TriQ translates high-level IR gates into
using Microsoft’s Z3 SMT solver. device-specific IR. Using a set of legal code trans-
To flexibly target different devices, we formations that are provided as input, TriQ
designed the SMT optimization to work with an replaces IR gates with equivalent device-specific
abstract representation of the hardware. TriQ gates, e.g., OpenQASM code for IBM systems.
preprocesses the target device’s connectivity During this pass, TriQ also applies a 1Q gate
graph and gate error rate data and converts optimization where continuous sequences of 1Q
them to a reliability matrix representation. gates are compressed into shorter sequences.
For each pair of qubits, the matrix specifies the TriQ exploits knowledge of hardware error rates
reliability of the lowest error rate path for a 2Q in this step as well. On all three vendors, single
gate between the qubits. When two hardware qubit rotations gates along the Z-axis of the
qubits are far away in the communication topol- qubit have no error.6 While compressing gate
ogy, the reliability of the best path will be low. sequences, TriQ maximizes the use of these Z
It will also be low if all paths between the two rotations, further increasing success rates.
qubits have high error rate edges. Therefore,
using the matrix, TriQ can pick communication-
and reliability-optimized mappings. Since the REAL-SYSTEM ARCHITECTURAL
core functionality of the pass operates using STUDIES USING TRIQ
this matrix abstraction, we can flexibly compute We performed real-system measurements for
good mappings for any device topology and a set of 12 benchmarks on 7 QC systems. These
noise profile simply by changing compile-time benchmarks include important QC kernels such
inputs. as the Toffoli gate and quantum Fourier transform
May/June 2020
77
Top Picks
Figure 5. Success rate for 12 benchmarks on 7 systems. Success rates varies drastically across systems and is
influenced by error rates, qubit connectivity, and application-machine topology match. Benchmarks that are too large to be
mapped onto a machine are marked “X.” This comparison is intended to understand the impact of architectural design
choices such as gate set and connectivity on benchmark performance and is not intended to pick a winning technology,
vendor or implementation. Individual benchmark performance numbers may change over time. These measurements
represent a snapshot of the performance of these systems when we performed the experiments.
operation. To understand architectural choices, shows that machines with dense qubit connec-
we performed multiple experiments with each tivity are less sensitive to application character-
benchmark and system, varying the level of opti- istics and allow a wider variety of programs to
mization and the inputs used for compilation. We execute successfully. Compared to a baseline,
used three main variants of the compiler with TriQ’s communication optimizations offer up
increasing levels of optimization for gate sequen- to 22X reduction in 2Q gate counts. For certain
ces, communication and for noise-adaptivity and programs, this means the difference between a
a fourth baseline version with no optimization. failed execution where noise corrupts the output
We compared different executables in terms and a successful execution where the correct
of instruction count and success rate. Figure 5 answers dominate. When the architecture does
shows the measured success rates using TriQ’s not have full connectivity, compilers like ours
full optimizations. The key insights from our can allow applications to take maximum advan-
study are summarized next. tage of the available hardware resources.
Importance of Gate Set Specificity: We studied Importance of Noise Adaptivity: Our work
whether it is beneficial to expose native gates to shows that the noise variability in QC hardware
software, instead of abstracting them in a device- can be effectively mitigated by software techni-
independent gate set. When TriQ has information ques. By mapping programs onto reliable regions
about the native gate set, the gate optimization of the hardware and orchestrating communica-
passes offers significant benefits. TriQ expresses tion along reliable hardware paths, TriQ effec-
several program instructions using a small num- tively shields applications from spatiotemporal
ber of native gates, leading to an average 50% noise variations. These optimizations provide
reduction in the instruction count and up to 26% further average success rate gains of 2.8X over
increase in success rate. Therefore, unlike prior gate and communication optimizations, and
proposals for device-independent ISAs for QC sys- allows more applications to execute successfully.
tems,7; 8 our results show that such abstractions Put together, TriQ’s optimizations offer up to 1.5-
are detrimental to high success rates. We recom- 28X higher success rates than IBM’s Qiskit,
mend that vendors make the most low-level native Rigetti’s Quil compiler, and hand optimized code
gates in their devices software visible. As an anal- from UMD. Our work is the first to show that such
ogy to classical microprocessors, this is similar to optimizations are important even on trapped ion
making microoperations software visible.9 systems which have less variability. Noise varia-
Importance of Qubit Connectivity: Our work tions are likely in all near-term QC systems in the
demonstrates that the match between applica- next 5 to 10 years. Therefore, compilers like TriQ
tion communication requirements and device will be crucial for reliable program executions.
topology significantly crucially impacts success TriQ’s functionality is portable across
rates. Comparing near-neighbor versus fully- diverse platforms while still performing full
connected systems (like IBM and UMD systems) top-to-bottom optimizations for device and
IEEE Micro
78
application characteristics provided as compile- When they are not well-matched, successful
time inputs. Leveraging microarchitecture executions are unlikely.
details such as native gate sets and noise rates Our work also breaks new ground in QC
was the key to our improvements. Therefore, benchmarking by being distinct from the exist-
QC systems are not yet ready for device-indepen- ing practices of measuring isolated hardware
dent abstraction layers that hide and obstruct characteristics or benchmarking custom-
information flow between hardware and designed applications. On one hand, vendors
software. characterize systems in terms of metrics such as
gate error rates and qubit coherence times.
These metrics are isolated measurements for
IMPACT OF OUR WORK each hardware component, and not direct meas-
Recently, tech news was dominated by dis- urements of program behavior. TriQ enables
cussions of Google’s so-called “quantum suprem- direct and accurate measurements of program
acy” announcement and reactions from other behavior across widely divergent QC platforms.
scientists and QC vendors.1 While QC systems In classical computing, this is akin to the differ-
offering high revenue streams (e.g., as cloud ence between knowing characteristics like core
accelerators) are still in the future, clearly QC is counts and clock rate, versus knowing actual
increasing in importance and has reached an benchmark performance. On the other hand,
inflection point in terms of engineering achieve- vendors have developed benchmarking applica-
ments in real implementations. This makes our tions such as quantum volume. These methods
work extremely timely, with high potential for use a family of custom generated
impact. Just this year, several aca- circuits to measure hardware qual-
demic and industry vendors have Our study features ity. Our work does not field a pre-
already adjusted their compiler systems with different ferred benchmark, but instead
toolflows and aspects of their qubit, noise, and relies on a suite of diverse applica-
exposed gate sets in response to architectural attributes tions to understand the impact of
and provides important
our work. Our optimizations are hardware on applications. This is
insights for designing
already part of IBM’s Qiskit Terra similar to the difference between
better architecture and
compiler as of version 0.8 and benchmarking supercomputers
hardware. These
Rigetti’s Quil compiler version insights will likely with LINPACK or other dedicated
1.16. TriQ, open sourced at influence future QC algorithms, and measuring the per-
https://github.com/prakashmur- ISA design. formance of real applications. We
ali/TriQ is also the first compiler believe that this approach of appli-
for trapped ion systems. cation-based benchmarking will
Our study features systems become common practice in QC, much like how
with different qubit, noise, and architectural benchmark suites such as SPEC are used for clas-
attributes and provides important insights for sical benchmarking.
designing better architecture and hardware. Most importantly, our work represents a sig-
These insights will likely influence future QC nificant advance on the way to practically viable
ISA design. Although QC applications work QC, which requires us to close a five to six order
with any universal gate set, we demonstrate of magnitude gap between algorithm needs
that shielding the natural gates for a qubit and device capabilities. Our work demonstrates
technology by abstracting them into more methods for achieving up to two orders of mag-
common gates imposes severe reliability and nitude improvements in program success rates
performance overheads on NISQ systems. and our approaches work well across vendor
Future QC ISAs need to work in tandem with implementations. In a world where increasing
the underlying qubit technology. Our work qubit count comes only with great engineering
also underscores the importance of matching effort, our work offers substantial and orthogo-
the application’s communication requirements nal advances over underlying hardware progress
and hardware topology by codesigning them. alone.
May/June 2020
79
Top Picks
1. Frank Arute et al. “Quantum supremacy using a Margaret Martonosi is currently the Hugh Trum-
programmable superconducting processor,” Nature, bull Adams ’35 Professor of Computer Science with
vol. 574, no. 7779, pp. 505–510, 2019. Princeton University. Her research focuses on com-
2. P. Murali, N. M. Linke, M. Martonosi, A. J. Abhari, puter architecture and hardware–software interface
issues in both classical and quantum systems. Mar-
N. H. Nguyen, and C. H. Alderete, “Full-stack, real-
tonosi received the Ph.D. degree in electrical engi-
system quantum computer studies: Architectural
neering from Stanford University. She is a Fellow of
comparisons and design insights,” in Proc. 46th Int.
IEEE and the Association for Computing Machinery
Symp. Comput. Archit., 2019, pp. 527–540.
(ACM). Contact her at [email protected].
3. IBM, “IBM Qiskit,” 2018, Accessed on: Jan. 1, 2020.
[Online]. Available: https://qiskit.org/ Ali Javadi Abhari is currently a Research Staff
4. Rigetti, “QuilC compiler,” 2020, Accessed on: Jan. 1, Member with IBM, Armonk, NY, USA, and a Manager
of the Quantum Compiler Group. His research interests
2020. [Online]. Available: https://github.com/rigetti/quilc
include quantum computing software, compilation,
5. A. J. Abhari et al., “Scaffold: Quantum programming
and architecture. Javadi Abhari received the
language,” Princeton University, Princeton, NJ, USA,
Ph.D. degree in electrical engineering from Princeton
Tech. Rep. TR-934-12, 2012.
University. Contact him at [email protected].
6. D. C. McKay, C. J. Wood, S. Sheldon, J. M. Chow, and
J. M. Gambetta, “Efficient z gates for quantum Nhung Hong Nguyen is currently a Ph.D. student
computing,” Phys. Rev. A, vol. 96, Aug. 2017, with Linke Lab, University of Maryland. She was a
Research Assistant with the Center for Quantum
Art. no. 022330.
Technology, Singapore, working in satellite quantum
7. X. Fu et al., “A microarchitecture for a
key distribution. Her research focuses on digital
superconducting quantum processor,” IEEE Micro,
quantum simulation, algorithms implementation, and
vol. 38, no. 3, pp. 40–47, May 2018.
error encoding on trapped ions. Nguyen received
8. A. W. Cross, L. S. Bishop, J. A. Smolin, and the B.S. degree in physics from Nanyang Technolog-
J. M. Gambetta, “Open quantum assembly language,” ical University, Singapore, working on surface spec-
2017, arXiv:1707.03429. [Online]. Available: https:// troscopy with neutral atoms. Contact her at
arxiv.org/abs/1707.03429 [email protected].
9. M. V. Wilkes, “The best way to design an automatic
Cinthia Huerta Alderete is currently a Ph.D. stu-
calculating machine,” in The Early British Computer
dent with the National Institute of Astrophysics,
Conferences. Cambridge, MA, USA: MIT Press, 1989,
Optics and Electronics (INAOE), San Andrés Chol-
pp. 182–184.
ula, Mexico, currently on a research stay at the Joint
Quantum Institute, University of Maryland. Her
research is focused on, but not limited to, the simula-
Prakash Murali is currently a Ph.D. student in the tion of paraparticle oscillators in a trapped-ion sys-
Computer Science Department, Princeton University. tem. Aside from this topic, she had collaborated on a
His research focuses on accelerating the progress few projects based on the circuit implementation of
toward practical quantum computation using com- different phenomena in quantum physics. Contact
puter architecture and compilation techniques. her at [email protected].
Contact him at [email protected].
IEEE Micro
80
Theme Article: Top Picks
May/June 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
81
Top Picks
example, in Spectre V1 (see Figure 1) a branch To be secure and efficient, we address two key
misprediction enables the attacker to access challenges.
and leak/transmit arbitrary program data by
First, we develop an abstraction that indi-
controlling the out-of-bounds address &array1
[off]. We refer to such data, which is brought cates how and when instructions can form
into the pipeline by a speculative instruction, as covert channels, so as to delay data forward-
secret. ing to the latest safe time.
Second, we identify and develop a microarch-
A secure, but conservative, way to block all
speculative execution attacks— itecture to indicate exactly when data should
regardless of covert channel—is to be considered secret, so as to
delay executing all instructions that This article proposes a re-enable data forwarding at
can access a secret until such new abstraction the earliest safe time.
instructions become nonspecula- through which to view
tive. In nearly all attacks today, this covert channels on Challenge #1: New Abstractions
would imply blocking all loads until speculative microarchi- for Describing All
they are nonspeculative, which tectures, discovers Microarchitectural Covert
new points where
would be tantamount to disabling Channels
instructions can create
speculative execution. Covert channels come in dif-
covert channels, and
This article proposes a princi- ferent shapes and sizes. For
discovers a new class
pled, high-performance mechanism of covert channels example, attackers can monitor
that achieves the same security how loads interact with the
guarantee as the above conserva- cache,5 the timing of SIMD units,6
tive scheme. The key idea is that speculative execution pipeline port contention,2 branch pre-
execution is safe unless speculatively accessed dictor state,1 and more. To comprehensively
data (secrets) reaches a covert channel. In many block leakage through these different channels,
cases, speculative instructions either do not leak it is necessary to understand their common
secrets or do not form covert channels, and so characteristics.
can execute freely under speculation. For exam- To address this challenge, this article pro-
ple, the first load in Spectre V1 forms a covert poses a new abstraction through which to view
channel, but it only leaks the attacker-selected covert channels on speculative microarchitec-
address &array1[off]—not the secret data in tures, discovers new points where instructions
that address. Likewise, many instructions (e.g., can create covert channels, and discovers a new
simple arithmetic) do not form covert channels class of covert channels. We find that all covert
even if their operands are secret values. channels are one of two flavors, which we call
This article presents speculative taint track- explicit and implicit channels (which are related
ing (STT), a framework that tracks the flow of to explicit and implicit information flow8). In an
speculatively accessed data through in-flight explicit channel, data is directly passed to an
instructions (similar to dynamic information instruction whose execution creates operand-
flow tracking/DIFT7) until it is about to reach an dependent hardware resource usage, and that
instruction that may form a covert channel. STT resource usage reveals the data. For example,
then delays the forwarding of the data until how a load impacts the cache depends on the
the instruction becomes nonspeculative or load address.5 In an implicit channel, data indi-
the execution squashes due to mis-speculation. rectly influences how (or that) an instruction(s)
execute, and these changes in resource usage
reveal the data. For example, the instructions
executed after a branch reveal the branch predi-
cate.2,6 This article further defines subclasses of
the implicit channel, based on when the leakage
occurs and based on the nature of the secret-
Figure 1. Spectre variant 1. dependent condition that forms the channel.
IEEE Micro
82
Key Advance: Safe Prediction. Through its the data, has become nonspeculative. Checking
investigation of implicit channels, this article this condition is akin to tracking a single extra
makes a key advance by showing how to use dependence for each instruction, as opposed to
hardware predictors safely. Spectre attacks performing complex backwards slice tracking.
were born from attackers mistraining predic-
tors to leak secrets. Through its abstraction Security Guarantees and Formal Analysis
for implicit channels, STT enforces a policy Alongside the main paper, we formally prove
that prevents arbitrary predictor mistraining that STT enforces a novel form of noninterfer-
from leaking any secret data over any covert ence3 with respect to speculatively accessed
channel. The article shows how this enables data. In a nutshell, we show that hardware
existing predictors to stay enabled without resource usage patterns over time are indepen-
leaking privacy, dramatically improving perfor- dent of data that eventually squashes (covering
mance. In the future, we expect the idea of microarchitectural interference- and timing-
safe prediction to enable further innovation, based attacks). We released a companion techni-
i.e., by enabling the design of new predictors cal report12 with detailed formal analysis and a
without fear of opening new security holes. security proof for this property.
Indeed, our follow-on work uses this idea to
safely improve the performance of instruc-
Putting It All Together
tions that create explicit channels.11
Putting everything together, STT provides
both high security and high performance. It
Challenge #2: Mechanisms to Quickly
does not require partitioning or flushing micro-
and Safely Disable Protection
architectural resources, and does not require
Once we have mechanisms to block secret
changes to the cache/memory subsystem or the
data from reaching covert channels, the next
software stack. When evaluated on SPEC06 work-
question is when and how to disable that protec-
loads, STT incurs 8.5% or 14.5% performance
tion, if speculation turns out to be correct. This is
overhead (depending on the threat model) rela-
crucial for performance, as delaying data forward-
tive to an insecure machine.
ing longer than necessary increases the chance
that delayed instructions reach the head of the
reorder buffer (ROB) and block retirement. ATTACKER MODEL AND
STT tackles this problem with a safe but aggres- PROTECTION SCOPE
sive approach, by re-enabling data forwarding as Attacker Model. STT assumes a powerful adver-
soon as data becomes a function of retired register sary that can monitor any microarchitectural
file state. This represents the earliest safe point, covert channel from anywhere in the system, and
but is nontrivial to implement in hardware. For induce arbitrarily speculative execution to access
example, a delayed instruction’s operand(s) may secrets and create covert channels. For example,
be the result of a complex dependence chain the attacker can monitor covert channels through
across many control flow and speculative opera- the cache/memory system,5 data-dependent arith-
tions. Intuitively, determining that data is a func- metic,4 port contention,2 branch predictors,1 etc.
tion of nonspeculative information would require Scope: Protecting Speculatively Accessed Data.
retracing a backwards slice of the program’s exe- We distinguish attacks based on whether the
cution, which is costly to do quickly. access instruction is doomed-to-squash (tran-
Despite the above challenges, STT proposes a sient) or bound to retire (nontransient). STT’s
simple hardware mechanism that can disable pro- goal is to block attacks involving doomed-to-
tection/re-enable forwarding for an arbitrary squash access instructions, shown in Figure 2.
instruction in a single cycle, using hardware similar These attacks can access data that a correct
to traditional instruction wake-up logic. The key (not miss-speculated) execution would never
idea is that to determine whether data is a function access, which often results in being able to read
of retired state, it is sufficient to determine whether from any location in memory. Attacks involving
the youngest load, whose return value influences bound-to-retire access instructions are out of
May/June 2020
83
Top Picks
IEEE Micro
84
Figure 4. Rewriting a store-load pair as an implicit
branch. implIf reveals a potential covert channel as a
function of memory aliasing to the older store. This
occurs if the microarchitecture supports store-to- Figure 5. Resolution-based implicit channel due to
load forwarding or memory-dependence secret-dependent pipeline squashes. When the
speculation. branch (B) resolves, it leaks the secret based on
whether a squash occurs, as this causes the
younger load to execute once or twice. There is an
As shown in Figure 5, if the branch mis-specu- analogous case when the (public) predictor state
lates and subsequently squashes, the load takes the branch.
may execute either once or twice depending
on the value of secret.
Finally, the abstraction applies to a large set of
a handful of such instructions, which can be
microarchitectural optimizations. For example,
identified manually.
the representation of store-to-load forwarding (see
An instruction should be classified as a trans-
Figure 4) also captures the behavior of memory-
mit instruction if its execution creates operand-
dependence speculation with a store set predictor.
dependent resource usage that can reveal the
Here, the store set predictor is modeled as a predic-
operand (partially or fully). Identifying implicit
tion on the implicit branch (implIf in the figure). As
branches is similar: the architect must analyze
we will see, being able to represent different opti-
whether the resource usage of some in-flight
mizations as predictions on implicit branches will
instruction changes as a function of some other
enable STT to apply a uniform mechanism to block
instruction’s operand. This definition can be for-
leakage through a variety of structures (e.g.,
malized by analyzing (offline) how information
branch, store set, etc., predictors).
flows in each functional unit at the SRAM-bit and
flip-flop levels to determine whether resource
STT: DESIGN usage depends on the input value, in the style of
Framework and Concepts the OISA10 or GLIFT8 formal frameworks. Auto-
STT requires that the microarchitect define matically performing such analysis is important
what instructions write secrets into registers future work.
(access instructions, mainly loads), what instruc-
tions can form explicit channels (transmitters), Taint and Untaint Propagation
and what instructions form implicit channel Conceptually, in each clock cycle, STT
branch predicates (for both explicit and implicit applies the following taint rules to instructions
branches). Finally, the architect must define the in the ROB:
Visibility Point, after which speculation is consid-
ered safe (e.g., at the point of the oldest unre- The Output Register of an Access Instruction
solved branch, or at the head of the ROB). If is tainted if and only if the access instruction
the visibility point refers to an instruction is unsafe.
older than an access instruction, we call the The output register of a non-access instruc-
access instruction unsafe; otherwise it is consid- tion is tainted if and only if at least one of its
ered safe. input operands is tainted.
We provide guidelines for microarchitects
on identifying access and transmit instruc- In the implementation, taint propagation is pig-
tions. An instruction should be classified as gybacked on the existing register renaming logic
an access instruction if it has the potential to in an out-of-order core. Tainting is therefore fast.
read a secret. Except for loads, there are only In contrast, it is difficult to propagate “untaint” to
May/June 2020
85
Top Picks
all dependencies of an access instruction that STT’s principles can be applied to efficiently
becomes safe in a single cycle. We address this make any hardware predictor impossible to
with a single-cycle implementation for untaint in exploit as a covert channel for leaking specula-
the “STT: Implementation” section. tively accessed data.
Unlike prior DIFT schemes,7 STT does not Conceptually, the protection mechanism
require tracking taint in any part of the memory does not need to reason about whether
system or across store-to-load forwarding. The an implicit channel is caused by an explicit
reason is that because loads are access instruc- or implicit branch: both types have a predicate
tions, the taint of their output is determined and the policy with respect to the predicate is
only based on whether they have reached the the same in both cases. The implementation,
visibility point. That is, the output of an unsafe however, must identify the predicate. We illus-
load is always tainted. trate this by showing how the STT microarchi-
tecture handles explicit branches.
Blocking Covert Channels Applying Principle #1 (Prediction-Based Chan-
Given STT’s rules for tainting/untainting data nels). STT requires that every frontend predictor
and its abstraction for covert channels, STT structure be updated based only on untainted
blocks all covert channels by applying a uniform data. This makes the execution path fetched by
rule across each type. the frontend unaffected by the output of unsafe
access instructions. STT passes a branch’s reso-
Blocking Explicit Channels STT blocks lution results to the direct/indirect branch pre-
explicit channels by delaying the execution of dictors only after the branch’s predicate and
any transmit instruction whose operands are target address become untainted; if the branch
tainted until they become untainted. This gets squashed before this, the predictor will not
scheme imposes relatively low overhead because be updated.
it only delays the execution of transmit instruc- Figure 6(c) demonstrates the effect of
tions if they have tainted operands. For example, STT on a speculative execution of the code
a load that only reads a (potential) secret but snippet in Figure 6(a), in which the branch
does not transmit one—such as the load on line B0 is mispredicted as taken. No matter how
2 in Figure 1—executes without delay. The load many experiments the attacker runs, the pre-
on line 3, however, will be delayed and eventually dicted direction of the branch B will not be a
squashed, thereby defeating the attack. function of secret, because the branch predic-
tor is not updated when B resolves. As a
Blocking Implicit Channels STT blocks result, the execution path does not depend on
implicit channels by enforcing an invariant that secret (top versus bottom)—it only depends
the sequence of instructions fetched/executed/ on the predicted branch direction (left versus
squashed never depends on tainted data. That right).
is, STT makes the program counter independent of Applying Principle #2 (Resolution-Based Chan-
tainted data. To enforce this invariant efficiently, nels). STT delays squashing a branch that resolves
without needing to delay execution of instruc- as mispredicted until the branch’s predicate beco-
tions following a tainted branch, we introduce mes untainted. As a result, a doomed-to-squash
two general principles to neutralize the sources branch with a tainted predicate (such as the
of implicit channels: branch B in Figure 6(c)) will never be squashed
and re-executed, preventing the implicit channel
Prediction-Based Implicit Channels are
leak discussed in the “Insights From Analysis of
eliminated by preventing tainted data from
Implicit Channels” section. As Figure 6(c) shows,
affecting the state of any predictor structure.
the doomed-to-squash branch B is eventually
Resolution-based Implicit Channels are
squashed once an older (mispredicted) branch
eliminated by delaying the effects of branch
with an untainted predicate squashes. Thus, the
resolution until the branch’s predicate
squash does not leak any information about the
becomes untainted.
branch’s resolution. Importantly, it is safe to
IEEE Micro
86
Figure 6. STT executing the code in (a), which includes an untainted branch B0, an access instruction reading secret,
and an implicit channel (due to branch B). (a) Implicit channel formed through the squash/control dependency on B.
(b) When earlier branch B0 is predicted correctly. (c) When earlier branch B0 is predicted incorrectly (left: B predicts
taken, right: B predicts not taken).
resolve a branch as soon as its predicate becomes A tainted register needs to be untainted once
untainted, even if an older branch with a tainted all access instructions on which it depends
predicate has not yet resolved. reach the visibility point, i.e., become safe.
STT only increases the latency of recovering Our key observation is that it suffices to track
from a tainted branch misprediction. For exam- only when the youngest access instruction
ple, in Figure 6(b), the load does not execute becomes safe, because instructions become non-
immediately after B resolves. Fortunately, speculative in program order in the processor
tainted branch mispredictions are only a small ROB. We call this youngest access instruction
fraction of overall branch mispredictions, which the youngest root of taint (YRoT).
are infrequent in the first place because Determining the YRoT is done through modi-
successful speculation requires accurate branch fications to rename logic in the processor front-
prediction. end. Specifically, the YRoT for an instruction X
Implicit Branches. The STT paper applies the being renamed is given by the max of 1) the
above principles to secure several common YRoT(s) of the instruction(s) producing the
microarchitectural optimizations that can be for- arguments for X, if those instructions are not
mulated as implicit branches, namely: store-to- access instructions; or 2) the ROB index of the
load forwarding, memory dependence specula- instruction(s) producing the arguments for X,
tion, and memory consistency speculation. In otherwise. (By convention, we assume the ROB
the process, the paper details various optimiza- index increases from ROB head to tail.) After
tions and cases which arise when dealing with rename, the YRoT is stored alongside the
implicit channels. In particular: whether the instruction in its reservation station and is con-
explicit/implicit branch has a prediction step, ceptually an extra dependence for that instruc-
can be resolved early or can be optimized in tion. When the visibility point changes, its new
some other way. For example, because store-to- position is broadcast to in-flight instructions,
load forwarding can only result in two observ- akin to a normal writeback broadcast, and
able outcomes (issue the load or forward from a instructions whose YRoT is less than the visibil-
prior store), we hide which one occurs by ity point’s new position are allowed to execute
unconditionally accessing the cache. (assuming their other dependencies are satis-
fied). The entire architecture requires modest
changes to the frontend rename logic, storage in
STT: IMPLEMENTATION reservation stations for the YRoT, and logic to
We previously assumed untaint information compare the YRoT to the visibility point which is
propagated along data dependencies instantly. comparable to normal instruction wakeup logic.
This is difficult to implement in hardware Figure 7 shows an example. Assume the Spec-
because a word of tainted data may be a function tre attack model, i.e., that the visibility point will
of complex dependence chains involving many be set to the ROB index of the oldest unresolved
access instructions. branch. The ROB contains three unresolved
May/June 2020
87
Top Picks
IEEE Micro
88
us to verify the STT machine’s branch predic- 5. P. Kocher et al., “Spectre attacks: Exploiting
tions and determine whether a prediction leads speculative execution,” in Proc. IEEE Symp. Secur.
to mis-speculation. Privacy, 2019, pp. 1–19.
6. M. Schwarz, M. Schwarzl, M. Lipp, and D. Gruss,
attack models is viable with STT without sacrific- Chong, and T. Sherwood, “Complete information flow
ing much performance. Compared to the baseline tracking from the gates up,” in Proc. 14th Int. Conf.
secure scheme (DelayExecute) described in the Archit. Support Program. Lang. Oper. Syst., 2009,
introduction, STT reduces overhead by 4:0 in pp. 109–120.
the Spectre model and 10.5 in the Futuristic 9. Y. Yarom and K. Falkner, “Flush+Reload: A high
model, on average. resolution, low noise, L3 cache side-channel attack,”
in Proc. Usenix Secur. Symp., 2014, pp. 719–732.
for very helpful discussions. We would especially high performance computing,” in Proc. 26th Netw.
like to thank our colleagues at Intel who contrib- Distrib. Syst. Secur. Symp. [Online]. Available: https://
May/June 2020
89
Top Picks
Artem Khyzha is a Postdoctoral Fellow in the Josep Torrellas is the Saburo Muroga Professor of
School of Computer Science, Tel Aviv University. Computer Science at the University of Illinois at
His research interests include formal methods Urbana–Champaign (UIUC). He is the Director of the
for software and hardware systems. Khyzha Center for Programmable Extreme Scale Computing
received a joint Ph.D. degree in computer science and past Director of the Illinois-Intel Parallelism Center.
from Technical University of Madrid and IMDEA His research interests include computer architecture
Software Institute. Contact him at artkhyzha@ and parallel processing. Torrellas received the Ph.D.
mail.tau.ac.il. degree from Stanford University. Contact him at
[email protected].
IEEE Micro
90
Theme Article: Top Picks
MicroScope: Enabling
Microarchitectural
Replay Attacks
Dimitrios Skarlatos Read Sprabery
University of Illinois at Urbana–Champaign Google
Mengjia Yan Josep Torrellas and Christopher W. Fletcher
Massachusetts Institute of Technology University of Illinois at Urbana–Champaign
Bhargava Gopireddy
Nvidia
& IT IS NOW well understood that modern pro- secrets at many points in a program’s
cessors leak secrets over microarchitectural execution.
side and covert channels. These channels Yet, a fundamental challenge for attackers
are seemingly everywhere—from the cache1–3 exploiting these channels is that the channels
to the branch predictor4 and other struc- are notoriously noisy. This means that multiple
tures5,6—and are capable of leaking program measurements of the same event often return
wildly different values. This occurs, for example,
when attempting to glean secret-dependent con-
Digital Object Identifier 10.1109/MM.2020.2986204 trol flow by measuring port contention inside
Date of publication 16 April 2020; date of current version 22 the pipeline.6 As a result, the attacker requires
May 2020. that the victim program run many times (e.g.,
May/June 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
91
Top Picks
thousands of times6) to increase the signal-to- secret is leaked. All the while, the victim has
noise ratio. This fact prevents attackers from logically run only once.
learning secrets in many important scenarios, This article introduces microarchitectural
such as when the victim program runs only replay attacks, and provides a complete proto-
once. type tool called MicroScope that runs the
To eliminate this limitation, this work attack on real hardware. We investigate the
introduces microarchitectural replay attacks attack’s ability to leak secrets in straight-line
(MRAs), a new class of attacks that enables an code, branches, and loops. As a proof of con-
attacker to offset the measurement variation cept, we demonstrate how the attack can
for (i.e., to denoise) potentially any microarch- denoise the notoriously noisy side channel of
itectural side channel, even if pipeline port contention in a sin-
the victim code is executed only gle run of the victim. Finally, we
This article introduces
once. The key observation is discuss how changing different
microarchitectural
that, in modern out-of-order parameters in the attack setup
replay attacks, and
speculative cores, a dynamic yields new flavors of the attack—
provides a complete
instruction may be forced to exe- prototype tool called e.g., enabling an attack to theo-
cute multiple times due to pipe- MicroScope that runs retically bias the output of a
line squashes caused by page the attack on real hardware instruction that gener-
faults, exceptions, or other hardware. ates true random numbers.
events. By forcing the squash We released the full Micro-
and reexecution of an instruction Scope framework as a kernel mod-
multiple times, the attacker can repeatedly ule, available at https://github.com/dskarlatos/
measure the execution characteristics of such MicroScope.
instruction. We call this attack an MRA.
This work also describes and implements a
specific family of MRAs that are applicable in BRIEF BACKGROUND
the context of Intel’s Software Guard Extensions Secure enclaves,8 such as Intel’s SGX7 allow
(SGX).7 Specifically, in this environment, the sensitive user-level code to run securely on a
attacker controls the operating system (OS) and, platform alongside an untrusted supervisor
while the attacker cannot see the victim’s data (i.e., an OS and/or hypervisor). Intel’s SGX uses
directly, it controls the victim’s demand paging. the OS for translation lookaside buffer (TLB)
Now, suppose that there is an instruction I that, and page table management. Each page table
based on secret data, forms a noisy side or entry contains a present bit, which identifies if
covert channel. In addition, suppose that the the physical page is present in memory or not.
attacker finds a public-address load L that is If the bit is cleared, then the translation process
older than I and is in the reorder buffer (ROB) at fails and a page fault exception is raised. The
the same time as I. In this case, the attacker can OS is then invoked to handle it. To keep the TLB
arrange for L to page fault after a long page coherent while updating page table entries, the
walk (e.g., by clearing the present bit of the OS can selectively flush TLB entries through the
corresponding page table entry, and evicting INVLPG instruction.
the multilevel page table entries from the
cache). While the page walk is underway, I exe-
SUMMARY OF THE MICROSCOPE
cutes and the attacker observes a noisy sam-
ATTACK
ple. Then, the OS pretends to service the page
MRAs are based on the key observation that
fault but keeps the present bit cleared. As a
modern hardware allows recently executed, but
result, L will go through the page walk again
not retired, instructions to be rolled back and
and I will execute again. This process is
replayed if certain conditions are met. This
repeated many times, causing the replay of I
an arbitrary number of times until the signal-
The name MicroScope comes from the attack’s ability to peer inside nearly
to-noise ratio is reduced enough that the any microarchitectural side channel.
IEEE Micro
92
Figure 1. Timeline of a MicroScope attack. The Replayer is an untrusted OS or hypervisor process that forces
the Victim code to replay, enabling the Monitor to denoise and extract the secret information.
behavior can be exploited to mount a variety of translations, and eventually suffer a page fault.
attacks (see the “Generalizing Microarchitec- In the meantime, instructions that are younger
tural Replay Attacks” section). than the replay handle, including the sensitive
An MRA attack has three actors: Replayer, instruction(s), can execute speculatively but not
Victim, and Monitor. In a MicroScope attack, retire.
which is a type of MRA, the Replayer is a mali-
cious OS or hypervisor that is responsible for Speculative Execution in the Shadow of Page
page table management. The Victim is an applica- Walks
tion process that executes on some secret data After the attack is set up, the Replayer allows
that the attacker wishes to exfiltrate. The Moni- the Victim to resume execution and issue the
tor is a malicious process that performs auxiliary replay handle, as shown in timeline 3 of Figure 1.
operations, such as causing contention and mon- The replay handle access misses in the L1 TLB, L2
itoring shared resources. Figure 1 shows the TLB, and page walk cache (PWC), and initiates a
timeline of the interleaved execution of the page walk. The hardware page walker fetches the
Replayer, Victim, and Monitor for a MicroScope necessary page table entries sequentially, start-
attack. ing from page global directory (PGD), then page
upper directory (PUD), page middle directory
Attack Setup (PMD), and finally page table entry (PTE).
MicroScope is enabled by what we call a The Replayer can tune the duration of the
Replay Handle. A replay handle can be any mem- speculative execution by choosing whether vic-
ory access instruction that occurs shortly before tim’s page table entries are either present or
one or more security-sensitive instructions in absent from the cache hierarchy and PWC (shown
program order. in the arrows above timeline 3 of Figure 1). The
In MicroScope, the Replayer sets up the speculative instructions executing in the shadow
attack by locating the page table entries req- of the page walk may leave some state in the cache
uired for virtual-to-physical translation of the subsystem and/or create contention for hardware
replay handle. Then, it performs the following structures in the core. This allows the Monitor to
steps, shown in the timeline 1 of Figure 1. First, perform a noisy measurement of the secret data.
it flushes the replay handle data from the cache. At the end of the page walk, the hardware raises a
After that, it clears the present bit of the leaf page fault exception and squashes the speculative
page table entry of the replay handle. After that, instructions in the pipeline.
it flushes the translation page table entries from The Replayer is then invoked to handle the
the caches. Finally, it flushes the TLB entry that page fault. The operation is shown in timeline 2
stores the virtual-to-physical translation of the of Figure 1. The Replayer chooses to keep the
replay handle access. Together, these steps will present bit cleared. Timeline 4 of Figure 1
cause the replay handle to miss in the TLB, shows the actions of the Victim. In this case,
induce a hardware page walk to locate the after the Victim resumes and reissues the replay
May/June 2020
93
Top Picks
where the Monitor executes in parallel with the SIMPLE ATTACK EXAMPLES
Victim’s speculative execution. Figure 2 shows several examples of codes that
present opportunities for MicroScope attacks.
Attack Summary Each example showcases a different use case.
The attack has the following six steps.
Single-Secret Attack
1) The Replayer identifies a replay handle and Figure 2(a) shows a simple code that has a sin-
prepares the attack—e.g., by priming micro- gle secret. Line 2 accesses a public address (i.e.,
architectural state. known to the OS). This access is the replay handle.
2) When the Victim executes the replay handle, After a few other instructions, sensitive code
it suffers a TLB miss followed by a page walk. at Line 4 processes some secret data. We call this
The time taken by this step can be over 1000 computation the transmit computation of the
cycles, and can be tuned as per the require- Victim, using terminology from prior work.9 The
ments of the attack. transmit computation may leave some state in the
3) In the shadow of the page walk and until the cache or may use specific functional units that cre-
page fault is serviced, the Victim continues ate observable contention. The goal of the adver-
to execute speculatively past the replay han- sary is to extract the secret information. The
dle into the sensitive region, potentially until adversary can obtain it by using MicroScope to
the ROB is full. repeatedly perform steps (2)–(5) from the “Attack
4) The Monitor can cause and measure conte- Summary” section.
ntion on shared hardware resources during
the Victim’s speculative execution, or inspect Loop-Secret Attack
the hardware state left by the Victim’s specu- We now consider the scenario where we want
lative execution. to monitor a given instruction in different itera-
5) When the replay handle triggers a page fault, tions of a loop. We call this case Loop Secret, and
the Replayer gains control and can optionally show an example in Figure 2(b). In the code, the
leave the present bit cleared in the PTE loop body has a replay handle and a transmit oper-
entry. This will induce another replay cycle ation. In each iteration, the transmit operation
that the Monitor can leverage to collect more accesses a different secret. The adversary wants
information. Before the replay, the attacker to obtain the secrets of all the iterations. The
IEEE Micro
94
challenging case is when the replay handle maps
to the same physical data page in all the iterations.
This scenario highlights a common problem in
side channel attacks: secret[i] and secret[i+1]
may induce similar effects, making it hard to dis-
ambiguate between the two. For example, both
secrets may be colocated in the same cache line,
or induce similar pressure on the execution units.
This fact severely impedes the ability to distin-
guish the two accesses.
MicroScope addresses this challenge by
using a second memory instruction to move
Figure 3. Latencies measured by performing a port
between the replay handles in different itera-
contention attack. (a) Victim executes two multiply
tions. This second instruction is located after
operations. (b) Victim executes two division
the transmit instruction in program order, and
operations.
we call it the Pivot instruction. For example, in
Figure 2(b), the instruction at Line 6 can act as
the pivot. branch using at least two different types of side
MicroScope uses the pivot as follows. After channels.
the adversary infers secret[i] and is ready to pro- First, if lines 3 and 5 in Figure 2(c) access dif-
ceed to extract secret[i+1], the adversary per- ferent cache lines, then the Monitor can perform
forms one additional action during step 6 in the a cache based side-channel attack to identify the
“Attack Summary” section. Specifically, after set- cache line accessed, and deduce the branch
ting the present bit in the PTE entry for the replay direction. A second case is when the two paths
handle, it clears the present bit in the PTE entry out of the branch perform different computa-
for the pivot, and resumes the Victim’s execution. tions. In this scenario, the Monitor can apply
As a result, all the Victim instructions before the pressure on the functional units and, by monitor-
pivot are retired, and a new page fault is incurred ing contention, deduce the operation that the
for the pivot. code performs and, hence, the branch direction.
When the Replayer is invoked to handle the
pivot’s page fault, it sets the present bit for the
pivot and clears the present bit for the replay EVALUATION
handle. When the Victim resumes execution, it We validated MRAs and MicroScope by
retires all the instructions of the current iteration denoising a notoriously noisy side channel: exe-
and proceeds to the next iteration, suffering a cution unit port contention.6 For this attack, we
page fault in the replay handle. Steps 2–5 repeat assume the SGX threat model. We use victim
again, enabling the monitoring of secret[i+1]. code similar to the one in Figure 2(c), where one
The process is repeated for all the iterations. side of the branch executes two division opera-
tions, and the other side executes two multipli-
Control Flow Secret Attack cation operations. The Replayer forces the
A final scenario that is commonly exploited replay of the code. Concurrently, the Monitor
using side channels is a secret-dependent branch executes a loop with one division operation in
condition. We call this case Control Flow Secret, each iteration. We measure the time taken by
and show an example in Figure 2(c). In the code, each iteration of the Monitor loop. If the Victim
the direction of the branch is determined by a executes the code with the two multiplications,
secret, which the adversary wants to extract. the Monitor instructions execute fast and,
As shown in the figure, the adversary uses a hence, no contention is measured. Figure 3(a)
replay handle before the branch, and a transmit shows the latency of each iteration of the Moni-
operation in both paths out of the branch. The tor. We see that all but four of the samples take
adversary can extract the direction taken by the less than 120 cycles, which we identify to be the
May/June 2020
95
Top Picks
IEEE Micro
96
Page fault-oriented defense mechanisms these instructions deterministically may provide
could be effective to defeat MicroScope. Unfortu- a way to debug hard-to-reproduce software
nately, solutions that rely on Intel TSX are not bugs, such as data races. Small enhancements
sufficient, since may enable single-stepping the code, or the abil-
TSX itself creates a ity to change the direction of branches on the
new mechanism Overall, with a good fly. Overall, with a good interface, MicroScope
with which to cre- interface, MicroScope may become a unique debugging tool for sequen-
ate replays, may become a unique tial and parallel code.
debugging tool for
through transac-
sequential and parallel
tion aborts. Thus,
code.
we believe further
research is needed
& REFERENCES
before applying either of the aforementioned 1. F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee,
defenses to any variant of MRA. “Last-level cache side-channel attacks are practical,”
in Proc. IEEE Symp. Secur. Privacy, May 2015,
MRAs as a Speculative Defense Mechanism pp. 605–622.
While MicroScope is presented as an attack, 2. Y. Yarom, D. Genkin, and N. Heninger, “CacheBleed:
its operation can be used to improve security A timing attack on OpenSSL constant time RSA,” Int.
defenses. This is because it provides a window Conf. Cryptographic Hardware Embedded Syst.,
into what speculative execution attacks, such as 2016.
Spectre and Meltdown, can do. For example, a 3. Y. Yarom and K. Falkner, “Flush+Reload: A high
Spectre attack on a given branch cannot affect resolution, low noise, L3 cache side-channel attack,”
subsequent instructions that are more than in Proc. USENIX Secur. Symp., 2014.
an ROB-long distance away from the branch 4. D. Evtyushkin, R. Riley, N. Abu-Ghazaleh, and
dynamically. MicroScope can provide this infor- D. Ponomarev, “Branchscope: A new side-channel
mation, and thus can determine when an instruc- attack on directional branch predictor,” in Proc. Int.
tion needs protection. Conf. Archit. Support Program. Lang. Oper. Syst.,
Furthermore, MicroScope can be used to per- 2018, pp. 693–707.
form black-box analysis of microarchitectural 5. M. Yan, R. Sprabery, B. Gopireddy, C. Fletcher,
structures, such as the ROB, load store queue R. Campbell, and J. Torrellas, “Attack directories, not
(LSQ), and others. Controlled fine-grain micro- caches: Side channel attacks in a non-inclusive
architectural replay capabilities can enable world,” in Proc. IEEE Symp. Secur. Privacy, vol. 1,
the reverse engineering of hardware structures. pp. 56–72, 2019, doi: 10.1109/SP.2019.00004.
This approach can reveal timing, number of ports, 6. A. C. Aldaya, B. B. Brumley, S. U. Hassan,
and interconnect information and, more impor- C. P. Garcıa, and N. Tuveri, “Port contention for fun
tantly, uncover previously unknown behavior of and profit,” IEEE Symp. Secur. Privacy, 2019,
hardware units under speculative execution. Such pp. 870–887.
information is not only useful for discovering pre- 7. Intel, “Intel software guard extensions programming
viously unknown vulnerabilities, but can further reference,” 2013. [Online]. Available: https://software.
provide a foundation for defense mechanisms. intel.com/sites/default/files/329298-001.pdf
8. P. Subramanyan, R. Sinha, I. Lebedev, S. Devadas,
Parallel Application Debugging and S. A. Seshia, “A formal foundation for secure
The mechanism that enables our current remote execution of enclaves,” in Proc. ACM SIGSAC
MicroScope prototype, namely the capture and Conf. Comput. Commun. Secur., 2017, pp. 2435–2450.
reexecution of an ROB-sized set of instructions, 9. V. Kiriansky, I. A. Lebedev, S. P. Amarasinghe,
is a very useful primitive in software develop- S. Devadas, and J. Emer, “DAWG: A defense against
ment and debugging. Capturing these instruc- cache timing attacks in speculative execution
tions can provide insights into the program processors,” in Proc. 51st Annu. IEEE/ACM Int. Symp.
state in a way that no current tool can. Replaying Microarchit., 2018, pp. 974–987.
May/June 2020
97
Top Picks
10. OpenSSL, “Open source cryptography and Bhargava Gopireddy is a Senior Architect at Nvi-
dia, where he works on energy-efficient GPU archi-
SSL/TLS toolkit,” 2019. [Online]. Available: www.
tectures. His research interests include energy
openssl.org
efficient many-core architectures and architectural
11. C. Canella et al., “A systematic evaluation of
support for operating systems/security. Gopireddy
transient execution attacks and defenses,” in Proc. received the Ph.D. degree in computer science from
28th USENIX Secur. Symp., 2019, pp. 249–266. the University of Illinois at Urbana–Champaign in
12. A. Nazari, N. Sehatbakhsh, M. Alam, A. Zajic, 2018. Contact him at [email protected].
and M. Prvulovic, “EDDIE: EM-based detection of
deviations in program execution,” in Proc. ACM/IEEE Read Sprabery completed his Ph.D. degree with
44th Annu. Int. Symp. Comput. Archit., 2017, a focus on cloud security in 2018 at the University
pp. 333–346. of Illinois at Urbana–Champaign and now works
as a security researcher at Google. Contact him
at [email protected].
Dimitrios Skarlatos is a Ph.D. candidate at the Josep Torrellas is the Saburo Muroga Professor
University of Illinois at Urbana–Champaign. His of Computer Science at the University of Illinois at
research lies at the intersection of computer archi- Urbana–Champaign (UIUC). He is the Director of the
tecture, security, and operating systems. He builds Center for Programmable Extreme Scale Computing
practical solutions that improve the performance and past Director of the Illinois-Intel Parallelism
and bolster—or sometimes break—the security Center. His research interests include computer
guarantees of computing systems. Contact him at architecture and parallel processing. Torrellas
[email protected]. received the Ph.D. degree from Stanford University.
Contact him at [email protected].
Mengjia Yan is an Assistant Professor in the Electri- Christopher W. Fletcher is an Assistant Professor
cal Engineering and Computer Science Department, in computer science at the University of Illinois at
Massachusetts Institute of Technology. Her research Urbana–Champaign. He has interests ranging from
interest lies in the areas of computer architecture and computer architecture to security to high-performance
hardware security, with a focus on side channel computing (ranging from theory to practice, algorithm
attacks and defenses. Yan received the Ph.D. degree to software to hardware). Fletcher received the Ph.D.
from the University of Illinois at Urbana–Champaign degree from Massachusetts Institute of Technology in
(UIUC). Contact her at [email protected]. 2016. Contact him at [email protected].
IEEE Micro
98
Theme Article: Top Picks
Creating Foundations
for Secure
Microarchitectures
With Data-Oblivious
ISA Extensions
Jiyong Yu Mohamad El Hajj and Christopher W. Fletcher
University of Illinois at Urbana–Champaign University of Illinois at Urbana–Champaign
Lucas Hsiung
SciFive
May/June 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
99
Top Picks
& A, ARGUABLY THE, central problem in secure side channels. Depending on the microarchitec-
computer architecture today is how to reason ture, this may require closing side channels
about security amid the sea of different micro- through the cache, translation lookaside buffer
architectural side channel attacks. The prevailing (TLB), etc. That is, security is not tied to closing
approach to stop these attacks is to a specific side channel and the
block leakage stemming from one To our knowledge, this programmer works with a simple,
hardware structure at a time. For is the first proposal that portable guarantee across micro-
example, by partitioning or ran- provides a basis to architectures. On the efficiency
domizing the cache layout, we block all traditional side side, each microarchitecture can
block (or at least aggravate) cache channel and specula- choose how to implement the
timing attacks. Yet, many hardware tive execution attacks safe load operation in whatever
structures have been shown to leak on commercial-class way maximizes performance
secrets—from the cache to the microarchitectures. while preserving security (e.g.,
branch predictors,5 speculative by microcoding the load into sim-
execution,8 port contention,1 arithmetic unit tim- pler safe operations,2 or using hardware parti-
ing,2 etc. Given the many avenues to leak a secret, tioning,11 or using cryptographic techniques9).
it is paramount to explore holistic defenses that Safe loads are just one example. More gener-
provide a basis to block leakage through all hard- ally, deciding which instruction operands to des-
ware structures. ignate as safe opens a new, rich ISA design space
In this direction, the article proposes ISA which trades-off performance and hardware
design principles for what we call data-oblivious complexity.
ISAs (OISAs). The key idea with an OISA is to Beyond formulating design principles for
explicitly but abstractly specify security policy, OISAs, the article proposes a concrete OISA
so that the policy can be decoupled from the extension built on top of RISC-V, implements
microarchitecture and even the threat model. (and open sources) that OISA extension on the
Analogous to a traditional ISA, this enables an BOOM out-of-order (OoO) speculative RISC-V
OISA to serve as a portable security-centric core,3 and provides a formal analysis showing
abstraction for software while enabling security- how the OISA provides a basis to achieve nonin-
aware implementation and optimization flexibil- terference (“zero privacy leakage”) on an
ity for hardware. abstract OoO speculative machine. Crucially, the
The OISA proposed in the article annotates security analysis and principles are robust to
what data is confidential and what instruction modern attacks. Case in point, the article’s formal
operands are safe. Inspired by information flow analysis shows how the OISA soundly defeats
policies (in particular, the classic policy High Z speculative execution attacks (such as Spectre8)
Low), the hardware dynamically enforces that without introducing special case reasoning.
confidential data is never passed to unsafe oper- To our knowledge, this is the first proposal
ands, i.e., Confidential data Z Unsafe operands. that provides a basis to block all traditional side
Informally, “safe” in the article means “does not channel and speculative execution attacks on
create a microarchitectural side channel as a func- commercial-class microarchitectures.
tion of the operand” (we also provide formal defi-
nitions), but other notions of safety can be
retrofitted into the implementation without MOTIVATION: SECURE AND
changing the OISA or the programs that sit on EFFICIENT DATA-OBLIVIOUS
top of it. PROGRAMMING
OISAs enable high security, portability, and The OISA project came about by asking the
efficiency. Consider a simple example OISA following question: Is it possible today to write
instruction: a load with a safe address operand. microarchitectural side channel-free programs
Security-wise and portability-wise, the OISA on modern microarchitectures?
guarantees that when the load executes, the The answer is no. Consider the most conser-
address will not leak through microarchitectural vative approach used by practitioners, called
IEEE Micro
100
data-oblivious programming. In a nutshell, a reveals its argument over a microarchitectural
data-oblivious program is one whose hardware side channel.
resource usage is independent of the program’s This program is legal data-oblivious code:
inputs. To write such programs, the guidelines The branch outcome in each iteration is public
are to use only simple instructions, or otherwise information, the round logic is data oblivious,
ensure that complex instructions do not receive and only the plaintext is meant to be revealed
Confidential data as operands. For example, sim- after decryption is complete. Yet, unwanted pri-
ple bitwise math is allowed, but memory vacy leaks because benign mispredictions can
operations/branches with Confidential data as cause the round logic to exit early. In this exam-
addresses/predicates are not (out of fear of, e.g., ple, an early mispredict of “not taken” allows the
cache-based/control flow-related side channels). attacker to see state before all rounds complete,
Despite being extremely conservative, the which allows it to perform cryptanalysis and
abovementioned guidelines fail in light of ISA- recover the secret key rkey.
invisible microarchitecture-specific optimiza-
tions. For example, on one microarchitecture, a
Core Issue: No Abstraction for Security
simple integer addition might be safe (e.g., imple-
To summarize, data-oblivious programming
mented as a single-cycle operation whose timing
today is insecure and slow. It is insecure because
is independent of its inputs) while on another it
of ISA-invisible microarchitecture-specific opti-
might be unsafe (e.g., implemented as a bit-serial
mizations. It is slow because, out of fear of leak-
operation that skips runs of 0s or zeros to save
ing privacy, programmers are forced into using
time). The article describes 11 like optimizations,
only the simplest of instructions.
which have been proposed in the literature, or
The article sets out to address these issues
are otherwise known to be implemented already,
by introducing new ISA-level abstractions
which break data-oblivious program security.
for reasoning about security and enabling higher
These include data-in-use optimizations (such as
performance. A new ISA abstraction addresses
data-dependent arithmetic) and data-at-rest opti-
the security problem by defining how instruc-
mizations (such as cache compression).
tions leak privacy across all compliant microarch-
In particular, the article points out for the
itectures. It further enables higher performance
first time that speculative execution breaks data-
by allowing data-oblivious programs to take
oblivious program security, by steering execu-
advantage of higher performance instructions, as
tion so that Confidential data is consumed by an
long as those instructions are deemed safe by
instruction whose execution can leak privacy.
the ISA, and gives microarchitects the ability to
This is nontrivial to see for realistic programs,
optimize those instructions subject to the ISA-
given the conservative guidelines used to write
prescribed security policies.
data-oblivious code. For example, consider data-
oblivious decryption
1 for (i = 0; i < NUM_ROUNDS; i++) FORMAL DEFINITIONS FOR
2 state = OblDecryptRound MICROARCHITECTURAL SIDE
(state, rkey [i]) CHANNELS
To start, the article develops a security defi-
3 leak(state)
nition for microarchitectural side channel-free
That is, perform a fixed number of decryption execution. There are two challenges. First, how
rounds, where each round works on a part of the to write the definition to account for any possi-
secret key (rkey) and incrementally updates the ble microarchitectural side channel. Second,
round state (state). Here, we assume that how to write the definition so that it sheds
OblDecryptRound, the round logic, is data obliv- insight on which instructions are “safe” from a
ious. leak() is a proxy for an instruction that microarchitectural side channel perspective.
To define privacy, we adopt a trace-based
Data-oblivious programming goes by several other names, e.g., “constant-
indistinguishability style definition inspired by
time programming” and “programming in the circuit abstraction,” depending
on the community. the oblivious RAM (ORAM)7 literature. We
May/June 2020
101
Top Picks
Figure 1. Changes in resource usage, as a function of data, create microarchitectural side channels. If a
latch is shaded in a given clock cycle, then it means that there is (explicit) information flow from the operands
to that latch in that cycle. Assume operands A and B are two sets of distinct data values, meant to induce
different ALU timings.
consider a program ; which takes public data x information flow abstraction similar to GLIFT.10
and confidential data y as input. That program’s Figure 1 shows an example using an arithmetic
execution trace, on a microarchitecture mArch, logic unit (ALU) with operand-independent and
i.e., “all the atoms in the universe that are per- then operand-dependent timing.
turbed as a result of running ðx; yÞ on mArch,” is First assume a single-cycle ALU (see Figure 1,
denoted mArchððx; yÞÞ. The subset of this trace Case 1). Suppose the input arrives and is stored in
that the attacker can see (called the view) is the input latches at the rising edge of cycle 1.
denoted ViewðmArchððx; yÞÞÞ. For privacy, we Using terminology from information flow, we say
require that the information in the View does the input latch is tainted in cycle 1. Now, regard-
not depend on confidential information, i.e., that less of the logic values of the input, the same
ViewðmArchððx; yÞÞÞ ’ ViewðmArchððx; y0 ÞÞÞ for latches are tainted in each cycle thereafter. That
all confidential data y and y0 . In this setting, ’ is, the output latches are tainted in cycle 2, etc.
informally means “equal, given the capabilities Because which latches are tainted when is indepen-
of any computationally bounded adversary.” For dent of the operands, we say the single-cycle ALU
example, in ORAM schemes the view is the does not form a microarchitectural side channel.
“memory access pattern” and ORAM seeks to Next assume an ALU with operand-dependent
make the memory access pattern independent of timing (see Figure 1, Case 2). For example, a mul-
confidential data. tiply operation that takes one or two cycles,
Next, we must define a view that captures any depending on whether an operand is 0. In this
possible microarchitectural side channel that an case, depending on the input, the output latch is
arbitrary software-based attacker can monitor. either tainted in cycle 2 or cycle 3. Because
This is nontrivial as the attacker can monitor which latches are tainted when is dependent on
many aspects of the program’s execution. For the operands, we say this ALU can form a micro-
example, its execution time, use of the cache, architectural side channel.
arithmetic units, etc. The article makes a key Putting everything together, we model the
observation that all of these leakages can be mod- processor as a state machine composed of com-
eled as confidential data-dependent changes in the binational logic and latches.** The subset of
program’s hardware resource usage over time. For latches that store the Confidential input are
instance, both arithmetic units and cache sets denoted tainted at the start. Then, the View out-
are hardware resources and the fact that they puts a trace that indicates which subset of
are used at confidential data-dependent times is latches are tainted in each cycle. That is, hard-
the crux of the attacks. ware resource usage as a function of time. If the
Then, the question is how to determine microarchitecture ensures that the View is inde-
whether a hardware resource is currently being pendent of Confidential data, the microarchitec-
“used” by a program. (Note that whether a hard- ture does not leak privacy. Conversely, if the
ware resource is being “used” is independent of definition is not satisfied, we can pinpoint which
the logic values currently stored in that struc-
ture.) For this, we rely on an explicit gate-level **W.l.o.g., we treat any state element (flip-flop, SRAM cell, etc.) as a latch.
IEEE Micro
102
(Confidential ! Safe) When Confidential
data is sent to a safe operand: The hardware
designer must add mechanisms to enforce
the security definition given that instruction’s
execution (see the “Formal Definitions for
Figure 2. Protection policies, checked before each Microarchitectural Side Channels” section),
instruction executes. for a specified view. For example, by disabling
performance optimizations, scrubbing side
effects and masking exceptions that occur as
instruction caused the problem by looking at
a function of confidential operands.
where the Views diverged. (To note, the article
(Public ! Safe/Unsafe) When public data is
defines taint propagation in a nonstandard way
sent to safe or unsafe operands, no special
to model only explicit information flows. This
treatment is needed and execution can pro-
prevents taint explosion, which would render
ceed without protection.
the definition not useful. Implicit information
flows are modeled by quantifying over all y0 .) Despite these rules’ simplicity, they provide
both security and efficiency benefits. As we will
see in the “Formal Analysis,” they provide a uni-
PRINCIPLES OF OISA DESIGN
form handling for both traditional- and specula-
The design principles for OISAs are twofold.
tive-execution-based attacks.8 Case in point, the
First, the OISA should expose security guaran-
only mention of speculation is a detail in the rule
tees in a microarchitecture-independent way.
for Confidential Z Unsafe, where we say such an
That is, programs written using an OISA should
information flow delays the instruction’s execu-
maintain the same security guarantees across all
tion until it is nonspeculative. This removes
OISA-enabled microarchitectures. Second, OISAs
false-positive violations due to benign misspecu-
should not preclude modern hardware perfor-
lations. At the same time, the rules enable block-
mance optimizations, except when those optimi-
ing attacks with low overhead. Case in point, the
zations have a chance to leak privacy.
rules encode some intuitive optimizations such
To address these goals, the OISA abstraction
as “Public data does not need protection” and “it
proposed in the article has two parts. First, the
is safe to compute on confidential data with safe
OISA labels data to be confidential/public, to
instructions.” The only situation where instruc-
capture whether that data is a function of user
tion execution is impeded is if confidential data
secrets (i.e., the sensitive program inputs from
is consumed by an unsafe operand.y
the “Formal Definitions for Microarchitectural
Key Idea: Abstract security policies facilitate pro-
Side Channels” section). Second, the OISA speci-
gramming simplicity, implementation flexibility,
fies, for each instruction, whether each instruc-
and performance optimizations. Specifying instruc-
tion operand is safe or unsafe.
tion operand security policy abstractly, i.e., as
Finally, compliant microarchitectures must
safe/unsafe, provides significant flexibility to both
monitor and take different actions based on what
the ISA and hardware designer while simplifying
data is consumed by what instruction operands
programmer-level reasoning about security. At
at runtime. Specifically, hardware must enforce
the ISA level, an ISA designer can decide which
the following rules (shown in Figure 2).
instructions are sufficiently important to warrant
(Confidential Z Unsafe) When confidential safe operands. These choices should be made
data is presented to an unsafe operand: The carefully: On one hand, safe operands impose a
hardware must delay that instruction’s exe- burden on hardware designers as the processor
cution until it is nonspeculative. If the rule must support mechanisms to uphold security viz.,
still applies when the instruction is nonspec- the “Formal Definitions for Microarchitectural
ulative, the program terminates with a fault
(as continuing would constitute an informa- y
This principle directly inspired our follow-on work to block, specifically,
tion leak). speculative execution attacks.12
May/June 2020
103
Top Picks
IEEE Micro
104
BOOM to our concrete BOOM prototype. Through
this model, we prove that the OISA provides a
basis to satisfy strong security definitions such as
those we defined in the “Formal Definitions for
Microarchitectural Side Channels” section. Our
security analysis is general, and applies given any
implementation of several important processor
structures (e.g., it models the branch predictor as
an arbitrary function that takes previous branch
Figure 4. Microarchitectural changes needed to resolutions as input).
support the OISA from the “Design of A Concrete Importantly, we are able to prove security
OISA” section (including the oblivious-memory while allowing high-performance hardware opti-
extension, denoted “omp”). Label stations check and mizations (e.g., OoO, speculative execution) to
enforce the transition rules from Figure 2. remain enabled in the common case and without
ever requiring hardware flushes to structures
such as the cache or branch predictors.
bitonic sort. On the other hand, if sort is speci-
Security intuition. Informally, to argue secu-
fied as a single safe instruction in the OISA, an
rity, we need to show the following.
implementation based on hardware partitioning
can achieve Oðn log nÞ time if implemented as a
a) Each instruction’s resource usage/side-
constant-time merge sort.
effects are independent of Confidential
data.
HARDWARE PROTOTYPE ON RISC-V b) The sequence of instructions that are exe-
BOOM cuted, i.e., the processor program counter
We prototype all hardware changes needed (PC), is independent of Confidential data.z
to support our OISA on top of the RISC-V BOOM
processor (for “Berkeley OoO Machine”).3 Condition (a) follows by definition by apply-
BOOM is the most sophisticated open RISC-V ing the rules in Figure 2 to each instruction as
processor, featuring modern performance opti- it executes. A more subtle point is that condi-
mizations such as speculative and OoO execu- tion (b) also follows from applying the same
tion, and is similar to commercial machines that rules. To see why, first consider a simple unpi-
run data-oblivious code today. pelined, in-order processor with no specula-
Microarchitectural changes to support the tion. In this case, it is clear condition (b) holds
OISA are shown in Figure 4. The main changes are because the only instruction type from Figure 3
logic at instruction issue/execute to enforce the that changes the PC as a function of data is a
rules from the “Principles of OISA Design” section, branch, and the OISA requires that the branch
storage/logic to implement the oblivious-memory predicate and target be Public data. What hap-
extension, and logic to track and denote data as pens when we consider more advanced pipe-
confidential/public. For the latter, we implement a lines, e.g., with prediction and speculation? In
hardware information flow tracking mechanism that case, microarchitectural state outside of
similar to hardware dynamic information flow program semantics, e.g., the branch predictor
tracking, but capable of checking and updating state, influences the PC. To extend our security
whether data is confidential/public (the data’s argument to these machines, we must extend
label) at any stage in the pipeline. what we mean by “resource usage/side-effect”
to include these structures. Then, using induc-
tion, one can show that if conditions (a) and
FORMAL ANALYSIS
(b) hold up to fetching the ith instruction, the
In parallel to our hardware prototype, we
branch predictor state when fetching the
develop a formal analysis that models an abstract
BOOM-class processor (OoO, speculative, super-
z
scalar), and describe how to map the abstract Similar requirements on “not tainting” the PC also govern prior work.10
May/June 2020
105
Top Picks
IEEE Micro
106
We make a key observation that the underlying 7. O. Goldreich and R. Ostrovsky, “Software protection
programming abstraction assumed for those and simulation on oblivious rams,” J. ACM, vol. 43,
works is the same abstraction provided by an pp. 431–473, May 1996.
OISA. For example, a homomorphic encryption 8. P. Kocher et al., “Spectre attacks: Exploiting
operation is akin to a safe instruction, just using a speculative execution,” in Proc. IEEE Symp. Secur.
different implementation suitable for a different Privacy, 2019, pp. 1–19.
threat model. This enables a new, large-scale 9. S. Sasy, S. Gorbunov, and C. W. Fletcher, “ZeroTrace:
research agenda to port insights and advances Oblivious memory primitives from Intel SGX,” in Proc.
made in the applied cryptography community to/ Netw. Distrib. Syst. Secur. Symp., San Diego, CA,
from the microarchitectural side channel commu- USA, Feb. 18–21, 2018. Available: http://dx.doi.org/
nity. For example, we can enable high-level pro- 10.14722/ndss.2018.23239
gramming abstractions for writing OISA-secure 10. M. Tiwari, H. M. Wassel, B. Mazloom, S. Mysore, F. T.
code by adding a new OISA backend to existing Chong, and T. Sherwood, “Complete information flow
data-oblivious compiler frameworks. At the same tracking from the gates up,” in Proc. 14th Int. Conf.
time, the notion of safe instructions provides a Archit. Support Program. Lang. Oper. Syst., 2009,
new theory to explore in applied cryptography. In pp. 109–120.
particular, algorithm design in encrypted compu- 11. Z. Wang and R. B. Lee, “New cache designs for
tation assumes only extremely simple safe opera- thwarting software cache-based side channel
tions (e.g., bit add or multiply). With an OISA, attacks,” in Proc. 34th Annu. Int. Symp. Comput.
however, we can choose which operations sup- Archit., 2007, pp. 494–505.
port safe operands, and co-design algorithms with 12. J. Yu, M. Yan, A. Khyzha, A. Morrison, J. Torrellas, and
this in mind to improve performance. C. Fletcher, “Speculative taint tracking (STT): A
comprehensive protection for speculatively accessed
data,” in Proc. 52nd Annu. IEEE/ACM Int. Symp.
Microarchit., 2019, pp. 954–968.
& REFERENCES
1. A. C. Aldaya, B. B. Brumley, S. U. Hassan, C. P.
Garcıa, and N. Tuveri, “Port contention for fun and
Jiyong Yu is currently working toward the Ph.D.
profit,” in Proc. IEEE Symp. Secur. Privacy, 2019, degree at the University of Illinois at Urbana–Champaign.
pp. 870–887. His research interests are in processor security. Contact
2. M. Andrysco, D. Kohlbrenner, K. Mowery, R. Jhala, S. him at [email protected].
Lerner, and H. Shacham, “On subnormal floating point
and abnormal timing,” in Proc. IEEE Symp. Secur.
Lucas Hsiung received the B.S. degree from the
Privacy, 2015, pp. 623–639.
University of Illinois at Urbana–Champaign in 2019
3. C. Celio, P.-F. Chiu, B. Nikolic, D. A. Patterson, and K.
and now works as a Security Verification Engineer at
, “BOOM v2: An open-source out-of-order
Asanovic SciFive. Contact him at [email protected].
RISC-V core,” Tech. Rep. UCB/EECS-2017-157, EECS
Dept., Univ. California, Berkeley, CA, USA, 2017.
Mohamad El Hajj is currently working toward the
4. D. E. Denning, “A lattice model of secure information
M.S. degree at the University of Illinois at Urbana–
flow,” Commun. ACM, vol. 19, pp. 236–243, May 1976.
Champaign with research interests in hardware
5. D. Evtyushkin, R. Riley, N. C. Abu-Ghazaleh, ECE, and security. Contact him at [email protected].
D. Ponomarev, “BranchScope: A new side-channel
attack on directional branch predictor,” in Proc. 23rd
Christopher W. Fletcher is an Assistant Profes-
Int. Conf. Archit. Support Program. Lang. Oper. Syst,
sor in computer science at the University of Illinois at
2018, pp. 693–707.
Urbana–Champaign. He has interests ranging from
6. A. Ferraiuolo, M. Zhao, A. C. Myers, and G. E. Suh,
computer architecture to security to high-performance
“HyperFlow: A processor architecture for computing (ranging from theory to practice, algorithm
nonmalleable, timing-safe information flow security,” in to software to hardware). Fletcher received the Ph.D.
Proc. ACM SIGSAC Conf. Comput. Commun. Secur., degree from Massachusetts Institute of Technology in
2018, pp. 1583–1600. 2016. Contact him at [email protected].
May/June 2020
107
Theme Article: Top Picks
& PRIVACY IN THE digital age has become General Data Protection Regulation (GDPR) and
increasingly difficult to achieve and a conten- the California Consumer Privacy Act (CCPA).
tious topic. As technologies that capitalize on Computer scientists and engineers must develop
facial recognition, location services, and per- systems and tools for embedding privacy into
sonal health tracking become mainstream, existing and new workflows. In this article, we
addressing these complex privacy issues is of describe a new approach to privacy, wringing,
foremost importance. Policy makers have put in with particular applicability to the problem of
place regulations on data protection through the sharing program traces.
When working toward application-tuned sys-
tems, developers often find themselves caught
Digital Object Identifier 10.1109/MM.2020.2986113 between the need to share information (so that
Date of publication 8 April 2020; date of current version 22 partners can make intelligent design choices)
May 2020. and the need to hide information (to protect
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
108
proprietary methods or sensitive data). One been eliminated. While there is no known
place where this problem comes to a head is in mechanism of quantifying the amount of sensi-
the release of program traces; even the simplest tive data that remains in an arbitrary trace, we
memory access traces leak a tremendous can at least say how much total information is
amount of information. For example, we can cap- shared, which provides a useful upper bound.
ture the memory access behavior of a critical If we share only a couple thousand bits about
cryptographic function (which is known to be a a trace, we can then be certain we are not giv-
function of the secret key), a set of ing away every user’s social
lookups corresponding to the pars- The key idea, wringing,
security number by accident.
ing of a social security number, or is to squeeze as much Reconstructing a useful trace
even detailed system configuration information as possible from a few thousand bits of
parameters that are considered a out of the trace without information is hard, but inter-
trade secret. While the sharing of completely compromis- estingly we are free to use any
these traces between technology ing its utility. In the ideal public information about the
partners can lead to more robust and case, only the useful nature of these traces in help-
high-performance systems, it can structure of the trace ing us accomplish this. Com-
also leak highly sensitive informa- remains and all poten- pression, when taken to this
tion, and expose user data to secu- tially sensitive data has extreme and lossy form, con-
been eliminated.
rity vulnerabilities. Today when such nects to privacy in this unex-
traces are needed, programmers may pected way. However, as is
be asked to obfuscate the key algorithm behav- often the case in computer architecture, an
iors to hide sensitive data or provide models of important tradeoff remains between informa-
the system, which approximate the same behav- tion leaked and ability of the trace to capture
ior but omit sensitive parts. Hand-built models the program behavior.
of the system are both tedious to code and of We formalize this new approach specifically in
limited predictive power. Since there is no well- the context of memory address traces in part
defined and well-trusted approach to this prob- because we have many prior trace analysis tech-
lem, developers are often forced to resort to niques to build on.7,9,12 To expose the tradeoff
rough human-language descriptions of the inherent to this problem, we explore a new class
behavior of programs (e.g., it is 80% pointer- of memory trace synthesis techniques based on
chasing). This leads to missed opportunities, ideas from signal processing. By projecting the
frustrated optimization, and the design process address space onto a wrapped 2-D heatmap, we
ultimately suffers. Ideally, engineers would decompose memory behavior into orthogonal
access methods to eliminate any sensitive infor- set of features that can then be replayed to repro-
mation from the traces while still capturing the duce the same “visible” patterns as the traces
program behavior and its interaction with the under examination. Specifically, we use a Hough-
underlying hardware. However, the extent to transformed3 trace to find both constant and
which “sensitive” data influences program strided access patterns. We find that for memory
behavior is rarely understood by a single party, traces it is indeed possible for useful program
and even harder to argue is that it is completely behavior to be conveyed in only a few thousand
absent from a trace. bits. We demonstrate the utility of wrung traces
We present a new formulation of this prob- through cache simulation with bounded leakage,
lem of sharing traces where before release one and even examine the sensitivity of wrung traces
knows (a priori) exactly how much information to a class of attacks on AES encryption.
a trace is leaking in the worst case. The key
idea, wringing, is to squeeze as much informa- TRACE WRINGING AS A NEW GAME
tion as possible out of the trace without The program traces we look at in this article
completely compromising its utility. In the are memory access traces specifically, but more
ideal case, only the useful structure of the trace generally fall into a class of traces useful
remains and all potentially sensitive data has for application-tuning and hardware–software
May/June 2020
109
Top Picks
IEEE Micro
110
heatmap for gcc where
instruction count (time)
runs along the x-axis and
the address runs along the
y-axis. If we were to plot
this for the entire memory,
Figure 2. Phases visible in the trace generated by gcc after k-means clustering. Each of
it would clearly be too
the three colors in the bar marks a unique phase in the trace. Note, importantly, that phases
large for such a graph (the
reoccur over time.
distance between the
stack and heap would
dwarf any local behavior), so instead, we plot the Given that both strong temporal and spatial
address modulo a large power of two. Heatmaps locality features show up as lines, decomposition
such as this have the advantage of mapping into a set of line segments is a natural place to
addresses onto a more manageable space, but at start. The Hough transform can be used to then
the same time, keep the spatio–temporal struc- find the locations and orientations of certain geo-
tures that would actually impact a real cache. metric primitives, such as lines, in the given space.
Interesting and intuitive patterns emerge after We apply Hough transformation, a popular com-
looking over this graph. The flat horizontal lines puter vision technique for detecting patterns in
in the graph are patterns of repeating access to a images; for our features, we employ the Hough line
set of addresses. These are high temporal locality transform. Specifically, we use the progressive
behaviors. Sharp diagonal lines, on the other probabilistic Hough transform,5 a rendition of the
hand, are regions of high spatial locality as Hough transform algorithm that only performs
addresses are accessed one after the other in suc- voting on a subset of the input points. These input
cession. If we can concisely capture the character points are chosen based on certain features of the
of these behaviors, without transmitting the expected result, such as a threshold, the length of
addresses themselves, we can minimize the the expected line, interpolation strategies, and the
amount of information leaked. The modulo-mem- angle of the line. By interleaving the voting process
ory heatmaps exhibit hierarchical organization.
with line detection, this algorithm finds the most
Globally, there exists a recurrence of similar pat-
prevalent features first, while also minimizing the
terns in the order of a few tens of thousand
computational load. The progressive probabilistic
instructions, i.e., the presence of program phases,
Hough transform returns a set of lines, with each
and within them, we observe patterns that we
lines (x,y) coordinates in the modulo-memory
associate with the more local memory access
heatmap space. We also introduce a variable,
activity. In order to find some representative of
weight, for each line, which is a measure of dark-
the higher echelons of this hierarchy, we employ
k-means clustering for program phase analysis. 2,9 ness of the line. Some intuition about how the
Rather than encoding the entire trace monolithi- probabilistic Hough transform functions is
cally, we can encode just the k representative described in Figure 3.
clusters independently. By breaking the pattern The list of phase identifiers (the result of clus-
down into a set of simpler behaviors, we can then tering), the two ðx; yÞ coordinates of each line
tackle them one-by-one. Figure 2 shows the result segment detected by the Hough transforms, and
of running the phase detector on the memory the line’s weight in the representative phase, cre-
address trace for gcc. Each of the three colors in ate compact “information packets.” The size of the
the bar in the figure show the occurrence of three total “transmission” is n and bounds the maximum
unique phases in the memory access trace. The amount of information leaked.
technique does a good job of lining up with the After phase detection and Hough-line transfor-
repeating structures in the heatmap. With these mation, we end up with a set of lines for each rep-
phases marked, we can encode the k representa- resentative phase. We can see the decomposition
tive clusters with log2 k bits. of a phase of gcc into lines in Figure 4. Each phase
May/June 2020
111
Top Picks
IEEE Micro
112
Figure 5. (a) Heatmaps for the sensitive input trace gcc and (b) the trace-wrung proxy generated by our
pipeline. The heatmap of the trace-wrung proxy shows that both global and local features line up with the input
trace. All but the subtlest of patterns are present in the trace-wrung proxy.
May/June 2020
113
Top Picks
IEEE Micro
114
Deeksha Dangwal is currently working toward the Joseph McMahan is currently a Research Scientist
Ph.D. degree in computer architecture with the with the University of Washington, Seattle. His research
Department of Computer Science, University of interests include computer architecture, security, formal
California, Santa Barbara. Her research interests methods, and machine learning. McMahan received
include computer architecture, privacy, and informa- the Ph.D. degree from the University of California Santa
tion theory. She is a student member of IEEE and Barbara. He is a member of ACM and IEEE. Contact
ACM. She is the corresponding author of this article. him at [email protected].
Contact her at [email protected].
Timothy Sherwood is currently a Professor of
Weilong Cui is currently a Software Engineer with computer science and the Associate Vice-Chancellor
Google. His research interests include statistical/eco- for Research with the University of California, Santa
nomic-inspired methods and programming languages Barbara. He is a cofounder of the hardware security
for computer architecture performance modeling, as startup Tortuga Logic and the 2016 ACM SIGARCH
well as novel micro/system-architecture and its interac- Maurice Wilkes Awardee “for contributions to novel
tion with software. Cui received the master’s and bach- program analysis advancing architectural modeling
elor’s degrees in computer science from Peking and security.” Sherwood received the B.S. degree in
University and the Ph.D. degree from the Department computer science from UC Davis, and the M.S. and
of Computer Science, University of California, Santa Ph.D. degrees from UC San Diego. Contact him at
Barbara. Contact him at [email protected]. [email protected].
May/June 2020
115
!
HOST
W
O
N
TER
IS
2020
G
RE
Join dedicated professionals at the IEEE International Symposium on Hardware Oriented Security
and Trust (HOST) for an in-depth look into hardware-based security research and development.
Key Topics:
• Semiconductor design, test • Cryptography
and failure analysis and cryptanalysis
• Computer architecture • Imaging and microscopy
• Systems security
Discover innovations from outside your sphere of influence at HOST. Learn about new research
that is critical to your future projects. Meet face-to-face with researchers and experts for
inspiration, solutions, and practical ideas you can put to use immediately.
& WE SPEAK OF viruses, though not the ones forecasts are challenging to make, but less so in
that infect computers. Believe it or not, this insurance markets due to the abundance of prece-
follows a long tradition in economics. Two cen- dent. It is possible to ground a forecast in histori-
turies ago, T. Malthus first made his grim pre- cal patterns.
dictions about death from resource shortages. Forecasts become more difficult when every-
Economics has been known as the dismal sci- thing is hypothetical. For example, you may
ence ever since. have heard about “stress tests” for banks, which
Grim is the mood of the day. This is a painful were implemented after the financial mess in
moment for many stockholders, managers, and 2008. That “test”—more like an
personnel. Many readers may audit—focuses on whether the bank
not have had to think about In spite of the lack of can survive a rare and hypothetical
this topic since the financial precedent, today’s painful scenario. (Now that we are
panic of 2008, or maybe even economic crisis experiencing an actual scenario, we
the dot-com bust. broadly contains pre- will find out if the hypothetical is
Dismal as this topic is, let dictable features. Here planned appropriately.)
us understand how the pan- is why. The economy is Most of today’s dislocation realizes
one big circular flow of
demic could cause so much scenarios that had been hypothetical
expenditure in which
economic damage, and why it until a few months ago. This time
one person’s purchase
may take so long to recover. markets inherit the economic decline
is another person’s
I will keep it basic and focus sale, and that further from shutting down services in the
on the economics of this goes into somebody’s economy—more in a minute. There is
situation for the technology paycheck from which nothing in recent history like this.
economy. more purchases arise. The situation looks quite different
than it did in 2008 or 2001, where
the disruption in those instances origi-
SHORT-RUN EFFECTS nated with the mortgage lending crisis, and with
Analysts have a special set of models for sce-
the misdirected investments behind the dotcom
narios that arise with low probability and impose
boom.
high expense. You may have come across such
In spite of the lack of precedent, today’s eco-
models when buying life insurance or cata-
nomic crisis broadly contains predictable fea-
strophic medical insurance. These types of
tures. Here is why. The economy is one big
circular flow of expenditure in which one person’s
Digital Object Identifier 10.1109/MM.2020.2984182 purchase is another person’s sale, and that fur-
Date of current version 22 May 2020. ther goes into somebody’s paycheck from which
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
118
more purchases arise. In addition, it just keeps It almost goes without saying, but gloomy
going. Virtually every part of the U.S. economy forecasts about illiquidity are the mood of the
has been finely tuned to expect this circular flow. day at many firms. Related, this is also part of
Until recently this flow was predictable. the logic behind federal legislation to either
The pandemic made the expectations about make credit available to business and cash to
flows obsolete. That introduces many disloca- households. It alleviates suffering and delays
tions. Here is the chain of causality around which liquidity crises.
all hypothetical forecasts revolve. Restaurants, The uncertainty about expenditure explains
bars, schools, theaters, and sporting events have some of the “wait and see” remarks from finan-
closed in all major locations where people cial analysts.
“shelter in place.” As of this writing, more than a
third of the population lives in areas with such
lockdowns, and it could increase. Related parts WHAT THIS DOES TO FIRMS
of the economy also have slowed considerably: Despite the uncertainty, what can we say?
Travel on airlines, staying in hotels, vacationing
Representative examples can illustrate.
in beaches, and enjoying amusement parks.
Many firms get their flow of funds directly
Depending on how you count it, that inter-
from expenditures for leisure and travel. When
rupts somewhere between 10% and 15% of the
that expenditure drops, so too does sales at, for
expenditure in the US economy, and the jobs of
example, Airbnb, Priceline, Hotel.com, Expedia,
somewhere between 10% and 20% of the labor
Orbitz, and more. When restaurants receive less
force. In the third week of March over three
visits, so too does advertising on Open Table
million people—more than 2% of the labor
force—filed new claims for unemployment. That and Yelp and, to some extent, Google.
has never before happened in one week. Related local transportation declines, such as
What happens next? Two additional concepts Uber and Lyft. Close to a third of their trips go to
help round out the big picture: When everybody airports. Additionally, many people are shelter-
cuts back a little, it adds up and reinforces the ing at home, neither going to the local restaurant
overall movements downward. Once everyone nor far away on a vacation. Furthermore, while
expects the worst, as we do now, then the low Uber Eats might make more deliveries, it is not
forecast becomes self-fulfilling. Economists call enough to make up for the overall decline.
these “multiplier” effects, and “self-reinforcing The next few months are a lousy time to sell a
expectations,” and it makes the whole worse consumer product in a retail outlet. Sales of
than the sum of its parts. iPhones and smart phones have declined, as
By way of analogy, system engineers might consumers put off an optional purchase and
recognize this phenomenon as secondary feed- retail outlets close for health safety. Same fore-
back effects. The sum of secondary feedbacks cast for printers, home networks, and tablets.
reinforces the first-order effects, and makes the Every parts’ supplier within those supply chains
system behave at a suboptimal equilibrium. should expect a lower sale in the near term. That
Worst of all, once it settles (at a high level of hurts sales at Apple, Samsung, Intel, Qualcomm,
unemployment), it does not move away easily. and HP, as well as distributors, like CDW, Sta-
The dot-com bust started from a different ples, and Office Max. Ouch.
place, but events there illustrate how multiplier The next generation of online games and
effects operate. In that case, the loss of financial entertainment looks like a mixed bag. Online
confidence led many of those online firms to can- channels help. Freemium services will tend to do
cel orders for equipment at the same time—PCs, better than subscription services, if any of them
office networking, and application software. does well at all. General use will go up, especially
That caused liquidity issues at otherwise healthy for established brands, such as Electronic Arts.
and efficient suppliers. No funds came in for new But not many devices will be sold, such as Xbox
sales, while inventories of final products piled or PlayStation, except through online channels.
up in warehouses, and the bills for last month’s What about Facebook and Google? Both have
inputs and workforce needed attention. seen massive increases in traffic from use, which
May/June 2020
119
Micro Economics
means they sell more ads with more people wireless networks, as home networks press into
online. With less shopping overall, however, each capacity utilization far above anticipated levels.
ad is not as valuable. The value of the ad-based This situation puts pressures on users to increase
business could increase or declines overall. wireless data contracts. Wireless firms also
You may say: what about online retailing? should benefit from pressure at households drop-
Amazon should be able to take advantage of ping broadband and going only with wireless
everyone shopping online instead of visiting Internet. Broadband carriers have seen home traf-
malls. And perhaps some of their third-party sup- fic increase, while pressures for cord-cutting
pliers will be fine as well (e.g., if increased (i.e., diminishing televi-
they have hand sanitizer to sell). sion contracts). The broadband
That said, the most profitable divi- Most people want to business also has always had to
sion at Amazon is AWS, and more know: how long will this manage a huge fraction of house-
broadly, nobody expects AWS, last? Forecasting the holds who do not pay their bills on
Azure, or Google Cloud to decline. recovery is especially time, which also should grow worse
The flexibility inherent in that difficult. Both biology over the next few months.
business model gives them advan- and economics plays a Finally, the financial side of
role. So does economic
tages over construction of new technology will go through a some-
policy. Many analysts
data centers or equipment pur- what predictable decline. No VC
want to know: Will any of
chases for business. will have an IPO in the near term.
the behavior exhibited
Streaming services appeal to during the period of Many startups with cash flow
all the people stuck at home. home sheltering persist issues will be sold at fire sale prices
Established services, such as into the future? to large buyers as acquisitions. As
Netflix, Hulu, YouTube, and Ama- happened in 2009, many VCs will
zon Prime video, are positioned tend to cutoff their worst perform-
to do well. This environment also gives an ing firms, but which firms and when? Good luck
opportunity to HBO Go, CBS, Sling, PlayStation with forecasting that.
Vue, and few others.
That said, new launches are always risky, but
that is especially so in this environment. For
UNANSWERED QUESTIONS
example, Quibi—with its hype and subscription Most people want to know: how long will this
price—probably would have been better off
last? Forecasting the recovery is especially diffi-
launching just a few months earlier, just as Dis- cult. Both biology and economics plays a role.
neyPlus did. Will Quibi gain traction in this envi-
So does economic policy.
ronment? Anybody’s guess. Many analysts want to know: Will any of the
Some online communication tools have got-
behavior exhibited during the period of home
ten more use too, such as WhatsApp, Zoom, and sheltering persist into the future?
Slack. Google Hangouts, Skype, and Webex have
As I write this, there is simply no precedent
seen resurgence, and lesser known services, for making an educated guess. There are too
such as Join.me, GoToMeeting, Stride, RingCen-
many unknowable unknowns.
tral, and BigMarker.
The carrier business will see a mix of experien- Shane Greenstein is a Professor at the Harvard
ces. Traffic has gone up for both wireline and Business School. Contact him at sgreenstein@ hbs.edu.
IEEE Micro
120
PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading provider President: kƷǠǹƌژ%ƷژFǹȏȵǠƌȄǠ
of technical information in the field.
President-Elect: FȏȵȵƷȽɋژ°ǚɓǹǹ
MEMBERSHIP: Members receive the monthly magazine Past President: ƷƩǠǹǠƌژuƷɋȵƌ
Computer, discounts, and opportunities to serve (all activities First VP: ¨ǠƩƩƌȵưȏژuƌȵǠƌȄǠ; Second VP: °ɲٯäƷȄژhɓȏ ژژژژژ
are led by volunteer members). Membership is open to all IEEE Secretary: %ǠȂǠɋȵǠȏȽژ°ƷȵȲƌȄȏȽ; Treasurer: %ƌɫǠưژkȏȂƷɋ
members, affiliate society members, and others interested in the VP, MemberȽǚǠȲ & Geographic Activities: Yervant Zorian
computer field. VP, Professional & Educational Activities: °ɲٮäƷȄژhɓȏ ژژژژژژژژژژژ
VP, Publications: Fabrizio Lombardi
COMPUTER SOCIETY WEBSITE: www.computer.org
VP, Standards Activities: Riccardo Mariani
OMBUDSMAN: Direct unresolved complaints to VP, Technical & Conference Activities: William D. Groppژ
[email protected].
2019–2020 IEEE Division VIII Director: Elizabeth L. Burdژ
CHAPTERS: Regular and student chapters worldwide provide the ژאאٮאאU---ژ%ǠɫǠȽǠȏȄژÝژ%ǠȵƷƩɋȏȵ¾ژيǚȏȂƌȽژuِژȏȄɋƷژژژژژژژژژژ
opportunity to interact with colleagues, hear technical experts, ژאאU---ژ%ǠɫǠȽǠȏȄژÝUUUژ%ǠȵƷƩɋȏȵٮ-ǹƷƩɋژيǚȵǠȽɋǠȄƌژuِژ°ƩǚȏƨƷȵ
and serve the local professional community.
AVAILABLE INFORMATION: To check membership status, report BOARD OF GOVERNORS
an address change, or obtain more information on any of the ¾ƷȵȂ ژ-ɱȲǠȵǠȄǒ ژيאא ژȄưɲ ژِ¾ ژǚƷȄً ژeȏǚȄ ژ%ِ ژeȏǚȄȽȏȄًژ
following, email Customer Service at [email protected] or call
°ɲٮäƷȄ ژhɓȏً ژ%ƌɫǠư ژkȏȂƷɋً ژ%ǠȂǠɋȵǠȏȽ ژ°ƷȵȲƌȄȏȽًژژژژژ
+1 714 821 8380 (international) or our toll-free number, OƌɲƌɋȏژäƌȂƌȄƌ
+1 800 272 6657 (US): ¾ƷȵȂ ژ-ɱȲǠȵǠȄǒ ژيאא ژuِ ژȵǠƌȄ ژǹƌǵƷً ژFȵƷư ژ%ȏɓǒǹǠȽًژ
• Membership applications ƌȵǹȏȽ ژ-ِ ژeǠȂƷȄƷɼٮGȏȂƷɼً¨ ژƌȂƌǹƌɋǚƌ ژuƌȵǠȂɓɋǚɓًژژژژژژژژژژژ
• Publications catalog -ȵǠǵ ژeƌȄ ژuƌȵǠȄǠȽȽƷȄً ژhɓȄǠȏ ژÅƩǚǠɲƌȂƌ
• Draft standards and order forms ¾ƷȵȂ ژ-ɱȲǠȵǠȄǒ ژيאאא ژwǠǹȽ ژȽƩǚƷȄƨȵɓƩǵًژ
• Technical committee list -ȵȄƷȽɋȏ ژɓƌưȵȏȽٯÝƌȵǒƌȽً ژ%ƌɫǠư ژ°ِ ژ-ƨƷȵɋً ژÞǠǹǹǠƌȂ ژGȵȏȲȲًژ
• Technical committee application GȵƌƩƷ ژkƷɬǠȽً ژ°ɋƷǑƌȄȏژîƌȄƷȵȏ
• Chapter start-up procedures
• Student scholarship information
• Volunteer leaders/staff directory EXECUTIVE STAFF
• IEEE senior member grade application (requires 10 years Executive Director: Melissa ِژRussell
practice and significant performance in five of those 10) Director, Governance & Associate Executive Director:
Anne Marie Kelly
PUBLICATIONS AND ACTIVITIES Director, Finance & Accounting: Sunny Hwang
Director, Information Technology & Services: Sumit Kacker
Computer: The flagship publication of the IEEE Computer Society,
Director, Marketing & Sales: Michelle Tubb
Computer publishes peer-reviewed technical content that covers
Director, Membership Development: Eric Berkowitz
all aspects of computer science, computer engineering,
technology, and applications.
COMPUTER SOCIETY OFFICES
Periodicals: The society publishes 12 magazinesژƌȄưژזژDZȏɓȵȄƌǹȽ. Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C.
Refer to membership application or request information as noted 20036-4928ٕ Phone: +1 202 371 0101ٕ Fax: +1 202 728 9614ٕژ
above. Email: ǚƷǹȲۮƩȏȂȲɓɋƷȵِȏȵǒ
Conference Proceedings & Books: Conference Publishing Los Alamitos: 10662 Los Vaqueros Cir., Los Alamitos, CA 90720ٕژ
Services publishes more than 275 titles every year. Phone: +1 714 821 8380ٕ Email: [email protected]
Standards Working Groups: More than 150 groups produce IEEE
u-u-¨°OU¥ژۯژ¥ÅkU¾Uw¨ژ%-¨°ژ
standards used throughout the world.
¥ǚȏȄƷژٕבבבגژזוהژזژڹژيFƌɱژٕגהגژאזژגוژڹژي
Technical Committees: TCs provide professional interaction in -ȂƌǠǹژيǚƷǹȲۮƩȏȂȲɓɋƷȵِȏȵǒ
more than 30 technical areas and directly influence computer
engineering conferences and publications. IEEE BOARD OF DIRECTORS
Conferences/Education: The society holds about 200 conferences President: ¾ȏȽǚǠȏژFɓǵɓưƌ
each year and sponsors many educational activities, including President-Elect: °ɓȽƌȄژhِٹژhƌɋǚɲژٺkƌȄư
computing science accreditation. Past President: eȏȽƸژuِFِژuȏɓȵƌ
Certifications: The society offers three software developer Secretary: Kathleen ِژKramer
credentials. For more information, visit Treasurer: Joseph V. Lillie
www.computer.org/certification. Director & President, IEEE-USA: eǠȂژȏȄȵƌưژ
Director & President, Standards Association: Robert S. Fishژ
BOARD OF GOVERNORS MEETING Director & VP, Educational Activities: °ɋƷȲǚƷȄژ¥ǚǠǹǹǠȲȽژ
Director & VP, Membership ۯGeographic Activities:ژژژژژژژ
ژגא٫ דאژ°ƷȲɋƷȂƨƷȵژאאژǠȄژuƩkƷƌȄًژÝǠȵǒǠȄǠƌًژÅ° hɓǵDZǠȄژǚɓȄ
Director & VP, Publication Services & Products: ¾ƌȲƌȄژ°ƌȵǵƌȵژ
Director & VP, Technical Activities: hƌɼɓǚǠȵȏژhȏȽɓǒƷ
revised ڳתuƌɲששڳ