On Chip Networks - Second Edition
On Chip Networks - Second Edition
ENRIGHT JERGER • ET AL
Synthesis Lectures on
Computer Architecture
On-Chip
Series Editor: Margaret Martonosi, Princeton University
Networks
Natalie Enright Jerger, University of Toronto
Tushar Krishna, Georgia Institute of Technology
Li-Shiuan Peh, University of Singapore
Second Edition
who are interested in learning about on-chip networks. This work is designed to be a short
synthesis of the most critical concepts in on-chip network design. It is a resource for both
understanding on-chip network basics and for providing an overview of state of-the-art research
in on-chip networks. We believe that an overview that teaches both fundamental concepts and
highlights state-of-the-art designs will be of great value to both graduate students and industry
engineers. While not an exhaustive text, we hope to illuminate fundamental concepts for the
reader as well as identify trends and gaps in on-chip network research.
With the rapid advances in this field, we felt it was timely to update and review the state
of the art in this second edition. We introduce two new chapters at the end of the book. We
have updated the latest research of the past years throughout the book and also expanded our
coverage of fundamental concepts to include several research ideas that have now made their
way into products and, in our opinion, should be textbook concepts that all on-chip network
practitioners should know. For example, these fundamental concepts include message passing, Natalie Enright Jerger
multicast routing, and bubble flow control schemes.
Tushar Krishna
Li-Shiuan Peh
About SYNTHESIS
This volume is a printed version of a work that appears in the Synthesis
Synthesis Lectures on
Computer Architecture
store.morganclaypool.com
On-Chip Networks
Second Edition
Synthesis Lectures on
Computer Architecture
Editor
Margaret Martonosi, Princeton University
Founding Editor Emeritus
Mark D. Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics
pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. The scope will
largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,
MICRO, and ASPLOS.
Customizable Computing
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
2015
Die-stacking Architecture
Yuan Xie and Jishen Zhao
2015
Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013
Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012
On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009
The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009
Transactional Memory
James R. Larus and Ravi Rajwar
2006
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00772ED1V01Y201704CAC040
Lecture #40
Series Editor: Margaret Martonosi, Princeton University
Founding Editor Emeritus: Mark D. Hill, University of Wisconsin, Madison
Series ISSN
Print 1935-3235 Electronic 1935-3243
On-Chip Networks
Second Edition
Tushar Krishna
Georgia Institute of Technology
Li-Shiuan Peh
National University of Singapore
M
&C Morgan & cLaypool publishers
ABSTRACT
This book targets engineers and researchers familiar with basic computer architecture concepts
who are interested in learning about on-chip networks. This work is designed to be a short
synthesis of the most critical concepts in on-chip network design. It is a resource for both un-
derstanding on-chip network basics and for providing an overview of state of-the-art research
in on-chip networks. We believe that an overview that teaches both fundamental concepts and
highlights state-of-the-art designs will be of great value to both graduate students and industry
engineers. While not an exhaustive text, we hope to illuminate fundamental concepts for the
reader as well as identify trends and gaps in on-chip network research.
With the rapid advances in this field, we felt it was timely to update and review the state
of the art in this second edition. We introduce two new chapters at the end of the book. We
have updated the latest research of the past years throughout the book and also expanded our
coverage of fundamental concepts to include several research ideas that have now made their
way into products and, in our opinion, should be textbook concepts that all on-chip network
practitioners should know. For example, these fundamental concepts include message passing,
multicast routing, and bubble flow control schemes.
KEYWORDS
interconnection networks, topology, routing, flow control, deadlock, computer ar-
chitecture, multiprocessor system on chip
ix
To our families
for their encouragement and patience
through the writing of this book.
xi
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Advent of the Multi-core Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Communication Demands of Multi-core Architectures . . . . . . . . . . . . . 1
1.2 On-chip vs. Off-chip Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Network Basics: A Quick Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Evolution to On-chip Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 On-chip Network Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.3 Performance and Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 This Book—Second Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Traffic-independent Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
xii
3.1.2 Traffic-dependent Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Direct Topologies: Rings, Meshes, and Tori . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Indirect Topologies: Crossbars, Butterflies, Clos Networks, and Fat Trees . . . 32
3.4 Irregular Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Splitting and Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Topology Synthesis Algorithm Example . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Hierarchical Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.1 Place-and-route . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.2 Implication of Abstract Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Brief State-of-the-Art Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Types of Routing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Deadlock Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Deterministic Dimension-ordered Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Oblivious Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Adaptive Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Multicast Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.7 Routing on Irregular Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.8 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.8.1 Source Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.8.2 Node Table-based Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8.3 Combinational Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.8.4 Adaptive Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.9 Brief State-of-the-Art Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Messages, Packets, Flits, and Phits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Message-based Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Circuit Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Packet-based Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Store and Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.2 Virtual Cut-through . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Flit-based Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 Wormhole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
xiii
5.5 Virtual Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.6 Deadlock-free Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6.1 Dateline and VC Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6.2 Escape VCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6.3 Bubble Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.7 Buffer Backpressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.8 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.8.1 Buffer Sizing for Turnaround Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.8.2 Reverse Signaling Wires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.9 Flow Control in Application Specific On-chip Networks . . . . . . . . . . . . . . . . 72
5.10 Brief State-of-the-Art Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6 Router Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1 Virtual Channel Router Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Buffers and Virtual Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.1 Buffer Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2.2 Input VC State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3 Switch Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.1 Crossbar Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.2 Crossbar Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.3 Crossbar Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Allocators and Arbiters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4.1 Round-robin Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.2 Matrix Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4.3 Separable Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4.4 Wavefront Allocator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4.5 Allocator Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.5.1 Pipeline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.5.2 Pipeline Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6 Low-power Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6.1 Dynamic Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.6.2 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.7 Physical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.7.1 Router Floorplanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.7.2 Buffer Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8 Brief State-of-the-Art Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
xiv
7 Modeling and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.1 Analytical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.2 Ideal Interconnect Fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1.3 Network Delay-throughput-energy Curve . . . . . . . . . . . . . . . . . . . . . . 105
7.2 On-chip Network Modeling Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.1 RTL and Software Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.2 Power and Area Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3 Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3.1 Message Classes, Virtual Networks, Message Sizes, and Ordering . . 109
7.3.2 Application Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.3.3 Synthetic Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.4 Debug Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.5 NoC Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.6 Brief State-of-the-Art Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Preface
This book targets engineers and researchers familiar with many basic computer architecture con-
cepts who are interested in learning about on-chip networks. This work is designed to be a short
synthesis of the most critical concepts in on-chip network design. We envision this book as a
resource for both understanding on-chip network basics and for providing an overview of state-
of-the-art research in on-chip networks. We believe that an overview that teaches both funda-
mental concepts and highlights state-of-the-art designs will be of great value to both graduate
students and industry engineers. While not an exhaustive text, we hope to illuminate funda-
mental concepts for the reader as well as identify trends and gaps in on-chip network research.
With the rapid advances in this field, we felt it was timely to update and review the state
of the art in this second edition. We introduce two new chapters at the end of the book, as will
be detailed below. Throughout the book, in addition to updating the latest research in the past
years, we also expanded our coverage of fundamental concepts to include several research ideas
that have now made their way into products and, in our opinion, should be textbook concepts
that all on-chip network practitioners should know. For example, these fundamental concepts
include message passing, multicast routing, and bubble flow control schemes.
The structure of this book is as follows. Chapter 1 introduces on-chip networks in the
context of multi-core architectures and discusses their evolution from simple point-to-point
wires and buses for scalability.
Chapter 2 explains how networks fit into the overall system architecture of multi-core
designs. Specifically, we examine the set of requirements imposed by cache-coherence protocols
in shared memory chip multiprocessors, and contrast that with the requirements in message-
passing multi-cores. In addition to examining the system requirements, this chapter also de-
scribes the interface between the system and the network.
Once a context for the use of on-chip networks has been provided through a discussion of
system architecture, the details of the network are explored. As topology is often a first choice in
designing a network, Chapter 3 describes various topology trade-offs for cost and performance.
Given a network topology, a routing algorithm must be implemented to determine the path(s)
messages travel to be delivered throughout the network fabric; routing algorithms are explained
in Chapter 4. Chapter 5 deals with the flow control mechanisms employed in the network; flow
control specifies how network resources, namely buffers and links, are allocated to packets as
they travel from source to destination. Topology, routing, and flow control all factor into the
microarchitecture of the network routers. Details on various microarchitectural trade-offs and
design issues are presented in Chapter 6. This chapter includes the design of buffers, switches,
and allocators that comprise the router microarchitecture. Although power consumption can
xviii PREFACE
be addressed through innovations in all areas of on-chip networks, we focus our new power
discussion in the microarchitecture chapter as this is where many such optimizations are realized.
New Chapter 7 covers the nuts and bolts of modeling and evaluating on-chip networks,
from software simulations to RTL design and emulation on FPGA, to architectural models
of delay, throughput, area, and power. The chapter also guides the reader on useful metrics for
evaluating on-chip networks and ideal theoretical yardsticks for comparing against.
With the plethora of industry and academia on-chip network chips now available, we
dedicate a new Chapter 8 to a survey of these. The chapter provides the reader with a sweeping
understanding of how the various fundamental concepts presented in the earlier chapters come
together, and the implications of the design and implementation of such concepts.
Finally in Chapter 9, we leave the reader with thoughts on key challenges and new areas
of exploration that will drive on-chip network research in the years to come. Substantial new
research has clearly surfaced, and here we focus on various significant trends that highlight the
cross-cutting nature of on-chip network research. Emerging new interconnects and devices sub-
stantially change the implementation tradeoffs of on-chip networks, and in turn prompt new de-
signs. Newly important metrics such as resilience, due to increasing variability in the fabrication
process, or quality-of-service that is prompted by multiple workloads running simultaneously
on many-cores, will add new dimensions and prompt new research ideas across the community.
Acknowledgments
We would like to thank Margaret Martonosi for her feedback and encouragement to create
the second edition of this book. We continue to be grateful to Mark Hill for his feedback and
support in crafting the previous edition. Additionally, we would like to thank Michael Morgan
for the opportunity to contribute once again to this lecture series. Many thanks to Timothy
Pinkston and Lizhong Chen for their detailed comments that were invaluable in improving
this manuscript. Thanks to Mario Badr, Wenbo Dai, Shehab Elsayed, Karthik Ganesan, Parisa
Khadem Hamedani, and Joshua San Miguel of the University of Toronto for proofreading our
early drafts. Thanks to Georgia Tech students Hyoukjun Kwon, Ananda Samajdar for feedback
on early drafts, and Monodeep Kar for help with literature surveys. Thanks also to the many
students and instructors who have used the first edition over the years and provided feedback
that lead to this latest edition.
CHAPTER 1
Introduction
Since the introduction of research into multi-core chips in the late 1990s [40, 271, 336], on-
chip networks have emerged as an important and growing field of research. As core counts
increase, and multi-core processors emerge in diverse domains ranging from high-end servers
to smartphones and even Internet of Things (IoT) gateways, there is a corresponding increase
in bandwidth demand to facilitate high core utilization and a critical need for scalable on-chip
interconnection fabrics. This diversity of application platforms has led to research in on-chip
networks spanning a variety of disciplines from computer architecture to computer-aided design,
embedded systems, VLSI, and more. Here, we provide a synthesis of critical concepts in on-chip
networks to quickly bootstrap students and designers into this exciting field.
The two primary costs associated with an on-chip network are area and power. As men-
tioned, many-core architectures operate under very tight power budgets. The impact of differ-
6 1. INTRODUCTION
ent designs on power and area will be discussed throughout this book, delved in more detail in
Chapter 6 on Router Microarchitecture.
CHAPTER 2
L1 I/D
Core
Cache
L2 Cache Router
Tags Data
Controller Logic
Memory Controller
receives
request
2
Directory
3
Send 3
2 Data Send
Data
rely on any implicit network ordering and can be mapped to an arbitrary topology. Directory
protocols rely on point-to-point messages rather than broadcasts; this reduction in coherence
messages allows this class of protocols to provide greater scalability. Rather than broadcast to
all cores, the directory contains information about which cores have the cache block. A single
core receives the read request from the directory in Figure 2.2b resulting in lower bandwidth
requirements.
Directories maintain information about the current sharers of a cache line in the system
and coherence state information. By maintaining a sharing list, directory protocols eliminate
the need to broadcast invalidation requests to the entire system. Addresses are interleaved across
directory nodes; each address is assigned a home node, which is responsible for ordering and han-
dling all coherence requests to that address. Directory coherence state is maintained in mem-
ory; to make directories suitable for on-chip many-core architectures, directory caches are used.
Going off-chip to memory for all coherence requests is impractical. By maintaining recently
accessed directory information in on-chip directory caches, latency is reduced.
Reply Q Request Q
Cache Cache
Controller Controller
Request Q Reply Q
Protocols can require several different message classes. Each class contains a group of co- message
herence actions that are independent of each other; that is, a request message in one class will classes
not lead to the generation of another request message in the same class, but can trigger a message
of a different class. Deadlock can occur when there are resource dependences between messages
of different classes [322]. Here we describe three typical classes: requests, interventions, and
14 2. INTERFACE WITH SYSTEM ARCHITECTURE
responses. Request messages include loads, stores, upgrades, and writebacks. Interventions are
messages sent from the directory to request modified data be transferred to a new node. Exam-
ples of response messages include invalidation acknowledgments, negative acknowledgments
(indicating a request has failed) and data messages.
Multiple virtual channels can be used to prevent protocol-level deadlock. The Alpha
21364 [254] allocates one virtual channel per message class to prevent protocol-level deadlock.
By requiring different message classes to use different virtual channels, the cyclic dependence
between requests and responses is broken in the network. Virtual channels and techniques to
deal with protocol-level deadlock and network deadlock are discussed in Chapter 5.
Logic Logic
Tags Data
Controller Controller
Figure 2.5 provides two walk-through examples of a many-core design configured with
private L2 caches. In Figure 2.5a, the load of A misses in L1, but hits in the core’s private L2
cache, and after step 3, the data are returned to the L1 and the core. However, in Figure 2.5b,
the load of A misses in the private L2 cache and must be sent to the network interface (4),
sent through the network to the memory controller (5), sent off chip, and finally re-traverse the
network back to the requestor (6). After step 6, the data are installed in the L2 and forwarded to
the L1 and the core. In this scenario, a miss to a private L2 cache requires two network traversals
and an off-chip memory access.
Alternatively, the L2 cache can be shared amongst all or some of the cores. Shared caches
represent a more effective use of storage as there is no replication of cache lines. However, L2
cache hits incur additional latency to request data from a different tile. Shared caches place
more pressure on the interconnection network as L1 misses also go into the network, but more
effective use of storage may reduce pressure on the off-chip bandwidth to memory. With shared
caches, more requests will travel to remote nodes for data. As shown in Figure 2.4b, the on-chip
network must attach to both the L1s and the L2 when the L2 is shared; both levels of cache
share the injection and ejection bandwidth of the router.
Figure 2.6 provides two walk-through examples similar to those in Figure 2.5 but with a
many-core system configured with a shared L2 cache. In Figure 2.6a, the L1 cache experiences
a miss to address A. Address A maps to a remote bank of the shared L2, so the load request
must be sent to the network interface (3) and traverse the network to the appropriate node. The
read request arrives at the remote node (4) and is serviced by the L2 bank (5). The data are sent
to the network interface (6) and re-traverse the network back to the requestor (7). After step 7,
16 2. INTERFACE WITH SYSTEM ARCHITECTURE
Core
L1 I/D
Cache 1
LD A
2 Miss A
Memory Controller
(a) Private L2 Hit
Format message to
4 memory controller
Core
L1 I/D
Cache 1
LD A
2 Miss A
Memory Controller
5 Request sent off-chip
(b) Private L2 Miss
the data are installed in the local L1 and sent to the core. Here, an L2 hit requires two network
traversals when the address maps to a remote cache (e.g., addresses can be mapped by a function
A mod N , where N is the number of L2 banks).
In Figure 2.6b, we give a walk-through example for an L2 miss in a shared configuration.
Initially, steps 1-4 are the same as the previous example. However, now the shared L2 bank
misses to address A (5). The read request must again be sent to the network interface (6), for-
warded through the network to the memory controller and sent off chip (7), returned through
the network to the shared L2 bank and installed in the L2 (8) and then sent through the network
2.1. SHARED MEMORY NETWORKS IN CHIP MULTIPROCESSORS 17
Router
Memory
Controller Logic
the case of a directory protocol, this will be the location of the home node as determined by the
memory address), the address of the cache line requested and the message request type (e.g.,
Read). Below the message format and send block, we show several possible message formats
that may be generated depending on the type of request. When a reply message comes from the
network, the MSHR matches the reply to one of the outstanding requests and completes the
cache miss actions. The message receive block is also responsible for receiving request messages
from the directory or another processor tile to initiate cache-to-cache transfers; the protocol
finite state machine takes proper actions and formats a reply message to send back into the net-
work. Messages received from the network may also have several different formats that must be
properly handled by the message receive block and the protocol finite state machine.
The memory-to-network interface (shown in Figure 2.9) is responsible for receiving mem-
ory request messages from processors (caches) and initiating replies. Different types and sizes transaction status
of messages are received from the network and sent back into the network as shown above the handling register
message format and send block and the message receive block. At the memory side, transaction (TSHR)
status handling registers (TSHRs) handle outstanding memory requests. If memory controllers
are guaranteed to service requests in order, the TSHRs could be simplified to a FIFO queue.
20 2. INTERFACE WITH SYSTEM ARCHITECTURE
Core
Cache
Figure 2.8: Processor-to-network interface (adapted from Dally and Towles [86]).
TSHRs
Director Cache Status Src Addr Data
Memory Controller
Off-chip Memory
Figure 2.9: Memory-to-network interface (adapted from Dally and Towles [86]).
2.1. SHARED MEMORY NETWORKS IN CHIP MULTIPROCESSORS 21
However, as memory controllers often reorder memory requests to improve utilization, a more
complicated interface is required. Once a memory request has been completed, the message
format and send block is responsible for formatting a message to be injected into the network
and sent back to the original requester. A network interface employing MSHRs and TSHRs is
similar to the design utilized by the SGI Origin [215].
Out-of-order transactions. Many of the latest versions of these standards relax the strict
ordering of bus-based semantics so point-to-point interconnect fabrics such as crossbars and
on-chip networks can be plugged in, while retaining backward compatibility to buses, such as
OCP 3.0 [297], AXI [26], and STNoC [245].
For instance, AXI relaxes the ordering between requests and responses, so responses need
not return in the same order as that of requests. Figure 2.11 illustrates this feature of AXI which
allows multiple requests to be outstanding and slaves to be operating at different speeds. This
allows multiple address and data buses to be used, as well as split-transaction buses (where a
Figure 2.11: The AXI protocol allows messages to complete out of order: D21 returns data prior to
D11 even though A11 occurred prior to A21.
2.4. CONCLUSION 25
transaction does not hold on to the bus throughout, but instead, requests and responses of the
same transaction separately arbitrate for the bus), and ultimately, on-chip networks. In on-chip
networks, packets sent between different pairs of nodes can arrive in different order from the
sending order, depending on the distance between the nodes and the actual congestion level. A
global ordering between all nodes is difficult to enforce. So out-of-order communication stan-
dards are necessary for on-chip network deployment.
Coherence. System-wide coherence support is provided by AMBA 4 ACE (AXI Co-
herency Extensions) and the more recent AMBA 5 CHI (Coherent Hub Interface) [26]. This
is in the form of additional channels to support various coherence messages, snoop response
controllers, barrier support, and QoS. This allows multiple processors to share memory for ar-
chitectures like ARM’s big.LITTLE.
2.4 CONCLUSION
This chapter introduces several system-level concepts that provide an important foundation and
context for our discussion of on-chip networks. We provide a high level overview of how various
architectural choices can impact on-chip network traffic. We also present a brief overview of
interface standards. We will revisit the impact of architectural design choices in Chapter 7 on
evaluation and in Chapter 8 where we present case studies of recent academic and industrial
designs that feature on-chip networks.
27
CHAPTER 3
Topology
The on-chip network topology determines the physical layout and connections between nodes
and channels in the network. The effect of a topology on overall network cost-performance is
profound. A topology determines the number of hops (or routers) a message must traverse as
well as the interconnect lengths between hops, thus influencing network latency significantly. As
traversing routers and links incurs energy, a topology’s effect on hop count also directly affects
network energy consumption. Furthermore, the topology dictates the total number of alternate
paths between nodes, affecting how well the network can spread out traffic and hence support
bandwidth requirements. The implementation complexity cost of a topology depends on two
factors: the number of links at each node (node degree) and the ease of laying out a topology on
a chip (wire lengths and the number of metal layers required).
One of the simplest topologies is a bus which connects a set of components with a single,
bus
shared channel. Each message on the bus can be observed by all components on the bus; it is
an effective broadcast medium. However, buses have limited scalability due to saturation of the
shared channel as additional components are added.
In this chapter, we will focus on switched topologies, where a set of components is con-
nected to one another via a set of routers and links. We first describe several metrics that are very
useful for developing back-of-the-envelope intuition when comparing topologies. Next, we will
describe several commonly used topologies in on-chip networks and compare them using these
metrics.
3.1 METRICS
Since the first decision designers have to make when building an on-chip network is, frequently,
the choice of the topology, it is useful to have a means for quick comparisons of different topolo-
gies before the other aspects of a network (such as its routing, flow control and microarchitecture)
are even determined. Here, we describe several abstract metrics that come in handy when com-
paring different topologies. Figure 3.1 shows three commonly used on-chip topologies used to
illustrate these metrics.
1 Note that the figure illustrates the 2-D version of meshes and tori.
28 3. TOPOLOGY
B B
B
A A
(a) Ring (b) Mesh (c) Torus
2 If there are multiple such cuts possible, it is the minimum among all the cuts.
3.1. METRICS 29
3.1.2 TRAFFIC-DEPENDENT METRICS
Next, we define a set of metrics that depends on the traffic (i.e., source-destination pairs) flowing
through the network.
Hop count. The number of hops a message takes from source to destination, or the num-
ber of links it traverses, defines hop count. This is a very simple and useful proxy for network
latency, since every node and link incurs some propagation delay, even when there is no con-
tention. The maximum hop count is given by the diameter of the network. In addition to the
maximum hop
maximum hop count, average hop count is very useful as a proxy for network latency. It is given count
by the average hops over all possible source-destination pairs in the network. average hop
count
For the same number of nodes, and assuming uniform random traffic where every node
uniform random
has an equal probability of sending to every other node, a ring (Figure 3.1a) will lead to higher traffic
hop count than a mesh (Figure 3.1b) or a torus [93] (Figure 3.1c). For instance, in the figure
shown, assuming bidirectional links and shortest-path routing, the maximum hop count of the
ring is four, that of a mesh is also four, while a torus improves the hop count to two. Looking at
average hop count, we see that the torus again has the lowest average hop count (1 13 ). The mesh
has a higher average hop count of 1 97 . Finally, the ring has the worst average hop count of the
three topologies in Figure 3.1 with an average of 2 29 . The formulas for deriving these values will
be presented in Section 3.2.
Maximum channel load. This metric is useful as a proxy for estimating the maximum
bandwidth the network can support, or the maximum number of bits per second (bps) that can
be injected by every node into the network before it saturates: maximum
channel load
Maximum Injection Bandwidth = 1 / Maximum Channel Load.
Intuitively, it involves first determining which link or channel3 in the network will be the most
congested given a particular traffic pattern, as this link will limit the overall network bandwidth.
For uniform random traffic, this link is often on the bisection cut. Next, the load on this channel
is estimated. Since at this early stage of design, we do not yet know the specifics of the links we are
using (how many actual interconnects form each channel, and each interconnects’ bandwidth
in bps), we need a relative way of measuring load. Here, we define it as being relative to the
injection bandwidth. So, when we say the load on a channel is two, it means that the channel is
loaded with twice the injection bandwidth. So, if we inject a flit every cycle at every node into
the network, two flits will wish to traverse this specific channel every cycle. If the bottleneck
channel can handle only one flit per cycle, it constrains the maximum bandwidth of the network
to half the link bandwidth, i.e., at most, a flit can be injected every other cycle. Thus, the higher
the maximum channel load, the lower the network bandwidth.
3 We use link to refer to the physical set of wires connecting routers in an on-chip network, and channel to refer to the
logical connection between routers in the network. In most designs, the link and channel are identical and can be used
interchangeably.
30 3. TOPOLOGY
Channel load can be calculated in a variety of ways, typically using probabilistic analysis.
If routing and flow control are not yet determined, channel load can still be calculated assuming
ideal routing (the routing protocol distributes traffic amongst all possible shortest paths evenly)
and ideal flow control (the flow control protocol uses every cycle of the link whenever there is
traffic destined for that link).
Here, we will illustrate this with a simple example, but the rest of the chapter will just show
formulas for the maximum channel load of various common on-chip network topologies rather
than walking through their derivations. Figure 3.2 shows an example network topology with
two rings connected with a single channel. First, we assume uniform random traffic where every
node has an equal probability of sending to every other node in the network including itself. To
calculate maximum channel load, we need to first identify the bottleneck channel. Here, it is the
single channel between the rings, shown in bold. We will assume it is a bidirectional link. With
ideal routing, half of every node’s injected traffic will remain within its ring, while the other half
will be crossing the bottleneck channel. For instance, for every packet injected by node A, there
is 1=8 probability of it going to either B, C, D, E, F, G, H, or itself. When the packet is destined
for A, B, C, D, it does not traverse the bottleneck channel; when it is destined for E, F, G, H,
it does. Therefore, 1=2 of the injection bandwidth of A crosses the channel. So does 1=2 of the
injection bandwidth of the other nodes. Hence, the channel load on this bottleneck channel is
2. As a result the network saturates at 1=2 the injection bandwidth. Adding more nodes to both
rings will further increase the channel load, and thus decrease the bandwidth.
A B E F
C D G H
Figure 3.2: Channel load example with two rings connected via a single channel.
Path diversity. A topology that provides multiple shortest paths (jRsrc dst j > 1, where
R represents the path diversity) between a given source and destination pair has greater path
diversity than a topology where there is only a single path between a source and destination pair
path diversity
(jRsrc dst j D 1). Path diversity within the topology gives the routing algorithm more flexibility
to load-balance traffic which reduces channel load and thus increases throughput. Path diversity
also enables packets to potentially route around faults in the network. The ring in Figure 3.1a
provides no path diversity (jRj D 1), because there is only one shortest path between pairs of
nodes. If a packet travels clock-wise between A and B (in Figure 3.1a), it traverses four hops;
if the packet goes counter-clockwise, it traverses five hops. More paths can be supplied only
at the expense of a greater distance traveled. With an even number of nodes in a ring, two
nodes that are half-way around the ring from each other will have a path diversity of two due
to two minimal paths. On the other hand, the mesh and torus in Figures 3.1b and c provide a
3.2. DIRECT TOPOLOGIES: RINGS, MESHES, AND TORI 31
wider selection of distinct paths between source and destination pairs. In Figure 3.1b, the mesh
supplies six distinct paths between A and B, all at the shortest distance of four hops.
m=5r×r
rxr
r=4n×m input r=4m×n
nxm switch mxn
input output
switch rxr switch
input
nxm switch mxn
input output
switch rxr switch
input
nxm switch mxn
input output
switch rxr switch
input
nxm switch mxn
input output
switch rxr switch
input
switch
A Clos network can be folded along the middle set of switches so that the input and
output switches are shared. In Figure 3.6b, a 5-stage folded Clos network characterized by the
triple .2; 2; 4/ is depicted. The center stage is realized with another 3-stage Clos formed using
.2; 2; 2/ Clos network. This Clos network is folded along the top row of switches.
Run
Inverse
VLD Length
Scan Run
Decoder Inverse
VLD Length
R R R Scan
Decoder
AC/DC
iDCT iQuant AC/DC
Predict iDCT iQuant
Predict
R R R
R R
VOP
Up Samp Stripe VOP Stripe
Reconstr
Memory Up Samp Reconstr Memory
R R
R
R R R
ARM Core VOP
Padding ARM Core VOP
Memory
Memory Padding
R R R
Figure 3.7: A regular (mesh) topology and a custom topology for a video object plane decoder
(VOPD) (from [47]).
The algorithm synthesizes a number of different topologies, starting with a topology where
all IP cores are connected through one large switch to the other extreme where each core has its
own switch. For each switch count, the algorithm tunes the operating frequency and the link
width. For a given switch count i , the input graph (Figure 3.8a) is partitioned into i min-cut
partitions. Figure 3.8b shows a min-cut partition for i D 3. The min-cut partition is performed
so that the edges of the graph that cross partitions have lower weights than the edges within
partitions. Additionally, the number of nodes assigned to each partition remains nearly the same.
Such a min-cut partition will ensure that traffic flows that have high bandwidth use the same
switch for communication.
Once the min-cut partitions have been determined, routes must be restricted to avoid
deadlocks. We discuss deadlock avoidance in Chapter 4. Next, physical links between switches
38 3. TOPOLOGY
c
o
n
c
must be established and paths must be found for all traffic flows through the switches. Once
the size of the switches and their connectivity is determined, the design can be evaluated to
see if power consumption of the switches and hop count objectives have been met. Finally, a
floorplanner is used to determine the area and wire lengths of a synthesized design.
3.6 IMPLEMENTATION
In this section, we discuss the implementation of topologies on a chip, looking at both physical
layout implications, and the role of abstract metrics defined at the beginning of the chapter.
3.6.1 PLACE-AND-ROUTE
There are two components of the topology which require careful thought during the physical
design: links and routers.
The links are routed on semi-global or global metal layers, depending on the channel
widths and the distance they need to traverse. Wire capacitance tends to be an order of mag-
nitude higher than that of transistors, and can dominate the energy of the network if not opti-
mized well. The target clock frequency determines the size of and distance between repeaters5
that need to be inserted to meet timing. Thicker wires with larger spacing between wires can
be employed to lower the wire resistance and coupling capacitance, thus increasing speed and
energy efficiency. However, metal layer density rules and design rules checking (DRC) can limit
how much one can play around with these parameters. In terms of area, the core to core links
can be routed over active logic, mitigating any area overheads apart from those of the repeaters.
But care needs to be taken since active switching of transistors can introduce cross talk in the
wires. Similarly, routing toggling links over sensitive circuits such as SRAMs which operate at
low voltages could introduce glitches and errors, leading to the area over caches usually being
blocked from wiring. Hence, the floorplanning of the entire chip needs to carefully consider
where router links lie relative to processor cores, caches, memory controllers, etc.
When implementing routers, the node degree (i.e., the number of ports in and out of
the router) determines the overhead, since each port has associated buffering and state logic,
and requires a link to the next node. As a result, while rings have poorer network performance
(latency, throughput, energy and reliability) when compared to higher-dimensional networks,
they have lower implementation overhead as they have a node degree of two while a mesh or
torus has a node degree of four. Similarly, high-radix topologies such as the 4 4 flattened
butterfly discussed in Section 3.3 have lower latency and higher throughput than a mesh for
the same channel width, but the seven ported routers add a higher area and energy footprint,
especially due to the larger crossbar switch whose area grows as a square of the number of ports.
The 2-D floorplan of the logical topology can also often lead to implementation overheads.
As an example, the torus from Figure 3.1 has to be physically arranged in a folded form to
folded torus
equalize wire lengths (see Figure 3.10) instead of employing long wrap-around links between
edge nodes. As a result, wire lengths in a folded torus are twice that in a mesh of the same size, so
5 An inverter or a pair of inverters.
40 3. TOPOLOGY
per-hop latency and energy are actually higher. Furthermore, a torus requires twice the number
of links which must be factored into the wiring budget. If the available wire tracks along the
bisection is fixed, a torus will be restricted to narrower links than a mesh, thus lowering per-link
bandwidth, and increasing transmission delay. From an architectural comparison on the other
hand, a torus has lower hop count (which leads to lower delay and energy) compared to a mesh.
These contrasting properties illustrate the importance of considering implementation details in
selecting between alternative topologies.
Similarly, trying to create an irregular topology optimized for an application’s commu-
nication graph could end up having many criss-crossing links. These would show up as wire
congestion during place-and-route forcing the automated tools or the designer to route around
congested nets adding delay and energy overheads.
CHAPTER 4
Routing
After determining the network topology, the routing algorithm is used to decide what path a
message will take through the network to reach its destination. The goal of the routing algo-
rithm is to distribute traffic evenly among the paths supplied by the network topology, so as
to avoid hotspots and minimize contention, thus improving network latency and throughput.
All of these performance goals must be achieved while adhering to tight constraints on imple-
mentation complexity: routing circuitry can stretch critical path delay and add to a router’s area
footprint. While energy overhead of routing circuitry is typically low, the specific route cho-
sen affects hop count directly, and thus substantially affects energy consumption. In addition,
the path diversity enabled by the routing algorithm is also useful for increasing resiliency in the
presence of network faults.
In this section, we briefly discuss various classes of routing algorithms. Routing algorithms are
generally divided into three classes: deterministic, oblivious and adaptive.
While numerous routing algorithms have been proposed, the most commonly used rout-
ing algorithm in on-chip networks is dimension-ordered routing (DOR) due to its simplicity.
dimension-order
Dimension-ordered routing is an example of a deterministic routing algorithm, in which all routing (DOR)
messages from node A to B will always traverse the same path. With DOR, a message tra- deterministic
verses the network dimension-by-dimension, reaching the ordinate matching its destination routing
before switching to the next dimension. In a 2-D topology such as the mesh in Figure 4.1,
X-Y dimension-ordered routing sends packets along the X-dimension first, followed by the
Y-dimension. A packet travelling from (0,0) to (2,3) will first traverse 2 hops along the X-
dimension, arriving at (2,0), before traversing 3 hops along the Y-dimension to its destination.
Another class of routing algorithms are oblivious ones, where messages traverse different
oblivious routing
paths from A to B, but the path is selected without regard to network congestion. For instance,
a router could randomly choose among alternative paths prior to sending a message. Figure 4.1
shows an example where messages from (0,0) to (2,3) can be randomly sent along either the
Y-X route or the X-Y route. Deterministic routing is a subset of oblivious routing.
A more sophisticated routing algorithm can be adaptive, in which the path a message
adaptive routing
takes from A to B depends on network traffic situation. For instance, a message can be initially
following the X-Y route and see congestion at (1,0)’s east outgoing link. Due to this congestion,
44 4. ROUTING
(2,3)
(0,0)
Figure 4.1: DOR illustrates an X-Y route from (0,0) to (2,3) in a mesh, while Oblivious shows
two alternative routes (X-Y and Y-X) between the same source-destination pair that can be chosen
obliviously prior to message transmission. Adaptive shows a possible adaptive route that branches
away from the X-Y route if congestion is encountered at (1,0).
the message will instead choose to take the north outgoing link toward the destination (see
Figure 4.1).
Routing algorithms can also be classified as minimal and non-minimal. Minimal routing
algorithms select only paths that require the smallest number of hops between the source and
minimal routing
the destination. Non-minimal routing algorithms allow paths to be selected that may increase
non-minimal
routing the number of hops between the source and destination. In the absence of congestion, non-
minimal routing increases latency and also power consumption as additional routers and links
are traversed by a message. With congestion, the selection of a non-minimal route that avoids
congested links, may result in lower latency for packets.
Before we get into details on specific deterministic, oblivious, and adaptive routing algo-
rithms, we will discuss the potential for deadlock that can occur with a routing algorithm.
A B
D C
Figure 4.2: A classic network deadlock where four packets cannot make forward progress as they
are waiting for links that other packets are holding on to.
d’
d d’ d
s s
Valiant’s routing algorithm and minimal oblivious routing are deadlock free when used
in conjunction with X-Y routing. An example of an oblivious routing algorithm that is not
deadlock free is one that randomly chooses between X-Y or Y-X routes. The oblivious algorithm
that randomly chooses between X-Y or Y-X routes is not deadlock-free because all four turns
from Figure 4.2 are possible leading to potential cycles in the link acquisition graph.
(0,0)
minimal paths could exploit a large degree of path diversity to provide load balancing and fault
tolerance.
Adaptive routing can be restricted to taking minimal routes between the source and the
misrouting
destination. An alternative option is to employ misrouting, which allows a packet to be routed
in a non-productive direction resulting in non-minimal paths. When misrouting is permitted,
livelock becomes a concern. Without mechanisms to guarantee forward progress, livelock can
livelock
occur as a packet is continuously misrouted so as to never reach its destination. We can combat
this problem by allowing a maximum number of misroutes per packet and giving higher priority
to packets than have been misrouted a large number of times. Misrouting increases the hop
count but may reduce end-to-end packet latency by avoiding congestion (queueing delay).
With a fully adaptive routing algorithm, deadlock can become a problem. For example,
the adaptive route shown in Figure 4.1 is a superset of oblivious routing and is subject to poten-
tial deadlock. Planar-adaptive routing [73] limits the resources needed to handle deadlock by
restricting adaptivity to only two dimensions at a time. Duato has proposed flow control tech-
niques that allow full routing adaptivity while ensuring freedom from deadlock [109]. Deadlock-
free flow control will be discussed in Chapter 5.
Another challenge with adaptive routing is preserving inter-message ordering as may be
needed by the coherence protocol. If messages must arrive at the destination in the same order
that the source issued them, adaptive routing can be problematic. Mechanisms to re-order mes-
sages at the destination can be employed or messages of a given class can be restricted in their
routing to prevent re-ordering.
4.5. ADAPTIVE ROUTING 49
ADAPTIVE TURN MODEL ROUTING
We introduced turn model routing earlier in Section 4.3 and discussed how dimension order
X-Y routing eliminates two out of four turns (Figure 4.3). Here, we explain how turn model
can be more broadly applied to derive deadlock-free adaptive routing algorithms. Adaptive turn
turn model
model routing eliminates the minimum set of turns needed to achieve deadlock freedom while routing
retaining some path diversity and potential for adaptivity.
With dimension order routing only four possible turns are permitted of the eight turns
available in a 2-D mesh. Turn model routing [131] increases the flexibility of the algorithm by
allowing six out of eight turns. Only one turn from each cycle is eliminated.
In Figure 4.6, three possible routing algorithms are illustrated. Starting with all possible
turns (shown in Figure 4.6a), the north to west turn is eliminated; after this elimination is made,
the three routing algorithms shown in Figure 4.6 can be derived. In Figure 4.6a, the west-first
algorithm is shown; in addition to eliminating the North to West turn, the South to West turn is
eliminated. In other words, a message must first travel in the West direction before traveling in
any other direction. The North-Last algorithm (Figure 4.6b) eliminates both the North to West
and the North to East turns. Once a message has turned North, no further turns are permitted;
hence, the North turn must be made last. Finally, Figure 4.6c removes turns from North to West
and East to South to create the Negative-First algorithm. A message travels in the negative
directions (west and south) first before it is permitted to travel in positive directions (east and
north). All three of these turn model routing algorithms are deadlock-free. Figure 4.7 illustrates
a possible turn elimination that is invalid; the elimination of North to West combined with the
elimination of West to North can lead to deadlock. A deadlock cycle is depicted in Figure 4.7b
that can result from a set of messages using the turns specified in Figure 4.7a.
(a) West First Turns (b) North Last Turns (c) Negative First Turns
Odd-even turn model routing [74] proposes eliminating a set of two turns depending on
whether the current node is in an odd or even column. For example, when a packet is traversing
a node in an even column,2 turns from East to North and from North to West are prohibited.
For packets traversing an odd column node, turns from East to South and from South to West
are prohibited. With this set of restrictions, the odd-even turn model is deadlock free provided
2A column is even if the dimension-0 coordinate of the column is even.
50 4. ROUTING
(2,3)
(0,3)
(0,0) (2,0)
180ı turns are disallowed. The odd-even turn model provides better adaptivity than other turn
model algorithms such as West-First. With West-First, destinations to the West of the source,
have no flexibility; with odd-even routing, there is flexibility depending on the allowable turns
for a given column.
In Figure 4.8, we apply the Negative-First turn model routing to two different source
destination pairs. In Figure 4.8a, three possible routes are shown between (0,0) and (2,3) (more
are possible); turns from North to East and from East to North are permitted allowing for
significant flexibility. However, in Figure 4.8b, there is only one path allowed by the algorithm
to route from (0,3) to (2,0). The routing algorithm does not allow the message to turn from East
to South. Negative routes must be completed first, resulting in no path diversity for this source-
4.6. MULTICAST ROUTING 51
destination pair. As illustrated by this example, turn model routing provides more flexibility and
adaptivity than dimension-order routing but it is still somewhat restrictive.
different length paths. Additionally, by choosing the entire route at the source node, source-
based routing is unable to take advantage of dynamic network conditions to avoid congestion.
However, as mentioned, multiple routes can be stored in the table and selected either randomly
or with a given probability to improve the load distribution in the network.
TO
From 00 01 02 10 11 12 20 21 22
00 X- N- N- E- EN EN E- NE NE
01 S- X- N- ES E- EN ES E- EN
02 S- S- X- ES ES E- ES ES E-
10 W- W- W- X- N- N- E- EN EN
11 W- W- W- S- X- N- ES E- NE
12 W- W- W- S- S- X- ES ES E-
20 W- W- W- W- W- W- X- N- N-
21 W- W- W- W- W- W- S- X- N-
22 W- W- W- W- W- W- S- S- X-
based routing tables can also be programmable. By allowing the routing tables to be changed,
the routing algorithm is better able to tolerate faults in the network.
The most significant downside to node routing tables is the increase in packet delay. Source
routing requires a single look-up to acquire the entire routing path for a packet. With node-based
routing, the latency of a look-up must be expended at each hop in the network.
sx x sy y
=0 =0
Productive
exit
Direction Vector
+x
+y
-x
-y
Queue Lengths
Route Selection
Selected Direction
exit
+x
+y
-x
-y
Vector
in the network. As will be discussed in Chapter 6, this routing computation can add a pipeline
stage to the router traversal.
CHAPTER 5
Flow Control
Flow control governs the allocation of network buffers and links. It determines when buffers
and links are assigned to messages, the granularity at which they are allocated, and how these
resources are shared among the many messages using the network. A good flow control proto-
col lowers the latency experienced by messages at low loads by not imposing high overhead in
resource allocation, and drives up network throughput by enabling effective sharing of buffers
and links across messages. In determining the rate at which packets access buffers (or skip buffer
access altogether) and traverse links, flow control is instrumental in determining network energy
and power consumption. The implementation complexity of a flow control protocol includes the
complexity of the router microarchitecture as well as the wiring overhead required for commu-
nicating resource information between routers.
Header Payload
Cache Line RC Type VCID Addr Bytes 0–15 Bytes 16–31 Bytes 32–47 Bytes 48–63
Figure 5.1: Composition of message, packets, flits: Assuming 16-byte wide flits and 64-byte cache
lines, a cache line packet will be composed of 5 flits and a coherence command will be a single-flit
packet. The sequence number (Seq#) is used to match incoming replies with outstanding requests,
or to ensure ordering and detect lost packets.
Figure 5.1, many messages will in fact be single-flit packets. For example, a coherence command
need only carry the command and the memory address which can fit in a 16-byte wide flit.
Flow control techniques are classified by the granularity at which resource allocation oc-
curs. We will discuss techniques that operate on message, packet and flit granularities in the next
sections with a table summarizing the granularity of each technique at the end.
We start with circuit-switching, a technique that operates at the message level, which is the
coarsest granularity, and then refine these techniques to finer granularities.
5.2. MESSAGE-BASED FLOW CONTROL 59
0 S08 A08 D08 D08 D08 D08 T08
Location
5 S08 A08 D08 D08 D08 D08 T08 S28
S08 A08 D08 D08 D08 D08 T08
8
0 1 2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Time
Figure 5.2: Circuit-switching example from Core 0 to Core 8, with Core 2 being stalled. S: Setup
flit, A: Acknowledgement flit, D: Data message, T: Tail (deallocation) flit. Each D represents a
message; multiple messages can be sent on a single circuit before it is deallocated. In cycles 12 and 16,
the source node has no data to send.
0 H B B B T
1 H B B B T
Location
2 H B B B T
5 H B B B T
8 H B B B T
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Time
current router. Latency experienced by a packet is thus drastically reduced over store and forward
flow control, as shown in Figure 5.4a. In Figure 5.3, 25 cycles are required to transmit the entire
packet; with virtual cut-through, this delay is reduced to 9 cycles. However, bandwidth and
storage are still allocated in packet-sized units. Packets still move forward only if there is enough
storage at the next downstream router to hold the entire packet. On-chip networks with tight
area and power constraints may find it difficult to accommodate the large buffers needed to
support virtual cut-through when packet sizes are large (such as 64- or 128-byte cache lines).
In Figure 5.4b, the entire packet is delayed when traveling from node 2 to node 5 even
though node 5 has buffers available for 2 out of 5 flits. No flits can proceed until all 5 flit buffers
are available.
0 H B B B T 0 H B B B T
1 H B B B T 1 H B B B T
Location
Location
2 H B B B T 2 H B B B T
5 H B B B T 5 Contention H B B B T
8 H B B B T 8 H B B B T
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 10 11
Time Time
(a) VCT with no delay (b) VCT with delay
0 H B B B T
1 H B B B T
Location
2 Contention H B B B T
5 H B B B T
8 H B B B T
0 1 2 3 4 5 6 7 8 9 10 11
Time
Wormhole flow control reduces packet latency by allowing a flit to leave the router as
soon as a downstream buffer is available (in the absence of contention, the latency is the same
as virtual cut through). Additionally, wormhole flow control can be implemented with fewer
buffers than packet-based techniques. Due to the tight area and power constraints of on-chip
networks, wormhole flow control is the predominant technique adopted thus far.
5.5. VIRTUAL CHANNELS 63
5.5 VIRTUAL CHANNELS
Virtual channels have been explained as the “swiss-army knife” of interconnection networks [86].
They were first proposed as a solution for deadlock avoidance [87], but have also been applied to
mitigate head-of-line blocking in flow control, thus extending throughput. Head-of-line block-
ing occurs in all the above flow control techniques where there is a single queue at each input;
head-of-line
when a packet at the head of the queue is blocked, it stalls subsequent packets that are lined up blocking
behind it, even when there are available resources for the stalled packets.
Essentially, a virtual channel (VC) is basically a separate queue in the router; multiple VCs
virtual channel
share the physical wires (physical link) between two routers. By associating multiple separate
queues with each input port, head-of-line blocking can be reduced. Virtual channels arbitrate
for physical link bandwidth on a cycle-by-cycle basis. When a packet holding a virtual channel
becomes blocked, other packets can still traverse the physical link through other virtual channels.
Thus, VCs increase the utilization of the physical links and extend overall network throughput.
Technically, VCs can be applied to all the above flow control techniques to alleviate head-
of-line blocking, though Dally first proposed them with wormhole flow control [87]. For in-
stance, circuit switching can be applied on virtual channels rather than the physical channel, so
a message reserves a series of VCs rather than physical links, and the VCs are time-multiplexed
onto the physical link cycle-by-cycle, also called virtual circuit switching [128]. Store-and-
forward flow control can also be used with VCs, with multiple packet buffer queues, one per VC,
VCs multiplexed on the link packet-by-packet. Virtual cut-through flow control with VCs work
similarly, except that VCs are multiplexed on the link flit-by-flit. However, as on-chip network
designs overwhelmingly adopt wormhole flow control for its small area and power footprint,
and use virtual channels to extend the bandwidth where needed, for the rest of this book, when
we mention virtual channel flow control, we assume that it is applied to wormhole, with both
buffers and links managed and multiplexed at the granularity of flits.
A walk-through example illustrating the operation of virtual channel flow control is de-
picted in Figure 5.6. Packet A initially occupies VC 0 and is destined for Node 4, while Packet B
initially occupies VC 1 and is destined for Node 2. At time 0, Packet A and Packet B both have
flits waiting in the west input virtual channels of Node 0. Both A and B want to travel outbound
on the east output physical channel. The head flit of Packet A is allocated virtual channel 0 for
the west input of router 1 and wins switch allocation (techniques to handle this allocation are
discussed in Chapter 6). The head flit of packet A travels to router 1 at time 1. At time 2, the
head flit of packet B is granted switch allocation and travels to router 1 and is stored in virtual
channel 1. Also at time 2, the head flit of A fails to receive a virtual channel for router 4 (its next
hop); both virtual channels are occupied by flits of other packets. The first body flit of A inherits
virtual channel 0 and travels to router 1 at time 3. Also at time 3, the head flit of B is able to
allocate virtual channel 0 at router 2 and continues on. At time 4, the first body flit of packet B
inherits virtual channel 1 from the head flit and wins switch allocation to continue to router 1.
By time 7, all of the flits of B have arrived at router 2, the head and body flits have continued
64 5. FLOW CONTROL
VC 0 BA HA VC 0 BA BA HA
VC 1 BB HB VC 1 BB HB
0 1 2 0 1 2
BC TD BC TD
HC BD HC BD
4 4
(a) T = 0 (b) T = 1
VC 0 BA BA HA VC 0 TA BA BA HA HB
VC 1 BB BB HB VC 1 BB BB
0 1 2 0 1 2
BC TD BC TD
HC BD HC BD
4 4
(c) T = 2 (d) T = 3
VC 0 TA BA BA HA VC 0 TA BA BA HA TB
VC 1 TB TB BB VC 1
0 1 2 0 1 2
BC TD BC TD
HC BD HC BD
4 4
(e) T = 4 (f ) T = 7
Figure 5.6: Virtual channel flow control walk-through example. Two packets A and B are broken
into 4 flits each (H: head, B: Body, T: Tail).
5.6. DEADLOCK-FREE FLOW CONTROL 65
on and the tail flit remains to be routed. The head flit of packet A is still blocked waiting for a
free virtual channel to travel to router 4.
With wormhole flow control using a single virtual channel, packet B would be blocked
behind packet A at router 1 and would not be able to continue to router 2 despite the availability
of buffers, links and the switch to do so. Virtual channels allow packet B to proceed toward its
destination despite the blocking of packet A. Virtual channels are allocated once at each router
to the head flit and the remainder of flits inherit that virtual channel. With virtual-channel flow
control, flits of different packets can be interleaved on the same physical channel, as seen in the
example between time 0 and 2.
Virtual channels are also widely used to break deadlocks, both within the network (see
Section 5.6), and for handling system-level or protocol-level deadlocks (see Section 2.1.3).
The previous sections have explained how different techniques handle resource allocation
and utilization. These techniques are summarized in Table 5.1.
D A0
ate
lin
e
A1
D0 D1 B1 B0
C1
C0
Figure 5.7: Two virtual channels with separate buffer queues denote with white and grey circles at
each router are used to break the cyclic route deadlock in Figure 4.2.
The same idea works across various oblivious/adaptive routing algorithms that allow all
turns and are thus deadlock-prone. A routing algorithm that randomly chooses between X-Y
and Y-X routes can be made deadlock-free by enforcing all X-Y packets to use VC 0 and all
Y-X packets to use VC 1. Similarly, routing algorithms that wish to allow all turns for path
diversity can be made deadlock-free by implementing a certain turn model in VC 0 and another
turn model in VC 1, and not allowing packets in one VC to jump to the other throughout the
traversal.
At the system level, messages that can potentially block each other can be assigned to
different message classes that are mapped to different virtual channels within the network, such
as request and acknowledgment messages of coherence protocols. These designs scale to mul-
tiple VCs by dividing all available VCs into multiple classes, and enforcing the ordering rules
described above across these classes. Within each class, flits can acquire any VC. Implementa-
tion complexity of virtual channel routers will be discussed in detail next in Chapter 6 on router
microarchitecture.
5.6. DEADLOCK-FREE FLOW CONTROL 67
5.6.2 ESCAPE VCS
The previous section discussed the benefits of enforcing ordering between VCs to prevent dead-
locks. However, enforcing an order on VCs lowers their utilization, affecting network through-
put when the number of VCs is small. In Figure 5.7, all packets are initially assigned VC 0 and
remain on VC 0 until they cross the dateline. As a result, VC 1 is underutilized. Escape VCs
have been proposed to address this by Duato [108]. Duato proved that the requirement of an
acyclic CDG was a sufficient condition for a deadlock-free routing algorithm but not necessary;
even if the CDG is cyclic, as long as there is an acyclic sub-part of the CDG, it can be used
to escape out of the cyclic-dependency. This acyclic connected sub-part of the CDG defines
a escape virtual channel. Hence, rather than enforcing a fixed order/priority between all VCs,
escape VC
that so long as there is a single escape VC that is deadlock-free, all other VCs can use fully
adaptive routing with no routing restrictions. This escape VC is typically made deadlock-free
by using a deadlock-free routing function within it. For instance, if VC 0 is designated as the
escape channel, all traffic on VC 0 must be routed using dimension-ordered routing, while all
other VCs can be routed with arbitrary routing functions. Explained simply, so long as access
to VCs is arbitrated fairly, a packet always has a chance of landing on the escape VC, and thus
of escaping a deadlock.1 Escape VCs help increase the utilization of VCs, or permits a higher
throughput with a smaller number of VC, making for leaner routers.
In Figure 5.8a, we illustrate once again how unrestricted routing with a single virtual
channel can lead to deadlock. Each packet is trying to acquire resources to make a clockwise
turn. Figure 5.8b utilizes two virtual channels. Virtual channel 1 serves as an escape virtual
channel. For example, Packet A could be allocated virtual channel 1 (and thus dimension order
routed to its destination). By allocating virtual channel 1 at router 4 for packet A, all packets
can make forward progress. The flits of packet A will eventually drain from VC 0 at router 1,
allowing packet B to be allocated either virtual channel at router 1. Once the flits of packet B
have drained, packet D can continue on virtual channel 0 or be allocated to virtual channel 1
and make progress before packet B has drained. The same goes for the flits of packet C.
TC TC
BC BC
VC 1
VC 0 TA BA HA VC 0 TA BA BA HA
HB HB
BB BB
0 1 0 1
BC BC
HC HC
HD BD BD TD HD BD BD TD
BB BB
TB TB
2 3 2 3
(a) Deadlock without Escape VCs (b) Escape VCs break deadlock cycle
Figure 5.8: Escape virtual channel example. Virtual Channel 1 serves as an escape virtual channel
that is dimension order XY routed.
injection into the ring to make sure a closed cyclic dependency is not created. A packet can only
be injected if there is empty buffer space in the ring to accommodate two packets. Requiring
empty buffer space for two packets guarantees that if the packet is injected, there will still be one
empty packet buffer in the ring. This empty buffer, referred to as a bubble, ensures that at least
one packet in the ring will be able to make forward progress, thus preventing the cycle to close.
Figure 5.9 shows an example where R1 has two empty bubbles which will allow Packet P1 to
be injected. The remaining routers only have one free bubble each preventing the injection of
Packets P0 and P2. The rule same applies for packets changing dimensions which is considered
as injection into a new dimension.
P0 P1 P2
Occupied packet-size bubble Free packet-size bubble
Due to the complexity associated with searching all buffers in a ring, bubble flow con-
trol requires there be two empty packet buffers in the local queue in order for a packet to be
5.7. BUFFER BACKPRESSURE 69
injected [298]. This increases the minimum buffer sizes requirements which can be undesirable
for maintaining a low area and power footprint in on-chip networks. Recent work has explored
adapting bubble flow control to wormhole switching to reduce the buffering requirements and
make it more compatible with on-chip networks [68, 147, 237, 355].
5.8 IMPLEMENTATION
The implementation complexity of a flow control protocol essentially involves the complexity of
the entire router microarchitecture and the wiring overhead imposed in communicating resource
information between routers. Here, we focus on the latter, as Chapter 6 elaborates on router
microarchitectures and associated implementation issues.
When choosing a specific buffer backpressure mechanism, we need to consider its perfor-
mance in terms of buffer turnaround time, and its overhead in terms of the number of reverse
signaling wires.
70 5. FLOW CONTROL
5.8.1 BUFFER SIZING FOR TURNAROUND TIME
Buffer turnaround time is the minimum idle time between when successive flits can reuse a
buffer turnaround
time buffer. A long buffer turnaround time leads to inefficient reuse of buffers, which results in poor
network throughput. If the number of buffers implemented does not cover the buffer turnaround
time, then the network will be artificially throttled at each router, since flits will not be able to
flow continuously to the next router even when there is no contention from other ports of the
router. As shown in Figure 5.10, the link between two routers is idle for 6 cycles while waiting
for a free buffer at the downstream router.
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
Credit Count 2 1 0 0 0 0 0 1 1 0 0 0
Figure 5.10: Throttling due to too few buffers. Flit pipeline stages discussed in Chapter 6. C: Credit
send. C-LT: Credit link traversal. C-Up: Credit update.
For credit-based buffer backpressure, a buffer is held from the time a flit departs the cur-
rent node (when the credit counter is decremented), to the time the credit is returned to inform
the current node that the buffer has been released (so the credit counter can be incremented
again). Only then can the buffer be allocated to the next flit, although it is not actually reused
until the flit traverses the current router pipeline and is transmitted to the downstream router.
Hence, the turnaround time of a buffer is at least the sum of the propagation delay of a data flit
to the next node, the credit delay back, and the pipeline delay, as is shown in Figure 5.11a.
In comparison, in on/off buffer backpressure, a buffer is held from the time a flit arrives
at the next node and occupies the last buffer (above the threshold), triggering the off signal to
be sent to stop the current node from sending. This persists until a flit leaves the next node and
frees up a buffer (causing the free buffer count to go over the threshold). Consequently, the on
signal is asserted, informing the current node that it can now resume sending flits. This buffer is
occupied again when the data flit arrives at the next node. Here, the buffer turnaround time is
thus at least twice the on/off signal propagation delay plus the propagation delay of a data flit,
and the pipeline delay, as shown in Figure 5.11b.
5.8. IMPLEMENTATION 71
On On Flit
Actual buffer usage propagation pipeline propagation
(Flit Pipeline Delay) delay delay Flit pipeline delay delay
Off Off
propagation pipeline
delay delay
Node 0 processes
Off signal, stops
sending flits
Node 0 receives
Off signal
(b) On/Off-based
(a) Credit-based
(b) On/Off-based
A custom network for an MPSoC may result in a heterogeneous set of switches; these
switches may differ in terms of number of ports, number of virtual channels, and number of
buffers [47]. For the flow control implementation, different numbers of buffers may be instan-
tiated at each node depending on the communication characteristics [164]. Buffering resources
will impact the thresholds of on/off flow control or the reverse signaling wires required by credit-
based flow control. Additionally, non-uniform link lengths in a customized topology will impact
the buffer turn-around time of the flow control implementation. Regularity and modularity are
sacrificed in this type of environment; however, the power, performance, and area gains can be
significant.
CHAPTER 6
Router Microarchitecture
Routers must be designed to meet latency and throughput requirements amid tight area and
power constraints; this is a primary challenge designers are facing as many-core systems scale.
Router complexity increases with bandwidth demands; very simple routers (unpipelined, worm-
hole, no VCs, limited buffering) with low area and power overheads can be built when high
throughput is not needed. Challenges arise when the latency and throughput demands on on-
chip networks are raised.
A router’s microarchitecture determines its critical path delay which affects per-hop delay
and overall network latency. The implementation of the routing, flow control, and the actual
router pipeline affect the efficiency at which buffers and links are used which governs over-
all network throughput. Router microarchitecture also impacts network energy—both dynamic
and leakage—as it determines the circuit components in a router and their activity. Finally, the
microarchitecture and underlying circuits directly contribute to the area footprint of the network.
VC Allocator
Route
Computation
Switch Allocator
VC 1
Input 1 VC 2 Output 1
VC 3
VC 4
Input Buffers
VC 1
VC 2
Input 5 Output 5
VC 3
VC 4
Input Buffers
Crossbar Switch
Buffers are used to house packets or flits when they cannot be forwarded right away onto output
links. Flits can be buffered on the input ports and on the output ports. Output buffering occurs
when the allocation rate of the switch is greater than the rate of the channel. Crossbar speedup
(discussed in Section 6.3.2) requires output buffering since multiple flits can be allocated to a
single output channel in the same cycle.
All previously proposed on-chip network routers have buffering at input ports, as input
buffer organization permits area and power-efficient single-ported memories. We will, therefore,
6.2. BUFFERS AND VIRTUAL CHANNELS 77
focus our discussion on input-buffered routers here, dissecting how such buffering is organized
within each input port.
Physical Virtual
Channels Channels
VC 0 tail head
VC 1 tail head
bitxbar bx1(in0[1],in1[1],in2[1],in3[1],in4[1],out0[1],out1[1],out2[1],out3[1],
out4[1],colsel0reg,colsel1reg,colsel2reg,colsel3reg,colsel4reg,1’bx);
bitxbar bx2(in0[2],in1[2],in2[2],in3[2],in4[2],out0[2],out1[2],out2[2],out3[2],
out4[2],colsel0reg,colsel1reg,colsel2reg,colsel3reg,colsel4reg,1’bx);
bitxbar bx3(in0[3],in1[3],in2[3],in3[3],in4[3],out0[3],out1[3],out2[3],out3[3]
out4[3],colsel0reg,colsel1reg,colsel2reg,colsel3reg,colsel4reg,1’bx);
endmodule
module bitxbar(i0,i1,i2,i3,i4,o0,o1,o2,o3,o4,sel0,sel1,sel2,sel3,sel4,inv
input i0,i1,i2,i3,i4;
output o0,o1,o2,o3,o4;
[2:0] sel0, sel1, sel2, sel3, sel4;
input inv;
endmodule
6.3. SWITCH DESIGN 81
i40 i30 i20 i10 i00
o0 o1 o2 o3 o4
Inject w columns
N
w rows
S
Eject N S E W
Figure 6.4: A 5 5 crosspoint crossbar switch. Each horizontal and vertical line is w bits wide (1 phit
width). The bold lines show a connection activated from the south input port to the east output port.
Figure 6.5: Crossbars with different speedups for a 5-port router. (a) No crossbar speedup, (b) cross-
bar with input speedup of 2, (c) crossbar with output speedup of 2, and (d) crossbar with input and
output speedup of 2.
Crossbar speedup can also be achieved by clocking the crossbar at a higher frequency than
the rest of the router. For instance, if the crossbar is clocked at twice the router frequency, it can
then send two flits each cycle between a single pair of input-output ports, achieving the same
performance as a crossbar with input and output speedup of 2. This is less likely in on-chip
networks where a router tends to run off a single clock supply that is already aggressive.
Grant 2
A2 A1
B1
C1
D2 D1
Request 0 Grant 0
01 02
Request 1 Grant 1
10 12
Request 2 Grant 2
20 21
Figure 6.8: Matrix arbiter. The boxes wij represent priority bits. When bit wij is set, request i has a
higher priority than request j .
x 0 0 0 x 0 0 1 x 0 1 1 x 1 1 1
1 x 0 0 1 x 0 1 1 x 1 1 0 x 0 0
1 1 x 0 1 1 x 1 0 0 x 0 0 1 x 0
1 1 1 x 0 0 0 x 0 0 1 x 0 1 1 x
x 0 0 0 x 0 0 1 x 0 0 0
1 x 0 0 1 x 0 1 1 x 0 1
1 1 x 0 1 1 x 1 1 1 x 1
1 1 1 x 0 0 0 x 1 0 0 x
(e) T = 4 (f ) T = 5 (g) T = 6
Figure 6.9: Matrix arbiter priority update for the request stream from Figure 6.7.
86 6. ROUTER MICROARCHITECTURE
N=k W 1 to generate M grants. k can be some parameter specific to the design. Figure 6.10 shows
an example; here a 3:4 separable allocator (an allocator matching 3 requests to 4 resources) is
composed of arbiters. For instance, consider a separable switch allocator for a router with four
ports, and three input VCs per input port. During the first stage of the allocator (comprised of
four 3:1 arbiters), each arbiter corresponds to an input port and chooses one of the three input
VCs as a winner. The winning VCs from the first stage then arbitrate for an output port in the
second stage (comprising three 4:1 arbiters). Each arbiter chooses one out of these input VCs
as a winner for the output port. Different arbiters have been used in practice, with round-robin
arbiters being the most popular due to their simplicity.
Requestor 1
requesting for Resource A granted to Requestor 1
resource A 3:1 4:1 Resource B granted to Requestor 1
arbiter arbiter Resource C granted to Requestor 1
Requestor 3
Resource D granted to Requestor 1
requesting for
resource A
3:1
arbiter
:
4:1 :
arbiter
3:1
arbiter
Requestor 1
requesting for Resource A granted to Requestor 3
resource D 3:1 4:1 Resource B granted to Requestor 3
Resource C granted to Requestor 3
arbiter arbiter Resource D granted to Requestor 3
Figure 6.10: A separable 3:4 allocator (3 requestors, 4 resources) which consists of four 3:1 arbiters
in the first stage and three 4:1 arbiters in the second. The 3:1 arbiters in the first stage decides which
of the 3 requestors win a specific resource, while the 4:1 arbiters in the second stage ensure a requestor
is granted just 1 of the 4 resources.
Figure 6.11 shows one potential outcome from a separable allocator. Figure 6.11a shows
the request matrix. Each of the 3:1 arbiters selects one value of each row of the matrix; these
first stage results of the allocator are shown in the matrix in Figure 6.11b. The second set of
4:1 arbiters will arbitrate among the requests set in the intermediate matrix. The final result
(Figure 6.11c) shows that only one of the initial requests was granted. Depending on the arbiters
used and the initial states, more allocations could result.
6.4. ALLOCATORS AND ARBITERS 87
1 1 1 1 0 0 1 0 0
1 1 0 1 0 0 0 0 0
1 0 0 1 0 0 0 0 0
1 0 1 0 0 0 0 0 0
p0
00 01 02 03
p1
10 11 12 13
p2
20 21 22 23
p3
30 31 32 33
Figure 6.12: A 4 4 wavefront allocator. Diagonal priority groups are connected with bold lines.
Connections for passing tokens are shown with grey lines.
00 01 02 03 00 01 02 03
10 11 12 13 10 11 12 13
20 21 22 23 20 21 22 23
30 31 32 33 30 31 32 33
00 01 02 03 00 01 02 03
10 11 12 13 10 11 12 13
20 21 22 23 20 21 22 23
30 31 32 33 30 31 32 33
1 0 0
0 1 0
0 0 0
0 0 1
6.5 PIPELINE
Figure 6.15a shows the logical pipeline stages for a basic virtual channel router, with all the
components discussed so far. Like the logical pipeline stages of a typical processor: instruction
fetch, decode, execute, memory and writeback, these are logical stages that will fit into a physical
pipeline depending on the actual clock frequency.
A head flit, upon arriving at an input port, is first decoded and buffered according to
its input VC in the buffer write (BW) pipeline stage. Next, the routing logic performs route
buffer write
route
computation (RC) to determine the output port for the packet. The header then arbitrates for
computation a VC corresponding to its output port (i.e., the VC at the next router’s input port) in the VC
allocation (VA) stage. Upon successful allocation of a VC, the header flit proceeds to the switch
VC allocation
allocation (SA) stage where it arbitrates for the switch input and output ports. On winning the
switch allocation
output port, the flit is then read from the buffer and proceeds to the switch traversal (ST) stage,
switch traversal
where it traverses the crossbar. Finally, the flit is passed to the next node in the link traversal
(LT) stage. Body and tail flits follow a similar pipeline except that they do not go through RC
link traversal
and VA stages, instead inheriting the route and the VC allocated by the head flit. The tail flit,
on leaving the router, deallocates the VC reserved by the head flit.
A wormhole router with no VCs does away with the VA stage, requiring just four logical
stages. In Figure 6.1, such a router will not require a VC allocator, and will have only a single
deep buffer queue in each input port.
Head
BW RC VA SA ST LT
flit
Body/
BW Bubble Bubble SA ST LT
tail flit
(a) Traditional 5-stage pipeline
Head RC
BW SA ST LT
flit VA
Body/ Bubble SA ST LT
BW
tail flit
(b) Lookahead routing pipeline
Head RC
VA ST LT
flit Setup
Body/ Setup ST LT
tail flit
(c) Low-load bypass pipeline
Head RC
BW VA ST LT
flit SA
Body/
BW SA ST LT
tail flit
(d) Speculative VC allocation or VC select pipeline
Header/ RC
LA-LT VA
Head LookAhead SA
flit
Payload ST LT
Header/ LA-LT SA
Body/ LookAhead
tail flit
Payload ST LT
Figure 6.15: Router pipeline [BW: Buffer Write, RC: Route Computation, VA: Virtual Channel
Allocation, SA: Switch Allocation, ST: Switch Traversal, LT: Link Traversal].
92 6. ROUTER MICROARCHITECTURE
If the physical pipeline has five stages just like the logical stages, then the stage with
the longest critical path delay will set the clock frequency. Typically, this is the VC or Switch
allocation stage when the number of VCs is high, or the crossbar traversal stage with very wide,
highly ported crossbars. The clock frequency can also be determined by the overall system clock,
for instance sized by the processor pipeline’s critical path instead.
Increasing the number of physical pipeline stages increases the per-hop router delay for
each message, as well as the buffer turnaround time which affects the minimum buffering needed
and affects throughput. Thus, pipeline optimizations have been proposed and employed to re-
duce the number of stages. Common optimizations targeting logical pipeline stages are explained
next. State-of-the-art router implementations can perform all actions within a single cycle.
1a Lookahead Routing
Computation
Inject
N
1
1b
A S
Eject N S E W
2
A
(a) Low-load bypassing of A
Virtual Channel 2a
Allocation
1a Lookahead Routing
Computation Switch Allocation 2b
3
Inject
1 1c 1b
B N
1
1c 1b
A S
Eject N S E W
4
A
3
B
VC Selection 1b 2c
Inject
2
B N
2 2a
2a
A S
3
E
Eject N S E W
4 A
3 B
Figure 6.17: Lookahead bypass example—Lookahead_B wins and B bypasses, A gets buffered.
the switch in Cycle 2. Since A’s lookahead loses, flit A gets buffered; A performs switch allocation
and VC selection in Cycle 2 and performs switch traversal in Cycle 3.
State-of-the-art networks can be designed today at modern technologies that spend a
State-of-the-art
single-cycle for switch arbitration and VC selection in the router, and the subsequent cycle for
traversing both the switch and link [160, 287], while operating at GHz frequencies. This enables
two-cycles per-hop traversal (at no contention).
0.04
25 0.035
0.03
20 Static Other
0.025
Area (mm2)
Power (mW)
Other Switch Allocator
15 0.02
Switch Allocator Crossbar
10 Links 0.015
Input VCs
Crossbar 0.01
5
Input VCs 0.005
0 0
Dynamic Static Dynamic Static Area
Power-Low Load Power-Saturation
In this section, we discuss the techniques used across on-chip networks to reduce power
consumption. We refer readers to the Synthesis Lectures on Computer Architecture Techniques for
Power Efficiency [179] for a more detailed description of low-power techniques used in cores and
caches.
(1) For multiple voltage-frequency islands, bi-synchronous FIFOs have to be used at the in-
terfaces of every pair of different voltage-frequency islands, incurring excess delays.
(2) Most existing proposals assume the use of multiple supply lines for accessing different
voltages. However, use of multiple voltage rails requires multiple voltage converters out-
side the chip along with the area overhead for multiple power distribution networks. The
6.6. LOW-POWER MICROARCHITECTURE 97
introduction of high bandwidth integrated voltage regulators can alleviate this problem by
allowing fast (sub 50 ns) voltage transitions.
As the on-chip network associated with a tile/core not only serves the flits injected from
that core, but also serves flits from different cores, the DVFS policy of the on-chip network
fabric has to be dealt with differently than for the cores. The existing literature on DVFS policies
for on-chip networks focuses on using static network parameters like average queue utilization,
average return time to memory requests, etc. to decide the new voltage-frequency (V-F) states
of the routers. Typically, a DVFS controller would perform the following tasks: namely monitor
a suitable network parameter, compute state feedback values based on previous states and target
value and update V-F state. Some recent papers on DVFS for on-chip networks are discussed
later in the bibliography of this chapter.
Power-Efficient Designs. The second class of technique tries to reduce power consump-
tion by reducing capacitance or switching activity.
The dynamic power of on-chip networks can be reduced by reducing the effective capac-
itance being switched. Wires dominate network power since wire capacitance is much larger
than gate capacitance. Energy-efficient signaling in the form of low-swing [287] and equal-
ized links [314] has been studied in this regard. Router power can also be reduced by reducing
the number of pipeline stages, and optimizing the buffers, crossbar, and arbiter circuits/micro-
architecture. For instance, SRAMs are more energy-efficient than flip flops and register files
for implementing buffers, while matrix-style crossbars are often more efficient than mux-based
crossbars. Crossbars can be further segmented [351] or designed with low-swing links [287] to
reduce power consumption during traversals. Complex arbiters can be split into multiple simpler
arbiters [189, 291] to reduce power consumption further.
Lowering the switching activity is another technique to reduce dynamic power. Clock-
gating is a popular method to reduce the amount of switching activity of latches between inactive
circuits. For instance, the dynamic power at low-loads in Figure 6.18a is primarily due to the
clock, and not actual traffic, providing an opportunity to reduce power. Efficient encoding of
the bits being sent from one router to the other could also be exploited to reduce the number of
bit-toggles, and thereby dynamic power.
(0.016 mm2)
(0.041 mm2)
P1
P0 BF
16 Bytes 5×5
P2 Crossbar P3
(0.763 mm2)
P4
(a) Router Layout from Kumar et al. [208]. BF: Buffer, BFC: Buffer Control,
VA: VC Allocator, SA: Switch Allocator. P0: North port; P1: East; P2: West;
P3: South; P4: Injection/Ejection.
North Output M5
North Input North Output Module
M5 Local Output Module
West Output Module
WR
(b) Router Layout from Balfour and Dally [38]. M5 and M6 indicate the
metal layers used.
To target allocator delay, Figure 6.19a replicates the allocators at every input port, so allo-
cator grant signals will not incur a large RC delay before triggering buffer reads, and the crossbar
can also be setup more quickly as control signals now traverse a shorter distance. This comes at
the cost of increased area and power. Allocator request signals still have to traverse through the
entire crossbar height and width, but their delay is mitigated as that router uses a pipeline opti-
mization technique, advanced bundles, to trigger allocations in advance. Figure 6.19b, however,
leverages their use of the semi-global metal layers for the crossbar to place the allocators in the
100 6. ROUTER MICROARCHITECTURE
middle, in the active area underneath the crossbar wiring, to lower wire delay to the allocators
without replication.
Here, we just aim to illustrate the many possible back-end design decisions that can be
made at floorplanning time to further optimize the router design. Note that as routers are just a
component of a many-core chip, its floorplan also needs to be done relative to the positions of
the network interfaces (NICs), cores and caches.
We will revisit floorplanning in NoC prototypes in Chapter 8. Most recent chip prototypes
with NoCs [39, 72, 101] synthesize the entire router as one module rather than hierarchically,
letting the CAD tools automatically place the various components of the router within the speci-
fied area. For instance, Figure 6.18b plots the area distribution for a state-of-the-art mesh router
with 4 VCs at 32 nm [64], designed by letting the CAD tools perform the place-and-route. The
buffers and crossbar contribute to over 70% and 20% of the area. In other designs, the router’s
buffers and arbiters are synthesized and then laid out as one module, leaving aside area for the
crossbar which is custom-designed and integrated during final place-and-route [287].
CHAPTER 7
where H is the average hop count through the topology, trouter is the pipeline delay through
a single router, twire is the wire delay between two routers, and tcontention .h/ is the delay due to
contention between multiple messages competing for the same network resources at a router
h-hops from the start. A factor of H C 1 is considered for router power and contention since a
packet traverses the input router prior to the first hop through the network. trouter accounts for
the time each packet spends in various stages at each router as the router coordinates between
multiple packets; depending on the implementation, this can consist of one to several pipeline
stages as discussed in Chapter 6. trouter and twire are design-time metrics. They can be used to
determine a lower-bound on the latency of any packet. H and tcontention .h/ are runtime metrics
that depend on traffic.
Throughput. The bisection bandwidth, defined earlier in Chapter 3, is a design-time met-
ric for the throughput of any network. As a reminder, it is the inverse of the maximum load across
the bisection channels of any topology. Ideal throughput assumes perfect flow control and per-
fect load balancing from the routing algorithm. The actual throughput at saturation, however,
might vary heavily, depending on how routing and flow control interact with runtime traffic.
104 7. MODELING AND EVALUATION
Throughput higher than the bisection bandwidth can be achieved if traffic does not go from one
end of the network to the other over the bisection links. However, often times, the achieved sat-
uration throughput is lower than the bisection bandwidth. A deterministic routing algorithm,
such as XY, might be unable to balance traffic across all available links in the topology in re-
sponse to network load. Heavily used paths will saturate quickly, reducing the rate of accepted
traffic. On the other hand, an adaptive routing algorithm using local congestion metrics could
lead to more congestion in downstream links. The inability of the arbitration schemes inside
the router to make perfect matching between requests and available resources can also degrade
throughput. Likewise, limited number of buffers and buffer turnaround latency can drive down
the throughput of the network.
Energy. The energy consumed by each flit during its network traversal is given by:
where EBW , ERC , EVA , ESA , EBR , and EST is the energy consumption for buffer write, route
computation, VC arbitration, switch arbitration, buffer read, and switch traversal, respectively.
ERC and EVA are only consumed by the head flit. The relative contribution of these parameters
is topology and flow control specific. For instance, a high-radix router might have a larger EST
and Ewire , but lower H . Similarly, a wormhole router will not consume EVA . Contention at
every router determines the number of times a flit may need to perform VA and SA before
winning both and getting access to the switch. EVA and ESA depends on the specific allocator
implementation.
Area. The area footprint of an on-chip network depends on the area of routers.
ANetwork D N .Arouter /
D N .p v AVC C p ARouteUnit C p AArbi t er _i nport C p AArbi t er _outport
C ACrossbar /;
where N is the number of routers (assuming all of them are homogeneous input buffered de-
signs), p is the number of ports, and v is the number of VCs per input port. AVC is the area
consumed by the buffers and control for each VC, which in turn depends on its implementation,
as Chapter 6 discussed. This equation assumes a separable switch allocator design; AArbiter_inport
represents the area of all the arbiters at each input port, and AArbiter_outport represents the area of
all the arbiters at each output port.
Wires do not directly contribute to the area footprint as they are often routed on higher
metal layers above logic; the link drivers are embedded within the crossbar while the link receiver
is within the input VC.
7.1. EVALUATION METRICS 105
7.1.2 IDEAL INTERCONNECT FABRIC
Ideal latency. The lowest latency (or ideal latency) that can be achieved by the intercon-
nection network is one that is solely dedicated to the wire delay between a source and destination.
This latency can be achieved by allowing all data to travel on dedicated pipeline wires that di-
rectly connect a source to a destination. Wire latency assumes the optimal insertion of repeaters
and flip-flops. The latency of this dedicated wiring would be governed only by the average wire
length D between the source and destination (assumed to be the Manhattan distance), packet
size L, channel bandwidth b , and propagation velocity v :
D L
Tideal D Twire D C :
v b
The first term corresponds to the time spent traversing the interconnect, while the second
corresponds to the serialization latency for a packet of length L to cross a channel of bandwidth
b . Ideally, serialization delay could be avoided with very wide channels. However, such a large
number of wires would not be feasible given the projected chip sizes.
Ideal throughput. The ideal throughput depends solely on the bandwidth provided by the
topology. It can be computed by calculating the load across all links in the topology for a par-
ticular traffic pattern with a specific routing algorithm, and taking the inverse of the maximum
load link.
Ideal energy. The energy expended to communicate data between tiles should ideally be
just the energy of interconnect wires as given by
L
Eideal D Ewire D D Ewire ;
b
where D is again the distance between source and destination and Ewire is the interconnect
transmission energy per unit length.
Parameter Value
Technology 45 nm
Vdd 1.0 V
Frequency 2 GHz
Topoloty 8-ary 2-mesh
Routing Dimension-ordered (DOR)
Traffic Uniform Random and Bit Complement
Router pipeline depth 1
Number of router ports 5
VCs per port 4
Buffers per port 4 (1-per VC)
Flit size (channel width) 128 bits
Link length 1 mm
40 40
35 On-Chip Network 35 On-Chip Network
30 Ideal 30 Ideal
Latency (Cycles)
Latency (Cycles)
25 25
20 20
15 15
10 10
5 5
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.05 0.1 0.15 0.2 0.25 0.3
Injected Load (flits/node/cycle) Injected Load (flits/node/cycle)
Figure 7.1: Latency vs. injected traffic for a 8 8 mesh on-chip network.
hop to get from the final router to the destination NIC. The ideal throughput is computed using
maximum channel load at the bisection links.
At low-loads, the state-of-the-art design is close to the ideal latency, the gap is due to the
additional 1-cycle router delay at every hop in the former. This small gap is due to the pipeline
optimizations incorporated into the design. A 5-stage pipeline at every router would increase
this gap significantly, leading to system-level performance penalties. At very high loads, the gap
increases due to contention. The state-of-the-art VC router delivers about 80% throughput of the
7.2. ON-CHIP NETWORK MODELING INFRASTRUCTURE 107
Table 7.2: Synthetic traffic patterns for k k mesh
ideal for both traffic patterns. The 20% gap is due to inefficiencies in routing and arbitration that
lead to a loss in link utilization at high loads. Simpler router designs will increase this throughput
gap significantly; wormhole flow control without virtual channels will saturate much earlier than
the curve shown. A small number of buffers will also reduce the saturation throughput.
Figure 7.2 plots the energy consumption of an ideal network and a state-of-the-art base-
line network, using the DSENT [328] energy models. This baseline architecture incorporates
many energy-efficient microarchitectural features but still significantly exceeds the energy con-
sumed solely by wires. This gap exists due to the additional buffering, switching, and arbitration
that occurs at each router; the gap widens until the network saturates.
Energy (nj/cycle)
2.5
2
1.5
1
0.5
0
0 0.1 0.2 0.3 0.4 0.5
Injected Load (flits/node/cycle)
Figure 7.2: Network energy with uniform random traffic in a 8 8 mesh on-chip network.
7.3 TRAFFIC
The traffic through the on-chip network depends on the kind of system it has been plugged
into and the overlaying communication protocol. Some of the common communication proto-
cols were described in detail in Chapter 2. Here we discuss how the communication protocol
prescribes the modeling and evaluation of on-chip networks. For the purpose of illustration, we
consider shared memory systems where the on-chip network interconnects the memory subsys-
tem (L1, L2, directory, memory controllers, etc.) and transfers cache-coherence traffic.
CHAPTER 8
Case Studies
Over the past decade, on-chip networks have been driving real multicore chips across commer-
cial products and research prototypes. We discuss a few of them here as case studies, focusing on
the system they are interconnecting and the design specifications. We highlight the topology,
routing algorithm, flow control and router microarchitecture and relate these designs back to
fundamental concepts presented in earlier chapters; however, in some cases, limited public infor-
mation is available about these chips so their treatment may not be complete. Table 8.1 summarizes
the features of all the chips discussed in this chapter. Case studies are presented in chronological
order starting with the most recent.
400 µm
Filter Weights
Row 1 Row 2 Row 3
Input Images
Partial Sums
Row 1 PE PE PE
Row 1
400 µm
Row 2 PE PE PE
Row 2
Row 3 PE PE PE
Row 3
Row 4 Row 5
(c) NoC Topology: Global Input Network, Global Output Network, Local Network
Global X Buses
Global Y Bus
Case I: Filt II: ifmap - AlexNet Layer 1 IV: ifmap - AlexNet Layer 4&5 Case VI: Psum
(d) Single-cycle Unicast and Multicast over the Global Input Network
Load Downgrade
Store Inv
NoC Routers lfill Mem Req
L2 Cache Traffic … …
(3) I/O
Splitter
NoC1 NoC2
L1.5 Cache Network L1.5/
Modified Routers L1.5 L2 Memory
(3)
OpenSparc CCX Arbiter Chip
T1 Core DRAM Bridge NoC2 NoC3
FPU Load Ack DG Ack
Store Ack Inv Ack
… Mem Reply
…
(a) Tile and Chipset (b) Three NoCs
The interface to the OpenSPARC T1’s L1 is the OpenSPARC CCX (CPU-Cache Cross-
bar) interface. An inclusive L2 cache is distributed across all the tiles. The memory subsystem
maintains cache coherence with a directory-based MESI coherence protocol. It adheres to the
TSO memory consistency model used by the OpenSPARC T1. OpenPiton adds a new L1.5
cache to transduce CCX messages to the coherence protocol messages. Coherent messages be-
tween L1.5 caches and L2 caches communicate through three NoCs. A chip bridge connects
the tile array to the chipset, through the upper-left tile, for serving memory and I/O requests.
OpenPiton also includes an AXI4-Lite bridge that provides connectivity to a wide range of I/O
devices by interfacing memory mapped I/O operations from the NoCs to AXI-Lite.
Three NoCs transport messages across the various message classes of the coherence pro-
tocol. NoC1 transports requests from L1.5 to the L2s; NoC2 transports responses and requests
from the L2 to the cores and the memory controller respectively; NoC3 transports writebacks
from the L1.5 and responses from the memory controller to the L2. To ensure deadlock-
freedom across the message classes, the priority order among the NoCs is NoC3 > NoC2 >
NoC1. This ensures that responses are always drained. The NoCs also maintain point-to-point
ordering.
All NoCs use 64-bit bi-directional links. Each NoC uses wormhole routers without any
virtual channels. The design essentially uses multiple physical networks instead of multiplexing
multiple VCs over the same physical links. Dimension-ordered XY routing is used to avoid
routing and protocol deadlocks. Each wormhole router takes one cycle when routing along the
same dimension, and two cycles at turns. In the ASIC prototype, the NoC routers consume less
than 3% of the entire chip area, which is dominated by the cores and caches.
122 8. CASE STUDIES
8.3 INTEL XEON-PHI (2015)
Intel’s Xeon Phi [321] line of processors is targeted for High Performance Computing (HPC)
workloads and contain tens of cores. The first iteration, called Knights Corner was released in
2012. It is implemented in 22 nm and operates at 1–1.2 GHz. It contains 61 P54C (Pentium
Pro) cores, interconnected by a bi-directional ring. The same ring is used in Intel’s Xeon products.
To get high-bandwidth and ring scalability, Intel uses 10 separate rings, 5 in each direction. The
5 rings are: one BL (64-byte for data), two AD (address), and two AK (acknowledgment and
other coherence messages). All rings are of a different width, optimized for their traffic type. All
packets are 1-flit wide, and each ring takes a cycle to deliver packets between ring stops. Apart
from the cores, there are ring stops for eight memory controllers, PCIe controllers, and a few
others for bookkeeping.
The second iteration, called Knights Landing was released in 2015. It is implemented in
14 nm and has 36 tiles, each with 2 silvermont (Atom) cores. A high level overview is shown in
Figure 8.3. There are 38 physical tiles, of which 36 are active; the remaining 2 tiles are for yield
recovery. Each tile comprises two cores, two vector processing units (VPUs) per core, a 1-Mbyte
level-2 (L2) cache that is shared between the two cores, and a slice of the distributed directory.
The NoC is a 6 6 mesh. There are four parallel meshes, each for delivering different types of
traffic. The mesh can deliver greater than 700 GB/s of total aggregate bandwidth. There are no
VCs within each mesh. The mesh is organized into rows and columns of “half ” rings that fold
upon themselves at the end points. In other words, the output link at the edge of an edge tile is
connected to the input link at the same edge. All packets use YX routing: a packet first traverses
the Y links to reach the right row, and then turns along the X to reach the destination. It takes
Figure 8.4: D E Shaw Research Anton 2: 4 4 mesh with skip channels [338].
unfairness. This is addressed by statically programming weights in the arbiters to provide service
proportional to load, also known as EoS (equality of service). This static programming is possible
since the class of MD applications running on the system is known.
Outwest
Outsouth
Outeast
Notification Router
(b) Latency-bound Notification Network Router
64 64) crossbar with 128-bit links. It operates at around 500 MHz delivering a throughput of
4.5 Tb/s. Figure 8.7 shows the chip floorplan.
The key idea in the Swizzle Switch is to re-use the data wires for arbitration, obviating
the need for a separate control plane for arbitration which adds lot of area and delay penalties
in crossbars due to the high fanout which in turn limits scalability. At each crosspoint, there is
a vector of priority bits which specify which input ports this particular input port inhibits, i.e.,
has higher priority over. Each input port repurposes a particular bit of the horizontal input bus
to assert a request, and a particular bit of the output bus to use as an inhibit line. Every output
channel operates independently in two modes, arbitration and data transmission.
During the arbitration phase, all inhibit lines are pre-charged to 1. If an input channel
has active data, it discharges the inhibit lines corresponding to the input ports it inhibits. For
every output port, its highest priority input port wins arbitration and the result gets latched in
a Granted Flip Flop to setup the connection for data transmission. During data transmission,
the output buses are pre-charged to 1. At crosspoints where Granted Flip Flop is 1, the output
remains charged or gets discharged based on the input. The Granted Flip Flop uses a thyristor-
based sense amplifier to set the enabled latch, which only enables the discharge of the output
bus for a short period of time, reducing the voltage swing on the output wire. This reduced swing
coupled with the single-ended sense amplifier helps to increase the speed, reduce the crosstalk,
and reduce the power consumption of the Swizzle-Switch.
The Swizzle Switch is proposed as a high-radix single-stage interconnect for a 64-core
topology. It takes four cycles to use the Swizzle Switch: one cycle for the signals to reach the
128 8. CASE STUDIES
Memory Controller Memory Controller
1.38 mm
512 kB L2 cache
NW NE
14.3 mm
L2 Area = 4.50 mm2
Swizzle
Switch × 3
0.86 mm
WS
Arm
ES Cortex
0.86 mm
SW SE A5
32 kB 32 kB
Icache Dcache
crossbar, one cycle for arbitration, one for data transmission, and one to reach the destination
core.
Response
5 flits (cache data)
Packet Size
Bypass Router-
1 cycle
and-link Latency
Operating
Frequency 1 GHz
Power Supply
Voltage 1.1 V and 0.8 V
or unicasts. Response packets, representing cache lines, are 5-flits wide. There are 6 VCs in each
router: four 1-flit deep for requests, and two 3-flit deep for responses.
The router implements two key features to approach the ideal latency, throughput and
energy limits. The first is a multicasting crossbar, with low-swing links, which are designed to
optimize both energy and latency. The links swing at 300 mV which leads to a 48.3% power
reduction compared to an equivalent full-swing crossbar. It provides a single-cycle switch + link
traversal (ST + LT) unlike conventional designs which spend a cycle each in the crossbar and
the link. The crosspoints of the crossbar allow flits to get forked out of multiple output ports.
The second feature of the router is the bypassing of pipeline stages to allow flits to arbitrate for
multiple ports in one cycle and traverse the crossbar and links in the next, without having to stop
and get buffered. This is implemented by sending 15 b lookaheads from the previous router to
try and pre-arbitrate for one (or more) ports of the crossbar, one cycle before the actual flit. The
arbiter is separable, the first stage (called mSA-I) arbitrates between input VCs at every input
port, while the second stage (called mSA-II) arbitrates between input ports at every output
port. The lookaheads have higher priority over local flits at each input port, and bypass mSA-I
to directly enter mSA-II. If the lookahead wins arbitration for all ports, the incoming multicast
flit is forked within the crossbar and not buffered at all. If the lookahead wins some or none of
its output ports, the incoming flit is buffered and subsequently re-arbitrates for the remaining
ports; partial grants are allowed. The regular router pipeline is three cycles: BW + mSA-I + VA
130 8. CASE STUDIES
in the first, mSA-II + lookahead traversal in the second, and ST+LT in the third. Successful
arbitration by the lookaheads leads to the datapath becoming a single-cycle per-hop (ST+LT)
low-swing traversal.
2D Mesh
Network
Computation Pipeline
Data
Register File Memory
Instruction
Memory
Memory Pipeline
One Core
Figure 8.9: Georgia Tech 3D-MAPS: 3-D chip with 8 8 2-D mesh on logic tier [184].
Each core runs a modified version of MIPS, and implements an in-order dual-issue
pipeline. A 2D mesh is used to connect the cores together, controlled by explicit communi-
cation and synchronization instructions. However there are no routers; explicit instructions are
provided to move data generated by a core to its N, S, E, or W neighbor. The memory tier also
has an 8 8 array of SRAM tiles, although these are not interconnected, and are private to each
core.
7×7
Local Switch
2 EXTs
DP
STM
NPE Multi-casting 16 SPUs
IPs
Source-routing is used for both unicasts and multicasts. The header flits carry 16 bits of
routing information that is used to specify the route for unicasts, and the destination SPU set for
multicasts. The header also carries a 4-bit burst length for data bursts of up to 8 flits per packet,
and a 2-bit priority for quality-of-service.
There are no VCs; wormhole flow control is used. Each router has a 4-stage pipeline. In
the first stage, an incoming flit’s header is parsed and it is buffered in a 8-depth FIFO that
manages synchronization for heterogeneous clock domains within the IPs and the NoC. In the
second stage, active input ports send request signals to each output port arbiter. The arbiters
perform round-robin scheduling according to the priority levels of the requests. In the third
stage, the grants are received. For multicasts, a grant checker is used to check if all requesting
output ports were granted or not. If they were, the flit is dequeued and broadcast out of the
crossbar in the fourth stage. If not, the flit retries next cycle for all ports as partial grants are not
allowed. A variable strength driver is employed at every input port of the crossbar to provide
sufficient driving strength for multicasting.
132 8. CASE STUDIES
8.11 INTEL SINGLE-CHIP CLOUD (2009)
The Intel SCC [160] is a 48-core research prototype to study many-core architectures and their
programmability. All 48 IA-cores boot Linux simultaneously. The chip is implemented in 45 nm
CMOS and operates at 2 GHz. There is no hardware cache coherence; instead software main-
tains coherence using message passing protocols such as MPI and OpenMP. The SCC has
24 tiles, each housing 2 cores and a private L1 and L2 per core. The tiles are connected by a
6 4 mesh NoC offering a bisection bandwidth of 256 GB/s. Figure 8.11 shows an overview
of the chip.
3.6 mm
5.2 mm 26.5 mm
21.4 mm
Technology 45 nm Process
Interconnect 1 Poly, 9 Metal (Cu)
Transistors Die: 1.3 B, Tile: 48 M
Tile Area 18.7 mm2
Die Area 567.1 mm2
Signals 970
Package 1567 pin LGA package
The routers have 5 ports with each input port housing five 24-entry queues, a route pre-
computation unit, and a virtual-channel (VC) allocator. Route pre-computation for the outport
of the next router is done on queued packets. The links are 16 B wide, with 2 B sidebands.
Sideband signals are used to transmit control information. XY dimension-ordered routing is
enforced. There are a total of 8 VCs, two reserved for the request and response message classes
and the rest in a free pool. Credit-based flow control is used between the routers. Input port and
output port arbitrations are done concurrently using a wrapped wave front arbiter. The router
uses virtual cut-through flow control and performs crossbar switch allocation in a single clock
cycle on a packet granularity. The router has a 3-cycle pipeline: Input Arbitration, Route Pre-
Compute + Switch Arbitration, followed by VC allocation. This is followed by a 1-cycle link
8.12. UC DAVIS ASAP (2009) 133
traversal to the next router. A packet consists of a single flit or multiple flits (up to three) with
header, body and tail flits.
The die is divided into 8 voltage islands, and 28 frequency islands to allow the V/F of each
island to be independently modulated by software. The complete 2D mesh is part of one V/F
domain. The NoC contributes to 5% and 10% of the total power at low-power (0.7 V, Cores
- 125 MHz, Mesh - 250 MHz) and high-power (1.14 V, Cores - 1 GHz, Mesh - 2 GHz)
operation, respectively.
to analog
Motion pads Viterbi FFT
Estimation Decoder
16 KB Shared Memories
(a) Block Diagram
Link
Proc. D
XAUI 1
PHY/MAC
Serialize
PCle 0 Deserialize
PHY/MAC
Serialize
Deserialize
UART
HPI, I2C GbE 0
JTAG, SPI
GbE 1
Flexible I/O
At the chip frequency of 1 GHz, iMesh provides a bisection bandwidth of 320 GB/s.
Actual realizable throughput depends on the traffic and how it is managed and balanced by the
flow control and routing protocols.
The iMesh has sophisticated network interfaces supporting both shared memory and mes-
sage passing paradigms. There are six physical networks: Memory Dynamic Network (MDN),
Tile Dynamic Network (TDN), User Dynamic Network (UDN), Static Network (STN), and
136 8. CASE STUDIES
I/O Dynamic Network (IDN), and inValidation Dynamic Network (VDN). Traffic is statically
divided across the six meshes. The caches and memory controllers are connected to the MDN
and TDN with inter-tile shared memory cache transfers going through the TDN and responses
going through the MDN, providing system-level deadlock freedom through two separate phys-
ical networks. The UDN supports user-level messaging, so threads can communicate through
message passing in addition to the cache coherent shared memory. Upon message arrivals, user-
level interrupts are issued for fast notification. Message queues can be virtualized into off-chip
DRAM in case of buffer overflows in the NIC. The STN is used for routing large streaming
data. I/O and system messages use the IDN. The TILE64 contains these five networks; the
VDN was introduced in the TILEPro64 for invalidation traffic to accelerate cache coherence.
The five dynamic networks (UDN, IDN, MDN, TDN, and VDN) use the dimension-
ordered routing algorithm, with the destination address encoded in X-Y coordinates in the
header. The static network (STN) allows the routing decision to be pre-set. This is achieved
through circuit switching: a setup packet first reserves a specific route, the subsequent message
then follows this route to the destination.
The dynamic networks use simple wormhole flow control without virtual channels to lower
the complexity of the routers, trading off the lower bandwidth of wormhole flow control by
spreading traffic over multiple networks. Credit-based buffer management is used. The static
network uses circuit switching to enable the software to pre-set arbitrary routes while enabling
fast delivery for the subsequent data transfer; the setup delay is amortized over long messages.
Buffer management in each network is varied. On the MDN, a conservative end-to-end
approach is used, where in every node communicating with DRAM is allocated a slot at the
memory controller. This guarantees that traffic on the MDN is always drained, without causing
any congestion. Acknowledgments are issued when the DRAM controller processes a request.
The storage at the memory controller is sized to cover the acknowledgment latency and allow
multiple in-flight memory requests. On the TDN, the link-level flow control is used. As long as
the MDN drains (due to the end-to-end flow control), the TDN can make forward progress. The
IDN and UDN are software accessible and implement mechanisms to drain into the DRAM,
and refill, to avoid deadlocks. In addition, the IDN utilized pre-allocated buffering with explicit
acknowledgments when communicating with I/O devices. The UDN can employ multiple end-
to-end buffer management schemes depending on the programming model.
The iMesh’ wormhole networks have a single-stage router pipeline during straight por-
tions of the route, and an additional route calculation stage when turning. Only a single buffer
queue is needed at each of the five router ports, since no VCs are used. Only three flit buffers
are used per port, just sufficient to cover the buffer turnaround time. This emphasis on simple
routers results in a low area overhead of just 5.5% of the tile footprint.
8.14. ST MICROELECTRONICS STNOC (2008) 137
8.14 ST MICROELECTRONICS STNOC (2008)
ST Microelectronic’s STNoC [82] aims to provide a programmable on-chip communication
platform on top of a simple network for heterogeneous multicore platforms. It encapsulates sup-
port for communication and synchronization primitives and low-level platform services within
what it calls the Interconnect Processing Unit (IPU). Examples of communication primitives are
send, receive, read and write, while synchronization primitives are test-and-set and compare-
and-swap. The aim is to have a library of different IPUs that support specific primitives so
MPSoC designers can select the ones that are compatible with their IP blocks. For instance,
IP blocks that interface with the old STBus require read-modify-write primitives that will be
mapped to appropriate IPUs. Currently STNoC supports two widely used SoC bus standards
fully: the STBus and AMBA AXI [27], and plans to add IPUs for other programming models
and standards.
The STNoC proposes a novel pseudo-regular topology, the Spidergon, that can be readily
tailored depending on the actual application traffic characteristics, which are known a priori.
Figure 8.14 sketches several variants of spidergons. Figure 8.14a shows a 6-node spidergon that
can have more links added to cater to higher bandwidth needs (Figure 8.14b). Figure 8.14c shows
a maximally connected 12-node spidergon, where most links can be trimmed off when they are
not needed (Figure 8.14d). The pseudo-regularity in STNoC permits the use of identical degree
three router nodes across the entire range of spidergon topologies, which simplifies design and
makes it easier for a synthesis algorithm to arrive at the optimal topology. A regular layout is
also possible, as Figure 8.14e illustrates.
The STNoC can be routed using regular routing algorithms that are identical at each
node, leveraging the ring-like topology of the spidergon. For instance, the Across-First routing
algorithm sends packets along the shortest paths, using the long across links that connect non-
adjacent nodes in STNoC only when that gives the shortest paths, and only as the first hop. For
instance, in Figure 8.14f, when going from Node 0–4, packets will be routed from Node 0–3 on
the long across link, then from 3–4 on the short link, leading to a 2-hop route. Note though that
here, clearly, link length differs significantly and needs to be taken into account. Despite a low
hop count, long link traversals cycles may increase the packet latency for Across-First routing.
The Across-First routing algorithm is not deadlock-free, relying on the flow control protocol to
ensure deadlock freedom instead.
STNoC routing is implemented through source routing, encoding just the across link
turn and the destination ejection, since local links between adjacent rings are default routes. The
Across-First algorithm can be implemented within the network interface controller either using
routing tables or combinational logic.
STNoC uses wormhole flow control, supporting flit sizes ranging from 16–512 bits de-
pending on the bandwidth requirements of the application. Virtual channels are used to break
deadlocks, using a dateline approach similar to what has been discussed, with the variant that
nodes that do not route past the dateline need not be constrained to a specific VC, but can in-
138 8. CASE STUDIES
5 4 5 4
0 3 0 3
1 2 1 2
(a) 6-node Spidergon (b) 6-node Spidergon with extra links
11 10 9 11 10 9
0 8 0 8
1 7 1 7
2 6 2 6
3 4 5 3 4 5
(c) 12-node Spidergon with all links (d) 12-node Spidergon with links removed
0 1 2
6 7 8 5 4
5 4 3 0 3
11 10 9 1 2
(e) 12-node full Spidergon layout (f ) route from Node 0 to Node 4 using
the Across-First algorithm
routing enables the use of many possible oblivious or deterministic routing algorithms or even
adaptive routing algorithms that chooses a route based on network congestion information at
injection point. This enables the TeraFLOPS to tailor routes to specific applications.
The TeraFLOPS has a minimum packet size of two flits (38-bit flits comprised of 6 bits
of control data and 32 bits of data), with no limits placed on the maximum packet size by the
router architecture. The network uses wormhole flow control with two virtual channels (called
“lanes”), although the virtual channels are used only to avoid system-level deadlock, and not for
8.16. IBM CELL (2005) 141
flow control, as packets are pinned to a VC throughout their network traversal. This simplifies
the router design since no VC allocation needs to be done at each hop. Buffer backpressure is
maintained using on/off signaling, with software programmable thresholds.
The high 5 GHz (15 FO4s) frequency of the TeraFLOPS chip mandates the use of ag-
gressively pipelined routers. The router uses a five-stage pipeline: buffer write, route computation
(extracting the desired output port from the header), two separable stages of switch allocation,
and switch traversal. The pipeline is shown in Figure 8.15b. It should be noted that single hop
delay is still just 1 ns. Each port has two input queues, one for each VC, that are each 16 flits
deep. The switch allocator is separable in order to be pipelineable, implemented as a 5–1 port
arbiter followed by a 2–1 lane arbiter. The first stage of arbitration within a particular lane es-
sentially binds one input port to an output port, for the entire duration of the packet, opting
for router simplicity over the flit-level interleaving of multiple VCs. Hence, the VCs are not
leveraged for bandwidth, but serve only deadlock avoidance purposes.
The crossbar switch is custom-designed, using bit-interleaving, or double pumping, with
alternate bits sent on different phases of a clock, reducing the crossbar area by 50%. The crossbar
is fully nonblocking, with a total bandwidth of 100 Gbytes/s. The crossbar circuit is shown in
Figure 8.15c. The layout of the router is custom, with the crossbar at the center, and the queues,
control, and arbiters for each VC (lane) on either side.
The maximum frequency of the router ranges from 1.7 GHz at 0.75 V to 5.1 GHz at
1.2 V. The corresponding measured on-chip network power per tile with all router ports active
ranges from 98–924 mW, consuming 39% of the tile power. Clock gating and sleep transistors
at every port help reduce dynamic and leakage power respectively, and can lower the total power
to 126 mW, a 7.3 reduction.
BIF
MIC SPE0 SPE2 SPE4 SPE6 IOIF0
highest priority given to the memory controller so requestors will not be stalled on read data.
Other elements on the EIB have equal priority and are served in a round-robin manner.
The IBM Cell uses explicit message passing as opposed to a shared-memory paradigm. It
is designed to preserve DMA over split-transaction bus semantics, so snoopy coherent transfers
can be supported atop the four unidirectional rings. In addition to the 4 rings, these 12 elements
interface to an address-and-command bus that handles bus requests and coherence requests. The
rings are accessed in a bus-like manner, with a sending phase where the source element initiates
a transaction (e.g., issues a DMA), a command phase through the address-and-command bus
where the destination element is informed about this impending transaction, then the data phase
where access to rings is arbitrated and if access is granted, data are actually sent from source to
destination. Finally, the receiving phase moves data from the NIC (called the Bus Interface
Controller (BIC)) to the actual local or main memory or I/O.
8.17 CONCLUSION
As evident by the case studies in this chapter, there has been a significant uptick in commercial
design and research prototypes featuring on-chip networks. Although meshes remain the most
common topology, rings, and crossbars continue to be optimized. Dimension-ordered routing
is widely favored across the designs studied due to its simplicity and deadlock freedom. Wider
8.17. CONCLUSION 143
variation is seen in both flow control methods and router pipeline stages. Finally, these commer-
cial design and prototypes highlight the importance of considering the entire system; supporting
cache coherence protocols, message passing, and broadcasting/multicasting features in many of
the designs. Although common attributes have emerged as the field of on-chip networks ma-
tures, we anticipate exciting new research in all aspect of on-chip networks to drive the field in
the next decade.
145
CHAPTER 9
Conclusions
The study of on-chip networks is a relatively new research field. Conference papers addressing
them began appearing only in the late 1990s and on-chip networks have only recently began
appearing in products in sophisticated forms. In this concluding chapter, we reflect on emerging
research challenges in this field, surveying the state of the art and summarizing several key
opportunities.
Field Conference
International Symposium on Computer Architecture (ISCA)
International Symposium on Microarchitecture (MICRO)
International Symposium on High Performance Computer Architecture
(HPCA)
Architecture
International Conference on Parallel Architectures and Compilation
Techniques (PACT)
International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS)
International Conference on Computer-Aided Design (ICCAD)
CAD Design Automation Conference (DAC)
Design Automation and Test in Europe (DATE)
International Conference on VLSI (VLSI)
VLSI
International Solid State Circuits Conference (ISSCC)
Network on Chip International Network on Chip Symposium (NOCS)
150 9. CONCLUSIONS
9.6 BIBLIOGRAPHIC NOTES
Finally, we refer the reader to other summary and overview papers [44, 50, 151, 238, 267, 276]
to help guide them in further study of on-chip networks.
151
References
[1] Pablo Abad, Pablo Prieto, Lucia G Menezo, Valentin Puente, and José-Ángel Gregorio.
Topaz: An open-source interconnection network simulator for chip multiprocessors and
supercomputers. In Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International Sym-
posium on, pages 99–106. IEEE, 2012. DOI: 10.1109/nocs.2012.19.
[2] Pablo Abad, Valentin Puente, and José Ángel Gregorio. MRR: Enabling fully
adaptive multicast routing for CMP interconnection networks. In International
Symposium on High Performance Computer Architecture, pages 355–366, 2009. DOI:
10.1109/hpca.2009.4798273.
[3] Pablo Abad, Valentin Puente, José Angel Gregorio, and Pablo Prieto. Rotary
router: An efficient architecture for CMP interconnection networks. In Proc. of the
International Symposium on Computer Architecture, pages 116–125, June 2007. DOI:
10.1145/1250662.1250678.
[4] Sergi Abadal, Albert Cabellos-Aparicio, Eduard Alarcón, and Josep Torrellas. WiSync:
An architecture for fast synchronization through on-chip wireless communication. In
Proc. of the Twenty-First International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS ’16, Atlanta, GA, USA, April 2-6, 2016, pages 3–
17, 2016. DOI: 10.1145/2872362.2872396.
[5] Sergi Abadal, Benny Sheinman, Oded Katz, Ofer Markish, Danny Elad, Yvan Fournier,
Damian Roca, Mauricio Hanzich, Guillaume Houzeaux, Mario Nemirovsky, et al.
Broadcast-enabled massive multicore architectures: A wireless rf approach. IEEE Micro,
35(5):52–61, 2015. DOI: 10.1109/mm.2015.123.
[6] Mohamed Abdelfattah and Vaughn Betz. Design tradeoffs for hard and soft fpga-based
networks-on-chip. In International Conference on Field-Programmable Technology, 2012.
DOI: 10.1109/fpt.2012.6412118.
[7] Mohamed Abdelfattah and Vaughn Betz. The power of communication: Energy-efficient
NoCs for FPGAs. In International Conference on Field-Programmable Logic and Applica-
tions, 2013. DOI: 10.1109/fpl.2013.6645496.
[8] Mohamed Abdelfattah and Vaughn Betz. Power analysis of embedded NoCs on FP-
GAs and comparison to custom buses. IEEE Trans. on VLSI, January 2016. DOI:
10.1109/tvlsi.2015.2397005.
152 REFERENCES
[9] Mohamed Abdelfattah, Andrew Bitar, and Vaughn Betz. Take the highway: Design for
embedded NoCs on FPGAs. In International Symposium on Field Programmable Gate Ar-
rays, 2015. DOI: 10.1145/2684746.2689074.
[10] Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar,
Ronald G. Dreslinski, David Blaauw, and Trevor Mudge. Scaling towards kilo-core pro-
cessors with asymmetric high radix topologies. In Proc. of the International Symposium on
High Performance Computer Architecture, 2013. DOI: 10.1109/hpca.2013.6522344.
[11] Ahmed K Abousamra, Rami G Melhem, and Alex K Jones. Deja vu switching for multi-
plane NoCs. In Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium
on, pages 11–18. IEEE, 2012. DOI: 10.1109/nocs.2012.9.
[12] Dennis Abts, Abdulla Bataineh, Steve Scott, Greg Faanes, Jim Schwarzmeier, Eric Lund-
berg, Tim Johnson, Mike Bye, and Gerald Schwoerer. The Cray BlackWidow: a highly
scalable vector multiprocessor. In Proc. of the Conference on Supercomputing, page 17, 2007.
DOI: 10.1145/1362622.1362646.
[13] Dennis Abts, Natalie Enright Jerger, John Kim, Mikko Lipasti, and Dan Gibson. Achiev-
ing predictable performance through better memory controller placement in many-core
CMPs. In Proc. of the International Symposium on Computer Architecture, 2009. DOI:
10.1145/1555754.1555810.
[14] Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha. GARNET: A detailed
on-chip network model inside a full-system simulator. In Proc. of the IEEE International
Symposium on Performance Analysis of Systems and Software, pages 33–42, April 2009. DOI:
10.1109/ispass.2009.4919636.
[15] Niket Agarwal, Li-Shiuan Peh, and Niraj K. Jha. In-network coherence filtering: Snoopy
coherence without broadcasts. In International Symposium on Microarchitecture, 2009. DOI:
10.1145/1669112.1669143.
[16] Niket Agarwal, Li-Shiuan Peh, and Niraj K. Jha. In-network snoop ordering (INSO):
Snoopy coherence on unordered interconnects. In Proc. of the International Sympo-
sium on High Performance Computer Architecture, pages 67–78, February 2009. DOI:
10.1109/hpca.2009.4798238.
[17] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A scalable
processing-in-memory accelerator for parallel graph processing. In Proc. of the 42nd Annual
International Symposium on Computer Architecture, pages 105–117, Portland, OR, June 13–
17, 2015. DOI: 10.1145/2749469.2750386.
REFERENCES 153
[18] Minseon Ahn and Eun Jung Kim. Pseudo-circuit: Accelerating communication for on-
chip interconnection networks. In Proc. of the International Symposium on Microarchitecture,
December 2010. DOI: 10.1109/micro.2010.10.
[19] Konstantinos Aisopos, Chia-Hsin Owen Chen, and Li-Shiuan Peh. Enabling system-
level modeling of variation-induced faults in networks-on-chips. In Proc. of the 48th De-
sign Automation Conference, DAC ’11, pages 930–935, New York, NY, 2011. ACM. DOI:
10.1145/2024724.2024931.
[20] Konstantinos Aisopos, Andrew DeOrio, Li-Shiuan Peh, and Valeria Bertacco. ARI-
ADNE: agnostic reconfiguration in a disconnected network environment. In 2011 Inter-
national Conference on Parallel Architectures and Compilation Techniques, PACT, pages 298–
309, Galveston, TX, October 10–14, 2011. DOI: 10.1109/pact.2011.61.
[21] Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John Arthur,
Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, Gi-Joon Nam, Brian Taba,
Michael Beakes, Bernard Brezzo, Jente B. Kuang, Rajit Manohar, William P. Risk,
Bryan Jackson, and Dharmendra S. Modha. TrueNorth: Design and tool flow of a
65 mW 1 million neuron programmable neurosynaptic chip. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 34(10):1537–1557, 2015. DOI:
10.1109/tcad.2015.2474396.
[22] Adrijean Andriahantenaina, Herve Charlery, Alain Greiner, Laurent Mortiez, and Ce-
sar Albenes Zeferino. SPIN: a scalable, packet switched, on-chip micro-network. In
Proc. of the Conference on Design, Automation and Test in Europe, pages 70–73, 2003. DOI:
10.1109/date.2003.1253808.
[23] Rajeev Balasubramonian Aniruddha N. Udipi, Naveen Muralimanohar. Towards scalable,
energy-efficient bus-based on-chip networks. In Proc. of the International Symposium on
High Performance Computer Architecture, 2010. DOI: 10.1109/hpca.2010.5416639.
[24] Amin Ansari, Asit Mishra, Jianping Xu, and Josep Torrellas. Tangle: Route-oriented
dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks.
In International Symposium on High Performance Computer Architecture, 2014. DOI:
10.1109/hpca.2014.6835953.
[25] Padma Apparao, Ravi Iyer, Xiaomin Zhang, Don Newell, and Tom Adelmeyer. Char-
acterization and analysis of a server consolidation benchmark. In Proc. of the In-
ternational Conference on Virtual Execution Environments, pages 21–30, 2008. DOI:
10.1145/1346256.1346260.
[26] ARM. AMBA open specifications. https://www.arm.com/products/amba-open-sp
ecifications.php.
154 REFERENCES
[27] ARM. AMBA specifications. https://www.arm.com/products/system-ip/amba-sp
ecifications.
[30] Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubi-
atowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel,
and Katherine Yelick. A view of the parallel computing landscape. Communications of the
ACM, 52(10):56–67, 2009. DOI: 10.1145/1562764.1562783.
[32] Todd Austin, Valeria Bertacco, Scott Mahlke, and Yu Cao. Reliable systems on un-
reliable fabrics. IEEE Design and Test of Computers, 25(4):322–332, July 2008. DOI:
10.1109/mdt.2008.107.
[33] Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas
Avižienis, John Wawrzynek, and Krste Asanović. Chisel: constructing hardware in a scala
embedded language. In Proc. of the 49th Annual Design Automation Conference, pages 1216–
1225. ACM, 2012. DOI: 10.1145/2228360.2228584.
[34] Mario Badr and Natalie Enright Jerger. SynFull: Synthetic traffic models capturing cache
coherent behaviour. In Proc. of the International Symposium on Computer Architecture, 2014.
DOI: 10.1109/isca.2014.6853236.
[35] Ali Bakhoda, John Kim, and Tor M. Aamodt. Throughput-effective on-chip networks for
manycore accelerators. In Proc. of the International Symposium on Microarchitecture, 2010.
DOI: 10.1109/micro.2010.50.
[36] Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt.
Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of
Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163–
174. IEEE, 2009. DOI: 10.1109/ispass.2009.4919648.
[37] Rajeev Balasubramonian, Naveen Muralimanohar, Karthik Ramani, Liqun Cheng, and
John Carter. Leveraging wire properties at the microarchitecture level. IEEE Micro,
26(6):40–52, Nov/Dec 2006. DOI: 10.1109/mm.2006.123.
[38] James Balfour and William J. Dally. Design tradeoffs for tiled CMP on-chip networks.
In Proc. of the International Conference on Supercomputing, pages 187–198, 2006. DOI:
10.1145/2591635.2667187.
REFERENCES 155
[39] Jonathan Balkind, Michael McKeown, Yaosheng Fu, Tri Nguyen, Yanqi Zhou, Alexey
Lavrov, Mohammad Shahrad, Adi Fuchs, Samuel Payne, Xiaohua Liang, Matthew
Matl, and David Wentzlaff. OpenPiton: An open source manycore research frame-
work. In Proc. of the Twenty-First International Conference on Architectural Support
for Programming Languages and Operating Systems, pages 217–232. ACM, 2016. DOI:
10.1145/2872362.2872414.
[40] Luiz A. Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz
Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. Piranha: a scalable
architecture based on single-chip multiprocessing. In Proc. of the International Symposium
on Computer Architecture, pages 282–293, 2000. DOI: 10.1109/isca.2000.854398.
[41] Edith Beigné, Fabien Clermidy, Pascal Vivet, Alain Clouard, and Marc Renaudin.
An asynchronous noc architecture providing low latency service and its multi-level
design framework. In Asynchronous Circuits and Systems, 2005. ASYNC 2005. Pro-
ceedings. 11th IEEE International Symposium on, pages 54–63. IEEE, 2005. DOI:
10.1109/async.2005.10.
[42] Luca Benini, Davide Bertozzi, Alessandro Bogliolo, Francesco Menichelli, and Mauro
Olivieri. MPARM: Exploring the multi-processor SoC design space with Sys-
temC. Journal of VLSI Signal Processing Systems, 41(2):169–182, September 2005. DOI:
10.1007/s11265-005-6648-1.
[43] Luca Benini and Giovanni De Micheli. Powering networks on chips. In Proc.
of the 14th International Symposium on System Synthesis, pages 33–38, 2001. DOI:
10.1109/isss.2001.957909.
[44] Luca Benini and Giovanni De Micheli. Networks on chips: a new SoC paradigm. IEEE
Computer, 35(1):70–78, January 2002. DOI: 10.1109/2.976921.
[45] Luca Benini and Giovanni De Micheli. Networks on Chips: Technology and Tools. Academic
Press, 2006.
[46] Keren Bergman, John Shalf, and Tom Hausken. Optical interconnects and ex-
treme computing. Optics and Photonics News, 27(4):32–39, April 2016. DOI:
10.1364/opn.27.4.000032.
[47] Davide Bertozzi, Antoine Jalabert, Srinivasan Murali, Rutuparna Tamhankar, Stergios
Stergiou, Luca Benini, and Giovanni De Micheli. NoC synthesis flow for customized
domain specific multiprocessor systems-on-chip. IEEE Transactions on Parallel and Dis-
tributed Systems, 16(2):113–129, February 2005. DOI: 10.1109/tpds.2005.22.
156 REFERENCES
[48] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC bench-
mark suite: Characterization and architectural implications. In Proc. of the 17th Interna-
tional Conference on Parallel Architectures and Compilation Techniques, pages 72–81, October
2008. DOI: 10.1145/1454115.1454128.
[49] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi,
Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti,
Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A.
Wood. The gem5 simulator. SIGARCH Computer Architecture News, 39(2):1–7, 2011.
DOI: 10.1145/2024716.2024718.
[50] Tobias Bjerregaard and Shankar Mahadevan. A survey of research and practices of
network-on-chip. ACM Computer Surveys, 38(1), 2006. DOI: 10.1145/1132952.1132953.
[51] Paul Bogdan, Radu Marculescu, and Siddharth Jain. Dynamic power management
for multidomain system-on-chip platforms: an optimal control approach. ACM Trans-
actions on Design Automation of Electronic Systems (TODAES), 18(4):46, 2013. DOI:
10.1145/2504904.
[52] Haseeb Bokhari, Haris Javaid, Muhammad Shafique, Jörg Henkel, and Sri Parameswaran.
Malleable NoC: dark silicon inspired adaptable network-on-chip. In Proc. of the 2015
Design, Automation & Test in Europe Conference & Exhibition, pages 1245–1248. EDA
Consortium, 2015. DOI: 10.7873/date.2015.0694.
[53] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny. QNoC: QoS ar-
chitecture and design processor for cost-effective network on chip. Special issue on Net-
works on Chip, The Journal of Systems Architecture, 50(2):105–128, February 2004. DOI:
10.1016/j.sysarc.2003.07.004.
[54] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodony. Routing table min-
imization for irregular mesh NoCs. In Proc. of the Conference on Design, Automation and
Test in Europe, pages 942–947, 2007. DOI: 10.1109/date.2007.364414.
[55] Anja Boos, Luca Ramini, Ulf Schlichtmann, and Davide Bertozzi. Proton: An auto-
matic place-and-route tool for optical networks-on-chip. In International Conference on
Computer-Aided Design, pages 138–145. IEEE, 2013. DOI: 10.1109/iccad.2013.6691109.
[56] Aaron Carpenter, Jianyun Hu, Ovunc Kocabas, Michael Huang, and Hui Wu. Enhanc-
ing effective throughput for transmission line-based bus. In Proc. of the 39th Annual In-
ternational Symposium on Computer Architecture, ISCA ’12, pages 165–176, 2012. DOI:
10.1109/isca.2012.6237015.
REFERENCES 157
[57] Aaron Carpenter, Jianyun Hu, Jie Xu, Michael Huang, and Hui Wu. A case for
globally shared-medium on-chip interconnect. In Proc. of the 38th Annual Inter-
national Symposium on Computer Architecture, ISCA ’11, pages 271–282, 2011. DOI:
10.1145/2000064.2000097.
[58] Mario R Casu and Paolo Giaccone. Rate-based vs delay-based control for DVFS in
NoC. In 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE),
pages 1096–1101. IEEE, 2015. DOI: 10.7873/date.2015.0613.
[59] Jeremy Chan and Sri Parameswaran. NoCGEN: A template based reuse methodology
for networks on chip architecture. In VLSI Design, 2004. Proceedings. 17th International
Conference on, pages 717–720. IEEE, 2004. DOI: 10.1109/icvd.2004.1261011.
[60] M. Frank Chang, Jason Cong, Adam Kaplan, Chunyue Liu, Mishali Naik, Jagannath
Premkumar, Glenn Reinman, Eran Socher, and Sai-Wang Tam. Power reduction of
CMP communication networks via RF-interconnects. In Proc. of the 41st Annual Interna-
tional Symposium on Microarchitecture, pages 376–387, November 2008. DOI: 10.1109/mi-
cro.2008.4771806.
[61] M. Frank Chang, Jason Cong, Adam Kaplan, Mishali Naik, Glenn Reinman, Eran Socher,
and Sai-Wang Tam. CMP network-on-chip overlaid with multi-band RF-interconnect.
In Proc. of the 14th International Symposium on High-Performance Computer Architecture,
pages 191–202, February 2008. DOI: 10.1109/hpca.2008.4658639.
[63] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha
Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing.
In Proc. of the 2009 IEEE International Symposium on Workload Characterization (IISWC),
IISWC ’09, pages 44–54, Washington, DC, USA, 2009. IEEE Computer Society. DOI:
10.1109/iiswc.2009.5306797.
[65] Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, and Li-Shiuan Peh. A low-
swing crossbar and link generator for low-power networks-on-chip. In International Con-
ference on Computer-Aided Design, 2011. DOI: 10.1109/iccad.2011.6105418.
158 REFERENCES
[66] L. Chen, D. Zhu, M. Pedram, and T. M. Pinkston. Simulation of NoC power-gating:
Requirements, optimizations,and the Agate simulator. Journal of Parallel and Distributed
Computing, 2016. DOI: 10.1016/j.jpdc.2016.03.006.
[67] Lizhong Chen and Timothy M Pinkston. Nord: Node-router decoupling for effective
power-gating of on-chip routers. In Proc. of the 2012 45th Annual IEEE/ACM Inter-
national Symposium on Microarchitecture, pages 270–281. IEEE Computer Society, 2012.
DOI: 10.1109/micro.2012.33.
[68] Lizhong Chen and Timothy M. Pinkston. Worm-bubble flow control. In Proc. of the
International Symposium on High Performance Computer Architecture, February 2013. DOI:
10.1109/hpca.2013.6522333.
[69] Lizhong Chen, Lihang Zhao, Ruisheng Wang, and Timothy Mark Pinkston.
MP3: Minimizing performance penalty for power-gating of clos network-on-chip.
In International Symposium on High Performance Computer Architecture, 2014. DOI:
10.1109/hpca.2014.6835940.
[70] Lizhong Chen, Di Zhu, Massoud Pedram, and Timothy M Pinkston. Power punch: To-
wards non-blocking power-gating of NoC routers. In 2015 IEEE 21st International Sym-
posium on High Performance Computer Architecture (HPCA), pages 378–389. IEEE, 2015.
DOI: 10.1109/hpca.2015.7056048.
[71] Xi Chen, Zheng Xu, Hyungjun Kim, Paul V Gratz, Jiang Hu, Michael Kishinevsky, Umit
Ogras, and Raid Ayoub. Dynamic voltage and frequency scaling for shared resources in
multicore processor designs. In Proc. of the 50th Annual Design Automation Conference,
page 114. ACM, 2013. DOI: 10.1145/2463209.2488874.
[72] Yu-Hsin Chen, Tushar Krishna, Joel Emer, and Vivienne Sze. Eyeriss: An Energy-
Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In IEEE
International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers, pages
262–263, 2016. DOI: 10.1109/jssc.2016.2616357.
[73] Andrew A. Chien and Jae H. Kim. Planar-adaptive routing: low-cost adaptive net-
works for multiprocessors. In Proc. of the International Symposium on Computer Architecture,
pages 268–277, 1992. DOI: 10.1109/isca.1992.753323.
[74] Ge-Ming Chiu. The odd-even turn model for adaptive routing. IEEE Transactions on
Parallel and Distributed Systems, pages 729–738, July 2000. DOI: 10.1109/71.877831.
[75] Myong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, Tina Wen, and Srinivas
Devadas. Oblivious routing on on-chip bandwidth-adaptive networks. In Proc. of the
International Conference on Parallel Architecture and Compilation Techniques, 2009. DOI:
10.1109/pact.2009.41.
REFERENCES 159
[76] Eric S. Chung, James C. Hoe, and Ken Mai. CoRAM: An in-fabric memory architecture
for fpga-based computing. In International Symposium on Field-Programmable Gate Arrays,
2011. DOI: 10.1145/1950413.1950435.
[77] Mark J. Cianchetti, Joseph C. Kerekes, and David H. Albonesi. Phastlane: A rapid transit
optical routing network. In International Symposium on Computer Architecture, 2009. DOI:
10.1145/1555754.1555809.
[78] Christopher Condrat, Priyank Kalla, and Steve Blair. Crossing-aware channel routing for
integrated optics. TCAD, 33(6):814–825, 2014. DOI: 10.1109/tcad.2014.2317575.
[79] Christopher Condrat, Priyank Kalla, and Steve Blair. Thermal-aware synthesis of inte-
grated photonic ring resonators. In International Conference on Computer-Aided Design,
2014. DOI: 10.1109/iccad.2014.7001405.
[80] Kypros Constantinides, Stephen Plaza, Jason Blome, Bin Zhang, Valeria Bertacco, Scott
Mahlke, Todd Austin, and Michael Orshansky. BulletProof: A defect tolerant CMP
switch architecture. In Proc. of the International Symposium on High Performance Computer
Architecture, pages 5–16, 2006. DOI: 10.1109/hpca.2006.1598108.
[81] Pat Conway and Bill Hughes. The AMD Opteron Northbridge architecture, present and
future. IEEE Micro Magazine, 27:10–21, March 2007. DOI: 10.1109/MM.2007.43.
[82] Marcello Coppola. Spidergon STNoC: The technology that adds value to your sys-
tem. In Hot Chips 22 Symposium (HCS), 2010 IEEE, pages 1–39. IEEE, 2010. DOI:
10.1109/hotchips.2010.7480082.
[83] Marcello Coppola, Riccardo Locatelli, Giuseppe Maruccio, Lorenzo Pieralisi, and
A. Scandurra. Spidergon: a novel on chip communication network. In International Sym-
posium on System on Chip, page 15, November 2004. DOI: 10.1109/issoc.2004.1411133.
[84] Cisco CRS-1. http://www.cisco.com.
[85] D. E. Culler and J. P. Singh. Parallel Computer Architecture: A Hardware/Software Approach.
Morgan Kaufmann Publishers Inc., 1999.
[86] William Dally and Brian Towles. Principles and Practices of Interconnection Networks. Mor-
gan Kaufmann Pub., San Francisco, CA, 2003.
[87] William J. Dally. Virtual-channel flow control. In Proc. of the International Symposium on
Computer Architecture, 1990. DOI: 10.1109/isca.1990.134508.
[88] William J. Dally. Express cubes: Improving the performance of k-ary n-cube intercon-
nection networks. IEEE Transactions on Computers, 40(9):1016–1023, September 1991.
DOI: 10.1109/12.83652.
160 REFERENCES
[89] William J. Dally and Hiromichi Aoki. Deadlock-free adaptive routing in multicomputer
networks using virtual channels. IEEE Transactions on Parallel and Distributed Systems,
4(4):466–475, 1993. DOI: 10.1109/71.219761.
[90] William J. Dally, Larry R. Dennison, David Harris, Kinhong Kan, and Thucydides Xan-
thopoulos. The reliable router: A reliable and high-performance communication substrate
for parallel computers. In Proc. of the First International Workshop on Parallel Computer
Routing and Communication, pages 241–255, 1994. DOI: 10.1007/3-540-58429-3_41.
[91] William J. Dally, J. A. Stuart Fiske, John S. Keen, Richard A. Lethin, Michael D. Noakes,
Peter R. Nuth, Roy E. Davison, and Gregory A. Fyler. The message-driven processor –
a multicomputer processing node with efficient mechanisms. IEEE Micro, 12(2):23–39,
April 1992. DOI: 10.1109/40.127581.
[92] William J. Dally and John W. Poulton. Digital Systems Engineering. Cambridge University
Press, 1998. DOI: 10.1017/cbo9781139166980.
[93] William J. Dally and Charles L. Seitz. The torus routing chip. Journal of Distributed
Computing, 1(3):187–196, 1986. DOI: 10.1007/bf01660031.
[94] William J. Dally and Charles L. Seitz. Deadlock-free message routing in multiprocessor
interconnection networks. IEEE Transactions on Computers, 36(5):547–553, 1987. DOI:
10.1109/tc.1987.1676939.
[95] William J. Dally and Brian Towles. Route packets, not wires: On-chip interconnection
networks. In Proc. of the 38th Conference on Design Automation, pages 684–689, 2001. DOI:
10.1109/dac.2001.935594.
[96] Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani
Azimi. Application-to-core mapping policies to reduce memory system interference in
multi-core systems. In International Sympoisum on High Performance Computer Architec-
ture, pages 107–118, 2013. DOI: 10.1109/hpca.2013.6522311.
[97] Reetuparna Das, Soumya Eachempati, Asit K. Mishra, N. Vijaykrishnan, and Chita R.
Das. Design and evaluation of hierarchical on-chip network topologies for next generation
CMPs. In Proc. of the International Symposium on High Performance Computer Architecture,
pages 175–186, February 2009.
[98] Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita Das. Application-aware
priorization mechanisms for on-chip networks. In Proc. of the International Symposium on
Microarchitecture, 2009. DOI: 10.1145/1669112.1669150.
[99] Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita Das. Aergia: Exploiting
packet latency slack in on-chip networks. In Proc. of the International Symposium on Com-
puter Architecture, 2010. DOI: 10.1145/1815961.1815976.
REFERENCES 161
[100] Reetuparna Das, Satish Narayanasamy, Sudhir Satpathy, and Ronald Dreslinski. Catnap:
Energy proportional multiple network-on-chip. In Proc. of the International Symposium on
Computer Architecture, 2013. DOI: 10.1145/2508148.2485950.
[101] Bhavya K. Daya, Chia-Hsin Owen Chen, Suvinay Subramanian, Woo-Cheol Kwon,
Sunghyun Park, Tushar Krishna, Jim Holt, Anantha P. Chandrakasan, and Li-Shiuan
Peh. SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable
mesh NoC with in-network ordering. In ACM/IEEE 41st International Symposium on
Computer Architecture, ISCA, pages 25–36, Minneapolis, MN, June 14–18, 2014. DOI:
10.1109/isca.2014.6853232.
[102] Martin De Prycker. Asynchronous Transfer Mode: Solution for Broadband ISDN, 3rd ed.,
Prentice Hall, 1995.
[103] Sujay Deb, Amlan Ganguly, Partha Pratim Pande, Benjamin Belzer, and Deukhyoun
Heo. Wireless noc as interconnection backbone for multicore chips: Promises and chal-
lenges. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2(2):228–239,
2012. DOI: 10.1109/jetcas.2012.2193835.
[104] Duo Ding, Yilin Zhang, Haiyu Huang, Ray T. Chen, and David Z. Pan. O-router: an
optical routing framework for low power on-chip silicon nano-photonic integration. In
Proc. of the Design Automation Conference. ACM, 2009. DOI: 10.1145/1629911.1629983.
[105] Dominic DiTomaso, Avinash Kodi, and Ahmed Louri. QORE: A fault toler-
ant network-on-chip architecture with power-efficient quad-function channel (QFC)
buffers. In International Symposium on High Performance Computer Architecture, 2014. DOI:
10.1109/hpca.2014.6835942.
[106] T. Dorta, J. Jimnez, J. L. Martn, U. Bidarte, and A. Astarloa. Overview of fpga-based
multiprocessor systems. In International Conference on Reconfigurable Computing and FP-
GAs, 2009. DOI: 10.1109/reconfig.2009.15.
[107] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaob-
ing Feng, Yunji Chen, and Olivier Temam. ShiDianNao: Shifting vision processing
closer to the sensor. In International Symposium on Computer Architecture, 2015. DOI:
10.1145/2749469.2750389.
[108] José Duato. A new theory of deadlock-free adaptive routing in wormhole networks.
IEEE Transactions on Parallel and Distributed Systems, 4(12):1320–1331, December 1993.
DOI: 10.1109/spdp.1993.395549.
[109] José Duato. A necessary and sufficient condition for deadlock-free adaptive routing in
wormhole networks. IEEE Transactions on Parallel and Distributed Systems, 6(10):1055–
1067, Oct 1995. DOI: 10.1109/71.473515.
162 REFERENCES
[110] José Duato, Sudhakar Yalamanchili, and Lionel M. Ni. Interconnection Networks: An
Engineering Approach, 2nd ed., Morgan Kaufmann, 2003.
[111] Noel Eisley, Li-Shiuan Peh, and Li Shang. In-network cache coherence. In Proc. of the
39th International Symposium on Microarchitecture, pages 321–332, December 2006. DOI:
10.1109/micro.2006.27.
[112] Natalie Enright Jerger, Ajaykumar Kannan, Zimo Li, and Gabriel H. Loh. NoC ar-
chitectures for silicon interposer systems. In International Symposium on Microarchitecture,
2014. DOI: 10.1109/micro.2014.61.
[113] Natalie Enright Jerger, Li-Shiuan Peh, and Mikko H. Lipasti. Circuit-switched coher-
ence. In Proc. of the International Network on Chip Symposium, pages 193–202, April 2008.
DOI: 10.1109/nocs.2008.4492738.
[114] Natalie Enright Jerger, Li-Shiuan Peh, and Mikko H. Lipasti. Virtual circuit tree mul-
ticasting: A case for on-chip hardware multicast support. In International Symposium on
Computer Architecture, pages 229–240, June 2008. DOI: 10.1109/isca.2008.12.
[115] Natalie Enright Jerger, Li-Shiuan Peh, and Mikko H. Lipasti. Virtual tree coherence:
Leveraging regions and in-network multicast trees for scalable cache coherence. In Proc. of
the 41st International Symposium on Microarchitecture, pages 35–46, November 2008. DOI:
10.1109/micro.2008.4771777.
[116] Natalie Enright Jerger, Dana Vantrease, and Mikko H. Lipasti. An evaluation of server
consolidation workloads for multi-core designs. In IEEE International Symposium on Work-
load Consolidation, pages 47–56, September 2007. DOI: 10.1109/iiswc.2007.4362180.
[117] Chris Fallin, Chris Craik, and Onur Mutlu. CHIPPER: A low-complexity bufferless
deflection router. In Proc. of the International Symposium on High Performance Computer
Architecture, 2011. DOI: 10.1109/hpca.2011.5749724.
[118] Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, and John Shalf.
OpenSoC fabric: On-chip network generator. In IEEE International Symposium on Per-
formance Analysis of Systems and Software, ISPASS, pages 194–203, 2016. DOI: 10.1109/is-
pass.2016.7482094.
[119] David Fick, Andrew DeOrio, Gregory Chen, Valeria Bertacco, Dennis Sylvester, and
David Blaauw. A highly resilient routing algorithm for fault-tolerant NoCs. In Proc. of
the Conference on Design, Automation and Test in Europe, DATE ’09, pages 21–26, 3001
Leuven, Belgium, Belgium, 2009. European Design and Automation Association. DOI:
10.1109/date.2009.5090627.
REFERENCES 163
[120] David Fick, Andrew DeOrio, Jin Hu, Valeria Bertacco, David Blaauw, and Dennis
Sylvester. Vicis: A reliable network for unreliable silicon. In Proc. of the 46th Annual De-
sign Automation Conference, DAC’09, pages 812–817, New York, NY, 2009. ACM. DOI:
10.1145/1629911.1630119.
[121] Finisar. Optimized fiber optics solutions for data center applications. http://www.fini
sar.com/markets/data-center.
[122] J. Flich, A. Mejia, P. Lopez, and J. Duato. Region-based routing: An efficient routing
mechanism to tackle unreliable hardware in network on chips. In Proc. of the First In-
ternational Symposium on Networks-on-Chip, NOCS’07, pages 183–194, Washington, DC,
2007. IEEE Computer Society. DOI: 10.1109/nocs.2007.39.
[123] Jose Flich, Andres Mejia, Pedro López, and José Duato. Region-based routing: An effi-
cient routing mechanism to tackle unreliable hardware in networks on chip. In Proc. of the
Network on Chip Symposium, pages 183–194, May 2007. DOI: 10.1109/nocs.2007.39.
[124] Jose Flich, Samuel Rodrigo, and José Duato. An efficient implementation of distributed
routing algorithms for NoCs. In Proc. of the International Network On Chip Symposium,
pages 87–96, April 2008. DOI: 10.1109/nocs.2008.4492728.
[125] Binzhang Fu, Yinhe Han, Jun Ma, Huawei Li, and Xiaowei Li. An abacus turn model
for time/space-efficient reconfigurable routing. In Proc. of the International Symposium on
Computer Architecture, June 2011. DOI: 10.1145/2000064.2000096.
[126] Mike Galles. Scalable pipelined interconnect for distributed endpoint routing: The SGI
SPIDER chip. In Proc. of Hot Interconnects Symposium IV, pages 141–146, 1996.
[127] Alan Gara, Matthias A. Blumrich, Dong Chen, George L.-T. Chiu, Paul Coteus,
Mark E. Giampapa, Ruud A. Haring, Philip Heidelberger, Dirk Hoenicke, Gerard V.
Kopcsay, Thomas A. Liebsch, Martin Ohmacht, Burkhard D. Steinmacher-Burow, Todd
Takken, and Pavlos Vranas. Overview of the Blue Gene/L system architecture. IBM Jour-
nal of Research and Developement, 49(2–3):195–212, 2005. DOI: 10.1147/rd.492.0195.
[128] Patrick T. Gaughan and Sudhakar Yalamanchili. Pipelined circuit-switching: a fault-
tolerant variant of wormhole routing. In Proc. of the Symposium on Parallel and Distributed
Processing, pages 148–155, December 1992. DOI: 10.1109/spdp.1992.242751.
[129] N. Genko, D. Atienza, G. De Micheli, J. Mendias, R. Hermida, and F. Catthoor. A
complete network-on-chip emulation framework. In Proc. of the Conference on Design Au-
tomation and Test in Europe, pages 246–251, March 2005. DOI: 10.1109/date.2005.5.
[130] R. Gindin, I. Cidon, and I. Keidar. NoC-based FPGA: Architecture and routing. In
International Symposium on Networks-on-Chip, 2007. DOI: 10.1109/nocs.2007.31.
164 REFERENCES
[131] Christopher J. Glass and Lionel M. Ni. The turn model for adaptive routing. In Proc.
of the International Symposium on Computer Architecture, pages 278–287, May 1992. DOI:
10.1109/isca.1992.753324.
[132] Nitin Godiwala, Jud Leonard, and Matthew Reilly. A network fabric for scalable mul-
tiprocessor systems. In Proc. of the Symposium on Hot Interconnects, pages 137–144, 2008.
DOI: 10.1109/hoti.2008.24.
[133] Kees Goossens, Martijn Bennebroek, Jae Young Hur, and Muhammad Aqeel
Wahlah. Hardwired networks on chip in FPGAs to unify functional and configu-
ration interconnects. In International Symposium on Networks-on-Chip, 2008. DOI:
10.1109/nocs.2008.4492724.
[134] Kees Goossens, John Dielissen, Om Prakash Gangwal, Santiago Gonzalez Pestana, An-
drei Radulescu, and Edwin Rijpkema. A design flow for application-specific networks on
chip with guaranteed performance to accelerate SoC design and verification. In Proc. of the
Design, Automation and Test in Europe Conference, pages 1182–1187, March 2005. DOI:
10.1109/date.2005.11.
[135] Kees Goossens, John Dielissen, and Andrei Radulescu. Æthereal network on chip: Con-
cepts, architectures, and implementations. IEEE Design and Test, 22(5):414–421, Septem-
ber 2005. DOI: 10.1109/mdt.2005.99.
[137] Paul Gratz, Boris Grot, and Stephen W. Keckler. Regional congestion awareness
for load balance in networks-on-chip. In Proc. of the 14th IEEE International Sympo-
sium on High Performance Computer Architecture, pages 203–214, February 2008. DOI:
10.1109/hpca.2008.4658640.
[138] Paul Gratz, Changkyu Kim, Robert G. McDonald, Stephen W. Keckler, and Doug
Burger. Implementation and evaluation of on-chip network architectures. In IEEE
International Conference on Computer Design, pages 477–484, October 2006. DOI:
10.1109/iccd.2006.4380859.
[163] Jingcao Hu and Radu Marculescu. Energy- and performance-aware mapping for regular
NoC architectures. IEEE Transactions on Computer Aided Design for Integrated Circuits
Systems, 24(4):551–562, April 2005. DOI: 10.1109/tcad.2005.844106.
[164] Jingcao Hu, Umit Y. Ogras, and Radu Marculescu. System-level buffer allocation for
application specific networks-on-chip router design. IEEE Transactions on Computer-
Aided Design for Integrated Circuits System, 25(12):2919–2933, December 2006. DOI:
10.1109/tcad.2006.882474.
[165] Paolo Ienne, Patrick Thiran, Giovanni De Micheli, and Frédéric Worm. An adaptive low-
power transmission scheme for on-chip networks. In Proc. of the International Symposium
on Systems Synthesis, pages 92–100, 2002. DOI: 10.1145/581199.581221.
th.
[167] Intel. From a few cores to many: A Tera-scale computing research overview.
http://download.intel.com/research/platform/terascale/terascale_overv
iew/_paper.pdf, 2006.
[168] Syed Ali Raza Jafri, Yu-Ju Hong, Mithuna Thottethodi, and T. N. Vijaykumar. Adaptive
flow control for robust performance and energy. In Proc. of the International Symposium on
Microarchitecture, 2010. DOI: 10.1109/micro.2010.48.
[169] Antoine Jalabert, Srinivasan Murali, Luca Benini, and Giovanni De Micheli.
xpipesCompiler: A tool for instantiating application specific networks on chip. In Proc. of
the Conference on Design, Automation and Test in Europe, volume 2, pages 884–889, Febru-
ary 2004. DOI: 10.1007/978-1-4020-6488-3_12.
[170] Nan Jiang, James Balfour, Daniel U Becker, Brian Towles, William J Dally, George
Michelogiannakis, and John Kim. A detailed and flexible cycle-accurate network-on-chip
simulator. In Performance Analysis of Systems and Software (ISPASS), 2013 IEEE Interna-
tional Symposium on, pages 86–96. IEEE, 2013. DOI: 10.1109/ispass.2013.6557149.
168 REFERENCES
[171] A.P. Jose, G. Patounakis, and K.L. Shepard. Near speed-of-light on-chip interconnects
using pulsed current-mode signaling. In Symposium on VLSI Circuits, pages 108–111, June
2005. DOI: 10.1109/vlsic.2005.1469345.
[172] Norman P. Jouppi. System implications of integrated photonics. In Proc. of the Interna-
tional Symposium on Low Power Electronics and Design, pages 183–184, August 2008. DOI:
10.1145/1393921.1393923.
[173] J. A. Kahl, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Intro-
duction to the Cell multiprocessor. IBM Journal of Research and Development, 49(4):589–
604, 2005. DOI: 10.1147/rd.494.0589.
[174] Andrew Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. Orion 2.0: A fast
and accurate NoC power and area model for early-stage design space exploration. In
Proc. of the Conference on Design, Automation and Test in Europe, April 2009. DOI:
10.1109/date.2009.5090700.
[175] Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H. Loh. Enabling interposer-
based disintegration of multi-core processors. In International Symposium on Microarchi-
tecture, 2015. DOI: 10.1145/2830772.2830808.
[176] Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H. Loh. Exploiting interposer
technologies to disintegrate and reintegrate multi-core processors for performance and
cost. IEEE Micro Top Picks from Computer Architecture, 2016. DOI: 10.1109/mm.2016.53.
[177] N. Kapre and J. Gray. Hoplite: Building austere overlay NoCs for FPGAs.
In International Conference on Field-Programmable Logic and Applications, 2015. DOI:
10.1109/fpl.2015.7293956.
[178] Evangelia Kasapaki, Martin Schoeberl, Rasmus Bo Sørensen, Christoph Müller, Kees
Goossens, and Jens Sparsø. Argo: A real-time network-on-chip architecture with an ef-
ficient GALS implementation. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 24(2):479–492, 2016. DOI: 10.1109/tvlsi.2015.2405614.
[179] Stefanos Kaxiras and Margaret Martonosi. Computer architecture techniques for
power-efficiency. Synthesis Lectures on Computer Architecture, 3(1):1–207, 2008. DOI:
10.2200/s00119ed1v01y200805cac004.
[180] Parviz Kermani and Leonar Kleinrock. Virtual cut-through: a new computer com-
munication switching technique. Computer Networks, 3(4):267–286, 1979. DOI:
10.1016/0376-5075(79)90032-1.
[181] B. Kim and V. Stojanović. A 4Gb/s/ch 356fj/b 10mm equalized on-chip intercon-
nect with nonlinear charge-injecting transmitter filter and transimpedance receiver in
REFERENCES 169
90nm CMOS technology. In IEEE Solid-State Circuits Conference, February 2009. DOI:
10.1109/isscc.2009.4977310.
[182] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniform
cache structure for wire-delay dominated on-chip caches. In Proc. of the 10th International
Conference on Architectural Support for Programming Languages and Operating System, pages
211–222, 2002. DOI: 10.1145/605397.605420.
[183] Dae Hyun Kim, Krit Athikulwongse, Michael Healy, Mohammad Hossain, Moon-
gon Jung, Ilya Khorosh, Gokul Kumar, Young-Joon Lee, Dean Lewis, Tzu-Wei Lin,
Chang Liu, Shreepad Panth, Mohit Pathak, Minzhen Ren, Guanhao Shen, Taigon Song,
Dong Hyuk Woo, Xin Zhao, Joungho Kim, Ho Choi, Gabriel Loh, Hsien-Hsin Lee,
and Sung Kyu Li. 3D-MAPS: 3D massively parallel processor with stacked memory.
In 2012 IEEE International Solid-State Circuits Conference, pages 188–190. IEEE, 2012.
DOI: 10.1109/isscc.2012.6176969.
[184] Dae Hyun Kim, Krit Athikulwongse, Michael B Healy, Mohammad M Hossain, Moon-
gon Jung, Ilya Khorosh, Gokul Kumar, Young-Joon Lee, Dean L Lewis, Tzu-Wei Lin,
Chang Liu, Shreepad Panth, Mohit Pathak, Minzhen Ren, Guanhao Shen, Taigon Song,
Dong Hyuk Woo, Xin Zhao, Joungho Kim, Ho Choi, Gabriel H. Loh, Hsien-Hsin S.
Lee, and Sung Kyu Lim. Design and analysis of 3D-MAPS (3D massively parallel proces-
sor with stacked memory). IEEE Transactions on Computers, 64(1):112–125, 2015. DOI:
10.1109/tc.2013.192.
[185] John Kim. Low-cost router microarchitecture for on-chip networks. In Proc. of the In-
ternational Symposium on Microarchitecture, 2009. DOI: 10.1145/1669112.1669145.
[186] John Kim, James Balfour, and William Dally. Flattened butterfly topology for on-chip
networks. In Proc. of the 40th International Symposium on Microarchitecture, pages 172–182,
December 2007. DOI: 10.1109/micro.2007.29.
[187] John Kim, William Dally, Steve Scott, and Dennis Abts. Technology-driven, highly-
scalable dragonfly topology. In Proc. of the International Symposium on Computer Architec-
ture, pages 194–205, June 2008. DOI: 10.1109/isca.2008.19.
[188] Jongman Kim, Chrysostomos Nicopoulos, Dongkook Park, Reetuparna Das, Yuan Xie,
N. Vijaykrishnan, Mazin S. Yousif, and Chita R. Das. A novel dimensionally-decomposed
router for on-chip communication in 3d architectures. In International Symposium on Com-
puter Architecture, pages 138–149, June 2007. DOI: 10.1145/1250662.1250680.
[189] Jongman Kim, Chrysostomos Nicopoulos, Dongkook Park, N. Vijaykrishnan, Mazin S.
Yousif, and Chita R. Das. A gracefully degrading and energy-efficient modular router
architecture for on-chip networks. In Proc. of the International Symposium on Computer
Architecture, pages 4–15, June 2006. DOI: 10.1109/isca.2006.6.
170 REFERENCES
[190] Jongman Kim, Dongkook Park, T. Theocharides, N. Vijaykrishnan, and Chita R. Das.
A low latency router supporting adaptivity for on-chip interconnects. In International
Conference on Design Automation, pages 559–564, 2005. DOI: 10.1109/dac.2005.193873.
[191] Joo-Young Kim, Junyoung Park, Seungjin Lee, Minsu Kim, Jinwook Oh, and Hoi-
Jun Yoo. A 118.4 GB/s Multi-Casting Network-on-Chip With Hierarchical Star-Ring
Combined Topology for Real-Time Object Recognition. Journal of Solid-State Circuits,
45(7):1399–1409, 2010. DOI: 10.1109/jssc.2010.2048085.
[192] Ryan Gary Kim, Wonje Choi, Guangshuo Liu, Ehsan Mohandesi, Partha Pratim Pande,
Diana Marculescu, and Radu Marculescu. Wireless NoC for VFI-enabled multicore chip
design: Performance evaluation and design trade-offs. IEEE Transactions on Computers,
65(4):1323–1336, 2016. DOI: 10.1109/tc.2015.2441721.
[193] Michel Kinsy, Myong Hyon Cho, Tina Wen, Edward Suh, Marten van Dijk, and Srinivas
Devadas. Application-aware deadlock-free oblivious routing. In Proc. of the International
Symposium on Computer Architecture, June 2009. DOI: 10.1145/1555754.1555782.
[194] Nevin Kirman, Meyrem Kirman, Rajeev K. Dokania, Jose F. Martinez, Alyssa B. Apsel,
Matthew A. Watkins, and David H. Albonesi. Leveraging optical technology in future
bus-based chip multiprocessors. In Proc. of the International Symposium on Microarchitecture,
pages 492–503, December 2006. DOI: 10.1109/micro.2006.28.
[195] Michael Kistler, Michael Perrone, and Fabrizio Petrini. Cell multiprocessor com-
munication network: Built for speed. IEEE Micro, 26(3):10–23, May 2006. DOI:
10.1109/mm.2006.49.
[196] Michihiro Koibuchi, Hiroki Matsutani, Hideharu Amano, and Timothy Mark
Pinkston. A lightweight fault-tolerant mechanism for network-on-chip. In
Proc. of the Second ACM/IEEE International Symposium on Networks-on-Chip, NOCS
’08, pages 13–22, Washington, DC, USA, 2008. IEEE Computer Society. DOI:
10.1109/nocs.2008.4492721.
[197] Pranay Koka, Michael O. McCracken, Herb Schwetman, Chia-Hsin Chen, Xuezhe
Zheng, Ron Ho, Kannan Raj, and Ashok V. Krishnamoorthy. A micro-architectural
analysis of switched photonic multi-chip interconnects. In International Symposium on
Computer Architecture, 2012. DOI: 10.1109/isca.2012.6237014.
[198] Pranay Koka, Michael O. McCracken, Herb Schwetman, Xuezhe Zheng, Ron Ho, and
Ashok V. Krishnamoorthy. Silicon-photonic network architectures for scalable, power-
efficient multi-chip systems. In International Symposium on Computer Architecture, 2010.
DOI: 10.1145/1815961.1815977.
REFERENCES 171
[199] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A
32-way multithreaded SPARC processor. IEEE Micro, 25(2):21–29, 2005. DOI:
10.1109/mm.2005.35.
[200] Rajesh Kota. HORUS: Large scale SMP using AMD Opteron™. http://www.
hypertransport.org/docs/tech/horus_external_white_paper_final.pdf. DOI:
10.1109/MM.2005.28.
[201] Yana Krasteva, Francisco Criado, Eduardo de la Torre, and Teresa Riesgo. A fast
emulation-based NoC prototyping framework. In RECONFIG, 2008. DOI: 10.1109/re-
config.2008.74.
[202] Yana E. Krasteva, Francisco Criado, Eduardo de la Torre, and Teresa Riesgo. A fast
emulation-based NoC prototyping framework. In Proc. of the 2008 International Conference
on Reconfigurable Computing and FPGAs, RECONFIG ’08, pages 211–216, Washington,
DC, USA, 2008. IEEE Computer Society. DOI: 10.1109/reconfig.2008.74.
[203] Tushar Krishna. garnet2.0. http://synergy.ece.gatech.edu/tools/garnet/.
[204] Tushar Krishna, Chia-Hsin Owen Chen, Woo Cheol Kwon, and Li-Shiuan Peh. Break-
ing the on-chip latency barrier using SMART. In Proc. of the International Symposium on
High Performance Computer Architecture, 2013. DOI: 10.1109/hpca.2013.6522334.
[205] Tushar Krishna, Amit Kumar, Patrick Chiang, Mattan Erez, and Li-Shiuan Peh. NoC
with near-ideal express virtual channels using global-line communication. In Proc. of Hot
Interconnects, pages 11–20, August 2008. DOI: 10.1109/hoti.2008.22.
[206] Tushar Krishna, Li-Shiuan Peh, Bradford M. Beckmann, and Steven K. Rein-
hardt. Towards the ideal on-chip fabric for 1-to-many and many-to-1 communica-
tion. In Proc. of the International Symposium on Microarchitecture, December 2011. DOI:
10.1145/2155620.2155630.
[207] John Kubiatowicz and Anant Agarwal. The anatomy of a message in the Alewife multi-
processor. In Proc. of the International Conference on Supercomputing, pages 195–206, July
1993. DOI: 10.1145/2591635.2667168.
[208] Amit Kumar, Partha Kundu, Arvind Singh, Li-Shiuan Peh, and Niraj K. Jha. A
4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS.
In Proc. of the International Conference on Computer Design, pages 63–70, October 2007.
DOI: 10.1109/iccd.2007.4601881.
[209] Amit Kumar, Li-Shiuan Peh, and Niraj K Jha. Token flow control. In Proc. of the 41st
International Symposium on Microarchitecture, pages 342–353, Lake Como, Italy, November
2008. DOI: 10.1109/micro.2008.4771803.
172 REFERENCES
[210] Amit Kumar, Li-Shiuan Peh, Partha Kundu, and Niraj K. Jha. Express virtual chan-
nels: Toward the ideal interconnection fabric. In Proc. of 34th Annual International
Symposium on Computer Architecture, pages 150–161, San Diego, CA, June 2007. DOI:
10.1145/1250662.1250681.
[211] Shashi Kumar, Axel Jantsch, Juha-Pekka Soininen, M. Forsell, Mikael Millberg, Johnny
Öberg, Kari Tiensyrjä, and Ahmed Hemani. A network on chip architecture and design
methodology. In Proc. of the IEEE Computer Society Annual Symposium on VLSI, pages 105–
112, April 2002. DOI: 10.1109/isvlsi.2002.1016885.
[212] G. Kurian, J. Miller, J. Psota, J. Michel, L. Kimerling, and A. Agarwal. ATAC: A 1000-
core cache-coherent processor with on-chip optical network. In International Conference
on Parallel Architectures and Compiler Techniques, 2010. DOI: 10.1145/1854273.1854332.
[213] Hyoukjun Kwon and Tushar Krishna. OpenSMART: Single-cycle multi-hop noc gener-
ator in bsv and chisel. In IEEE International Symposium on Performance Analysis of Systems
and Software, ISPASS, 2017.
[214] Ying-Cherng Lan, Shih-Hsin Lo, , Yueh-Chi Lin, Yu-Hen Hu, and Sao-Jie
Chen. BiNoC: A bidirectional NoC architecture with dynamic self-reconfigurable
channel. In Proc. of the International Symposium on Networks-on-Chip, 2009. DOI:
10.1109/nocs.2009.5071476.
[215] James Laudon and Daniel Lenoski. The SGI Origin: a ccNUMA highly scalable server.
In Proc. of the 24th Annual International Symposium on Computer Architecture, pages 241–
251, May 1997. DOI: 10.1145/264107.264206.
[216] Doowon Lee, Ritesh Parikh, and Valeria Bertacco. Brisk and limited-impact NoC rout-
ing reconfiguration. In Proc. of the Conference on Design, Automation & Test in Europe,
DATE ’14, pages 306:1–306:6, 2014. DOI: 10.7873/date2014.319.
[217] Jae W. Lee, Man Cheuk Ng, and Krste Asanović. Globally synchronized frames for
guaranteed quality of service in on-chip networks. In Proc. of the International Symposium
on Computer Architecture, June 2008. DOI: 10.1109/isca.2008.31.
[218] Kangmin Lee, Se-Joong Lee, Sung-Eun Kim, Hye-Mi Choi, Donghyun Kim, Sunyoung
Kim, Min-Wuk Lee, and Hoi-Jun Yoo. A 51mW 1.6GHz on-chip network for low-power
heterogeneous SoC platform. In Proc. of the International Solid-State Circuits Conference,
pages 152–153, February 2004. DOI: 10.1109/isscc.2004.1332639.
[219] Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo. Low-power network-on-chip for high-
performance SoC design. IEEE Transactions on VLSI Systems, 14(2), February 2006. DOI:
10.1109/tvlsi.2005.863753.
REFERENCES 173
[220] Michael Lee, John Kim, Dennis Abts, Mike Marty, and Jae Lee. Probabilistic distance-
based arbitration: Providing equality of service for many-core cmps. In Proc. of the Inter-
national Symposium on Microarchitecture, 2010. DOI: 10.1109/micro.2010.18.
[221] Whay Sing Lee, William J. Dally, Stephen W. Keckler, Nicholas P. Carter, and Andrew
Chang. An efficient protected message interface in the MIT M-Machine. IEEE Computer
Special Issue on Design Challenges for High Performance Network Interfaces, 31(11):69–75,
November 1998. DOI: 10.1109/2.730739.
[222] Charles Leiserson. Fat-trees: Universal networks for hardware-efficient supercom-
puting. IEEE Transactions on Computers, 34(10):892–901, October 1985. DOI:
10.1109/tc.1985.6312192.
[223] Feihui Li, Chrysostomos Nicopoulos, Thomas Richardson, Yuan Xie, N. Vijaykrish-
nan, and Mahmut Kandemir. Design and management of 3D chip multiprocessors us-
ing network-in-memory. In Proc. of the International Symposium on Computer Architecture,
pages 130–141, June 2006. DOI: 10.1109/isca.2006.18.
[224] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and
Norman P. Jouppi. McPAT: an integrated power, area, and timing modeling framework for
multicore and manycore architectures. In 42st Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), pages 469–480, 2009. DOI: 10.1145/1669112.1669172.
[225] Zheng Li, Jie Wu, Li Shang, Robert Dick, and Yihe Sun. Latency criticality aware on-
chip communication. In Proc. of the IEEE Conference on Design, Automation, and Test in
Europe, March 2009. DOI: 10.1109/date.2009.5090820.
[226] Zimo Li, Joshua San Miguel, and Natalie Enright Jerger. The runahead network-on-
chip. In International Symposium on High Performance Computer Architecture, 2016. DOI:
10.1109/hpca.2016.7446076.
[227] Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla:
A unified graphics and computing architecture. IEEE Micro, 28(2):39–55, March-April
2008. DOI: 10.1109/mm.2008.31.
[228] Jifeng Liu, Lionel C Kimerling, and Jurgen Michel. Monolithic Ge-on-Si lasers
for large-scale electronic-photonic integration. Semiconductor Science and Technology,
27(9):094006, 2012. DOI: 10.1088/0268-1242/27/9/094006.
[229] Yong Liu, Ping-Hsuan Hsieh, Seongwon Kim, Jae-sun Seo, Robert Montoye, Leland
Chang, Jose Tierno, and Daniel Friedman. A 0.1 pJ/b 5-to-10Gb/s charge-recycling
stacked low-power I/O for on-chip signaling in 45nm CMOS SOI. In 2013 IEEE In-
ternational Solid-State Circuits Conference Digest of Technical Papers, pages 400–401. IEEE,
2013. DOI: 10.1109/isscc.2013.6487787.
174 REFERENCES
[230] Pejman Lotfi-Kamran, Boris Grot, and Babak Falsafi. NOC-Out: Microarchitecting
a scale-out processor. In Proc. of the International Symposium on Microarchitecture, 2012.
DOI: 10.1109/micro.2012.25.
[231] Zhonghai Lu, Ming Lui, and Axel Jantsch. Layered switching for networks on chip. In
Proc. of the Conference on Design Automation, pages 122–127, San Diego, CA, June 2007.
DOI: 10.1109/dac.2007.375137.
[232] Lian-Wee Luo, Noam Ophir, Christine P. Chen, Lucas H. Gabrielli, Carl B. Poitras,
Keren Bergmen, and Michal Lipson. WDM-compatible mode-division multiplexing on a
silicon chip. Nature Communications, 5:3069 EP –, 01 2014. DOI: 10.1038/ncomms4069.
[233] Sheng Ma, Natalie Enright Jerger, and Zhiying Wang. DBAR: an efficient
routing algorithm to support multiple concurrent applications in networks-on-chip.
In Proc. of the International Symposium on Computer Architecture, June 2011. DOI:
10.1145/2000064.2000113.
[234] Sheng Ma, Natalie Enright Jerger, and Zhiying Wang. Supporting efficient collective
communication in NoCs. In Proc. of the International Symposium on High Performance Com-
puter Architecture, February 2012. DOI: 10.1109/hpca.2012.6168953.
[235] Sheng Ma, Natalie Enright Jerger, and Zhiying Wang. Whole packet forwarding: Ef-
ficient design of fully adaptive routing algorithms for networks-on-chip. In Proc. of the
International Symposium on High Performance Computer Architecture, February 2012. DOI:
10.1109/hpca.2012.6169049.
[236] Sheng Ma, Natalie Enright Jerger, Zhiying Wang, Ming-Che Lai, and Libo Huang.
Holistic routing algorithm design to support workload consolidation in NoCs. IEEE
Transactions on Computers, 63(3), March 2014. DOI: 10.1109/tc.2012.201.
[237] Sheng Ma, Zhiying Wang, Zonglin Liu, and Natalie Enright Jerger. Leaving one slot
empty: Flit bubble flow control for torus cache-coherent NoCs. IEEE Transactions on
Computers, 64:763–777, March 2015. DOI: 10.1109/tc.2013.2295523.
[238] Radu Marculescu, Umit Y. Ogras, Li-Shiuan Peh, Natalie Enright Jerger, and Yatin
Hoskote. Outstanding research problems in NoC design: System, microarchitecture, and
circuit perspectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 28(1):3–21, January 2009. DOI: 10.1109/tcad.2008.2010691.
[239] Theodore Marescaux, Andrei Bartic, Diderick Verkest, D. Verkest, Rudy Lauwereins,
Serge Vernalde, and R. Lauwereins. Interconnection networks enable fine-grain dynamic
multi-tasking on FPGAs. In International Conference on Field-Programmable Logic and
Applications, 2002. DOI: 10.1007/3-540-46117-5_82.
REFERENCES 175
[240] Michael R. Marty and Mark D. Hill. Coherence ordering for ring-based chip multipro-
cessors. In Proc. of the 39th International Symposium on Microarchitecture, pages 309–320,
December 2006. DOI: 10.1109/micro.2006.14.
[241] Michael R. Marty and Mark D. Hill. Virtual hierarchies to support server consolidation.
In Proc. of the International Symposium on Computer Architecture, pages 46–56, June 2007.
DOI: 10.1145/1250662.1250670.
[242] Hiroki Matsutani, Michihiro Koibuchi, Hideharu Amano, and Tsutomu Yoshinaga. Pre-
diction router: Yet another low latency on-chip router architecture. In Proc. of the Inter-
national Symposium on High Performance Computer Architecture, pages 367–378, February
2009. DOI: 10.1109/hpca.2009.4798274.
[243] George Michelogiannakis, James Balfour, and William J. Dally. Elastic-buffer flow con-
trol for on-chip networks. In Proc. of the International Symposium on High Performance
Computer Architecture, pages 151–162, February 2009. DOI: 10.1109/hpca.2009.4798250.
[244] George Michelogiannakis, Nan Jiang, Daniel Becker, and William J.Dally. Packet chain-
ing: Efficient single-cycle allocation for on-chip networks. In Proc. of the International
Symposium on Microarchitecture, 2011. DOI: 10.1145/2155620.2155631.
[247] Mikael Millberg, Erland Nilsson, Rikard Thid, and Axel Jantsch. Guaranteed bandwidth
using looped containers in temporally disjoint networks within the Nostrum network-
on-chip. In Proc. of the conference on Design, Automation and Testing in Europe (DATE),
pages 890–895, 2004. DOI: 10.1109/date.2004.1269001.
[248] D.A.B. Miller. Rationale and challenges for optical interconnects to electronic chips.
Proc. of the IEEE, 88(6):728–749, June 2000. DOI: 10.1109/5.867687.
[249] David Miller. Attojoule optoelectronics for low-energy information processing and
communications: a tutorial review. https://arxiv.org/abs/1609.05510v2. DOI:
10.1109/jlt.2017.2647779.
[250] Asit K Mishra, Reetuparna Das, Soumya Eachempati, Ravi Iyer, Narayanan Vijaykr-
ishnan, and Chita R Das. A case for dynamic frequency tuning in on-chip networks.
In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO),
pages 292–303. IEEE, 2009. DOI: 10.1145/1669112.1669151.
[253] Thomas Moscibroda and Onur Mutlu. A case for bufferless routing in on-chip networks.
In 36th International Symposium on Computer Architecture (ISCA), pages 196–207, Austin,
TX, June 20–24, 2009. DOI: 10.1145/1555754.1555781.
[254] Shubhendu S. Mukherjee, Peter Bannon, Steven Lang, Aaron Spink, and David
Webb. The Alpha 21364 network architecture. IEEE Micro, 22(1):26–35, 2002. DOI:
10.1109/40.988687.
[255] Robert Mullins, Andrew West, and Simon Moore. Low-latency virtual-channel routers
for on-chip networks. In Proc. of the International Symposium on Computer Architecture,
pages 188–197, June 2004. DOI: 10.1109/isca.2004.1310774.
[256] S. Murali and Giovanni De Micheli. SUNMAP: A tool for automatic topology selection
and generation for NoCs. In Proc. of the Design Automation Conference, pages 914–919, June
2004. DOI: 10.1145/996566.996809.
[258] Srinivasan Murali, Paolo Meloni, Federico Angiolini, David Atienza, Salvatore Carta,
Luca Benini, Giovanni De Micheli, and Luigi Raffo. Designing application-specific net-
works on chips with floorplan information. In International Conference on Computer-Aided
Design, pages 355–362, November 2006. DOI: 10.1109/iccad.2006.320058.
[259] Ted Nesson and S. Lennart Johnsson. ROMM routing on mesh and torus networks. In
Proc. of the Symposium on Parallel Algorithms and Architectures, pages 275–287, 1995. DOI:
10.1145/215399.215455.
[261] Rishiyur Nikhil. Bluespec system verilog: efficient, correct rtl from high level
specifications. In MEMOCODE, pages 69–70. IEEE, 2004. DOI: 10.1109/mem-
cod.2004.1459818.
REFERENCES 177
[262] Erland Nilsson, Mikael Millberg, Johnny Oberg, and Axel Jantsch. Load distribu-
tion with proximity congestion awareness in a network on chip. In Proc. of the Con-
ference on Design, Automation and Test in Europe, pages 1126–1127, March 2003. DOI:
10.1109/date.2003.1253765.
[263] Christopher Nitta, Kevin Macdonald, Matthew Farrens, and Venkatesh Akella. Inferring
packet dependencies to improve trace based simulation of on-chip networks. In Interna-
tional Symposium on Networks on Chip, 2011. DOI: 10.1145/1999946.1999971.
[265] Peter R. Nuth and William J. Dally. The J-Machine network. In Proc. of the
International Conference on Computer Design, pages 420–423, October 1992. DOI:
10.1109/iccd.1992.276305.
[266] Umit Y Ogras, Paul Bogdan, and Radu Marculescu. An analytical approach for network-
on-chip performance analysis. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 29(12):2001–2013, 2010. DOI: 10.1109/tcad.2010.2061613.
[267] Umit Y. Ogras, Jingcao Hu, and Radu Marculescu. Key research problems in
NoC design: A holistic perspective. In Proc. of the International Conference on
Hardware-Software Codesign Systems and Synthesis, pages 69–74, September 2005. DOI:
10.1145/1084834.1084856.
[268] Umit Y. Ogras and Radu Marculescu. “It’s a small world after all”: NoC performance op-
timization via long-range link insertion. IEEE Transactions on Very Large Scale Integration
(VLSI) Systems - Special Section Hardware/Software Codesign System Synthesis, 14(7):693–
706, July 2006. DOI: 10.1109/tvlsi.2006.878263.
[269] Umit Y Ogras, Radu Marculescu, Diana Marculescu, and Eun Gu Jung. Design
and management of voltage-frequency island partitioned networks-on-chip. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 17(3):330–341, 2009. DOI:
10.1109/tvlsi.2008.2011229.
[270] Jungju Oh, Milos Prvulovic, and Alenka Zajic. TLSync: Support for multiple fast barriers
using on-chip transmission lines. In Proc. of the 38th Annual International Symposium on
Computer Architecture, ISCA ’11, pages 105–116, 2011. DOI: 10.1145/2000064.2000078.
[271] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kun-Yung
Chang. The case for a single-chip multiprocessor. In Proc. of the International Symposium
on Architectural Support for Parallel Languages and Operating Systems, pages 2–11, October
1996. DOI: 10.1145/237090.237140.
178 REFERENCES
[272] Network on Chip (NoC) Blog. Network-on-chip (noc) blog. https://networkonchi
p.wordpress.com/2011/02/22/simulators/.
[273] Marta Ortin, Dario Suarez, Maria Villarroya, Cruz Izu, and Victor Vinals. Dynamic
construction of circuits for reactive traffic in homogeneous CMPs. In Design, Automation
and Test in Europe Conference and Exhibition (DATE), 2014, pages 1–4. IEEE, 2014. DOI:
10.7873/date.2014.254.
[274] Ralph H. Otten and Robert K. Brayton. Planning for performance. In Proc. of the confer-
ence on Design Automation, pages 122–127, June 1998. DOI: 10.1109/dac.1998.724452.
[275] Jin Ouyang and Yuan Xie. LOFT: A high performance network-on-chip providing qual-
ity of service support. In Proc. of the International Symposium on Microarchitecture, 2010.
DOI: 10.1109/micro.2010.21.
[276] John D. Owens, William J. Dally, Ron Ho, D. N. Jayasimha, Stephen W. Keckler, and
Li-Shiuan Peh. Research challanges for on-chip interconnection networks. IEEE Mi-
cro, Special Issue on On-Chip Interconnects for Multicores, 27(5):96–108, September/October
2007. DOI: 10.1109/mm.2007.4378787.
[278] Maurizio Palesi, Rickard Holsmark, Shashi Kumar, and Vincenzo Catania. A method-
ology for design of application specific deadlock-free routing algorithms for NoC systems.
In Proc. of the International Conference on Hardware-Software Codesign Systems and Synthe-
sis, pages 142–147, October 2006. DOI: 10.1145/1176254.1176289.
[279] Yan Pan, John Kim, and Gokhan Memik. FlexiShare: Energy-efficient nanophotonic
crossbar architecture through channel sharing. In International Symposium on High Perfor-
mance Computer Architecture, 2010. DOI: 10.1109/HPCA.2010.5416626.
[280] Yan Pan, Prabhat Kumar, John Kim, Gokhan Memik, Yu Zhang, and Alok Choudhary.
Firefly: Illuminating future network-on-chip with nanophotonics. In Proc. of the Interna-
tional Symposium on Computer Architecture, June 2009. DOI: 10.1145/1555754.1555808.
[281] Michael K. Papamichael and James C. Hoe. Connect: re-examining conventional wis-
dom for designing NoCs in the context of FPGAs. In Proc. of the ACM/SIGDA inter-
national symposium on Field Programmable Gate Arrays, pages 37–46. ACM, 2012. DOI:
10.1145/2145694.2145703.
[282] Ritesh Parikh and Valeria Bertacco. uDIREC: Unified diagnosis and reconfiguration for
frugal bypass of NoC faults. In International Symposium on Microarchitecture, 2013. DOI:
10.1145/2540708.2540722.
REFERENCES 179
[283] Ritesh Parikh, Reetuparna Das, and Valeria Bertacco. Power-aware NoCs through rout-
ing and topology reconfiguration. In 2014 51st ACM/EDAC/IEEE Design Automation
Conference (DAC), pages 1–6. IEEE, 2014. DOI: 10.1109/dac.2014.6881489.
[284] Ritesh Parikh, Rawan Abdel Khalek, and Valeria Bertacco. Formally enhanced runtime
verification to ensure NoC functional correctness. In International Symposium on Microar-
chitecture, 2011. DOI: 10.1145/2155620.2155668.
[285] Dongkook Park, Reetuparna Das, Chrysostomos Nicopoulos, Jongman Kim, N. Vijaykr-
ishnan, Ravishankar Iyer, and Chita R. Das. Design of a dynamic priority-based fast path
architecture for on-chip interconnects. In Proc. of the 15th IEEE Symposium on High-
Performance Interconnects, pages 15–20, August 2007. DOI: 10.1109/hoti.2007.1.
[286] Dongkook Park, Soumya Eachempati, Reetuparna Das, Asit K. Mishra, Yuan Xie, N. Vi-
jaykrishnan, and Chita R. Das. MIRA: A multi-layered on-chip interconnect router ar-
chitecture. In Proc. of the International Symposium on Computer Architecture, pages 251–261,
June 2008. DOI: 10.1109/isca.2008.13.
[287] S. Park, T. Krishna, C.-H. O. Chen, B. Daya, A. P. Chandrakasan, and L.-S. Peh. Ap-
proaching the Theoretical Limits of a Mesh NoC with a 16-Node Chip Prototype in 45nm
SOI. In Proc. of the ACM/EDAC/IEEE Design Automation Conference (DAC), pages 398–
405, 2012. DOI: 10.1145/2228360.2228431.
[288] Sunghyun Park, Masood Qazi, Li-Shiuan Peh, and Anantha P. Chandrakasan.
40.4fJ/bit/mm low-swing on-chip signaling with self-resetting logic repeaters embedded
within a mesh NoC in 45nm SOI CMOS. In Design, Automation and Test in Europe
(DATE), pages 1637–1642, 2013. DOI: 10.7873/date.2013.332.
[289] Sudeep Pasricha and Nikil Dutt. On-Chip Communication Architectures: System on Chip
Interconnect. Morgan Kaufmann, 2008.
[290] Li-Shiuan Peh and William J. Dally. Flit-reservation flow control. In Proc. of the 6th
International Symposium on High Performance Computer Architecture, pages 73–84, February
2000. DOI: 10.1109/hpca.2000.824340.
[291] Li-Shiuan Peh and William J. Dally. A delay model and speculative architecture for
pipelined routers. In Proc. of the International Symposium on High Performance Computer
Architecture, pages 255–266, January 2001. DOI: 10.1109/hpca.2001.903268.
[292] Li-Shiuan Peh and William J. Dally. A delay model for router microarchitectures. IEEE
Micro, 21(1):26–34, January 2001. DOI: 10.1109/40.903059.
[293] D. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox, P. Harvey,
H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Pham, J. Pille,
180 REFERENCES
S. Posluszny, M. Riley, D. Stasiak, M. Suzuoki, O. Takahashi, J. Warnock, S. Weitzel,
D. Wendel, and K. Yazawa. Overview of the architecture, circuit design, and physical
implementation of a first-generation cell processor. IEEE Journal of Solid-State Circuits,
41(1):179–196, 2006. DOI: 10.1109/jssc.2005.859896.
[294] Timothy Mark Pinkston and José Duato. Appendix F: Interconnection networks. In
John L. Hennessy and David A. Patterson, Eds., Computer Architecture: A Quantitative
Approach, pages 1–114. Elsevier Publishers, 5th edition, September 2011.
[296] Andreas Prodromou, Andreas Panteli, Chrysostomos Nicopoulos, and Yiannakis Sazei-
des. Nocalert: An on-line and real-time fault detection mechanism for network-on-chip
architectures. In International Symposium on Microarchitecture, 2012. DOI: 10.1109/mi-
cro.2012.15.
[298] V. Puente, C. Izu, R. Beivide, J. A. Gregorio, F. Vallejo, and J. M. Prellezo. The adaptive
bubble router. Journal of Parallel and Distributed Computing, 64(9):1180–1208, 2001. DOI:
10.1006/jpdc.2001.1746.
[299] Antonio Pullini, Federico Angiolini, Davide Bertozzi, and Luca Benini. Fault tolerance
overhead in network-on-chip flow control schemes. In Proc. of the Symposium on Integrated
and Circuits System Design, pages 224–229, Sept 2005. DOI: 10.1109/sbcci.2005.4286861.
[300] Martin Radetzki, Chaochao Feng, Xueqian Zhao, and Axel Jantsch. Methods for fault
tolerance in networks-on-chip. ACM Computing Surveys, 46(1):8:1–8:38, July 2013. DOI:
10.1145/2522968.2522976.
[301] Mukund Ramakrishna, Paul Gratz, and Alexander Sprintson. GCA: Global congestion
awareness for load balance in networks-on-chip. In Proc. of the International Symposium on
Networks-on-Chip, April 2013. DOI: 10.1109/nocs.2013.6558405.
[302] Aniruddh Ramrakhyani and Tushar Krishna. Static bubble: A framework for deadlock-
free irregular on-chip topologies. In Proc. of the International Symposium on High Perfor-
mance Computer Architecture, 2017.
[303] José Renau, B. Fraguela, James Tuck, Wei Liu, Milos Prvulovic, Luis Ceze, Karin Strauss,
Smruti Sarangi, Paul Sack, and Pablo Montesinos. SESC simulator. http://sesc.sou
rceforge.net.
REFERENCES 181
[304] Samuel Rodrigo, Jose Flich, José Duato, and Mark Hummel. Efficient unicast and
multicast support for CMPs. In Proc. of the International Symposium on Microarchitecture,
pages 364–375, November 2008. DOI: 10.1109/micro.2008.4771805.
[305] Manuel Saldana, Lesley Shannon, and Paul Chow. The routability of multiprocessor
network topologies in FPGAs. In International Workshop on System-level Interconnect Pre-
diction, 2006. DOI: 10.1145/1117278.1117290.
[306] Ahmad Samih, Ren Wang, Anil Krishna, Christian Maciocco, Charlie Tai, and Yan Soli-
hin. Energy-efficient interconnect via router parking. In High Performance Computer Archi-
tecture (HPCA2013), 2013 IEEE 19th International Symposium on, pages 508–519. IEEE,
2013. DOI: 10.1109/hpca.2013.6522345.
[307] Daniel Sanchez and Christos Kozyrakis. ZSim: fast and accurate microarchitectural
simulation of thousand-core systems. In The 40th Annual International Symposium on
Computer Architecture, ISCA’13, pages 475–486, Tel-Aviv, Israel, June 23–27, 2013. DOI:
10.1145/2508148.2485963.
[309] Graham Schelle and Dirk Grunwald. On-chip interconnect exploration for multicore
processors utilizing FPGAs. In Proceedings of the 2nd Workshop on Architecture Research
using FPGA Platforms, February 2006.
[311] Steve Scott, Dennis Abts, John Kim, and William J. Dally. The BlackWidow high-radix
Clos network. In Proc. of the International Symposium on Computer Architecture, pages 16–
27, June 2006. DOI: 10.1109/isca.2006.40.
[312] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep
Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa,
Ed Grochowski, Toni Juan, and Pat Hanrahan. Larrabee: A many-core x86 archi-
tecture for visual computing. ACM Transactions on Graphics, 27, August 2008. DOI:
10.1145/1360612.1360617.
182 REFERENCES
[313] Daeho Seo, Akif Ali, Won-Taek Lim, Nauman Rafique, and Mithuna Thottenhodi.
Near-optimal worst-case throughput routing for two-dimensional mesh networks. In Proc.
of the 32nd Annual International Symposium on Computer Architecture, pages 432–443, June
2005. DOI: 10.1109/isca.2005.37.
[314] Jae-sun Seo, Ron Ho, Jon K. Lexau, Michael Dayringer, Dennis Sylvester, and David
Blaauw. High-bandwidth and low-energy on-chip signaling with adaptive pre-emphasis
in 90nm CMOS. In IEEE International Solid-State Circuits Conference, ISSCC 2010, Di-
gest of Technical Papers, pages 182–183, San Francisco, CA, February 7–11, 2010. DOI:
10.1109/isscc.2010.5433993.
[315] Korey Sewell, Ronald G. Dreslinski, Thomas Manville, Sudhir Satpathy, Nathaniel Ross
Pinckney, Geoffrey Blake, Michael Cieslak, Reetuparna Das, Thomas F. Wenisch, Dennis
Sylvester, David Blaauw, and Trevor N. Mudge. Swizzle-Switch Networks for Many-Core
Systems. IEEE Journal on Emergerging and Selected Topics in Circuits And Systems, 2(2):278–
294, 2012. DOI: 10.1109/jetcas.2012.2193936.
[316] Assaf Shacham, Keren Bergman, and Luca P. Carloni. On the design of a photonic
network on chip. In Proc. of International Symposium on Networks-on-Chip, pages 53–64,
May 2007. DOI: 10.1109/nocs.2007.35.
[317] Akbar Sharifi, Emre Kultursay, Mahmut Kandemir, and Chita R. Das. Addressing end-
to-end memory access latency in NoC-based multicores. In Proc. of the International Sym-
posium on Microarchitecture, 2012. DOI: 10.1109/micro.2012.35.
[318] Keun Sup Shim, Myong Hyon Cho, Michel Kinsy, Tina Wen, Mieszko Lis, G Edward
Suh, and Srinivas Devadas. Static virtual channel allocation in oblivious routing. In Proc.
of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip, pages 38–43.
IEEE Computer Society, 2009. DOI: 10.1109/nocs.2009.5071443.
[319] Arjun Singh, William J. Dally, Amit K. Gupta, and Brian Towles. GOAL:
A load-balanced adaptive routing algorithm for torus networks. In Proc. of the
International Symposium on Computer Architecture, pages 194–205, June 2003. DOI:
10.1109/isca.2003.1207000.
[320] Arjun Singh, William J. Dally, Brian Towles, and Amit K. Gupta. Locality-preserving
randomized oblivious routing on torus networks. In SPAA, pages 9–13, 2002. DOI:
10.1145/564870.564873.
[321] Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sun-
daram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. Knights Land-
ing: Second-generation Intel Xeon Phi product. IEEE Micro, 36(2):34–46, 2016. DOI:
10.1109/hotchips.2015.7477467.
REFERENCES 183
[322] Yong Ho Song and Timothy Mark Pinkston. A progressive approach to handling
message-dependent deadlocks in parallel computer systems. IEEE Transactions on Parallel
and Distributed Systems, 14(3):259–275, March 2003. DOI: 10.1109/tpds.2003.1189584.
[323] Sonics. SonicsGN. http://sonicsinc.com/products/on-chip-networks/sonicsg
n/.
[324] Sonics Inc. http://www.sonicsinc.com/home/htm.
[325] Daniel J. Sorin, Mark D. Hill, and David A. Wood. A Primer on Memory Consistency and
Cache Coherence. Morgan Claypool, 2011. DOI: 10.2200/s00346ed1v01y201104cac016.
[326] Krishnan Srinivasan and Karam S. Chatha. A low complexity heuristic for design of
custom network-on-chip architectures. In Proc. of the Conference on Design, Automation
and Test in Europe, pages 130–135, 2006. DOI: 10.1109/date.2006.244034.
[327] S. Stergiou, E Angiolini, D. Bertozzi, S. Carta, L. Raffo, and G. De Micheli. xpipesLite:
A synthesis-oriented design flow for networks on chip. In Proc. of the Conference on Design,
Automation and Test Europe, pages 1188–1193, 2005. DOI: 10.1109/date.2005.1.
[328] C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Sto-
janovic. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-
Electronic Networks-on-Chip Modeling. In Proc. of International Symposium on Networks-
on-Chip, pages 201–210, 2012. DOI: 10.1109/nocs.2012.31.
[329] C. Sun, Mark T. Wade, Yunsup Lee, Jason S. Orcutt, Luca Alloatti, Michael S. Georgas,
Andrew S. Waterman, Jeffrey M. Shainline, Rimas R. Avizienis, Sen Lin, Benjamin R.
Moss, Rajesh Kumar, Fabio Pavanello, Amir H. Atabaki, Henry M. Cook, Albert J. Ou,
Jonathan C. Leu, Yu-Hsin Chen, Krste Asanovic, Rajeev J. Ram, Milos A. Popovic, and
Vladimir M. Stojanovic. Single-chip microprocessor that communicates directly using
light. Nature, 528(7583):534–538, 2015. DOI: 10.1038/nature16454.
[330] Chen Sun, Mark Wade, Michael Georgas, Sen Lin, Luca Alloatti, Benjamin Moss, Ra-
jesh Kumar, Amir H Atabaki, Fabio Pavanello, Jeffrey M Shainline, Jason S. Orcutt, Ra-
jeev J. Ram, Milos Popovic, and Vladimir Stojanovic. A 45 nm CMOS-SOI monolithic
photonics platform with bit-statistics-based resonant microring thermal tuning. IEEE
Journal of Solid-State Circuits, 51(4):893–907, 2016. DOI: 10.1109/jssc.2016.2519390.
[331] Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. Wavescalar. In
Proc. of the 36th International Symposium on Microarchitecture, pages 291–302, 2003. DOI:
10.1109/micro.2003.1253203.
[332] Yasuhiro Take, Hiroki Matsutani, Daisuke Sasaki, Michihiro Koibuchi, Tadahiro
Kuroda, and Hideharu Amano. 3D NoC with inductive-coupling links for building-block
SiPs. IEEE Transactions on Computers, 63(3):748–763, 2014. DOI: 10.1109/tc.2012.249.
184 REFERENCES
[333] Y. Tamir and H. C. Chi. Symmetric crossbar arbiters for VLSI communication
switches. IEEE Transactions Parallel and Distributed Systems, 4(1):13–27, 1993. DOI:
10.1109/71.205650.
[334] Yuval Tamir and Gregory L. Frazier. Dynamically-allocated multi-queue buffers for
VLSI communication switches. IEEE Transactions on Computers, 41(6):725–737, June
1992. DOI: 10.1109/12.144624.
[335] Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, and Anant Agarwal. Scalar
operand networks: On-chip interconnect for ILP in partitioned architectures. In Proc.
of the International Symposium on High Performance Computer Architecture, pages 341–353,
February 2003. DOI: 10.1109/hpca.2003.1183551.
[336] J. Tendler, J. Dodson, J.S. Fields, H. Le, and B. Sinharoy. Power4 system mi-
croarchitecture. IBM Journal of Research and Development, 46(1):5–26, 2002. DOI:
10.1147/rd.461.0005.
[337] Kevin Tien, Noah Sturcken, Naigang Wang, Jae-woong Nah, Bing Dang, Eugene J.
O’Sullivan, Paul S. Andry, Michele Petracca, Luca P. Carloni, William J. Gallagher, and
Kenneth L. Shepard. An 82%-efficient multiphase voltage-regulator 3D interposer with
on-chip magnetic inductors. In VLSIC, page 192. IEEE, 2015.
[338] Brian Towles, J. P. Grossman, Brian Greskamp, and David E. Shaw. Unifying on-chip
and inter-node switching within the Anton 2 network. In Proceeding of the 41st Annual
International Symposium on Computer Architecuture, ISCA ’14, pages 1–12, Piscataway, NJ,
USA, 2014. IEEE Press. DOI: 10.1109/isca.2014.6853238.
[339] Anh Thien Tran, Dean Nguyen Truong, and Bevan M. Baas. A reconfigurable source-
synchronous on-chip network for GALS many-core platforms. IEEE Transactions on CAD
of Integrated Circuits and Systems, 29(6):897–910, 2010. DOI: 10.1109/tcad.2010.2048594.
[340] Marc Tremblay and Shailender Chaudhry. A third-generation 65nm 16-core 32-thread
plus 32-scout-thread CMT SPARC processor. In Proc. of the International Solid-State
Circuits Conference, 2008. DOI: 10.1109/isscc.2008.4523067.
[341] Sebastian Turullols and Ram Sivaramakrishnan. Sparc t5: 16-core cmt processor with
glueless 1-hop scaling to 8-sockets. In Hot Chips 24 Symposium (HCS), 2012 IEEE,
pages 1–37. IEEE, 2012. DOI: 10.1109/hotchips.2012.7476504.
[342] L. G. Valiant and G. J. Brebner. Universal schemes for parallel communication. In Proc.
of the 13th Annual ACM Symposium on Theory of Computing, pages 263–277, 1981. DOI:
10.1145/800076.802479.
REFERENCES 185
[343] J.W. van den Brand, C. Ciordas, K. Goossens, and T. Basten. Congestion-controlled
best-effort communication for networks-on-chip. In Proc. of the Conference on Design, Au-
tomation and Test in Europe, pages 948–953, April 2007. DOI: 10.1109/date.2007.364415.
[344] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer,
A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-
tile 1.28 TFLOPS network-on-chip in 65nm CMOS. In Proc. of the IEEE Inter-
national Solid-State Circuits Conference (ISSCC), pages 98–99, February 2007. DOI:
10.1109/isscc.2007.373606.
[345] Dana Vantrease, Robert Schreiber, Matteo Monchiero, Moray McLaren, Norman P.
Jouppi, Marco Fiorentino, Al Davis, Nathan L. Binkert, Raymond G. Beausoleil, and
Jung Ho Ahn. Corona: System implications of emerging nanophotonic technology.
In International Symposium on Computer Architecture, pages 153–164, June 2008. DOI:
10.1109/isca.2008.35.
[346] Anja von Beuningen and Ulf Schlichtmann. PLATON: A Force-Directed Placement
Algorithm for 3D Optical Networks-on-Chip. In International Symposium on Physical
Design. ACM, 2016. DOI: 10.1145/2872334.2872356.
[347] S. Wamakulasuriya and T.M. Pinkston. Characterization of deadlocks in k-ary n-cube
networks. IEEE Transactions on Parallel and Distributed Systems, 10(9):904–921, 1999.
DOI: 10.1109/71.798315.
[348] S. Wamakulasuriya and T.M. Pinkston. A formal model of message blocking and dead-
lock resolution in interconnection networks. IEEE Transactions on Parallel and Distributed
Systems, 11(3):212–229, 2000. DOI: 10.1109/71.841739.
[349] Danyao Wang, Natalie Enright Jerger, and J. Gregory Steffan. DART: A programmable
architecture for NoC simulation on FPGAs. In International Network on Chip Symposium
(NOCS), pages 145–152, 2011. DOI: 10.1145/1999946.1999970.
[350] Danyao Wang, Charles Lo, Jasmina Vasiljevic, Natalie Enright Jerger, and J. Gregory
Steffan. DART: A programmable architecture for NoC simulation on FPGAs. IEEE
Transactions on Computers, 99:1–1, 2012. DOI: 10.1109/TC.2012.121.
[351] Hang-Sheng Wang, Li-Shiuan Peh, and Sharad Malik. Power-driven design of router
microarchitectures in on-chip networks. In Proc. of the 36th International Symposium on
Microarchitecture, pages 105–116, November 2003. DOI: 10.1109/micro.2003.1253187.
[352] Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, and Sharad Malik. Orion: A
power-performance simulator for interconnection networks. In Proc. of the 35th Interna-
tional Symposium on Microarchitecture, pages 294–305, November 2002. DOI: 10.1109/mi-
cro.2002.1176258.
186 REFERENCES
[353] L. Wang, P. Kumar, K.H. Yum, and E.J. Kim. APCR: An adaptive physical channel
regulator for on-chip interconnects. In Proc. of the International Conference on Parallel Ar-
chitecture and Compilation Techniques, 2012. DOI: 10.1145/2370816.2370830.
[354] Lei Wang, Yuho Jin, Hyungjun Kim, and Eun Jung Kim. Recursive partitioning mul-
ticast: A bandwidth-efficient routing for on-chip networks. In Proc. of the International
Symposium on Networks-on-Chip, May 2009. DOI: 10.1109/nocs.2009.5071446.
[355] Ruisheng Wang, Lizhong Chen, and Timothy Mark Pinkston. Bubble coloring:
Avoiding routing- and protocol-induced deadlocks with minimal virtual channel require-
ment. In Proc. of the 27th International ACM Conference on International Conference on
Supercomputing, ICS ’13, pages 193–202, New York, NY, USA, 2013. ACM. DOI:
10.1145/2464996.2465436.
[356] David Wentzlaff, Patrick Griffin, Henry Hoffman, Liewei Bao, Bruce Edwards, Carl
Ramey, Matthew Mattina, Chyi-Chang Miao, John Brown III, and Anant Agarwal. On-
chip interconnection architecture of the Tile processor. IEEE Micro, 27(5):15–31, 2007.
DOI: 10.1109/mm.2007.4378780.
[357] Daniel Wiklund and Dake Lui. SoCBus: Switched network on chip for hard real time
embedded systems. In Proc. of the International Parallel and Distributed Processing Sympo-
sium, pages 8–16, April 2003. DOI: 10.1109/ipdps.2003.1213180.
[359] P. Wolkotte, P. Holzenspies, and G. Smit. Fast, accurate and detailed NoC simulations.
In International Symposium on Networks on Chip, 2007. DOI: 10.1109/nocs.2007.18.
[360] Pascal T. Wolkotte, Gerard J .M Smit, Gerard K. Rauwerda, and Lodewijk T. Smit.
An energy-efficient reconfigurable circuit-switched network-on-chip. In Proc. of the 19th
International Parallel and Distributed Processing Symposium, pages 155–162, 2005. DOI:
10.1109/ipdps.2005.95.
[361] Jae-Yeon Won, Xi Chen, Paul V. Gratz, Jiang Hu, and Vassos Soteriou. Up by their
bootstraps: Online learning in artificial neural networks for CMP uncore power manage-
ment. In International Symposium on High Performance Computer Architecture, 2014. DOI:
10.1109/hpca.2014.6835941.
[362] Steven C. Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta.
The SPLASH-2 programs: Characterization and methodological considerations. In Proc.
of the International Symposium on Computer Architecture, pages 24–36, June 1995. DOI:
10.1109/isca.1995.524546.
REFERENCES 187
[363] Frédéric Worm, Paolo Ienne, Patrick Thiran, and Giovanni De Micheli. An adaptive
low-power transmission scheme for on-chip networks. In ISSS ’02: Proceedings of the 15th
international symposium on System Synthesis, pages 92–100, New York, NY, USA, 2002.
ACM. DOI: 10.1109/isss.2002.1227158.
[364] L. Wu, A. Lottarini, T. Paine, M. A. Kim, and K. A. Ross. Q100: The architecture and
design of a database processing unit. In International Conference on Architectural Support for
Programming Languages and Operating Systems, 2014. DOI: 10.1145/2541940.2541961.
[365] Yuan Xie, Jason Cong, and Sachin Sapatnekar. Three-dimensional IC: Design, CAD, and
Architecture. Springer, 2009. DOI: 10.1007/978-1-4419-0784-4.
[366] Yi Xu, Yu Du, Bo Zhao, Xiuyi Zhou, Youtao Zhang, and Jun Yang. A low-
radix and low-diameter 3D interconnection network design. In International Sym-
posium on High Performance Computer Architecture, pages 30–42, February 2009. DOI:
10.1109/hpca.2009.4798234.
[367] Yi Xu, Bo Zhao, Youtao Zhang, and Jun Yang. Simple virtual channel allocation for high
throughput and high frequency on-chip routers. In Proc. of the International Symposium on
High Performance Computer Architecture, 2010. DOI: 10.1109/hpca.2010.5416640.
[368] Haofan Yang, Jyoti Tripathi, Natalie Enright Jerger, and Dan Gibson. Dodec: Random-
link, low-radix on-chip networks. In Proc. of the International Symposium on Microarchitec-
ture, 2014. DOI: 10.1109/micro.2014.19.
[369] Yuan Yao and Zhonghai Lu. DVFS for NoCs in CMPs: A thread voting approach. In
2016 IEEE International Symposium on High Performance Computer Architecture (HPCA),
pages 309–320. IEEE, 2016. DOI: 10.1109/hpca.2016.7446074.
[370] Yuan Yao and Zhonghai Lu. Memory-access aware DVFS for network-on-chip in
CMPs. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE),
pages 1433–1436. IEEE, 2016. DOI: 10.3850/9783981537079_0455.
[371] Jieming Yin, Onur Kayiran, Matthew Poremba, Gabriel Loh, and Natalie Enright
Jerger. Efficient synthetic traffic models for large complex SoCs. In Proc. of
the International Symposium on High Performance Computer Architecture, 2016. DOI:
10.1109/hpca.2016.7446073.
[372] Jieming Yin, Pingqiang Zhou, Sachin S Sapatnekar, and Antonia Zhai. Energy-efficient
time-division multiplexed hybrid-switched NoC for heterogeneous multicore systems. In
Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 293–
303. IEEE, 2014. DOI: 10.1109/ipdps.2014.40.
188 REFERENCES
[373] Kunzhi Yu, Cheng Li, Hao Li, Alex Titriku, Ayman Shafik, Binhao Wang, Zhongkai
Wang, Rui Bai, Chin-Hui Chen, Marco Fiorentino, Patrick Yin Chiang, and Samuel
Palermo. A 25 Gb/s hybrid-integrated silicon photonic source-synchronous receiver with
microring wavelength stabilization. IEEE Journal of Solid State Circuits, 51(9):2129–2141,
September 2016. DOI: 10.1109/jssc.2016.2582858.
[374] Bilal Zafar, Timothy Mark Pinkston, Aurelio Bermúdez, and José Duato. Deadlock-
free dynamic reconfiguration over Infiniband™ networks. Parallel Algorithms Applications,
19(2–3):127–143, 2004. DOI: 10.1080/10637190410001725463.
[375] Lihang Zhao, Woojin Choi, Lizhong Chen, and Jeff Draper. In-network traffic regula-
tion for transactional memory. In International Symposium on High Performance Computer
Architecture, 2013. DOI: 10.1109/hpca.2013.6522346.
[376] Zhiping Zhou, Bing Yin, and Jurgen Michel. On-chip light sources for silicon photonics.
Light: Science & Applications, 4:e358–, 11 2015. DOI: 10.1038/lsa.2015.131.
[377] A. K. Ziabari, J. L. Abellan, R. Ubal Tena, C. Chen, A. Joshi, and D. Kaeli. Lever-
aging silicon-photonic NoC for designing scalable GPUs. In International Conference on
Supercomputing, 2015. DOI: 10.1145/2751205.2751229.
[378] Amir Kavyan Ziabari, Jose L. Abellan, Yenai Ma, Ajay Joshi, and David Kaeli. Asym-
metric NoC architectures for GPU systems. In Proc. of the International Symposium on
Networks on Chip, 2015. DOI: 10.1145/2786572.2786596.
[379] Arslan Zulfiqar, Pranay Koka, Herb Schwetman, Mikko Lipasti, Xuezhe Zheng, and
Ashok V. Krishnamoorthy. Wavelength stealing: An opportunistic approach to channel
sharing in multi-chip photonic interconnects. In International Symposium on Microarchi-
tecture, 2013. DOI: 10.1145/2540708.2540728.
189
Authors’ Biographies
NATALIE ENRIGHT JERGER
Natalie Enright Jerger is an Associate Professor and the Percy Edward Hart Professor of Elec-
trical and Computer Engineering in the Edward S. Rogers Sr. Department of Electrical and
Computer Engineering at the University of Toronto. She completed her Ph.D. at the University
of Wisconsin-Madison in 2008. She received her Master of Science degree from the Univer-
sity of Wisconsin-Madison and Bachelor of Science in Computer Engineering from Purdue
University in 2004 and 2002, respectively. Her research interests include multi- and many-core
architectures, on-chip networks, cache coherence protocols, memory systems, and approximate
computing. Her research is supported by NSERC, Intel, CFI, AMD, and Qualcomm. She was
awarded an Alfred P. Sloan Research Fellowship in 2015, Borg Early Career Award in 2015,
MICRO Hall of Fame in 2015, the Ontario Professional Engineers Young Engineer Medal in
2014, and the Ontario Ministry of Research and Innovation Early Researcher Award in 2012.
TUSHAR KRISHNA
Tushar Krishna is an Assistant Professor in the School of Electrical and Computer Engineer-
ing at the Georgia Institute of Technology. He received a Ph.D. in Electrical Engineering and
Computer Science from Massachusetts Institute of Technology in 2014. Prior to that he re-
ceived a M.S.E in Electrical Engineering from Princeton University in 2009, and a B.Tech in
Electrical Engineering from the Indian Institute of Technology (IIT) Delhi in 2007. Before
joining Georgia Tech in 2015, he worked as a researcher in the VSSAD Group at Intel, Mas-
sachusetts. His research interests span computer architecture, on-chip networks, heterogeneous
SoCs, deep learning accelerators, and cloud networks.
LI-SHIUAN PEH
Li-Shiuan Peh is Provost’s Chair Professor in the Department of Computer Science of the
National University of Singapore, with a courtesy appointment in the Department of Electrical
and Computer Engineering since September 2016. Previously, she was Professor of Electrical
Engineering and Computer Science at MIT and was on the faculty of MIT since 2009. She
was also the Associate Director for Outreach of the Singapore-MIT Alliance of Research &
Technology (SMART). Prior to MIT, she was on the faculty of Princeton University from
2002. She graduated with a Ph.D. in Computer Science from Stanford University in 2001, and
190 AUTHORS’ BIOGRAPHIES
a B.S. in Computer Science from the National University of Singapore in 1995. Her research
focuses on networked computing, in many-core chips as well as mobile wireless systems. She
received the IEEE Fellow in 2017, NRF Returning Singaporean Scientist Award in 2016, ACM
Distinguished Scientist Award in 2011, MICRO Hall of Fame in 2011, CRA Anita Borg Early
Career Award in 2007, Sloan Research Fellowship in 2006, and the NSF CAREER award in
2003.