6Design-Space Exploration and Optimization 07556373
6Design-Space Exploration and Optimization 07556373
Abstract—A 3-D network-on-chip (NoC) enables the design of large numbers of embedded cores in a single die.
of high performance and low power many-core chips. Existing Three-dimensional NoC architectures combine the benefits
3-D NoCs are inadequate for meeting the ever-increasing perfor- of these two new paradigms to offer an unprecedented per-
mance requirements of many-core processors since they are sim- formance gain [2], [3]. With freedom in the third (vertical)
ple extensions of regular 2-D architectures and they do not fully dimension, NoC architectures that were previously impossi-
exploit the advantages provided by 3-D integration. Moreover, ble or prohibitive due to wiring constraints in planar ICs
the anticipated performance gain of a 3-D NoC-enabled many- are now realizable in 3-D NoC, and many 3-D imple-
core chip may be compromised due to the potential failures mentations can outperform their 2-D counterparts. However,
of through-silicon-vias that are predominantly used as verti- existing 3-D NoC architectures predominantly follow straight-
cal interconnects in a 3-D IC. To address these problems, we forward extensions of regular 2-D NoC designs, which do not
propose a machine-learning-inspired predictive design method-
fully exploit the advantages provided by the 3-D integration
ology for energy-efficient and reliable many-core architectures
enabled by 3-D integration. We demonstrate that a small-world
technology [3]. Another challenge is that the anticipated per-
network-based 3-D NoC (3-D SWNoC) performs significantly formance gain of 3-D NoC-enabled many-core chips may be
better than its 3-D MESH-based counterparts. On average, the compromised due to potential failures of the through-silicon-
3-D SWNoC shows 35% energy-delay-product improvement over vias (TSVs) used as vertical interconnects. TSVs in a 3-D
3-D MESH for the PARSEC and SPLASH2 benchmarks con- IC fail due to voids, cracks, and different kinds of fabrica-
sidered in this paper. To improve the reliability of 3-D NoC, tion challenges [4]. Additionally, the workload induced stress
we propose a computationally efficient spare-vertical link (sVL) increases the resistance of the TSVs, which leads to different
allocation algorithm based on a state-space search formulation. mean-time-to-failure (MTTF) for different TSVs [5], [6].
Our results show that the proposed sVL allocation algorithm can The main focus of this paper is to explore and conse-
significantly improve the reliability as well as the lifetime of 3-D quently establish performance-energy-reliability tradeoffs for
SWNoC. 3-D small-world NoC (SWNoC) [7], [8]. To this end, we make
Index Terms—3-D network-on-chip (NoC), discrete optimiza-
the following contributions.
tion, machine-learning, small-world (SW). 1) We consider the design space of 3-D SWNoC archi-
tectures, where the vertical connections predomi-
nantly work as long-range shortcuts for SW networks.
3-D SWNoC architectures (as shown in Fig. 1) help
I. I NTRODUCTION with both energy-efficiency (small average path length)
and reliability (average path length grows insignifi-
HREE-DIMENSIONAL ICs are capable of achieving
T better performance, functionality, and packaging density
compared to their traditional planar counterparts [1], [2]. On
cantly due to link failures). This is the first work to
exploit the advantages of 3-D integration to design
a power-law-based SW network-enabled 3-D NoC
the other hand, network-on-chip (NoC) enables integration architecture.
2) The design space of a 3-D SWNoC is combinatorial in
Manuscript received February 25, 2016; revised May 17, 2016 and nature. Hence, we leverage machine-learning techniques
July 8, 2016; accepted August 13, 2016. Date of publication August 30, 2016; to intelligently explore the design space to optimize the
date of current version April 19, 2017. This work was supported in part placement of both planar and vertical communication
by the U.S. National Science Foundation under Grant CNS 1564014, Grant links for high performance and energy efficiency.
CCF-0845504, Grant CNS-1059289, and Grant CCF-1162202, and in part by 3) We consider spare-vertical link (sVL) allocation to
the Army Research Office under Grant W911NF-12-1-0373. This paper was
recommended by Associate Editor S. Pasricha.
improve the reliability of the 3-D NoC. This is another
S. Das, J. R. Doppa, and P. P. Pande are with the School of Electrical combinatorial optimization problem, where we do not
Engineering and Computer Engineering, Washington State University, know the cost function. We can experimentally compute
Pullman, WA 99163 USA (e-mail: [email protected]; [email protected]; the quality (or cost) of a solution by running a simula-
[email protected]). tion. We solve this problem using a state-space search
K. Chakrabarty is with the Department of Electrical and Computer formulation, where the simulations guide the search pro-
Engineering, Duke University, Durham, NC 27708 USA (e-mail:
[email protected]).
cess. We leverage the structure of the problem and
Color versions of one or more of the figures in this paper are available domain knowledge of the 3-D SWNoC to efficiently pro-
online at http://ieeexplore.ieee.org. duce an sVL allocation that can significantly improve the
Digital Object Identifier 10.1109/TCAD.2016.2604288 reliability of the 3-D NoC.
0278-0070 c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
720 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 36, NO. 5, MAY 2017
TABLE I
Algorithm 1 NoC Design Optimization via STAGE F EATURE D ESCRIPTION
1: Input: D = Design space, O = cost function,
(I, S) = initial state and successor generation functions,
C = network constraints, φ = feature function for NoC design,
A = local search procedure, R =regression learner,
MAX = maximum iterations,
2: Output: dbest , the best NoC design
3: Initialization: initialize evaluation function E, training set Z,
initial design d0 , Obest = O(d0 ), and dbest = d0
4: Repeat: minimize ( fij ∗ dij ), where fij and dij are the com-
5: Base search: From d0 , run the search procedure A guided by munication frequency and Cartesian distance between
O until a local optima is reached, leading to a search trajectory the cores, respectively. In this step, we form clusters
(d0 , d1 , . . . , dT ).
with 16 cores in each die.
6: Generate training data: For each design di on the search
trajectory, add (φ(di ), yi ) to Z, where yi is the best value along
3) Link Distribution (L): The link length distribution L =
the search trajectory. {l1 , l2 , . . . , lk }, where k depends on the size and topol-
7: Re-train E: E = R(Z). ogy of the network; li ’s are determined based on the
8: Meta search: From dT , run the search procedure A guided by SW connectivity parameter α. For higher values of α, lk
E until a local optima is reached to produce the best predicted decreases.
starting state d̂. 4) Communication Frequency (F): The communication
9: Next starting state: If d̂ = dT (no search progress), set d0 frequency among different cores F = { fij |1 ≤ i,
using I. Otherwise, set d0 = d̂. j ≤ N, i = j}. We assume that F for each application is
10: Update Obest and dbest if y∗ < Obest , where y* is the best given as an input to perform application-specific network
value encountered during base search and meta search. optimization.
11: Until MAX iterations or convergence. 2) Objective Function O: We define O as the communica-
12: Return best design dbest . tion cost of the given 3-D NoC, which is the product of hop
count, frequency of communication, and link length summed
over every source and destination pair, that is
N
N
A. Challenges O= r ∗ hij + dij ∗ fij (1)
The main challenges in applying STAGE to 3-D NoC design i=1 j=1,i=j
are as follows.
1) We need to define additional features of the optimization where fij and dij are defined as above; hij is hop count
problem that can be exploited to learn improved evalua- between ith and jth node, and r denotes the number of
tion functions for efficient design space exploration. We switch stages. From a practical point-of-view, r is the num-
provide these features for 3-D NoC designs (Table I), ber of cycles a message spends inside a switch to move
but they can be adapted to other types of NoC designs from input to output port. An NoC design with low O will
as well. have low latency and energy consumption, and hence, low
2) Defining appropriate search spaces by leveraging the energy-delay-product (EDP).
domain knowledge can potentially improve the effec- 3) Network Constraints: To explore only physically feasi-
tiveness of the STAGE algorithm. We need to identify ble 3-D NoC designs, we enforce some constraints on the
good starting state distribution (subset of initial 3-D placement of VLs and switch configurations. If TSVs are
NoC design solutions) and search operators (actions to considered as the VLs, we only allow placing them point-
get successor states from a given state) to navigate the to-point (regularly) between the switches. Such constraints
design space. We have explored γ -greedy for starting may put additional limits on the performance of NoC designs.
state distribution with the hope of improving over ran- However, efficient optimization can overcome such limitations.
dom starting state distribution (see “starting states and The SW network has an irregular connectivity. Hence, the
successor function” below). number of links connected to each switch is not constant. For
3) We need to find a good knowledge representation for fair comparison between our SW network and 3-D MESH,
the evaluation function E that is expressive, can be we assume that both of them use the same average number
trained quickly, and makes fast predictions. We picked of connections, <kavg > per switch. This also ensures that the
regression trees (RTs) as it satisfies all the requirements. 3-D SW NoC does not introduce additional links compared to
a 3-D MESH. For a 64-core system, <kavg > is 4.5 considering
all the switches, including the peripheral ones. In addition, the
maximum connectivity per node, <kmax >, is set to be 7 for
B. Instantiation for 3-D NoC Optimization the SW network as found in [33].
In this section, we provide all the details needed to apply 4) Starting States and Successor Function: For starting
the STAGE algorithm to our 3-D NoC optimization problem. states, we randomly generate an SW network that satisfies the
1) Design Space: Our design space depends on a set of network constraints. The successor function S takes a network
network resources, which are given as input to the optimization as input and returns a set of next states, and allows the search
algorithm. These resources are defined as follows. procedure to navigate the NoC design space. S generates one
1) Cores (C): A set of all cores C = {C1 , C2 , . . . , CN }, candidate state for each link connecting two nodes in the input
where N is total number of cores. We assume that every network. It simply removes that link and places a link with
core is connected to at least one switch. the same length between two nodes in the NoC that are not
2) Planar Dies (P): A set of all dies P. For N = 64, we directly connected.
consider four dies with each die containing 16 cores. The STAGE algorithm can benefit if we can specify
For core placement, we follow a greedy algorithm to the starting state distribution using some domain knowledge.
DAS et al.: DESIGN-SPACE EXPLORATION AND OPTIMIZATION OF AN ENERGY-EFFICIENT AND RELIABLE 3-D SWNoC 723
Therefore, we also consider a starting-state distribution, We employed the WEKA machine-learning toolkit [35] to
named, γ -greedy. We formulate the starting state (design) train RTs over training set Z, and tune the hyper-parameters
construction as a sequential decision-making task, where we using validation data.
select the next link to be placed at each step. In γ -greedy dis-
tribution, we select a link greedily with probability γ based V. S PARE -V ERTICAL L INK A LLOCATION
on communication frequency and a random link with proba-
The anticipated performance gain of 3-D NoC-enabled
bility (1 − γ ). We start with γ = 1 (completely greedy) and
many-core chips can be compromised due to potential failures
gradually reduce γ to increase the randomness.
5) Local Search Procedure A: We employed a stochastic of the TSVs that are mainly used as vertical interconnects in
hill-climbing procedure, where the next states are sampled a 3-D IC. Workload induced stress is one of the main reasons
stochastically. for the failure of TSV-based VLs in 3-D IC. Stress increases
6) Feature Function φ: The main challenge in adapting the resistance of the TSVs, which leads to different MTTF
STAGE to our NoC domain is to define a set of features φ for different TSV-based VLs [4], [5]. The TSV failure model
for each network that can drive the learner. We divide the is described in detail in the longer version of this paper [30].
whole network into several overlapping subgraphs or regions, The performance of the 3-D NoC degrades over time, lead-
and define a set of features that can be categorized into three ing to eventual failure of the chip. Therefore, we consider the
types. allocation of sVLs as a way to improve the reliability of the
1) Average Hop Count (h): which calculates the average 3-D NoC.
hop count for each region or subnetwork.
2) Weighted Communication: which is defined as the sum A. Spare VL Allocation Problem
of the products of hop count and communication fre- Given a set of m functional VLs F and budget size of
quency over all source-destination
N pairs for a particular sVLs n (n > 0, n << m), we want to select the subset
hop count ( N i=1 f
j=1,j=i ij ∗ hk ). The highest value of n functional VLs out of m those when provided with one
of k depends on the network size and topology. If sVL each will maximize the reliability (lifetime) of the 3-D
the value of this feature is small, it indicates that NoC. We can experimentally compute the quality of a given
highly communicating cores are placed in the same sVL allocation solution by running a simulation. This is an
neighborhood. instance of a combinatorial optimization problem with an
3) Clustering Coefficient (Cc ): which captures the connec- unknown cost function, where the quality of a given solution
tivity of one core with its neighbors [34]. While the hop can be computed only by making a simulator call. Here, the
count takes into account mainly long-range communi- term “solution” refers to a particular 3-D NoC configuration
cation, the clustering coefficient focuses more on local incorporated with sVLs for n functional VLs.
connectivity among the immediate neighbors. We found
these features to sufficiently capture the network charac- B. Computational Challenges
teristics, efficient to compute, and allow learning highly The main challenge here, is that we have ahuge
accurate evaluation function, E. number
m
In this paper, for N = 64 cores, we divide the whole net- of possible solutions or NoC configurations n to allo-
work into nine regions. For each region, we consider average
cate sVLs among the functional links. A naïve approach is
hop counts as the features. In addition, the initial network has
to enumerate all possible solutions; compute the quality of
the highest hop count of eight, and hence, we require eight fea-
each solution via simulator call; and pick the best solution.
tures for weighted communication cost. Finally, for each die
However, the simulator call is expensive in terms of both
in the network, we consider the average clustering coefficient
time and memory requirements. Hence, this exhaustive search
and it gives rise to four more features. Table I lists all these
approach to quantify the performance and lifetime of each of
features.
7) Regression Learner: The quality of our optimization the candidate configuration is infeasible for practical purposes.
methodology depends on the accuracy of the evaluation func-
tion E. We can employ any regression learning algorithm, C. State Space Search Formulation
e.g., k nearest-neighbor, linear regression, support vector We solve the sVL allocation problem using a state-space
regression, and RT. However, a regression learner that is non- search formulation, where the simulations guide the search
linear, fast in terms of training time and prediction time will process. Each state in our search space is a particular NoC
improve the effectiveness of the STAGE algorithm. Therefore, configuration allocated with sVLs and consists of a set S ⊆ F,
analytically, the RT learner suits our needs the best. where S is a partial or complete solution. Our search space
Our training data consists of a set of input–output pairs is a 3-tuple <I, A, T>, where I is the initial state function
{(xi , yi )}ni=1 , where each xi ∈ Rm is a feature vector and yi ∈ R that returns the initial search state S = ∅ meaning solution
is the corresponding output. The RT learning algorithm tries to set is empty; A is a finite set of actions (or search opera-
learn a function E in the form of tree (a set of if-then rules) to tors) corresponding to growing the partial solution S by one
minimize the deviation of the predicted output E(xi ) from the element from F\S; and T is the terminal state predicate that
correct output yi . The key idea in RT learning is to recursively- maps search nodes to {1, 0} indicating whether the node is
partition the input space (as in hierarchical clustering) until we a terminal or not. Each terminal state in the search space cor-
find regions that have very similar output values. The recursive responds to a complete solution (|S| = n, where |S| denotes
partitioning is represented as a tree, where leaves correspond to the total number of candidates of S), while nonterminal states
the cells of the partition. Each leaf is assigned the sample mean correspond to a partial solution (|S| < n). Thus, the decision
of all the output variables in that cell as its prediction. During process for constructing a complete solution corresponds to
testing, we find the cell of the partition that input x belongs selecting a sequence of actions leading from the initial state
to through a series of comparison questions on the features, (none of the sVLs are allocated) to a terminal state (all the n
and return the prediction associated with that cell. RTs also sVLs are allocated). In principle, we can employ any heuristic
allow us to identify the features that are important in making search procedure (e.g., greedy and beam search) guided by
predictions. simulations.
724 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 36, NO. 5, MAY 2017
Fig. 3. Nonhomogeneous VL utilization pattern of the 3-D SWNoC for the CANNEAL benchmark. The region between second and third dies is denoted
by VLs numbering 17 ∼ 32, and carries 45% of the total VL traffic of the four die 3-D system.
Algorithm 2 Greedy sVL Allocation benchmark (one of the PARSEC benchmark with highest traf-
1: Input: F = set of m functional VLs, fic injection load and skewed traffic). We can see that the traffic
n = budget for spare-VLs, densities of VLs 17–32 (we call this region as critical region)
2: Output: S, the best set of n fVLs that gets spares are significantly higher than that of the others and expectedly,
3: Initialization: initialize solution set S = ∅ their MTTF values are significantly lower.
4: for each greedy step = 1 to n Our key insight is that for a small budget size n (say less
5: for each choice x ∈ F than the number of critical VLs), the spares should be allocated
6: value (x) = simulator_call (S ∪ x) to some of the critical VLs only and there is no benefit for
7: end for allocating spares to noncritical VLs (chip will fail due the
8: x∗ = arg max value(x) failure of all critical VLs). We can use this domain knowledge
x∈F
9: S = S ∪ x∗ // Functional VL x* gets spare to prune the search space of possible solutions for the spare-
10: F = F\x∗ // x* is removed from F VL allocation problem. Let H ⊆ F correspond to the critical
11: end for VLs and the total number of critical VLs is h, where h = |H|.
12: return S If we consider complete solutions from H only (i.e., subsets
*simulator_call is a procedure that calculates and returns the net- of size n from H), we can still retain the optimal solution. In
work performance and lifetime for a given NoC, benchmark suite, other words, we get huge computational savings without losing
and routing algorithm through extensive experiments. any accuracy dueto sound pruning.
For exhaustive search, we
h m
can consider n instead of n candidate solutions, where
h < m. For greedy search, we only consider the VLs from H
D. Greedy Search for Spare-VL Allocation
for spare allocation.
This is the simplest search procedure (Algorithm 2). We For the rest of the experiments and analysis, we denote
start with an empty solution set S. In each greedy step, we add the baseline greedy and exhaustive search by greedy-full
the sVL from F \S to the solution set S that when provided with and exhaustive-full, respectively. In addition, these techniques
a spare link, it improves the reliability by maximum amount. enabled with domain knowledge-based pruning are named as
We repeat this greedy selection step until S is a complete greedy-restricted and exhaustive-restricted.
solution (|S| = n). The time complexity of greedy search is
O(m ∗ n − n2 ) simulator calls. VI. E XPERIMENTAL R ESULTS AND A NALYSIS
The greedy search is able to produce highly effective sVL
allocation that can significantly improve the reliability of the In this section, we first present the achievable perfor-
3-D NoC. This effect of greedy sVL allocation was observed mance and energy consumption profiles of our optimized 3-D
through experimental studies as the cost function is unknown SW NoC architecture. Then, we present a detailed reliability
and we need to find solutions via simulator calls. The alloca- analysis in the presence of sVL insertion.
tion policy to allocate a spare (if sVL budget allows) to the first
functional VL that fails with a given functional and sVL-based A. Experimental Setup
3-D NoC configuration is highly effective. Intuitively, if we do To evaluate the performance of different NoCs, we use
not allocate spare to the functional VL that is expected to fail a cycle-accurate NoC simulator that can simulate any regular
first, it will result in a cascade of VL failures reducing the or irregular 3-D architecture [36]. We consider a chip mul-
lifetime of the chip drastically. tiprocessor consisting of 64 cores and 64 network switches
equally partitioned in four layers. In each die, 16 cores are
placed in regular interval in a grid pattern [3]. The length
E. Domain Knowledge for Sound Pruning of each packet is 64 flits and each flit consists of 32 bits.
In 3-D NoC enabled many-core chips, some VLs experience The switches are synthesized from an RTL level design using
heavy traffic and high utilization as the underlying routing TSMC 65-nm CMOS process in synopsys design vision.
algorithm tries to find shortest paths between source and des- All switch ports have a buffer depth of two flits and each
tination cores via these links. As a result, those VLs with switch port has four virtual channels in case of irregular
high utilization undergo heavier stress, and introduce addi- NoC. The NoC simulator uses wormhole routing, where the
tional delay in the path and fail more quickly when compared data flits follow the header flits once the router establishes
to others. Moreover, this is not an independent phenomenon: a path. For regular 3-D mesh-based NoC, XYZ-dimension
one VL failure can decrease the time to failure of a neigh- order-based routing is used. For irregular architectures such
boring VL leading to a clustering effect as workload of the as the SW network, the topology-agnostic adaptive layered
neighboring links increase. For example, in Fig. 3, we show shortest path routing algorithm is adopted [37]. The energy
the traffic densities and the MTTF values of all the VLs for consumption of the network switches was obtained from
a 64-core and four-layer 3-D SWNoC for the CANNEAL the synthesized netlist by running synopsys prime power,
DAS et al.: DESIGN-SPACE EXPLORATION AND OPTIMIZATION OF AN ENERGY-EFFICIENT AND RELIABLE 3-D SWNoC 725
while the energy dissipated by wireline links was obtained and the regression-learning algorithm. Note that the best O-
through HSPICE simulations. We consider four SPLASH-2 value decreases monotonically as the set of explored designs
benchmarks, namely, FFT, RADIX, LU, and WATER [38], increases over the iterations. We also ran the same experi-
and five PARSEC benchmarks, namely, DEDUP, VIPS, ment with the γ -greedy starting state distribution as mentioned
FLUIDANIMATE, CANNEAL, and BODYTRACK (BT) [39] above. However, the communication cost O and the prediction
in this performance evaluation. These benchmarks vary in error have similar characteristics as the random distribution for
characteristics from computation intensive to communication the benchmarks and the system size considered in this paper.
intensive in nature and thus are of particular interest in Therefore, we present and discuss our results with a random
this paper. starting-state distribution.
It is also seen that, both the SA and GA show similar trends
in the cost function optimization. Both of them reach Obest
B. Performance of the Optimization Algorithm more gradually compared to STAGE, and even after 50 min
In Section IV, we described the details of the STAGE their respective Obest does not reach the same solution as
optimization algorithm for designing the 3-D SWNoC archi- STAGE. It should be noted that we have to optimize the link
tecture. Here, we first characterize the performance of the locations for various applications. Hence, this additional time
optimization algorithm by quantifying various performance needed by SA and GA will be a significant overhead when we
metrics of the optimized 3-D SWNoC. To evaluate the per- have to optimize and reconfigure the SWNoC in the field. It
formance of STAGE algorithm, we compare it with the well- should be noted that the final link distribution of the optimized
known combinatorial optimization algorithms, viz., SA [40] 3-D SWNoC is the same for SA, GA, and STAGE. However,
and genetic algorithm (GA) [41]. We evaluate the perfor- as shown in Fig. 4 the benefit of STAGE over SA and
mance in terms of both the quality of solution and the GA mainly comes from the much faster convergence time.
convergence time. We can conclude that STAGE algorithm is more efficient in
1) STAGE vs. SA and GA: We create the initial network designing an optimized SWNoC with better performance. We
following the power law distribution shown in Section IV, denote the final optimized NoC as 3-D SW_opt.
where long-range links are placed randomly. Our goal is to 2) Characteristics of the Design (Random vs. Optimized):
find an optimized network starting from this random SW net- Now we investigate why the STAGE-based optimization algo-
work. We call this initial NoC architecture as 3-D SW_rand. rithm is suitable for developing energy-efficient NoC architec-
Fig. 4 shows the communication cost of the optimized net- tures. In Section IV, we described the details of the feature def-
work from the STAGE, SA, and GA algorithm as a function inition (φ), to represent each network. So, we will explore how
of time. the design features change before and after the optimization
To compare the performance of these two optimization process. Here, we specifically consider the role of the weighted
algorithms, we consider two parameters, viz., the quality of communication feature mentioned in Section IV. Fig. 5 shows
the solution and the convergence time. To make the com- the weighted communication feature, which reveals the per-
parison fair, we consider the same NoC configuration and centage of total communication that is constrained between
apply both STAGE and SA algorithm to optimize it. We used two nodes separated by k hops (k ≥ 1). Careful observation
a machine configured with Intel Core i7-4700MQ processor of Fig. 5 shows that for 3-D SW_opt, the traffic constrained
and 8 GB RAM running at a clock frequency of 2.4 GHz. within one, two, and three hop increases compared to 3-D
Fig. 4 shows the cost of the best solution obtained at any SW_rand. Moreover, the amount of traffic that has to traverse
particular time for SA, GA, and STAGE. We consider the best beyond three hops decreases.
explored cost, Obest , as the quality of the optimization algo- Hence, the internode communication that takes place in
rithm. It is evident that STAGE reaches Obest very fast (within less than three hops becomes more frequent. Since the aver-
5 min). During the optimization process, the learned func- age hop count of the optimized network is calculated to be
tion E predicts an initial network configuration to start the 2.94, any communication below this average hop count can be
local search procedure that can lead to lower communication considered to be efficient. Essentially, the optimized network
cost (O). During the initial exploration phase, the error-rate becomes more efficient for the same objective function.
is nonmonotonic and high. After a few iterations the predic- The inset in Fig. 5 shows the percentage of communi-
tion error reduces to less than 1%, and after 20 iterations, the cation versus the number of hops, where the area under
error is almost zero (0.05%). The prediction error remained the curve denotes the weighted communication feature men-
more or less the same for all the subsequent iterations. These tioned in Section IV. We can see that the 3-D SW_opt
results indicate the effectiveness of our network features φ curve shifts toward the left, which means that on an average,
726 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 36, NO. 5, MAY 2017
(a)
(b)
(c)
Fig. 6. (a) Normalized network latency of 3-D SWNoC compared with other 3-D NoCs. (b) Normalized energy consumption per message of 3-D SWNoC
compared with other 3-D NoCs. (c) Normalized EDP of 3-D SWNoC compared with other 3-D NoCs.
TABLE II
any message in the optimized network traverses less hops com- C OMPARISON OF AVERAGE H OP C OUNT AND C OMMUNICATION
pared to the initial network. Hence, it spends less time inside C OST OF 3-D N O C A RCHITECTURES
the network and occupies less network resources. Therefore,
the STAGE-based optimization algorithm converges to an
efficient architecture.
Fig. 8. Lifetime of the 3-D SWNoC as a function of the number of sVL for the CANNEAL, DEDUP, and VIPS (from the left) benchmark for greedy-full
and exhaustive-full sVL allocation.
knowledge-based pruning, which we will introduce later). For increases for both exhaustive search and greedy search. The
brevity, we show results for three representative benchmarks time complexities of exhaustive search and greedy search
with varying traffic patterns, viz., CANNEAL, DEDUP, and m
VIPS. These benchmarks are chosen because they have a wide (in terms of the number of simulator calls) are O n and
variation in message injection rates, e.g., high (CANNEAL), O(mn − n2 ), respectively. For example, for a 64 core 3-D
medium (DEDUP), and low (VIPS). Fig. 8(a)–(c) plots the NoC with m = 48 and n = 8, the total solution exploration
lifetime of the 3-D SWNoC with sVL allocation using greedy- times for exhaustive and greedy search are 377, 348, 994,
full and exhaustive-full algorithms for different number of and 356q, respectively, (here q corresponds to the computa-
sVLs for the CANNEAL, DEDUP, and VIPS benchmarks, tion time of a single simulator call which is ∼7 min in the
respectively. From these figures, we can see that both greedy- current experimental setup using a machine configured with
full and exhaustive-full sVL allocation algorithms achieve the Intel Core i7-4700MQ processor and 8 GB RAM running at
same lifetime for the 3-D SWNoC. Note that, greedy search (as a clock frequency of 2.4 GHz). Therefore, our sVL alloca-
expected) takes significantly less computation time to produce tion algorithms may not scale for large-scale 3-D NoC. We
the solution when compared to exhaustive search. To explain consider using domain knowledge of the workload of differ-
this, we first need to understand the details of the sVL alloca- ent functional VLs to prune the solution space as described in
tion procedure. To be more specific, the VL failure sequence Section V. In 3-D NoC, the workload of some fVLs (say criti-
and its effects on NoC performance need to be explored. cal VLs) is much higher than the others and hence, their failure
If any functional VL fails, then the workload of this partic- probabilities are higher too. Intuitively, when the sVLs budget
ular VL negatively affects the other neighboring VLs and as (n) is small, it is beneficial to allocate spares to some of the
a result, the EDP increases rapidly. Consequently, allocation critical VLs only because the chip will fail due to a cascade
of sVLs to the functional VL, which fails first, is expected to of critical VL failures.
minimize the NoC performance penalty. If sVL is allocated We select a subset of critical VLs (say H) out of the m
without following the VL failure sequence, then the allocation functional VLs that we will consider for allocating spares
effect may not be visible on both the EDP profile and lifetime and prune the remaining ones. Pruning can improve the com-
at all. putational efficiency of solving the sVL allocation problem,
To explain this behavior in more detail, we consider the case but may potentially compromise the accuracy of solutions
of the CANNEAL benchmark for the 3-D SWNoC and num- depending on the amount of pruning. We can consider varying
ber the 48 VLs serially starting from 1 to 48 for a 64 core amounts of pruning from |H| = n (only one candidate solu-
system (as shown in Fig. 3). For 8 sVLs, the sVL alloca- tion) to |H| = m (no pruning) to tradeoff speed and accuracy of
tion solution from exhaustive search corresponds to assigning producing sVL allocation solutions. A simple pruning strategy
spares to functional VLs numbered 26, 22, 27, 10, 42, 43, 7, to achieve this goal is as follows: rank all the functional VLs
and 6. Somewhat surprisingly, greedy search also produced according to their workload; select the top-|H| VLs to be con-
the same sVL allocation solution. Our experimental analysis sidered for spare allocation; prune the remaining m-|H| VLs.
showed that the greedy search produces sVL allocation solu- We can use both exhaustive search and greedy search to find
tions that can significantly improve the reliability of the 3-D the solution from this restricted set of candidate solutions. We
NoC. The allocation policy to allocate a spare (if the sVL bud- refer to the exhaustive and greedy sVL allocation algorithms
get allows) to the first functional VL that fails with a given as exhaustive-restricted and greedy-restricted, respectively.
functional and sVL-based 3-D NoC configuration is highly Fig. 3 shows the traffic densities of all the VLs for
effective. Intuitively, if we do not allocate spare to the func- a 64-core 3-D SWNoC consisting of four planar layers for
tional VL that is expected to fail first, we will be faced with the CANNEAL benchmark. It can be noted that the traffic
a cascade of VL failures, which will reduce the lifetime of the densities of some VLs (critical VLs) are significantly higher
chip drastically. For example, the VL failure sequence without than that of the others. To identify the critical VLs, we rank
any spares allocated is 26, 22, 27, 10, 32, 30, 25, 18, and so on. VLs according to workload and sort the highest workload
Greedy search allocates the first spare to functional VL 26. The ones. In this particular work, we consider 16 critical VLs.
VL failure sequence after assigning spare to VL 26 is 22, 26, This number is chosen considering the worst-case VL failure
27, 23, 32, 30, 31, 25, 18, and so on. Greedy search allocates scenario where all 16 critical VLs are placed in between two
the second spare to functional VL 22. Continuing this policy, adjacent planar dies and if all of them fail together, then the
greedy search assigns spares to the same set of functional VLs NoC becomes completely unrouteable. Therefore, we prune
as done by the exhaustive search. We found this behavior to all the noncritical VLs, a total of 32 out of 48 (other than
be consistent across all the benchmarks. 16 critical VLs). In other words, |H| = 16 corresponding to
16 high workload carrying VLs, which is significantly smaller
B. Domain Knowledge for Pruning the Search Space compared to m = 48 (total number of VLs). We found that
The time to compute the sVL allocation solution grows as with this setting, both the search algorithms with pruning pro-
the number of functional VLs (m) and the number of sVLs (n) duce the same sVL allocation solutions as their counterparts
DAS et al.: DESIGN-SPACE EXPLORATION AND OPTIMIZATION OF AN ENERGY-EFFICIENT AND RELIABLE 3-D SWNoC 729
(a) (b)
Fig. 9. (a) Estimated runtime for different number of sVL: greedy-full versus greedy-restricted. (b) Estimated runtime for different number of sVL:
exhaustive-full versus exhaustive-restricted.
Fig. 10. Lifetime determination algorithm is explained with normalized EDP D. Effects of Spare-VL Allocation on 3-D NoC
profile of 3-D SWNoC with and without sVL allocation. As an example, life-
time calculation for 3-D SWNoC with DEDUP benchmark has been plotted. Whenever a sVL is allocated to a functional VL, the sVL
The EDP of 3-D MESH-0 sVL (dotted line) corresponds to time t = 0 and carries the traffic when the corresponding functional VL fails.
extended only for reference purpose. This minimizes the effect of VL failure on 3-D NoC perfor-
mance degradation and essentially helps in maintaining lower
without any pruning (exhaustive-full and greedy-full) for dif- EDP value over longer period of time. However, there exists
ferent number of sVLs (n = 1 to any number of upper limit). an upper limit for the sVL number, beyond which the advan-
In other words, we do not lose accuracy due to pruning. We tages of sVL allocation can no longer be pronounced. We call
do not show these results for the sake of brevity. The main this number as the optimum number of sVLs.
benefit of pruning is that it improves the computational effi- Depending on the benchmark and NoC configuration, the
ciency of producing sVL allocation solutions. As an example, optimum number of sVL varies. In this paper, we consider
Fig. 9(a) and (b) shows the estimated runtime comparison 3-D SWNoC as the testbed for evaluating the performance of
of greedy-full and greedy-restricted, and exhaustive-full and sVL allocation. However, subsequent experiments and analysis
exhaustive-restricted, respectively. We can see that the com- are equally applicable for other 3-D NoC architectures as well.
putational gains are significant due to pruning, but without 1) Optimum Number of Spare VLs: In this section, we eval-
losing any accuracy. uate the effects of different number of sVLs on the 3-D NoC
performance. Fig. 11(a)–(c) demonstrates the normalized EDP
C. Computing the Lifetime of 3-D SWNoC With of 3-D SWNoC with time for CANNEAL, DEDUP, and VIPS
sVL Allocation benchmarks, respectively. Similar to the previous experiments,
In this section, we describe the procedure to compute the we have considered these three benchmarks as the represen-
lifetime of any 3-D NoC configuration. For better understand- tative of high, medium, and low injection benchmarks from
ing, we plot the EDP profile of 3-D SWNoC with and without the PARSEC and SPLASH-2 suites. All the EDP values are
sVLs incorporated into it, and graphically illustrate how to normalized with respect to the EDP of fault free 3-D MESH
calculate the lifetime of any 3-D NoC. with no sVLs allocated to it at t = 0.
As defined in the earlier section, the lifetime of any 3-D From these figures, we can see that the EDP remains
NoC is the time when the EDP value of that particular NoC unchanged up to a certain point and after that, it increases
equals to a certain threshold value. Since the performance when the functional VLs start failing. This happens due to the
requirement for the NoC is application and/or user dependent, fact that initially no functional VL fails and EDP remains con-
the threshold value to compute the lifetime of the 3-D NoC stant up to a certain time. Subsequently, VLs from the critical
will vary. region (as defined in Section V-E) having high traffic density
Fig. 10 illustrates the lifetime computation procedure for start failing. In such a link failure scenario, the traffic of the
a 3-D SWNoC incorporated with 8 sVLs for DEDUP bench- failed VLs is carried by the neighboring VLs along with their
mark. This particular configuration is chosen as an example, own traffic. This has two kinds of negative effects. First, the
however, the procedure is applicable for any other 3-D NoC EDP and the network latency of the NoC increases due to
and benchmark. For the reference purposes, the EDP profile a critical link failure. Second, the neighboring functional VLs
730 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 36, NO. 5, MAY 2017
(a) (b)
(c)
Fig. 11. (a) Normalized EDP profile for 3-D SWNoC with different number of sVLs allocation for the CANNEAL benchmark. (b) Normalized EDP profile
for the 3-D SWNoC with different number of sVLs for the DEDUP benchmark. (c) EDP profile for 3-D SWNoC with different number of sVL allocation
for the VIPS benchmark.
also fail quickly which further degrades the NoC performance. Fig. 12(a)–(c) shows the EDP profile with time of 3-D
As a result, the EDP increases at a faster rate. SWNoC with partial sTSV allocation. In these figures, 3-D
Another interesting result is that as the number of allocated SW-8 sVLs denotes the performance of 3-D SWNoC with
sVLs increases, the EDP profile shifts toward the right on 8 sVLs allocation (complete bundle allocation) whereas the
the time scale. This implies that the 3-D SWNoC with sVL 3-D SW-8 sVLs_x% denotes the performance of 3-D SWNoC
allocation can maintain a particular EDP level for a longer with individual sTSV allocation within the bundle (VL). For
period of time. Expectedly, the lifetime of 3-D SWNoC also example, 3-D SW-8 sVLs_50% indicates 50% TSVs within
increases with sVL allocation. In addition, we can see that the the bundle (for 8 VLs) have sTSVs. From the figures, it is
difference between the EDP profiles on the time axis decreases clear that, complete sVL allocation performs better than par-
gradually as the sVL number increases. For the CANNEAL tial sTSV allocation. As the percentage of sTSV allocation
benchmark, the right-most EDP is found to be for 8 sVLs. It increases, the EDP profile shifts right on the time scale and
is seen that even if we increase the number of sVLs beyond lifetime improves consequently. It should be noted that if we
8 for CANNEAL, the EDP profile does not shift to the right allocate 100% sTSVs, then it is equivalent to full-sVL alloca-
anymore. This implies that any further improvement of EDP tion (3-D SW-8sVLs in the figure) and achieves the best EDP
profile is not possible, and we call this scenario as the satura- profile and maximum lifetime for the 3-D NoC.
tion effect of sVL allocation. Similarly, for 3-D SWNoC with 3) Saturation of Lifetime Improvement: In this paper, we
DEDUP and VIPS benchmarks, the EDP profile gets saturated have considered one-to-one correspondence between sVLs and
for 14 sVLs. functional-VLs, where any sVL replaces one functional VL
2) Performance of 3-D SWNoC With Partial sVL- regardless of the workload intensity. Allocation of such sVLs
Allocation: In this section, we evaluate the performance of increases the traffic carrying capability of the critical VLs
3-D SWNoC with partial sVL allocation. With partial sVL and improves the lifetime of the 3-D NoC. As an example,
allocation, instead of allocating an sVL (total bundle of TSVs Fig. 13 plots the percentage of lifetime improvement of 3-D
replacing the whole VL), we only allocate some sTSVs to SWNoC for the CANNEAL benchmark with different number
an fVL and compare its performance with full sVL-allocation of sVLs allocation. Note that similar lifetime improvements
explored earlier. For partial sVL allocation, we need to con- are observed for other benchmarks as well.
sider the cross-coupling capacitance of the individual TSVs. From the figure, we can see that as the number of allocated
If we consider a grid-based layout of the TSVs in a bun- sVLs increases, the lifetime of the 3-D SWNoC also increases.
dle, then the centrally located TSVs will have the highest Initially, the gain of lifetime is almost linear with the num-
cross coupling. We replace the TSVs that are affected most by ber of allocated sVLs and later, the gain increment decreases
the cross coupling in this partial allocation. As a case study, and improvement saturates after some point. Allocation of
we consider this partial TSV allocation to the critical fVLs sVL increases the combined lifetime of the particular VL
only and allocate 50% and 75% of the total TSVs in an fVL. (consists of sVL and functional VL in this case), which helps
We characterize the performance of this partial TSV allocation to minimize the network latency and EDP degradation due to
in comparison with the full sVL allocation. VL failure.
DAS et al.: DESIGN-SPACE EXPLORATION AND OPTIMIZATION OF AN ENERGY-EFFICIENT AND RELIABLE 3-D SWNoC 731
(a) (b)
(c)
Fig. 12. (a) Normalized EDP of 3-D SWNoC for CANNEAL benchmark with 8-sVL allocation. Here, 3-D SW-8 sVLs_x% denotes the partial sVL allocation
where x denotes the percentage of total TSVs needed to enable one full VL. (b) Normalized EDP of 3-D SWNoC for DEDUP benchmark with 8-sVL allocation.
Here, 3-D SW-8 sVLs_x% denotes the partial sVL allocation where x is the percentage of total TSVs needed to enable one full VL. (c) Normalized EDP
of 3-D SWNoC for VIPS benchmark with 8-sVL allocation. Here, 3-D SW-8 sVLs_x% denotes the partial sVL allocation where x is the percentage of total
TSVs needed to enable one full VL.
VIII. C ONCLUSION
We proposed a robust design optimization methodology
to improve the energy efficiency of 3-D NoC architectures
by combining the benefits of SW networks and machine-
learning techniques to intelligently explore the design space.
We showed that the optimized 3-D SWNoC architecture out-
performs the existing 3-D NoCs. The optimized 3-D SW NoC
on an average achieves 35% EDP reduction over conventional
3-D MESH. We also demonstrated the efficacy and robustness
of the 3-D SWNoC in presence of nonhomogeneous work-
load induced VL failure. The proposed 3-D SWNoC shows
better resilience and EDP profile against VL failure at any
Fig. 13. Effect of sVL allocation on 3-D SWNoC for the CANNEAL bench- instant of time compared to state-of-the-art 3-D NoCs. We
mark. The improvement of lifetime of 3-D SWNoC initially increases linearly also proposed an sVL allocation mechanism to address the
and saturates beyond 8-sVL allocation. The gain is normalized with respect performance degradation and lifetime shortening problem due
to the initial lifetime of 3-D SWNoC at t = 0.
to VL failure. We showed that with a small number of sVLs,
we could exploit NoC domain knowledge to develop efficient
In general, most critical VLs fail early when compared to and computationally inexpensive algorithms to explore optimal
the other VLs. If the sVLs are allocated to the critical VLs, solution. The proposed sVL allocation significantly improves
then they help in significantly increasing the lifetime of the the reliability and lifetime of the 3-D NoC.
NoC. However, the lifetime gain saturates as the number of
allocated sVL crosses a certain number. This happens due
to the fact that the combined lifetime of some critical VLs R EFERENCES
even with the sVL allocation is shorter than other noncritical [1] V. F. Pavlidis and E. G. Friedman, Three-Dimensional Integrated Circuit
VLs. Consequently, even if we allocate sVLs to these non- Design. San Francisco, CA, USA: Morgan Kaufmann, 2009.
[2] A. W. Topol et al., “Three-dimensional integrated circuits,” IBM J. Res.
critical VLs, they do not improve the EDP beyond what is Develop., vol. 50, no. 4.5, pp. 491–506, Jul. 2006.
achieved already. It is important to note that similar effects are [3] B. S. Feero and P. P. Pande, “Networks-on-chip in a three-dimensional
also observed for DEDUP and VIPS benchmarks as well (in environment: A performance evaluation,” IEEE Trans. Comput., vol. 53,
these cases, the saturation effect was observed for 14 sVLs). no. 1, pp. 32–45, Jan. 2009.
[4] A.-C. Hsieh et al., “TSV redundancy: Architecture and design issues
However, we have omitted plotting such repetitive results and in 3-D IC,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20,
analysis. no. 4, pp. 711–722, Apr. 2012.
732 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 36, NO. 5, MAY 2017
[5] T. Frank et al., “Resistance increase due to electromigration induced [33] R. Kim et al., “Energy-efficient VFI-partitioned multicore design using
depletion under TSV,” in Proc. IEEE Int. Rel. Phys. Symp. (IRPS), wireless NoC architectures,” in Proc. CASES, Uttar Pradesh, India,
Monterey, CA, USA, Apr. 2011, pp. 3F.4.1–3F.4.6. Oct. 2014, pp. 1–9.
[6] Y. Cheng et al., “A novel method to mitigate TSV electromigration for [34] M. D. Humphries, K. Gurney, “Network ‘small-world-ness’: A quanti-
3D ICs,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, Natal, Brazil, tative method for determining canonical network equivalence,” J. PLoS
Aug. 2013, pp. 121–126. One, vol. 3, no. 4, Apr. 2008, Art. no. e0002051.
[7] U. Y. Ogras and R. Marculescu, “‘It’s a small world after all’: NoC per- [35] WEKA Toolkit. Accessed on Feb. 22, 2016. [Online]. Available:
formance optimization via long-range link insertion,” IEEE Trans. Very http://www.cs.waikato.ac.nz/ml/weka/
Large Scale Integr. (VLSI) Syst., vol. 14, no. 7, pp. 693–706, Jul. 2006. [36] P. Wettin et al., “Design space exploration for wireless NoCs incorpo-
[8] S. Das, J. R. Doppa, D. H. Kim, P. P. Pande, and K. Chakrabarty, rating irregular network routing,” IEEE Trans. Comput.-Aided Design
“Optimizing 3D NoC design for energy efficiency: A machine learning Integr. Circuits Syst., vol. 33, no. 11, pp. 1732–1745, Nov. 2014.
approach,” in Proc. ICCAD, Austin, TX, USA, Nov. 2015, pp. 705–712. [37] O. Lysne, T. Skeie, S.-A. Reinemo, and I. Theiss, “Layered routing in
[9] P. Jacob et al., “Predicting the performance of a 3D processor-memory irregular networks,” IEEE Trans. Parallel Distrib. Syst., vol. 17, no. 1,
chip stack,” IEEE Design Test Comput., vol. 22, no. 6, pp. 540–547, pp. 51–65, Jan. 2006.
Nov./Dec. 2005. [38] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The
[10] H. G. Lee, N. Chang, U. Y. Ogras, and R. Marculescu, “On-chip SPLASH-2 programs: Characterization and methodological considera-
communication architecture exploration: A quantitative evaluation of tions,” in Proc. Int. Symp. Comput. Architect., Santa Margherita Ligure,
point-to-point, bus, and network-on-chip approaches,” ACM Trans. Italy, 1995, pp. 24–36.
Design Autom. Electron. Syst., vol. 12, no. 3, pp. 1–20, Aug. 2007. [39] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation,
[11] I. Loi, F. Angiolini, S. Mitra, L. Benini, and S. Fujita, “Characterization Dept. Comput. Sci., Princeton Univ., Princeton, NJ, USA, Jan. 2011.
and implementation of fault-tolerant vertical links for 3-D networks-on- [40] G. Palermo, C. Silvano, G. Mariani, R. Locatelli, and M. Coppola,
chip,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 30, “Application-specific topology design customization for STNoC,” in
no. 1, pp. 124–134, Jan. 2011. Proc. Euromicro Conf. Digit. Syst. Design Architect. Methods Tools,
[12] F. Li et al., “Design and management of 3D chip multiprocessors Lübeck, Germany, Aug. 2007, pp. 547–550.
using network-in-memory,” in Proc. ISCA, Boston, MA, USA, 2006, [41] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist
pp. 130–141. multiobjective genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput.,
[13] J. Kim et al., “A novel dimensionally-decomposed router for on-chip vol. 6, no. 2, pp. 182–197, Apr. 2002.
communication in 3D architectures,” in Proc. ISCA, San Diego, CA, [42] H. Matsutani et al., “Low-latency wireless 3D NoCs via randomized
USA, Jun. 2007, pp. 138–149. shortcut chips,” in Proc. DATE, Dresden, Germany, Mar. 2014, pp. 1–6.
[43] B. Noia and K. Chakrabarty, Design-for-Test and Test Optimization
[14] A.-M. Rahmani et al., “High-performance and fault-tolerant 3D NoC-bus
hybrid architecture using ARB-NET-based adaptive monitoring plat- Techniques for TSV-Based 3D Stacked ICs. Cham, Switzerland: Springer,
2014.
form,” IEEE Trans. Comput., vol. 63, no. 3, pp. 734–747, Mar. 2014.
[15] C. Seiculescu, S. Murali, L. Benini, and G. De Micheli, “SunFloor 3D:
A tool for networks on chip topology synthesis for 3-D systems on Sourav Das (S’14) is currently pursuing the
chips,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 29, Ph.D. degree with the Electrical Engineering and
no. 12, pp. 1987–2000, Dec. 2010. Computer Engineering Department, Washington
[16] P. Zhou, P.-H. Yuh, and S. S. Sapatnekar, “Application-specific 3D State University, Pullman, WA, USA.
network-on-chip design using simulated allocation,” in Proc. ASP-DAC, His current research interest includes low-power
Taipei, Taiwan, Jan. 2010, pp. 517–522. network-on-chip design.
[17] S. Murali, C. Seiculescu, L. Benini, and G. De Micheli, “Synthesis
of networks on chips for 3D systems on chips,” in Proc. ASP-DAC,
Yokohama, Japan, Jan. 2009, pp. 242–247.
[18] Y. Xu et al., “A low-radix and low-diameter 3D interconnection net-
work design,” in Proc. Symp. HPCA, Raleigh, NC, USA, Feb. 2009,
pp. 30–42.
[19] C. A. M. Macron et al., “Tiny NoC: A 3D mesh topology with router Janardhan Rao Doppa (M’14) received the Ph.D.
channel optimization for area and latency minimization,” in Proc. Int. degree from Oregon State University, Corvallis,
Conf. VLSI Design, Mumbai, India, 2014, pp. 228–233. OR, USA.
[20] S. Manipatruni et al., “Wide temperature range operation of micrometer-
scale silicon electro-optic modulators,” Opt. Lett., vol. 33, no. 19, He is an Assistant Professor with Washington
pp. 2185–2187, 2008. State University, Pullman, WA, USA.
[21] J. Kim, F. Wang, and M. Nowak, “Method and apparatus for Dr. Doppa was a recipient of the Outstanding
providing through silicon via (TSV) redundancy,” U.S. Patent US Paper Award for the research on structured predic-
2 010 029 560 0A1, Nov. 2010. tion at the AAAI 2013 conference.
[22] W.-P. Tu, Y.-H. Lee, and S.-H. Huang, “TSV sharing through multiplex-
ing for TSV count minimization in high-level synthesis,” in Proc. IEEE
SOCC, Taipei, Taiwan, Sep. 2011, pp. 156–159.
[23] Y. Wang et al., “Economizing TSV resources in 3-D network-on-chip Partha Pratim Pande (SM’11) received the M.S.
design,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 3, degree in computer science from the National
pp. 493–506, Mar. 2015.
[24] U. Kang et al., “8 Gb 3-D DDR3 DRAM using through-silicon-via tech- University of Singapore, Singapore, in 2002, and the
nology,” IEEE J. Solid-State Circuits, vol. 45, no. 1, pp. 111–119, Ph.D. degree in ECE from the University of British
Jan. 2010. Columbia, Vancouver, BC, Canada, in 2005.
[25] E. J. Marinissen and Y. Zorian, “Testing 3D chips containing through- He is a Professor and holds the Boeing Centennial
silicon vias,” in Proc. IEEE Int. Test Conf., Austin, TX, USA, 2009, Chair in computer engineering with the School
pp. 1–11. of Electrical Engineering and Computer Science,
[26] H.-H. S. Lee and K. Chakrabarty, “Test challenges for 3D inte- Washington State University, Pullman, WA, USA.
grated circuits,” IEEE Des. Test Comput., vol. 26, no. 5, pp. 26–35,
Sep./Oct. 2009.
[27] L. Jiang, Q. Xu, and B. Eklow, “On effective TSV repair for 3D-stacked
ICs,” in Proc. DATE, Dresden, Germany, Mar. 2012, pp. 793–798. Krishnendu Chakrabarty (F’08) received the
[28] T. Petermann and P. D. L. Rios, “Spatial small-world networks: M.S.E and Ph.D. degrees from the University of
A wiring-cost perspective,” arXiv:cond-mat/0501420, Jan. 2005. Michigan, Ann Arbor, MI, USA, in 1992 and 1995.
[29] S. Das, D. Lee, D. H. Kim, and P. P. Pande, “Small-world network He is the William H. Younger Distinguished
enabled energy efficient and robust 3D NoC architectures,” in Proc. Professor of Engineering with the Department
GLSVLSI, Pittsburgh, PA, USA, 2015, pp. 133–138. of Electrical and Computer Engineering, Duke
[30] S. Das et al., “Design-space exploration and optimization of an energy- University, Durham, NC, USA.
efficient and reliable 3D small-world network-on-chip,” arXiv preprint Prof. Chakrabarty served as an Editor-in-Chief
arXiv:1608.06972, 2016. for the IEEE Design & Test of Computers from
[31] J. A. Boyan and A. W. Moore, “Learning evaluation functions to improve
optimization by local search,” J. Mach. Learn. Res., vol. 1, pp. 77–112, 2010 to 2012 and the ACM Journal on Emerging
Nov. 2000. Technologies in Computing Systems from 2010 to
[32] K. D. Boese, “Cost versus distance in the traveling salesman prob- 2015. He currently serves as an Editor-in-Chief for the IEEE T RANSACTIONS
lem,” UCLA Comput. Sci. Dept., Los Angeles, CA, USA, Tech. ON V ERY L ARGE S CALE I NTEGRATION (VLSI) S YSTEMS . He is a fellow
Rep. CSD-950018, 1995. of ACM.