0% found this document useful (0 votes)
132 views13 pages

NeuronLink An Efficient Chip-to-Chip Interconnect For Large-Scale Neural Network Accelerators

Uploaded by

聂珣
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views13 pages

NeuronLink An Efficient Chip-to-Chip Interconnect For Large-Scale Neural Network Accelerators

Uploaded by

聂珣
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1966 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO.

9, SEPTEMBER 2020

NeuronLink: An Efficient Chip-to-Chip Interconnect


for Large-Scale Neural Network Accelerators
Shanlin Xiao , Member, IEEE, Yuhao Guo, Student Member, IEEE, Wenkang Liao, Member, IEEE,
Huipeng Deng, Yi Luo, Huanliang Zheng, Jian Wang, Cheng Li,
Gezi Li, and Zhiyi Yu , Senior Member, IEEE

Abstract— Large-scale neural network (NN) accelerators typ- I. I NTRODUCTION


ically consist of several processing nodes, which could be imple-
mented as a multi- or many-core chip and organized via a
network-on-chip (NoC) to handle the heavy neuron-to-neuron
traffic. Multiple NoC-based NN chips are connected through
chip-to-chip interconnection networks to further boost the overall
D EEP neural networks (DNNs), as a branch of neural
networks (NNs), have been widely adopted for solving
challenging real-world problems, such as image classification,
neural acceleration capability. Huge amounts of multicast-based object detection, natural language processing, and autonomous
traffic travel on-chip or cross chips, making the interconnection driving [1]–[3]. To enhance the feature extraction capabil-
network design more challenging and become the bottleneck of ity, DNNs go even deeper with more stacked layers. For
the NN system performance and energy. In this article, we pro-
pose coupling intrachip and interchip communication techniques, example, the first DNN in ImageNet Challenge that surpassed
called NeuronLink, for NN accelerators. Regarding the intrachip human-level accuracy, ResNet-152 [1], consists of 152 layers
communication, we propose scoring crossbar arbitration, arbi- requires 11.3G multiply-and-accumulates (MACs) and 60 mil-
tration interception, and route computation parallelization tech- lion weights. The superior accuracy of DNNs comes at the cost
niques for virtual-channel routing, leading to a high-throughput of high computational complexity.
NoC with a lower hardware cost for multicast-based traffic.
Regarding the interchip communication, we propose a lightweight Loading millions of parameters and performing millions
and NoC-aware chip-to-chip interconnection scheme, enabling of MACs add throughput and energy-efficiency challenges
efficient interconnection for NoC-based NN chips. In addition, to hardware substrate. This has motivated various comput-
we evaluate the proposed techniques on a four connected ing platforms [4]–[7], like CPU, GPU, ASIC, and field-
NoC-based deep neural network (DNN) chips with four field- programmable gate array (FPGA), to accelerate DNN oper-
programmable gate arrays (FPGAs). The experimental results
show that the proposed interconnection network can efficiently ations. Among these platforms, CPUs suffer from low per-
manage the data traffic inside DNNs with high-throughput and formance with limit parallelism, while GPUs are able to
low-overhead against state-of-the-art interconnects. provide high parallelism and throughput but consume huge
Index Terms— Chip-to-chip interconnection, deep neural net- power consumption. ASICs can offer the highest computa-
work (DNN), hardware accelerator, interconnection architecture, tional efficiency but lack the flexibility, while FPGAs can
network-on-chip (NoC). provide the highest reconfigurability but still suffer from the
flexibility at runtime and large power consumption. These
design paradigms fail to offer an energy-efficient scheme
Manuscript received February 8, 2020; revised April 28, 2020 and June 26,
2020; accepted June 30, 2020. Date of publication July 20, 2020; date for processing various DNNs in large-scale NN acceleration
of current version August 26, 2020. This work was supported in part systems.
by the National Key Research and Development Program of China under Network-on-chips (NoCs) emerge as a promising solution
Grant 2017YFA0206200 and Grant 2018YFB2202601; in part by the National
Natural Science Foundation of China (NSFC) under Grant 61674173, to manage complex data movements with energy-efficiency in
Grant 61834005, and Grant 61902443; and in part by Huawei Technologies modern multi- or many-core chips [8]–[13]. DNN accelerators
Co., Ltd. (Corresponding authors: Zhiyi Yu; Shanlin Xiao.) could also benefit from the NoC-based design paradigm to
Shanlin Xiao, Yuhao Guo, Wenkang Liao, Huipeng Deng, Yi Luo, Huan-
liang Zheng, and Jian Wang are with the School of Electronics and obtain throughput, energy-efficiency, flexibility, and scalabil-
Information Technology, Sun Yat-sen University, Guangzhou 510006, China ity. NoC decouples DNN operations into data movement
(e-mail: [email protected]; [email protected]; liaowk@ and computation. NoCs offer parallelism as multiple neuron
mail2.sysu.edu.cn; [email protected]; [email protected];
[email protected]; [email protected]). processing units can communicate with each other and operate
Cheng Li and Gezi Li are with Huawei Technologies Co. Ltd., Hangzhou simultaneously. NoCs provide energy-efficiency as they can
310051, China (e-mail: [email protected]; [email protected]). reduce the off-chip memory access by handling the data
Zhiyi Yu is with the School of Electronics and Information Technology,
Sun Yat-sen University, Guangzhou 510006, China, and also with the School movements between processing units’ on-chip. NoCs provide
of Microelectronics Science and Technology, Sun Yat-sen University, Zhuhai flexibility as they can handle various DNN data flows through
519082, China (e-mail: [email protected]). flexible interconnection. NoCs support scalability as computa-
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. tion units are independent of the data flow, adding computation
Digital Object Identifier 10.1109/TVLSI.2020.3008185 resources would not change the existing interconnect.
1063-8210 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1967

In DNNs, one-to-many and many-to-one communications


between different layers are common [14], [15]. This leads to
a multicast-based communication pattern and introduces heavy
and sophisticated data traffic in a DNN acceleration system.
When the scale of DNN becomes larger, DNN acceleration
systems are often integrated multiple chips to build a powerful
computing platform. In such a system, multicast-based traffic
is traveling on-chip or cross chips, making the data movements
more complex. Hence, interconnection networks become the
bottleneck of system performance and energy. For example,
in a modern DNN accelerator [16], over 26% chip area
and 50% power consumption comes from the interconnection
networks. A vital reason for this is that these accelerators
fail to address the interconnection issue from a systemic Fig. 1. Simple NN.
perspective. In this article, we propose coupling intrachip
and interchip communication methods to efficiently manage
the data movements in large-scale NN acceleration systems. addressing the low bisection bandwidth of a tree and the high
We call the proposed interconnection network, NeuronLink. diameter of a mesh topology, which shows energy efficiency in
The main contributions of this article include the following. handling the multicast-based traffic. However, the architecture
1) Considering the multicast-based traffic characteristic suffers from wire physical limitations.
inside DNNs, we propose a set of virtual-channel (VC) Eyeriss-v2 [14] proposed a hierarchical mesh network
router optimization methods for on-chip interconnection, (HM-NoC) for DNNs. The processing elements (PEs) and
including scoring crossbar arbitration, arbitration inter- global buffers (GLBs) are grouped into clusters and connected
ception, and route computation parallelization, leading through HM-NoC. The NoC can be configured into different
to a high-throughput NoC with a lower hardware cost. modes with circuit-switched routing, according to the transmit-
2) We propose a lightweight and NoC-aware chip-to-chip ted data type. With the tailored NoC, different data types (input
interconnection scheme, including a low-overhead data- activation, weight, and partial sums) are transferred between
link layer, physical layer, and flow control mechanism, PEs and GLBs, delivering from high bandwidth to high data
enabling efficient interconnection for NoC-based DNN reuse.
chips. For large-scale NN architectures, DaDianNao [16]
3) To validate the efficiency and effectiveness of the employed fat-tree topology for intrachip communication and
proposed interchip and intrachip interconnection tech- mesh via HyperTransport 2.0 to manage data movements
niques, we propose a DNN accelerator with four NoC- among chips. However, the separated intrachip and interchip
based chips organized in a 2-D mesh manner and designs do not fully consider the multicast traffic inside
evaluate it on a four connected FPGA prototype. DNNs, making the interconnection network be a bottleneck
of the acceleration system.
The rest of this article is organized as follows. Section II
describes some of the related works. Section III introduces
the background and motivation of this article. In Section IV, III. BACKGROUND AND M OTIVATION
we explain our NeuronLink in detail. Section V demonstrates
A. DNNs
a large-scale DNN accelerator with the proposed interconnect.
Finally, the experimental results are given in Sections VI and An NN is an interconnected group of nodes called neurons.
VII concludes. The operation of a neuron can be expressed as follows:
 N −1 

II. R ELATED W ORK OUT =  Wi ∗ INi + b (1)
i=0
Hardware acceleration of NNs has gained substantial atten-
tion in recent years. However, there are very limited works on where Wi represents the weights, INi represents input acti-
the interconnection networks for running DNN. vation of a neuron, b represents bias, OUT represents output
Neu-NoC [17] proposed a hybrid ring-mesh NoC for neuro- activation of a neuron, and () is the activation function for
morphic systems targeting on accelerating multilayer percep- introducing nonlinearity in the network model.
tron. In Neu-NoC, the neurons in the same layer are connected Modern DNNs typically consist of multiple types of stacked
by one ring and the neurons in the same ring share the layers, such as convolutional, pooling, activation, and fully
same data to realize the multicast traffic. These local rings connected layers. Within these layers, the convolutional and
communicate to others via a mesh NoC to implement the data fully connected layers are the most computationally intensive.
movements between different layers. However, ring topology The basic operation used in convolutional and fully connected
typically suffers from the throughput and latency. layers
 N −1 is an MAC operation, as depicted in (1). To compute
ClosNN [18] proposed a custom clos topology-based indi- i=0 Wi ∗INi in (1), a critical step is matrix-vector arithmetic.
rect interconnection network for feed-forward NNs. It aims at Fig. 1 depicts a simple NN.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1968 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020

Fig. 2. Traffic patterns inside DNN accelerators.

TABLE I
S PECIFICATION OF T RAFFIC PATTERNS I NSIDE DNN A CCELERATORS

Fig. 3. Some representative state-of-the-art DNN models. There is a visible


trend toward increasingly larger NNs.

B. Traffic Patterns Inside DNN Accelerators


In this work, we focus on systematic intrachip and interchip
In DNNs, each hidden/output neuron is connected to a small
communication methods for large-scale NNs.
region/all of the input neurons (e.g., convolutional layers/fully
connected layers). The outputs of the neurons in one layer IV. NeuronLink
form the inputs of the neurons in the next layer. Communi-
cations occur between different layers, mainly including one- In this section, we introduce our proposed interconnect,
to-one, one-to-many, and many-to-one patterns, as indicated NeuronLink, in detail. NeuronLink is a full chip-to-chip inter-
in Fig. 2. connect, include both intrachip and interchip parts. We named
Without loss of generality, we assume a DNN accel- the intrachip part NeuronLink-R (i.e., NeuronLink router)
erator consisting of massive PEs for neuron computation and the interchip part NeuronLink-C (NeuronLink chip-
and an on-chip memory buffer [we refer to as the GLB] to-chip connection). We first give the overall interconnect
for inputs/weights/intermediate values from DRAM. Mapping architecture (NeuronLink). Then, we propose optimizations
DNN dataflows over these PEs can be abstracted to perform of VC routing for on-chip communication (NeuronLink-R).
one of three communications listed in Table I. Finally, we introduce a lightweight chip-to-chip interconnect
scheme (NeuronLink-C).

C. Motivation A. Overall Interconnect Architecture


Our motivation is driven by two key observations. First, The overall interconnect architecture is shown in Fig. 4.
from an algorithm perspective, there is a visible trend toward From bottom to the top, we define a physical layer, a data
increasingly larger NNs, as shown by the accuracy improve- link layer, and a transaction layer. It is noted that NoCs are
ments on ImageNet visual recognition challenge with the implemented for the transaction layer. One of the key features
increasing model size (see Fig. 3) [1], [19]–[26]. GoogLeNet of NeuronLink is that we introduce transmission priority to
was the winner of 2014 ImageNet challenge [19]. It achieved ease the congestion in chip-to-chip communication for large-
74.8% top-1 accuracy with about 4 million parameters. The scale NN applications.
winner of 2017 ImageNet challenge was SENet [23], which At the transmitting side, the data link layer receives packets
has 82.7% top-1 accuracy with about 145.8 million parameters. from the transaction layer (i.e., NoCs) through a bus. The
The model size was increased over 36×. A recent model, bus consists of data, request, and acknowledge signals. The
AmoebaNet-B [26], even contains over 550 million parame- acknowledge signal (4 bits) indicates if the data link layer
ters. receives packets. The header flit of the received packet is
Second, from a hardware perspective, with the size of analyzed first, packet priority, multicast type, and destination
NN increasing, long-range interconnection across cores/chips address information from the header flit are available. Then,
emerges as a new challenge for high-performance and energy- the body flits are stored in corresponding VCs according to
efficiency. The energy consumption of core-to-core/chip- their priorities. With these messages and credit mechanisms,
to-chip communications increasing quickly as the distance the credit management (CRM) unit decides which VC to be
between the source and the destination increases. For exam- sent if there are several requests. Finally, the packet to be sent
ple, in IBM TrueNorth system [27], an intercore/interchip is stored into an asynchronous first-input–first-output (FIFO)
data movement could consume 224× energy of an intra- and backed up in the retry buffer. If the CRM unit receives a
core/intrachip one (i.e., 894 versus 4 pJ per spike per hop). retry data command, the retry buffer sends packets according

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1969

Fig. 5. Structure of our router.

Fig. 4. Overall interconnect architecture. It consists a physical layer, a data


link layer, and a transaction layer (NoC).

to the flit address within the retry data command. At the same
time, the CRM unit sends related commands to indicate the
retrying operation.
The physical layer receives data and commands from the
data link layer. We utilize asynchronous FIFOs to trans-
fer data and asynchronous handshake mechanism to transfer
commands. Asynchronous handshake approach is introduced
to address the issue that commands need a higher priority
than data (e.g., for retransmission, commands are required to
transmit before the data). Then, the synchronous header, check
header, and package header are added to the data or com-
mands by the encoder. After that, scrambler utilizes the IEEE
802.3 standard to avoid successive “0" or “1" value and
balance the number of “0" and “1." The encoded message will
not be scrambled. Therefore, these two steps can be executed
simultaneously to improve parallelism. Finally, control fields
are generated by the byte striping module to avoid mistakes
on the chips. Fig. 6. VC routing optimization I. (a) Conventional round-robin crossbar
arbitration. (b) Our scoring crossbar arbitration.
At the receiving side, the data processed by the physical
medium attachment (PMA) sublayer is recovered to normal
data after block boundary alignment and channel bonding.
Then, the data are descrambled to normal order and sent to the five output ports, VC buffers, a route computation unit, a VC
decoder. After that, the elastic buffer is used to synchronize allocator, a switch allocator, and a 5 × 5 crossbar switch. The
data from the recovery clock to the local clock. Command dashed rectangle indicates that the component is with our
interface and data interface transfer command or data to the optimization. Details about the VC routing optimizations are
data link layer by recognizing the sync header. given in the following.
For a command, the CRM in the data link layer analyzes it 1) Scoring Crossbar Arbitration: Fig. 6(a) illustrates a con-
and responds accordingly. For data, it is buffered and sent to ventional round-robin crossbar arbitration method. For each
different VCs depending on the priority and address. Once the input port, if there exist packets of different priorities in
data fails to pass the check by the decoder, the CRM generates the VCs, the highest priority one will be forwarded to the
retry command to require the transport side to retry data and crossbar arbitration unit by the VC allocator. Then, packets
abandon any data until it receives the retry data. from different input ports are granted by priority comparator
and round-robin arbitration to access the routing resource. This
multilevel arbitration method would introduce long latency and
B. On-Chip VCs Routing Optimization consume more hardware resources.
Our router architecture is shown in Fig. 5, which is the opti- To reduce latency, we propose a scoring arbitration method
mization of a conventional router. It consists of five input ports, as shown in Fig. 6(b). The main feature of our method

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1970 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020

Fig. 8. VC routing optimization III. (a) Conventional sequential route


computation. (b) Our proposed parallel route computation.

Fig. 7. VC routing optimization II. (a) Conventional routing without


arbitration interception. (b) Our proposed routing with arbitration interception.

is performing crossbar priority arbitration and round-robin Fig. 9. Encoding format.


arbitration in parallel with the scoring mechanism. The related
score of the packets from each port is computed based on both
their priority (cls) and the current round-robin factor (reslast ) next router or IP core, which effectively eases transmission
as weight, and the one gets the highest score will be granted congestion and improves the performance of NoC.
and transmitted to the next router or IP core. The computation 3) Route Computation Parallelization: Fig. 8(a) shows the
of the score of the packets is defined as route computation process in a conventional router. The
VC input buffering and route computation are processed in
Sn = (cls ∗ req) + (req − reslast + reqn )%req (2)
sequence. When the route computation involves a multicast
where Sn , cls, req, reslast , and reqn are the score of nth request lookup table, it may take a longer time to query the routing
input, the packet priority of the request input, the total number destination address in the lookup table. This would increase
of arbitration, the last arbitration result, and the input port the delay of data packet transmission and greatly lower the
number of current request, respectively. performance of the on-chip network.
2) Arbitration Interception: As Fig. 7(a) depicts, packets Fig. 8(b) shows our route computation. Unlike the process
from the same upstream node are stored in the related VCs shown in Fig. 8(a), VC input buffering and route computa-
(VC0–VC3) of the input port based on their priority. When tion are processed in parallel. The packet header enters the
multiple input ports apply to the crossbar arbitration of the VC FIFO unit and the route computation unit simultaneously.
same output port, ungranted packets are congested in the The route computation unit performs corresponding route
buffers until the next arbitration after the previously granted calculation based on the destination address, packet type, and
packets are transmitted to the next router or IP core. It is priority information contained in the packet header. In the
no doubt that it aggravates the congestion and increases the route computation unit, we adopt the XY routing algorithm
latency of the NoC. for unicast packets and lookup table-based routing algorithm
Therefore, we propose an arbitration interception to ease for multicast packets. The routing algorithms are deterministic
the congestion. When the high-priority channel from the and effective in preventing deadlock. The body and tail of the
same input port fails to go through the crossbar arbitration, packets only enter the VC FIFO unit and do not enter the route
the crossbar will return an arbitration interception signal to computation unit.
the VC allocator. After receiving the signal, the VC allocator
disables the request of the high-priority channel and grants
C. Chip-to-Chip Interconnect
the lower priority channel to apply to the crossbar arbitration,
as Fig. 7(b) illustrates. This approach prevents the situation The flow control greatly impacts the throughput, latency,
that lower priority packets have to wait in the VCs until reliability, and cost of an interconnect. We introduce details
higher priority packets get granted and transmitted to the of our flow control in the NeuronLink.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1971

Fig. 10. Large-scale DNN accelerator. (a) Overall architecture of the accelerator. (b) 2-D mesh NoC-based single chip. (c) Architecture of the processing
node. (d) Architecture of the analog processing unit.

TABLE II TABLE III


D IFFERENT T YPE OF C OMMANDS D IFFERENT T YPE OF C ONTROL F IELDS

1) Flow Control Mechanism: As shown in Fig. 4, we intro-


duce CRM module for flow control, as the mechanism used
in [28]. It serves for the whole system’s arbitration, manage-
ment of data receiving and sending, and the retry function.
It arbitrates the requests from VCs according to target VC’s
credit and local VC’s priority. Besides, it also in charge of
commands’ sending and analyzing. Herein, we use a credit
synchronization mechanism to avoid the RX side overflow.
This module sends a credit upgrade command according to the
RX side’s status and receives another chip’s credit command
to upgrade the local credit. If the credit is reduced to 0,
it indicates that the target VC at the RX side is full.
Fig. 9 shows our command format. The command is with
triple modular redundancy (TMR) [29]. With TMR, the CRM
can also recognize the command correctly even if there are
bit-flips during SerDes transmission. The commands’ types are
specified in Table II. For simplicity, we only defined five kinds
of command. Unless an error occurs, the data link layer only Fig. 11. Mapping ResNet-18 to the proposed DNN accelerator.
sends a credit synchronization command. So the system sends
effective data rather than command in most cases. Compared to
PCIe’s redundancy of packet in the data link layer, our design full, no control field will be sent. Therefore, most of the time,
has lower overhead and can increase the effective throughput the physical layer sends effective data or command block.
greatly. The techniques used in the flow control improve the effi-
ciency of our interconnect as demonstrated in Section VI-C.
2) Control Field: Since our encoding format is different
from the traditional 64-/66-b encoding approach, new control
V. L ARGE -S CALE DNN ACCELERATOR W ITH NeuronLink
fields are needed. The customized control field is shown
in Table III. Only not ready and channel bonding package In this section, we present a large-scale DNN accelera-
is needed for initialization. Unless the elastic buffer is almost tor with the proposed interconnect, NeuronLink. The overall

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1972 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020

architecture of the DNN accelerator is shown in Fig. 10. The


accelerator consists of four chips [see Fig. 10(a)]. These chips
are connected through NeuronLink and organized in a ring
approach. Each chip consists of 16 processing nodes, which
are organized in a 2-D mesh NoC [see Fig. 10(b)]. For a
high bandwidth off-chip communication, like data exchanging
with the main memory (DRAM), each chip maintains a PCIe
interface, as indicated in Fig. 10(b). Each node is made up
of eDRAM buffers to store input feature maps, four digital
processing units (DPUs) to perform shift-and-add, activation,
and max-pooling operations, and eight analog processing units
(APUs) to perform in situ MAC operations, connected with
a shared bus [see Fig. 10(c)]. Each APU contains several
crossbar arrays, DACs, and ADCs, connected with a shared
bus [see Fig. 10(d)]. Our architecture is targeted on DNN
inference, which is the dominant application in many fields.
Fig. 11 shows the mapping of the ResNet-18 model [1] to
the proposed large-scale DNN accelerator. The circular and
square represent the router and processing node, respectively.
The number inside a circle indicates the layer number of
ResNet-18 model. The arrow represents the data movement
inside a chip or cross chips. The dark green circle indicates Fig. 12. Our evaluation platform, which consists of four Xilinx
that this router transfers data from the local processing node ZCU102 FPGA boards.
to the main memory (DRAM) through the PCIe interface.
TABLE IV
VI. E XPERIMENTAL R ESULTS FPGA R ESOURCE U TILIZATION PER C HIP

A. Experiment Setting
We implement the proposed interconnect, NeuronLink,
in C++ at the cycle-accurate level for fast performance evalu-
ation and also in Verilog for FPGA-based hardware evaluation.
Network throughput, average latency, channel utilization, and
VC utilization are used as performance evaluation metrics.
Throughput is the rate at which packets are delivered by the
network for a particular traffic pattern. Latency is the time
required for a packet to traverse the network from source to
destination.
For the rest of the experiment parts, Section VI-B shows
the FPGA-based hardware characteristic of NeuronLink.
Section VI-C shows the performance of the interconnect.
Section VI-D gives comparisons against other interconnects.

B. Hardware Characteristic
Fig. 13. Hardware resource comparison: on-chip versus chip-to-chip.
Fig. 12 shows our FPGA-based evaluation platform, which
consists of four Xilinx ZCU102 FPGA boards. These FPGA
boards are connected in a ring approach via optical fibers. comparison between the on-chip network (16 NeuronLink-Rs)
We use the platform to evaluate the interconnection net- and the chip-to-chip network (two NeuronLink-Cs).
work of the DNN accelerator described in Section V. It is
noted that our NeuronLink includes two parts, NeuronLink-
R (router, see Section IV-B) and NeuronLink-C (chip-to-chip C. Performance of NeuronLink
connection, see Section IV-C). Each FPGA board implements To better understand the performance of NeuronLink,
two NeuronLink-Cs for chip-to-chip communication and a we evaluate the performance of the on-chip network
2-D 4 × 4 mesh NoC with 16 NeuronLink-Rs for on-chip (NeuronLink-R) and chip-to-chip network (NeuronLink-C),
communication. The overall system is simulated and evaluated respectively. For the on-chip network, we give the network
based on Vivado 2016.4 with the ResNet-18 model mapping throughput, average latency, channel utilization, and VC uti-
to the hardware platform (see Fig. 11). The system can run at lization under different injection rates. For the chip-to-chip
250 MHz. FPGA resource utilization of a single chip is shown network, we give the network throughput and average latency
in Table IV. In Fig. 13, we also show the hardware resource under different injection rates, and the bit error rate. Besides,

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1973

we also evaluate the overall performance of NeuronLink in The simulation results show that our network can effectively
terms of network throughput and average latency. handle traffic imbalances. Also, channel utilization and VC uti-
1) On-Chip Interconnection Network Performance: The lization increase as the injection rate increases as shown
architecture of the on-chip network is shown in Fig. 10(b), in Figs. 14 and 15, which indicates that no blocking occurs in
which is a 2-D 4 × 4 mesh NoC with the optimized router the network.
(NeuronLink-R, see Fig. 5) described in Section IV-B. The 2) Chip-to-Chip Interconnection Network Performance:
performance of on-chip network is evaluated under six test In Fig. 16, we give the performance of the chip-to-chip
cases as summarized in Table V. Our test cases cover different interconnect (NeuronLink-C), including network throughput
traffic patterns (shuffle, neighbor, and real DNN workloads), and average latency. In our interconnect, it takes two cycles to
different packet types (unicast and multicast), different packet transfer a flit in the physical layer. Therefore, the maximum
priorities (fixed and variation), and random packet length, throughput can be calculated as follows:
to offer a comprehensive evaluation of the on-chip network. f physical 2numdata
Figs. 14 and 15 show the throughput, average latency, channel Throuthputmax = ∗ (3)
2 f datalink numcmd + 2numdata
utilization, and VC utilization under different injection rate
of synthetic traffic (Test-1–Test-4) and real DNN workloads where fphysical and f datalink represent the working frequency
(Test-5 and Test-6), respectively. For real DNN workloads, of the physical layer and the data link layer, respectively.
we choose two representative DNN models: AlexNet [30] numdata and numcmd represent the number of data and the
and ResNet-18 [1]. Our DNN-to-NoC mapping strategy is number of commands, respectively. Here, f physical is 250 MHz
similar to [31]. As demonstrated in Section III-B, unicast and and f datalink is 300 MHz. In our experiment, the ratio of
multicast coexist in the DNN accelerators after the mapping. commands and data is 1:5. Each packet has nine flits. The
For a better understanding of the on-chip interconnection length of a flit is 130 bits, while the length of a command is
performance, some key data are listed in Table VI, includ- 64 bits. With these parameters, the maximum throughput is
ing saturation injection rate, light-load latency, and satu- calculated as about 0.388 flits/cycle. As shown in Fig. 16(a),
ration channel utilization. As can be seen from Figs. 14 the saturation throughput is consistent with the ideal maximum
and 15, and Table VI, the zero-load latency under all test throughput. In other words, our interconnection system has
cases is five cycles, which is as we expected and consistent reached the maximum utilization rate (100%) for SerDes
with the router structure. For all test cases, the saturation transmissions.
throughput is over 0.45 flits/cycle/node, which achieves 70.3% In our interconnect, because no additional messages are
of the ideal throughput (the maximum ideal throughput is added to the payload in the data link layer, the encoding
0.643 flits/cycle/node). Because the traffic patterns affect the efficiency is only affected by the encoder in the physical
latency and saturation point, average latency curves vary under layer (see Fig. 4). During the encoding phase, every 64-bits
different traffic patterns. Compared with the shuffle pattern, payload is encoded to 80 bits with extra 16 bits. Therefore,
the transmission path of the neighbor pattern faces less com- the encoding efficiency of our interconnect is 80% (64/80 bits).
petition, so it is less prone to congestion. Thus, Test-4 shows a The maximum bandwidth of NeuronLink can be calculated as
lower latency (23 cycles, indicated in Table VI) and a greater follows:
saturation point (0.75 flits/cycle/node, indicated in Table VI) Bandwidthmax = Nlane ∗ Vlane ∗ E encode (4)
in Fig. 14(b). It is noted that the DNN workloads show a
similar latency and saturation point as the synthetic traffic where Nlane , Vlane , and E encode represent the number of lanes of
except the latency of Test-6. Because residual connections the SerDes, data transmission rate of the SerDes, and encoding
in ResNet-18 make the traffic more sophisticated. Among efficiency, receptively. Here, Nlane , Vlane , and E encode are 2,
different shuffle patterns (Test-1–Test-3), the workload of 10 Gb/s, and 80%. The Bandwidthmax is calculated as 16 Gb/s.
Test-3 is heavier with both unicast and multicast traffic. Thus, With a assumption that credit synchronization command is sent
it reaches saturation earlier (0.64 flits/cycle/node, indicated per 5 data, the effective bandwidth is 16 × 5×130/(5 × 130+
in Table VI). For modest loads (Test-1 and Test-2), we observe 64) = 14.56 Gb/s. Compared to PCIe Gen2 × 4 (8 Gb/s), our
that there are twice exponential increases in latency [see effective bandwidth increases by 82%.
Fig. 14(b)]. This is because of the traits of the shuffle pattern For average latency shown in Fig. 16(b), when the injection
(the traffic is heavier at the center of the network and the rate is lower than 0.2 flits/cycle, the transmission will not be
routers at that region are prone to congestion) and the packet blocked or influenced by credit update so the average latency
injection method (newly generated packets from a certain node is 105 cycles. When the injection rate increases but less than
would not be injected into the network when that node faces 0.4 flits/cycle, the latency increases to 132 cycles due to round-
the congestion) we adopted. As the injection rate increases, robin transmission of four routers. Because of the limitation
the traffic becomes heavier and the latency grows rapidly. of SerDes’s throughput and credit mechanism, the average
When the injection rate reaches a certain point, packets are latency grows sharply after 0.4-flits/cycle injection rate. In the
unable to be introduced into the network from the center situation of saturation, VCs have about 40 packets (360 flits)
region. Thus, the latency goes smoothly. After that, as the and we need an additional 864 cycles to transfer them.
injection rate approaches saturation, the injected packets from To verify the reliability and stability of the NeuronLink,
the node at the edge of the network dominates the network traf- we utilize a built-in self-test (BIST) circuit to generate random
fic and cause the second exponential increase in latency again. numbers to test the bit error rate. In the test, we use the

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1974 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020

Fig. 14. Performance of on-chip interconnection under synthetic traffic (Test-1–Test-4). (a) Throughput. (b) Average latency. (c) Channel utilization.
(d) VC utilization.

Fig. 15. Performance of on-chip interconnection under real DNN workloads (Test-5 and Test-6). (a) Throughput. (b) Average latency. (c) Channel utilization.
(d) VC utilization.

TABLE V
S PECIFICATION OF D IFFERENT T YPE OF T EST C ASES

TABLE VI
P ERFORMANCE OF THE O N -C HIP N ETWORK

PRBS31 pattern. A huge amount of data (21 614 777 535 681) with a real workload, ResNet-18. It is noted that the ResNet-
is transferred at the sending side for about 24 h, no bit error 18 is mapping to four chips not a single chip here for overall
occurs at the receiving side. According to the confidence cal- interconnect performance evaluation. The ResNet-18 mapping
culation formula, we have a bit error rate of 2.663207e−15 for is illustrated in Fig. 11. Within the mapping, the traffic
NeuonLink. Note that PCIe used in 10G Ethernet, the required includes one-to-one, one-to-many, and many-to-one commu-
bit error rate is e − 12. Our interconnect shows a better bit nications. For the overall interconnect, the saturation injection
error rate. rate, the saturation throughput, the light-load latency, and the
3) Overall Interconnect Performance: In Fig. 17, we give saturation latency are 0.6 flits/cycle/node, 0.43 flits/cycle/node,
the performance of the overall interconnect, including network 25 cycles, and 105 cycles, respectively. Even under a sophisti-
throughput and average latency. The evaluation is based on a cated real workload, our interconnect shows a similar perfor-
four-FPGAs hardware platform, as demonstrated in Fig. 12, mance with the test case listed in Table V.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1975

Fig. 16. Performance of chip-to-chip interconnection. (a) Throughput under different injection rate. (b) Average latency under different injection rate.

Fig. 17. Performance of the overall interconnect with ResNet-18 mapping. (a) Throughput under different injection rate. (b) Average latency under different
injection rate.

TABLE VII
C OMPARISON W ITH A C ONVENTIONAL ROUTER

D. Comparison With Other Interconnects


1) Comparison With Other Routers: For comparison,
we implement a conventional router on the Xilinx Fig. 18. Dynamic power breakdown of the proposed router.
ZCU102 FPGA board. Both the proposed router (NeuronLink-
R) and the conventional router have four input ports and
five output ports. Each input port has four VCs (VC0–VC3). fewer LUTs and 47% fewer flip flops than the conventional
VC0–VC3 can hold 27 flits, 9 flits, 9 flits, and 9 flits, VC router with round-robin arbiter at the 300 MHz. The
respectively, and the bit width of a single flit is 130 bits. The power consumption of NeuronLink-R is slightly lower than
difference is that the conventional one does not adopt the the conventional router. It is noted that the power listed
scoring crossbar arbitration, the arbitration interception, and in Table VII is the total power consumption, including the
the route computation parallelization techniques proposed in dynamic power and static power. Generally, the FPGA board
this work. The implementation results are shown in Table VII. has a high static power consumption. The dynamic power of
Experimental results show that the proposed router simplifies the conventional router and the proposed router are 29 and
the complexity of the circuit, resulting in approximately 10% 28 mW, respectively. For a better understanding of the power

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1976 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020

Fig. 19. Performance comparison between a conventional router and our proposed router. (a) Throughput under different injection rate. (b) Average latency
under different injection rate.

TABLE VIII TABLE IX


C OMPARISON W ITH O THER C HIP - TO -C HIP I NTERCONNECTS S INGLE APU C HARACTERISTICS

consumption in the router, Fig. 18 gives the dynamic power


breakdown of the proposed router.
We also implement an on-chip network with a conven-
tional router for comparison. The on-chip network is based
on a 2-D 4 × 4 mesh topology with the same architec-
ture described in Fig. 10(b) (not include the NeuronLink
interface). Both implementations are adopted XY routing
algorithm avoiding the deadlock and livelock and tested
with neighbor traffic pattern, here. Performance comparison TABLE X
is shown in Fig. 19. We can see that the average packet P ROPOSED A CCELERATOR C HARACTERISTICS
latency of our design increases significantly at the injection
rate of 0.6 flits/cycle/node before the NoC gets saturated.
Our proposed VC router offers a twofold reduction in the
average packet latency and twofold increase in the network
throughput performance when compared with those of the
conventional VC router, respectively, when the injection rate
exceeds 0.7 flits/cycle/node.
2) Comparison With Other Chip-to-Chip Interconnects:
We compare NeuronLink-C with two kinds of controllers:
PCIe Gen2 X4 for endpoint and SRIO 2.0 for a host. The
comparison is given in Table VIII. It shows that our design
has a lower hardware cost. Comparing to other interconnects,
our design simplifies the data flow. Most modules only need
a cycle to process data so we achieve lower LUTs and Flip-
Flops utilization. On the other hand, we need a large number
of FIFOs to store different routers and various packets with
different priorities (four kinds of priorities are supported in and DACs), DPUs, SerDes, and eDRAM buffers. To perform
NeuronLink), thus more BRAMs are needed. an apples-to-apples comparison with the state-of-the-art NoC-
3) Comparison With Other Large-Scale DNN Accelerators: based DNN accelerators, we synthesize the digital circuit
As demonstrated in Fig. 10, the behavior of our DNN acceler- modules at a 32-nm CMOS technology with 1.2 GHz using
ator is a function of the NeuronLink (including NeuronLink-R Synopsys Design Compiler. We model the area and energy
and NeuronLink-C), APUs (including crossbar array, ADCs, for crossbar array (100 ns, one cycle, the read latency for

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1977

TABLE XI achieve high-throughput and low bit error rate with low-
C OMPARISON W ITH N O C-BASED DNN A CCELERATORS overhead against state-of-the-art interconnects. Furthermore,
our NeuronLink-based DNN accelerator outperforms the pre-
vious NoC-based DNN accelerators 1.27×–1.34× in terms of
power efficiency and 2.01×–9.12× in terms of area efficiency.

R EFERENCES
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep
residual networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Oct. 2016,
pp. 630–645.
[2] G. Hinton et al., “Deep neural networks for acoustic modeling in speech
recognition: The shared views of four research groups,” IEEE Signal
Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
chies for accurate object detection and semantic segmentation,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
[4] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,”
in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA),
Feb. 2015, pp. 161–170.
[5] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
the crossbar array) using the results from [32], for ADCs learning with limited numerical precision,” in Proc. Int. Conf. Mach.
using the results from [33], for DACs using the results from Learn. (ICML), Jul. 2015, pp. 1737–1746.
[6] X. Zhou et al., “Cambricon-S: Addressing irregularity in sparse
[34], for SerDes using the results from [35]. We use CACTI neural networks through a cooperative software/hardware approach,” in
7.0 [36] at 32 nm to model the area and energy for all Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),
on-chip buffers, including eDRAM and SRAM. Table IX Oct. 2018, pp. 15–28.
[7] S. Yin et al., “An energy-efficient reconfigurable processor for binary-
summaries the characteristics of a single APU. In our analysis, and ternary-weight neural networks with flexible data bit width,” IEEE
the accelerator contains four chips [see Fig. 10(a)]. Each J. Solid-State Circuits, vol. 54, no. 4, pp. 1120–1136, Apr. 2019.
chip consists of 16 processing nodes, 16 NeuronLink-Rs, 2 [8] P. Ou et al., “A 65 nm 39 GOPS/W 24-core processor with 11Tb/s/W
packet-controlled circuit-switched double-layer network-on-chip and
NeuronLink-Cs, and 2 SerDes. For a fair comparison, the PCIe heterogeneous execution array,” in IEEE Int. Solid-State Circuits Conf.
shown in Fig. 10(b) is not included in the analysis. The (ISSCC) Dig. Tech. Papers, Feb. 2013, pp. 56–57.
hardware characteristic of the accelerator is given in Table X. [9] A. Touzene, “On all-to-all broadcast in dense Gaussian network on-
chip,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 4, pp. 1085–1095,
Our accelerator has 41.2-mm2 chip area and 15.4-W power Apr. 2015.
consumption. [10] B. Bohnenstiehl et al., “KiloCore: A 32-nm 1000-processor computa-
There are few works about the interconnection networks tional array,” IEEE J. Solid-State Circuits, vol. 52, no. 4, pp. 891–902,
Apr. 2017.
running DNNs. Eyeriss-v2 [14] and DaDianNao [16] are two [11] A. Firuzan, M. Modarressi, M. Daneshtalab, and M. Reshadi, “Recon-
representative works. In Table XI, we compare our accelerator figurable network-on-chip for 3D neural network accelerators,” in Proc.
with Eyeriss-v2 and DaDianNao. We consider two key metrics: 12th IEEE/ACM Int. Symp. Netw.-Chip (NOCS), Oct. 2018, pp. 1–8.
[12] H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling flexible
power efficiency and area efficiency. Power efficiency repre- dataflow mapping over DNN accelerators via reconfigurable intercon-
sents the number of 8-/16-bit operations performed per watt nects,” in Proc. 23rd Int. Conf. Archit. Support Program. Lang. Oper.
(GOPS/W), while the area efficiency represents the number Syst. (ASPLOS), Mar. 2018, pp. 461–475.
[13] M. F. Reza and P. Ampadu, “Energy-efficient and high-performance NoC
of 8-/16-bit operations performed per mm2 (GOPS/mm2 ). architecture and mapping solution for deep neural networks,” in Proc.
The power efficiency and area efficiency of Eyeriss-v2 and 13th IEEE/ACM Int. Symp. Netw.-Chip, Oct. 2019, pp. 1–8.
DaDianNao are also scaled to 32 nm under 8-bit precision for a [14] Y.-H. Chen, T.-J. Yang, J. S. Emer, and V. Sze, “Eyeriss v2: A flexible
straight comparison. As shown in Table XI, our accelerator has accelerator for emerging deep neural networks on mobile devices,” IEEE
J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, Jun. 2019.
1361.7-GOPS/W power efficiency and 508.9-GOPS/mm2 area [15] K. Bhardwaj and S. M. Nowick, “A continuous-time replication strategy
efficiency. The NeuronLink-based accelerator shows 1.34× for efficient multicast in asynchronous NoCs,” IEEE Trans. Very Large
better power efficiency and 9.12× better area efficiency against Scale Integr. (VLSI) Syst., vol. 27, no. 2, pp. 350–363, Feb. 2019.
[16] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in
Eyeriss-v2 and 1.27× better power efficiency and 2.01× better Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2014,
area efficiency against DaDianNao. pp. 609–622.
[17] X. Liu, W. Wen, X. Qian, H. Li, and Y. Chen, “Neu-NoC: A high-
efficient interconnection network for accelerated neuromorphic systems,”
VII. C ONCLUSION in Proc. 23rd Asia South Pacific Design Automat. Conf. (ASP-DAC),
In this work, we proposed a complete chip-to-chip inter- Jan. 2018, pp. 141–146.
[18] R. Hojabr, M. Modarressi, M. Daneshtalab, A. Yasoubi, and
connection network for large-scale NN accelerators, including A. Khonsari, “Customizing clos network-on-chip for neural networks,”
both intrachip and interchip communication techniques. The IEEE Trans. Comput., vol. 66, no. 11, pp. 1865–1877, Nov. 2017.
interconnect aiming at efficiently managing the huge amount [19] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
of multicast-based traffic inside DNN accelerators. The inter- [20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
connect is fully evaluated with the cycle-accuracy C++ the inception architecture for computer vision,” in Proc. IEEE Conf.
and RTL model. Also, the interconnect is implemented with Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826.
[21] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
a four FPGAs-based hardware platform. Experiment results transformations for deep neural networks,” in Proc. IEEE Conf. Comput.
have shown that the proposed interconnect, NeuronLink, can Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5987–5995.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1978 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020

[22] X. Zhang, Z. Li, C. C. Loy, and D. Lin, “PolyNet: A pursuit of structural [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
diversity in very deep networks,” in Proc. IEEE Conf. Comput. Vis. with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Pattern Recognit. (CVPR), Jul. 2017, pp. 3900–3908. Process. Syst. (NIPS), Dec. 2012, pp. 1097–1105.
[23] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. [31] K.-C. Chen and T.-Y. Wang, “NN-noxim: High-level cycle-accurate
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, NoC-based neural networks simulator,” in Proc. 11th Int. Workshop
pp. 7132–7141. Netw. Chip Architectures (NoCArc), Oct. 2018, pp. 1–5.
[24] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable [32] A. Shafiee et al., “ISAAC: A convolutional neural network
architectures for scalable image recognition,” in Proc. IEEE/CVF Conf. accelerator with in-situ analog arithmetic in crossbars,” in Proc.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8697–8710. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2016,
[25] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “AutoAug- pp. 14–26.
ment: Learning augmentation strategies from data,” in Proc. IEEE/CVF [33] L. Kull et al., “A 3.1 mW 8b 1.2 GS/s single-channel asynchronous SAR
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 113–123. ADC with alternate comparators for enhanced speed in 32nm digital
[26] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution SOI CMOS,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
for image classifier architecture search,” in Proc. AAAI Conf. Artif. Papers, Feb. 2013, pp. 468–469.
Intell., vol. 33, 2019, pp. 4780–4789. [34] M. Saberi, R. Lotfi, K. Mafinezhad, and W. A. Serdijn, “Analysis of
[27] P. A. Merolla et al., “A million spiking-neuron integrated circuit with power consumption and linearity in capacitive digital-to-analog convert-
a scalable communication network and interface,” Science, vol. 345, ers used in successive approximation ADCs,” IEEE Trans. Circuits Syst.
no. 6197, pp. 668–673, Aug. 2014. I, Reg. Papers, vol. 58, no. 8, pp. 1736–1748, Aug. 2011.
[28] W. Liao, Y. Guo, S. Xiao, and Z. Yu, “A low-cost and high-throughput [35] B. Zhang et al., “A 28 Gb/s multistandard serial link transceiver for
NoC-aware chip-to-chip interconnection,” in Proc. IEEE Int. Symp. backplane applications in 28 nm CMOS,” IEEE J. Solid-State Circuits,
Circuits Syst. (ISCAS), Oct. 2020. vol. 50, no. 12, pp. 3089–3100, Dec. 2015.
[29] F. Mireshghallah, M. Bakhshalipour, M. Sadrosadati, and [36] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing
H. Sarbazi-Azad, “Energy-efficient permanent fault tolerance in NUCA organizations and wiring alternatives for large caches with
hard real-time systems,” IEEE Trans. Comput., vol. 68, no. 10, CACTI 6.0,” in Proc. 40th Annu. IEEE/ACM Int. Symp. Microarchi-
pp. 1539–1545, Oct. 2019. tecture (MICRO), Dec. 2007, pp. 3–14.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.

You might also like