NeuronLink An Efficient Chip-to-Chip Interconnect For Large-Scale Neural Network Accelerators
NeuronLink An Efficient Chip-to-Chip Interconnect For Large-Scale Neural Network Accelerators
9, SEPTEMBER 2020
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1967
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1968 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020
TABLE I
S PECIFICATION OF T RAFFIC PATTERNS I NSIDE DNN A CCELERATORS
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1969
to the flit address within the retry data command. At the same
time, the CRM unit sends related commands to indicate the
retrying operation.
The physical layer receives data and commands from the
data link layer. We utilize asynchronous FIFOs to trans-
fer data and asynchronous handshake mechanism to transfer
commands. Asynchronous handshake approach is introduced
to address the issue that commands need a higher priority
than data (e.g., for retransmission, commands are required to
transmit before the data). Then, the synchronous header, check
header, and package header are added to the data or com-
mands by the encoder. After that, scrambler utilizes the IEEE
802.3 standard to avoid successive “0" or “1" value and
balance the number of “0" and “1." The encoded message will
not be scrambled. Therefore, these two steps can be executed
simultaneously to improve parallelism. Finally, control fields
are generated by the byte striping module to avoid mistakes
on the chips. Fig. 6. VC routing optimization I. (a) Conventional round-robin crossbar
arbitration. (b) Our scoring crossbar arbitration.
At the receiving side, the data processed by the physical
medium attachment (PMA) sublayer is recovered to normal
data after block boundary alignment and channel bonding.
Then, the data are descrambled to normal order and sent to the five output ports, VC buffers, a route computation unit, a VC
decoder. After that, the elastic buffer is used to synchronize allocator, a switch allocator, and a 5 × 5 crossbar switch. The
data from the recovery clock to the local clock. Command dashed rectangle indicates that the component is with our
interface and data interface transfer command or data to the optimization. Details about the VC routing optimizations are
data link layer by recognizing the sync header. given in the following.
For a command, the CRM in the data link layer analyzes it 1) Scoring Crossbar Arbitration: Fig. 6(a) illustrates a con-
and responds accordingly. For data, it is buffered and sent to ventional round-robin crossbar arbitration method. For each
different VCs depending on the priority and address. Once the input port, if there exist packets of different priorities in
data fails to pass the check by the decoder, the CRM generates the VCs, the highest priority one will be forwarded to the
retry command to require the transport side to retry data and crossbar arbitration unit by the VC allocator. Then, packets
abandon any data until it receives the retry data. from different input ports are granted by priority comparator
and round-robin arbitration to access the routing resource. This
multilevel arbitration method would introduce long latency and
B. On-Chip VCs Routing Optimization consume more hardware resources.
Our router architecture is shown in Fig. 5, which is the opti- To reduce latency, we propose a scoring arbitration method
mization of a conventional router. It consists of five input ports, as shown in Fig. 6(b). The main feature of our method
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1970 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1971
Fig. 10. Large-scale DNN accelerator. (a) Overall architecture of the accelerator. (b) 2-D mesh NoC-based single chip. (c) Architecture of the processing
node. (d) Architecture of the analog processing unit.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1972 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020
A. Experiment Setting
We implement the proposed interconnect, NeuronLink,
in C++ at the cycle-accurate level for fast performance evalu-
ation and also in Verilog for FPGA-based hardware evaluation.
Network throughput, average latency, channel utilization, and
VC utilization are used as performance evaluation metrics.
Throughput is the rate at which packets are delivered by the
network for a particular traffic pattern. Latency is the time
required for a packet to traverse the network from source to
destination.
For the rest of the experiment parts, Section VI-B shows
the FPGA-based hardware characteristic of NeuronLink.
Section VI-C shows the performance of the interconnect.
Section VI-D gives comparisons against other interconnects.
B. Hardware Characteristic
Fig. 13. Hardware resource comparison: on-chip versus chip-to-chip.
Fig. 12 shows our FPGA-based evaluation platform, which
consists of four Xilinx ZCU102 FPGA boards. These FPGA
boards are connected in a ring approach via optical fibers. comparison between the on-chip network (16 NeuronLink-Rs)
We use the platform to evaluate the interconnection net- and the chip-to-chip network (two NeuronLink-Cs).
work of the DNN accelerator described in Section V. It is
noted that our NeuronLink includes two parts, NeuronLink-
R (router, see Section IV-B) and NeuronLink-C (chip-to-chip C. Performance of NeuronLink
connection, see Section IV-C). Each FPGA board implements To better understand the performance of NeuronLink,
two NeuronLink-Cs for chip-to-chip communication and a we evaluate the performance of the on-chip network
2-D 4 × 4 mesh NoC with 16 NeuronLink-Rs for on-chip (NeuronLink-R) and chip-to-chip network (NeuronLink-C),
communication. The overall system is simulated and evaluated respectively. For the on-chip network, we give the network
based on Vivado 2016.4 with the ResNet-18 model mapping throughput, average latency, channel utilization, and VC uti-
to the hardware platform (see Fig. 11). The system can run at lization under different injection rates. For the chip-to-chip
250 MHz. FPGA resource utilization of a single chip is shown network, we give the network throughput and average latency
in Table IV. In Fig. 13, we also show the hardware resource under different injection rates, and the bit error rate. Besides,
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1973
we also evaluate the overall performance of NeuronLink in The simulation results show that our network can effectively
terms of network throughput and average latency. handle traffic imbalances. Also, channel utilization and VC uti-
1) On-Chip Interconnection Network Performance: The lization increase as the injection rate increases as shown
architecture of the on-chip network is shown in Fig. 10(b), in Figs. 14 and 15, which indicates that no blocking occurs in
which is a 2-D 4 × 4 mesh NoC with the optimized router the network.
(NeuronLink-R, see Fig. 5) described in Section IV-B. The 2) Chip-to-Chip Interconnection Network Performance:
performance of on-chip network is evaluated under six test In Fig. 16, we give the performance of the chip-to-chip
cases as summarized in Table V. Our test cases cover different interconnect (NeuronLink-C), including network throughput
traffic patterns (shuffle, neighbor, and real DNN workloads), and average latency. In our interconnect, it takes two cycles to
different packet types (unicast and multicast), different packet transfer a flit in the physical layer. Therefore, the maximum
priorities (fixed and variation), and random packet length, throughput can be calculated as follows:
to offer a comprehensive evaluation of the on-chip network. f physical 2numdata
Figs. 14 and 15 show the throughput, average latency, channel Throuthputmax = ∗ (3)
2 f datalink numcmd + 2numdata
utilization, and VC utilization under different injection rate
of synthetic traffic (Test-1–Test-4) and real DNN workloads where fphysical and f datalink represent the working frequency
(Test-5 and Test-6), respectively. For real DNN workloads, of the physical layer and the data link layer, respectively.
we choose two representative DNN models: AlexNet [30] numdata and numcmd represent the number of data and the
and ResNet-18 [1]. Our DNN-to-NoC mapping strategy is number of commands, respectively. Here, f physical is 250 MHz
similar to [31]. As demonstrated in Section III-B, unicast and and f datalink is 300 MHz. In our experiment, the ratio of
multicast coexist in the DNN accelerators after the mapping. commands and data is 1:5. Each packet has nine flits. The
For a better understanding of the on-chip interconnection length of a flit is 130 bits, while the length of a command is
performance, some key data are listed in Table VI, includ- 64 bits. With these parameters, the maximum throughput is
ing saturation injection rate, light-load latency, and satu- calculated as about 0.388 flits/cycle. As shown in Fig. 16(a),
ration channel utilization. As can be seen from Figs. 14 the saturation throughput is consistent with the ideal maximum
and 15, and Table VI, the zero-load latency under all test throughput. In other words, our interconnection system has
cases is five cycles, which is as we expected and consistent reached the maximum utilization rate (100%) for SerDes
with the router structure. For all test cases, the saturation transmissions.
throughput is over 0.45 flits/cycle/node, which achieves 70.3% In our interconnect, because no additional messages are
of the ideal throughput (the maximum ideal throughput is added to the payload in the data link layer, the encoding
0.643 flits/cycle/node). Because the traffic patterns affect the efficiency is only affected by the encoder in the physical
latency and saturation point, average latency curves vary under layer (see Fig. 4). During the encoding phase, every 64-bits
different traffic patterns. Compared with the shuffle pattern, payload is encoded to 80 bits with extra 16 bits. Therefore,
the transmission path of the neighbor pattern faces less com- the encoding efficiency of our interconnect is 80% (64/80 bits).
petition, so it is less prone to congestion. Thus, Test-4 shows a The maximum bandwidth of NeuronLink can be calculated as
lower latency (23 cycles, indicated in Table VI) and a greater follows:
saturation point (0.75 flits/cycle/node, indicated in Table VI) Bandwidthmax = Nlane ∗ Vlane ∗ E encode (4)
in Fig. 14(b). It is noted that the DNN workloads show a
similar latency and saturation point as the synthetic traffic where Nlane , Vlane , and E encode represent the number of lanes of
except the latency of Test-6. Because residual connections the SerDes, data transmission rate of the SerDes, and encoding
in ResNet-18 make the traffic more sophisticated. Among efficiency, receptively. Here, Nlane , Vlane , and E encode are 2,
different shuffle patterns (Test-1–Test-3), the workload of 10 Gb/s, and 80%. The Bandwidthmax is calculated as 16 Gb/s.
Test-3 is heavier with both unicast and multicast traffic. Thus, With a assumption that credit synchronization command is sent
it reaches saturation earlier (0.64 flits/cycle/node, indicated per 5 data, the effective bandwidth is 16 × 5×130/(5 × 130+
in Table VI). For modest loads (Test-1 and Test-2), we observe 64) = 14.56 Gb/s. Compared to PCIe Gen2 × 4 (8 Gb/s), our
that there are twice exponential increases in latency [see effective bandwidth increases by 82%.
Fig. 14(b)]. This is because of the traits of the shuffle pattern For average latency shown in Fig. 16(b), when the injection
(the traffic is heavier at the center of the network and the rate is lower than 0.2 flits/cycle, the transmission will not be
routers at that region are prone to congestion) and the packet blocked or influenced by credit update so the average latency
injection method (newly generated packets from a certain node is 105 cycles. When the injection rate increases but less than
would not be injected into the network when that node faces 0.4 flits/cycle, the latency increases to 132 cycles due to round-
the congestion) we adopted. As the injection rate increases, robin transmission of four routers. Because of the limitation
the traffic becomes heavier and the latency grows rapidly. of SerDes’s throughput and credit mechanism, the average
When the injection rate reaches a certain point, packets are latency grows sharply after 0.4-flits/cycle injection rate. In the
unable to be introduced into the network from the center situation of saturation, VCs have about 40 packets (360 flits)
region. Thus, the latency goes smoothly. After that, as the and we need an additional 864 cycles to transfer them.
injection rate approaches saturation, the injected packets from To verify the reliability and stability of the NeuronLink,
the node at the edge of the network dominates the network traf- we utilize a built-in self-test (BIST) circuit to generate random
fic and cause the second exponential increase in latency again. numbers to test the bit error rate. In the test, we use the
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1974 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020
Fig. 14. Performance of on-chip interconnection under synthetic traffic (Test-1–Test-4). (a) Throughput. (b) Average latency. (c) Channel utilization.
(d) VC utilization.
Fig. 15. Performance of on-chip interconnection under real DNN workloads (Test-5 and Test-6). (a) Throughput. (b) Average latency. (c) Channel utilization.
(d) VC utilization.
TABLE V
S PECIFICATION OF D IFFERENT T YPE OF T EST C ASES
TABLE VI
P ERFORMANCE OF THE O N -C HIP N ETWORK
PRBS31 pattern. A huge amount of data (21 614 777 535 681) with a real workload, ResNet-18. It is noted that the ResNet-
is transferred at the sending side for about 24 h, no bit error 18 is mapping to four chips not a single chip here for overall
occurs at the receiving side. According to the confidence cal- interconnect performance evaluation. The ResNet-18 mapping
culation formula, we have a bit error rate of 2.663207e−15 for is illustrated in Fig. 11. Within the mapping, the traffic
NeuonLink. Note that PCIe used in 10G Ethernet, the required includes one-to-one, one-to-many, and many-to-one commu-
bit error rate is e − 12. Our interconnect shows a better bit nications. For the overall interconnect, the saturation injection
error rate. rate, the saturation throughput, the light-load latency, and the
3) Overall Interconnect Performance: In Fig. 17, we give saturation latency are 0.6 flits/cycle/node, 0.43 flits/cycle/node,
the performance of the overall interconnect, including network 25 cycles, and 105 cycles, respectively. Even under a sophisti-
throughput and average latency. The evaluation is based on a cated real workload, our interconnect shows a similar perfor-
four-FPGAs hardware platform, as demonstrated in Fig. 12, mance with the test case listed in Table V.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1975
Fig. 16. Performance of chip-to-chip interconnection. (a) Throughput under different injection rate. (b) Average latency under different injection rate.
Fig. 17. Performance of the overall interconnect with ResNet-18 mapping. (a) Throughput under different injection rate. (b) Average latency under different
injection rate.
TABLE VII
C OMPARISON W ITH A C ONVENTIONAL ROUTER
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1976 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020
Fig. 19. Performance comparison between a conventional router and our proposed router. (a) Throughput under different injection rate. (b) Average latency
under different injection rate.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
XIAO et al.: NeuronLink: EFFICIENT CHIP-TO-CHIP INTERCONNECT FOR LARGE-SCALE NN ACCELERATORS 1977
TABLE XI achieve high-throughput and low bit error rate with low-
C OMPARISON W ITH N O C-BASED DNN A CCELERATORS overhead against state-of-the-art interconnects. Furthermore,
our NeuronLink-based DNN accelerator outperforms the pre-
vious NoC-based DNN accelerators 1.27×–1.34× in terms of
power efficiency and 2.01×–9.12× in terms of area efficiency.
R EFERENCES
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep
residual networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Oct. 2016,
pp. 630–645.
[2] G. Hinton et al., “Deep neural networks for acoustic modeling in speech
recognition: The shared views of four research groups,” IEEE Signal
Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
chies for accurate object detection and semantic segmentation,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
[4] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,”
in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA),
Feb. 2015, pp. 161–170.
[5] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
the crossbar array) using the results from [32], for ADCs learning with limited numerical precision,” in Proc. Int. Conf. Mach.
using the results from [33], for DACs using the results from Learn. (ICML), Jul. 2015, pp. 1737–1746.
[6] X. Zhou et al., “Cambricon-S: Addressing irregularity in sparse
[34], for SerDes using the results from [35]. We use CACTI neural networks through a cooperative software/hardware approach,” in
7.0 [36] at 32 nm to model the area and energy for all Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),
on-chip buffers, including eDRAM and SRAM. Table IX Oct. 2018, pp. 15–28.
[7] S. Yin et al., “An energy-efficient reconfigurable processor for binary-
summaries the characteristics of a single APU. In our analysis, and ternary-weight neural networks with flexible data bit width,” IEEE
the accelerator contains four chips [see Fig. 10(a)]. Each J. Solid-State Circuits, vol. 54, no. 4, pp. 1120–1136, Apr. 2019.
chip consists of 16 processing nodes, 16 NeuronLink-Rs, 2 [8] P. Ou et al., “A 65 nm 39 GOPS/W 24-core processor with 11Tb/s/W
packet-controlled circuit-switched double-layer network-on-chip and
NeuronLink-Cs, and 2 SerDes. For a fair comparison, the PCIe heterogeneous execution array,” in IEEE Int. Solid-State Circuits Conf.
shown in Fig. 10(b) is not included in the analysis. The (ISSCC) Dig. Tech. Papers, Feb. 2013, pp. 56–57.
hardware characteristic of the accelerator is given in Table X. [9] A. Touzene, “On all-to-all broadcast in dense Gaussian network on-
chip,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 4, pp. 1085–1095,
Our accelerator has 41.2-mm2 chip area and 15.4-W power Apr. 2015.
consumption. [10] B. Bohnenstiehl et al., “KiloCore: A 32-nm 1000-processor computa-
There are few works about the interconnection networks tional array,” IEEE J. Solid-State Circuits, vol. 52, no. 4, pp. 891–902,
Apr. 2017.
running DNNs. Eyeriss-v2 [14] and DaDianNao [16] are two [11] A. Firuzan, M. Modarressi, M. Daneshtalab, and M. Reshadi, “Recon-
representative works. In Table XI, we compare our accelerator figurable network-on-chip for 3D neural network accelerators,” in Proc.
with Eyeriss-v2 and DaDianNao. We consider two key metrics: 12th IEEE/ACM Int. Symp. Netw.-Chip (NOCS), Oct. 2018, pp. 1–8.
[12] H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling flexible
power efficiency and area efficiency. Power efficiency repre- dataflow mapping over DNN accelerators via reconfigurable intercon-
sents the number of 8-/16-bit operations performed per watt nects,” in Proc. 23rd Int. Conf. Archit. Support Program. Lang. Oper.
(GOPS/W), while the area efficiency represents the number Syst. (ASPLOS), Mar. 2018, pp. 461–475.
[13] M. F. Reza and P. Ampadu, “Energy-efficient and high-performance NoC
of 8-/16-bit operations performed per mm2 (GOPS/mm2 ). architecture and mapping solution for deep neural networks,” in Proc.
The power efficiency and area efficiency of Eyeriss-v2 and 13th IEEE/ACM Int. Symp. Netw.-Chip, Oct. 2019, pp. 1–8.
DaDianNao are also scaled to 32 nm under 8-bit precision for a [14] Y.-H. Chen, T.-J. Yang, J. S. Emer, and V. Sze, “Eyeriss v2: A flexible
straight comparison. As shown in Table XI, our accelerator has accelerator for emerging deep neural networks on mobile devices,” IEEE
J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, Jun. 2019.
1361.7-GOPS/W power efficiency and 508.9-GOPS/mm2 area [15] K. Bhardwaj and S. M. Nowick, “A continuous-time replication strategy
efficiency. The NeuronLink-based accelerator shows 1.34× for efficient multicast in asynchronous NoCs,” IEEE Trans. Very Large
better power efficiency and 9.12× better area efficiency against Scale Integr. (VLSI) Syst., vol. 27, no. 2, pp. 350–363, Feb. 2019.
[16] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in
Eyeriss-v2 and 1.27× better power efficiency and 2.01× better Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2014,
area efficiency against DaDianNao. pp. 609–622.
[17] X. Liu, W. Wen, X. Qian, H. Li, and Y. Chen, “Neu-NoC: A high-
efficient interconnection network for accelerated neuromorphic systems,”
VII. C ONCLUSION in Proc. 23rd Asia South Pacific Design Automat. Conf. (ASP-DAC),
In this work, we proposed a complete chip-to-chip inter- Jan. 2018, pp. 141–146.
[18] R. Hojabr, M. Modarressi, M. Daneshtalab, A. Yasoubi, and
connection network for large-scale NN accelerators, including A. Khonsari, “Customizing clos network-on-chip for neural networks,”
both intrachip and interchip communication techniques. The IEEE Trans. Comput., vol. 66, no. 11, pp. 1865–1877, Nov. 2017.
interconnect aiming at efficiently managing the huge amount [19] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
of multicast-based traffic inside DNN accelerators. The inter- [20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
connect is fully evaluated with the cycle-accuracy C++ the inception architecture for computer vision,” in Proc. IEEE Conf.
and RTL model. Also, the interconnect is implemented with Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826.
[21] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
a four FPGAs-based hardware platform. Experiment results transformations for deep neural networks,” in Proc. IEEE Conf. Comput.
have shown that the proposed interconnect, NeuronLink, can Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5987–5995.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.
1978 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2020
[22] X. Zhang, Z. Li, C. C. Loy, and D. Lin, “PolyNet: A pursuit of structural [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
diversity in very deep networks,” in Proc. IEEE Conf. Comput. Vis. with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Pattern Recognit. (CVPR), Jul. 2017, pp. 3900–3908. Process. Syst. (NIPS), Dec. 2012, pp. 1097–1105.
[23] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. [31] K.-C. Chen and T.-Y. Wang, “NN-noxim: High-level cycle-accurate
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, NoC-based neural networks simulator,” in Proc. 11th Int. Workshop
pp. 7132–7141. Netw. Chip Architectures (NoCArc), Oct. 2018, pp. 1–5.
[24] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable [32] A. Shafiee et al., “ISAAC: A convolutional neural network
architectures for scalable image recognition,” in Proc. IEEE/CVF Conf. accelerator with in-situ analog arithmetic in crossbars,” in Proc.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8697–8710. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2016,
[25] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “AutoAug- pp. 14–26.
ment: Learning augmentation strategies from data,” in Proc. IEEE/CVF [33] L. Kull et al., “A 3.1 mW 8b 1.2 GS/s single-channel asynchronous SAR
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 113–123. ADC with alternate comparators for enhanced speed in 32nm digital
[26] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution SOI CMOS,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
for image classifier architecture search,” in Proc. AAAI Conf. Artif. Papers, Feb. 2013, pp. 468–469.
Intell., vol. 33, 2019, pp. 4780–4789. [34] M. Saberi, R. Lotfi, K. Mafinezhad, and W. A. Serdijn, “Analysis of
[27] P. A. Merolla et al., “A million spiking-neuron integrated circuit with power consumption and linearity in capacitive digital-to-analog convert-
a scalable communication network and interface,” Science, vol. 345, ers used in successive approximation ADCs,” IEEE Trans. Circuits Syst.
no. 6197, pp. 668–673, Aug. 2014. I, Reg. Papers, vol. 58, no. 8, pp. 1736–1748, Aug. 2011.
[28] W. Liao, Y. Guo, S. Xiao, and Z. Yu, “A low-cost and high-throughput [35] B. Zhang et al., “A 28 Gb/s multistandard serial link transceiver for
NoC-aware chip-to-chip interconnection,” in Proc. IEEE Int. Symp. backplane applications in 28 nm CMOS,” IEEE J. Solid-State Circuits,
Circuits Syst. (ISCAS), Oct. 2020. vol. 50, no. 12, pp. 3089–3100, Dec. 2015.
[29] F. Mireshghallah, M. Bakhshalipour, M. Sadrosadati, and [36] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing
H. Sarbazi-Azad, “Energy-efficient permanent fault tolerance in NUCA organizations and wiring alternatives for large caches with
hard real-time systems,” IEEE Trans. Comput., vol. 68, no. 10, CACTI 6.0,” in Proc. 40th Annu. IEEE/ACM Int. Symp. Microarchi-
pp. 1539–1545, Oct. 2019. tecture (MICRO), Dec. 2007, pp. 3–14.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 28,2022 at 17:22:11 UTC from IEEE Xplore. Restrictions apply.