0% found this document useful (0 votes)
61 views24 pages

E0294 Scribe Lecture 9

This document provides an overview of performance scaling of DianNao DNN accelerators. It discusses two improved versions: DaDianNao and ShiDianNao. DaDianNao integrates many DianNao units on a single chip similar to a multi-core design. It uses eDRAM instead of SRAM and connects units using a fat-tree interconnect and 5x5 crossbar routers. ShiDianNao focuses on computer vision tasks using a similar architecture optimized for image processing. The document also previews communication-aware accelerators Eyeriss and Eyeriss-V2 to be covered later.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views24 pages

E0294 Scribe Lecture 9

This document provides an overview of performance scaling of DianNao DNN accelerators. It discusses two improved versions: DaDianNao and ShiDianNao. DaDianNao integrates many DianNao units on a single chip similar to a multi-core design. It uses eDRAM instead of SRAM and connects units using a fat-tree interconnect and 5x5 crossbar routers. ShiDianNao focuses on computer vision tasks using a similar architecture optimized for image processing. The document also previews communication-aware accelerators Eyeriss and Eyeriss-V2 to be covered later.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

E0294: SYSTEMS FOR MACHINE

LEARNING

Lecture #9

Author
Boul Chandra Garai
PhD, CSA, IISc
SR No: 23318

6th February, 2024

1
Contents
1 Re-cap from the previous class: DianNao Accelerator 3

2 Performance scaling of DianNao DNN accelerators 4


2.1 DaDianNao: A big computer . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Routers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 5x5 Crossbar Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Performance Achievements with DaDianNao accelerator: . . . . . . . . 9
2.2 ShiDianNao : Vision Computer . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Communication aware DNN accelerators 16


3.1 Eyeriss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Eyeriss-V2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2
1 Re-cap from the previous class: DianNao Accelerator
DianNao[1] was designed as one custom accelerator for DNN, it implemented using tiling
based techniques to improve performance by reducing the number of operations. The top
level architecture is as shown in the Figure1 below:

Figure 1: DianNao, Tiling-based DNN accelerator

At very high lavel if we see, it has three Neural Functional Unit (NFU), the first one
NFU-1 cakes care of multiplications, the 2nd one NFU-2 is the adder-tree and the 3rd one
NFU-3 takes care of non-linear operations. It has three different buffers to store the inputs,
weights and outputs. The buffers sizes are different to take care of multiple dimensions of
input, weight, and output fmaps.
At high-level the architecture of DianNao describes as follows:
1. Compute Units:
- DianNao consists of multiple compute units, each responsible for executing neural network
layers.
- These compute units are specialized for matrix multiplications and other operations com-
mon in deep learning.
2. Memory Hierarchy:
- DianNao employs a memory hierarchy to manage data movement efficiently.
- It includes local buffers for storing intermediate results and weights.
- The global memory holds the neural network parameters.
3. Control Logic:

3
- The control logic orchestrates the execution of layers.
- It manages data flow, synchronization, and control signals.
4. Interconnect:
- The interconnect connects compute units, memory, and control logic.
- It ensures efficient communication between components.
5. Overall Design Philosophy:
- DianNao’s architecture prioritizes energy efficiency and throughput.
- It achieves this by minimizing data movement and maximizing parallelism.
DianNao is a remarkable high-throughput and energy-efficient accelerator designed for deep
neural networks.
Performance measure of the DianNao accelerator:
1. Key Achievements:
- DianNao achieves a high throughput of 452 GOP/s (operations like synaptic weight mul-
tiplications and neuron outputs additions).
- It accomplishes this in a small footprint of 3.02 mm² and consumes 485 mW.
- Compared to a 128-bit 2 GHz SIMD processor, DianNao is 117.87 times faster and reduces
total energy consumption by 21.08 times.
2. Memory Considerations:
- DianNao’s design accounts for memory access efficiency.
- It uses a folded approach, where a few hardware units serve multiple neurons in a time-
multiplexed manner.
- Buffers are employed to bring in the required elements, maximizing data reuse.
In summary, DianNao’s high throughput and compact design make it suitable for a broad
range of systems and applications, enabling the use of state-of-the-art machine learning
algorithms.
For more technical details and interesting fact about this accelerator, it is strongly en-
courage to explore the original research paper [1].

2 Performance scaling of DianNao DNN accelerators


This section discuss how this custom accelerator DianNao scaled up to further improve the
performance of DNN accelerators. Here are the two example of improved version of the
DianNao architecture:
– DaDianNao [2]
– ShiDianNao [3]
In the next section we will discuss two communication aware DNN accelerators:
– Eyeriss [4]
– Eyeriss-V2 [5]

2.1 DaDianNao: A big computer


DaDianNao is a remarkable machine-learning supercomputer accelerator that has made
waves in the field. Developed by researchers at the University of Virginia, it’s designed
to accelerate neural network computations, particularly for tasks like deep learning and
convolutional neural networks (CNNs).

4
We have seen in the previous design DianNao consists of single NFU Unit, which consists
of three NFUs (NFU-1, NFU-2 and NFU-3) as a unit for specialized dedicated operation.
DianNao is not suitable for executing large DNN operations, because of high volume of
parameters it need to handle which may not be able to give the desired throughput. To
address these challenge DaDianNao integrates many of the DianNao type NFU on a single
chip in the similar concept of multi-core architecture design.
The two immediate modification implemented on the DaDianNao architectures are with
respect to the predecessor is the introduction of:
– Synapses close to neuron
– eDRAM instead of SRAM
Because of the introduction of multiples NFU units, it called for multiple buffers. Because
some of those buffers were storing the weights and intermediate values, for which memory is
required, here they particularly used eDRAM, because these are more energy efficient and
have a higher storage density than traditional SRAMs, these are very critical for such large
scale applications.
Following Figure2 shows the simplified architecture of the DaDiaNao architecture.

Figure 2: Simplified floorplan with a single central NFU showing wire congestion.

If we closely observe the Figure2, we can see that each NFU is connected to four eDRAM
bank, and sufficient eDRAM capacity is provided to hold all synapses on the combined
eDRAM of all chips which will save of-chip DRAM access. The NFU unit is similar to the
NFU units in DianNao architecture, but here the NFU unit is connected to the four eDRAM
bank through wires. This unit of NFU with four eDRAM banks form a tile, and multiples of
these tiles are connected together to form the DaDianNao architecture, similar concept like
today’s multi-core architecture, DaDianNao architecture integrated 16 such tiles as shown
in the following Figure3.

5
Figure 3: Tile-based organization of a node (left) and tile architecture (right). A node
contains 16 tiles, two central eDRAM banks and fat tree interconnect; a tile has an NFU,
four eDRAM banks and input/output interfaces to/from the central eDRAM banks.

Sixteen such tiles are integrated with:


– Two central eDRAM banks with fat-tree 1 interconnect.
– Tiles are connected with mesh NoC having routers [Section 2.1.1] 5×5 crossbars. Each
tile consists of four eDRAM, and each each eDRAM banks are connected to NFU, and to
integrate sixteen such tiles and connect all of them with each other in this architecture
Fat-Tree interconnect is being used.
All these tiles are connected using mesh-NOC topology having routers of 5x5 Crossbar
Architecture [Section 2.1.2].
Its worth to note that there are two types of interconnect used in the DaDianNao architecture
accelerator design: (i) Fat-Tree interconnect to connect all the eDRAM banks, and (ii) 5x5
Crossbar mesh-NOC Architecture to connect all the tiles with each others.

2.1.1 Routers

In the following section we briefly illustrate the architecture of a network router and its
operation in a mesh network topology.
1 In a tree data structure, every branch has the same thickness (bandwidth), regardless of its position in

the hierarchy. These branches are all “skinny” (meaning low-bandwidth). However, in a Fat-Tree, branches
closer to the top of the hierarchy are “fatter” (thicker) than branches further down. The varied thickness
(bandwidth) of the data links in a Fat-Tree allows for more efficient and technology-specific use.

6
Figure 4: Simplified architecture of a network router and its operation in a mesh network
topology.

On the left side of the Figure 4, we can see the internal structure of a router[6]. It
consists of:
- Input Buffers: These are used to store incoming packets or data before they are processed.
There might be multiple buffers corresponding to different input ports (labeled Input 0 to
Input (p-1), where p is the number of ports).
- Route Computation: This is where the router decides the next hop for each incoming
packet based on its destination address.
- Switch Allocator: Once the routing decision is made, the switch allocator manages the
actual switching of the packet from the input buffer to the appropriate output line, ensuring
that packets do not collide.
- Crossbar: This is a switching fabric that allows multiple packets to be moved simultane-
ously from inputs to outputs in a non-blocking fashion.
The Figure4 also shows a Router Operational Pipeline with the following stages:
- IB: Input Buffering, where packets are initially stored.
- RC: Route Computation, determining the packet’s next hop.
- SA: Switch Allocation, the process of assigning the crossbar switch resources.
- ST: Switch Traversal, the actual transfer of packets through the crossbar.
- LT: Link Traversal, where the packet moves out onto the link to the next router.
The right side of the Figure4, is a representation of a mesh network topology, which is a
network where each node (represented by circles) is connected to one or more nodes in the

7
network, allowing for multiple paths for data to travel. This can provide high resilience and
multiple routes for data packets to reach their destination.
The message path shows the route a message might take through the mesh network to reach
its destination, hopping from node to node. The dotted lines indicate the intended path
for a packet, which might be determined by the route computation in each router it passes
through.
The red dotted line zoom through the router architecture to the mesh network, indicating
that the router’s operations (routing, switching, flow control, error control) are fundamental
to the functioning of the mesh network, managing the path that messages take through the
network.

2.1.2 5x5 Crossbar Architecture

In this section we will briefly discuss about the 5x5 Crossbar Architecture. The following
Figure5 shows the schematic of a router used in network-on-chip (NoC) architectures or
similar high-performance computing systems with multiple inputs and outputs which is
used in the DaDianNao DNN architecture.

Figure 5: Simplified 5x5 Crossbar NOC Architecture schematic.

This type of router is critical in managing data transmission within a multi-core proces-
sor or between processors in a system. This NoC architecture Consists of multiplexers, High
parallelism and Reduces head of line blocking. Here’s an overview of the all the components
and their functions:
- Input Ports (N, E, S, W, L): These ports correspond to the different directions from which
data packets can enter the router. ”N, E, S, W” typically stand for North, East, South,

8
and West, representing the four cardinal directions in a mesh network, and ”L” represent a
local input from the processor core/DNN tile to which the router is attached.
- Crossbar (5x5): This is a switch matrix that allows for the dynamic connection of any of
the input ports to any of the output ports. The notation (5x5) indicates that there are five
input channels and five output channels, allowing for any combination of connections.
- Arbiter (VC and SW Allocator): This component serves two purposes: it acts as a Vir-
tual Channel (VC) allocator and a Switch Allocator (SW). The VC allocator manages the
assignment of virtual channels to incoming packets to ensure efficient use of the router’s
resources. The SW allocator, on the other hand, dynamically assigns paths through the
crossbar for the packets to reach their output ports.
- Virtual Channels (VC) Blocks: These are buffers associated with each input port and
are used to temporarily store data packets before they are forwarded through the crossbar.
The VC blocks help to manage congestion and provide multiple paths for packets to travel,
improving bandwidth and reducing latency.
- Buffer-full, Credit-in/out: These signals are part of the flow control mechanism. ”Buffer-
full” indicates whether the buffer associated with an output port is full. ”Credit-in” and
”Credit-out” signals are used to manage the flow of data by indicating whether space is
available for more data to be sent or received.
- Link N, E, S, W, L: These represent the physical or logical links to neighboring routers or
processor cores in the respective directions.
- Config: This is a configuration interface used for setting up or managing router settings.
- Req N, grant N: These signals are associated with the arbitration process for the North
input port. ”Req N” is the request signal from the North input port asking to send data,
and ”grant N” is the signal from the arbiter granting the request.
- Aselect: A signal related to the arbiter’s selection process, indicating which input port has
been granted access to the crossbar for data transmission.
The diagram is a simplified representation of a complex router architecture. In practice,
routers like this manage the transfer of data packets within a network, making decisions
based on the destination of the data, the availability of the network resources, and the op-
timal path to ensure efficient and reliable data communication.

2.1.3 Performance Achievements with DaDianNao accelerator:

At very high-level the architecture of the DaDianNaoas follows:


Major Components:
- Input Buffer (NBin): This buffer handles input neurons.
- Output Buffer (NBout): Responsible for output neurons.
- Synaptic Weight Buffer (SB): Stores synaptic weights.
- Neural Functional Unit (NFU): The computational block that performs both synapse and
neuron computations.
- Control Logic (CP): Manages overall system control and coordination.
- energy efficient high density eDRAM to store synapses
- Fat-Tree and 5 x 5 crossbar mesh NoC router interconnect
On a subset of large neural network layers, DaDianNao achieves remarkable
results:

9
- Speedup: Up to 656.63x compared to a GPU.
- Energy Reduction: An average reduction of 184.05x for a 64-chip system.
Implementation Details:
- The architecture is implemented down to the place and route at 28 nm.
- It combines custom storage and computational units, along with electrical inter-chip in-
terconnects.
DaDianNao’s innovative design enables efficient and powerful neural network computations,
making it a significant advancement in machine-learning hardware. The cell-based layout
of the chip is shown in Figure6, 44.53% of the chip area is used by the 16 tiles, 26.02% by
the four HT IPs, 11.66% by the central block (including 4MB eDRAM, router and control
logic). The wires between the central block and the tiles occupy 8.97% of the area. Over-
all, about a half (47.55%) of the chip is consumed by memory cells (mostly eDRAM). The
combinational logic and register only account for 5.88% and 4.94% of the area respectively.

Figure 6: Snapshot of the DaDianNao node layout.

Figure 7 compare the performance of the DaDianNao architecture against the GPU
baseline (to NVIDIA K20 GPU).On average, the 1-node, 4-node, 16-node and 64-node ar-
chitectures are respectively 21.38x, 79.81x, 216.72x, and 450.65x faster than the to NVIDIA
K20 GPU. The first reason for the higher performance is the large number of operators:

10
in each node, there are 9216 operators (mostly multipliers and adders), compared to the
2496 MACs of the GPU. The second reason is that the on-chip eDRAM provides the nec-
essary bandwidth and low-latency access to feed these many operators. For more technical
and deeper interesting on this accelerator it is strongly encourage to explore the original
research paper [3].

Figure 7: Speedup of Speed up with respect to NVIDIA K20 GPU (training).

For training and initialization, the energy reduction of DaDianNao architecture with
respect to the to NVIDIA K20 GPU on training is: 172.39x, 180.42x, 142.59x, and 66.94x
for the 1-node, 4-node, 16-node and 64-node architectures, as shown in the Figure8. As we
can see from the figure that, if we increase the number of chips, the reduction of energy
decreases almost for all the convolutional layers (except for the POOL and LRN), this can
be due to the increase in complexity of interconnect network, clock signals and control logics
hardware implementation as we increase the number of chips, but the benefit will be seen in
performance in terms of speed-up(Figure 7. As per the analysis of the paper[3], a significant
portion of energy is consumed by the NFU.

11
Figure 8: Energy reduction with respect to NVIDIA K20 GPU (training).

2.2 ShiDianNao : Vision Computer


The DianNao chips was mostly design keeping in mind the performance of multilayer per-
ceptron, thats why there was no convolutional neural networks. Where as DaDianNao
was extended for very small small convolutional neural networks. But the ShiDianNao is
designed for real application of DNN, it can be used where deep convolutional layer com-
putation is required in a more generalised way. As illustrated in Figure 9, ShiDianNao
accelerator consists of the following main components: two buffers for input and output
neurons (NBin and NBout), a buffer for synapses (SB), a neural functional unit (NFU) plus
an arithmetic unit (ALU) for computing output neurons, and a buffer and a decoder for
instructions (IB). It introduces two additional components if we compare with DianNao and
DaDianNao:
– ALU (Arithmetic Logic Unit), this is a computational block that performs arithmetic and
logical operations. It’s connected to the NFU to carry out the necessary calculations
– Instruction buffer (this stores the instructions that will be decoded for the NFU to exe-
cute).

12
Figure 9: ShiDianNao Accelerator architecture.

High level Architecture Overview:


Processing Elements (PEs): Instead of parallel multipliers and adders, ShiDianNao employs
a grid of simple PEs. Each PE handles a single Multiply-Accumulate (MAC) operation or
other operations required in a Convolutional Neural Network (CNN).
Weight Buffers: The entire set of weights resides in three buffers, totaling 256KB (with a
focus on classifier layers).
Synaptic Buffer: ShiDianNao also includes a 128KB synaptic buffer.
Input/Output Buffers: These buffers handle input and output data.
Instruction Buffer: Stores instructions for the accelerator.
ShiDianNao is an accelerator designed to shift vision processing closer to the image sensor
(CMOS or CCD).
It exploits an important property of Convolutional Neural Networks (CNNs): shared weights
among many neurons. This property considerably reduces the neural network memory foot-
print.
By mapping the entire CNN within an SRAM, ShiDianNao eliminates all DRAM accesses
for weights. When placed next to the image sensor, it further eliminates all remaining
DRAM accesses (for inputs and outputs).
Executing DNNs:
Instead of focusing on a single type of layer (e.g., only convolutional layers), ShiDianNao ex-
ecutes the entire DNN. Both CNNs and DNNs are popular, but CNNs have a key difference:
each neuron shares its weights with all other neurons within a feature map. This weight
sharing significantly reduces the total number of weights compared to DNNs. ShiDianNao
leverages this weight sharing property to efficiently execute the entire DNN, including vari-
ous types of layers (convolutional, fully connected, etc.).
Dataflow: DianNao was output stationary dataflow, ShiDianNao also used the output sta-

13
tionary dataflow, but it integrate multiple chips which call for more complex control logics.
Each PE processes one neuron at a time during convolutions. ShiDianNao reduces data
movement for input neurons, output neurons, and weights.
Algorithm-hardware mapping: Given one big neural netwroks, the algorith is imple-
mented to map/decide which layer has to go which tile to improve the energy reduction and
to improve the communication to improve overall performance by executing DNN instead of
single layers. A convolutional layer constructs multiple output feature maps with multiple
input feature maps. When executing a convolutional layer, the accelerator continuously
performs the computations of an output feature map, and will not move to the next output
feature map until the current map has been constructed. When computing each output fea-
ture map, each PE of the accelerator continuously accommodates a single output neuron,
and will not switch to another output neuron until the current neuron has been computed.
Figure 10 an example to illustrate how different neurons of the same output feature map are
simultaneously computed. Without losing any generality, Figure 10 shows a small design
having 2 × 2 PEs (PE0,0 , PE1,0 , PE0,1 and PE1,1, and a convolutional layer with 3 × 3
kernel size (convolutional window size) and 1 × 1 step size.

Figure 10: Algorithm-hardware mapping between a convolutional layer (convolutional win-


dow: 3 × 3; step size: 1 × 1) and an NFU implementation (with 2 × 2 PEs) [3].

ShiDianNao accelerator and performance comparison of it to its counter-


parts, DianNao and DaDianNao.
1. ShiDianNao:
- Design Focus: ShiDianNao is specifically tailored for embedded devices located near the
CMOS or CCD sensors. It targets small networks that can fit on a chip, with a 288KB

14
SRAM budget.
- Design Point: It strikes an intermediate balance between energy efficiency and throughput.
While it offers relatively low throughput, it excels in energy efficiency.
Architecture:
- The entire set of weights resides in three buffers, totaling 256KB (with a focus on classifier
layers).
- Instead of parallel multipliers and adders, ShiDianNao employs a grid of simple processing
elements (PEs). Each PE handles a single Multiply-Accumulate (MAC) operation or other
operations required in a Convolutional Neural Network (CNN). - PEs also have small FIFO
buffers ( 10 entries) for value shuffling.
- Additional components include a 128KB synaptic buffer, input/output buffers, and an
instruction buffer.
Dataflow:
- Each PE processes one neuron at a time during convolutions.
- For convolutions, synaptic weights are broadcast to all PEs, and each PE performs the
MAC operation.
- ShiDianNao reduces data movement for input neurons, output neurons, and weights.
Performance:
- Throughput: 194 GOP/s at 1 GHz with a power consumption of 320mW.
- Significantly more energy-efficient than DianNao and faster than DianNao (which accesses
DRAM).
Applications:
– Facial recognition
– Handwritten digit recognition
2. DianNao:
- Parallel multipliers and tree of adders per neuron.
- High wiring complexity due to many values being read from buffers.
- Energy consumption is higher compared to ShiDianNao.
3. DaDianNao:
- Larger area (68mm²) and higher throughput (6 TOP/s) than ShiDianNao.
- Thruput/area advantage for DaDianNao.
- ShiDianNao has a thruput/watt advantage.
- DaDianNao takes 50ms to process one image, while being less energy-efficient than ShiD-
ianNao.
A layout characteristics of the currently implemented ShiDianNao version is shown in Fig-
ure11. ShiDianNao has 8 × 8 (64) PEs and a 64 KB NBin, a 64 KB NBout, a 128 KB SB,
and a 32 KB IB. The overall SRAM capacity of ShiDianNao is 288 KB (11.1× larger than
that of DianNao), in order to simultaneously store all data and instructions for a practical
CNN. Yet, the total area of ShiDianNao is only 3.52× larger than that of DianNao (4.86
mm² vs. 1.38 mm² )[3].

15
Figure 11: Layout of ShiDianNao (65 nm).

ShiDianNao strikes a balance between energy efficiency and throughput, making it suit-
able for embedded vision processing close to sensors. For further detail and more information
on Shidian it is strongly encourage to read the original paper [3].

3 Communication aware DNN accelerators


Communication-aware DNN accelerators refer to specialized hardware architectures de-
signed to consider not only the computational resources but also the communication aspects
within the system.
1. Importance of Communication:
- In deep neural network (DNN) accelerators, computational resources (such as processing
elements, memory, and arithmetic units) play a crucial role.
- However, communication (data movement) is equally essential. Efficient data movement
can significantly impact performance and energy efficiency.
- For example, mapping neural network workloads in an accelerator often involves reusing or
parallelizing certain computations. This leads to different dataflows or mapping strategies,
which require effective communication mechanisms.
2. Challenges Addressed by Communication-Aware Accelerators:
- Irregular Dataflows: DNNs exhibit diverse layer types, shapes, cross-layer fusion, and
sparsity. These variations result in irregular dataflows within accelerators.
- PE Underutilization: Traditional accelerators with rigid and tightly coupled connections
among processing elements (PEs) and buffers may suffer from PE underutilization due to
irregular dataflows.
- Solution: Communication-aware DNN accelerators aim to address these challenges by
optimizing communication paths, enabling efficient mapping of both regular and irregular
dataflows. They achieve near 100% PE utilization by considering communication as a crit-
ical factor.
3. Examples of Communication-Aware DNN Accelerators:
- Eyeriss : Eyeriss is a groundbreaking DNN accelerator architecture designed to efficiently

16
run deep neural networks (DNNs) on resource-constrained platforms, such as mobile devices.
- Eyeriss v2: Eyeriss v2, an advanced DNN accelerator, also considers communication as-
pects. It introduces the Row-Stationary Plus (RS+) dataflow, optimizing spatial tiling and
parallelism for improved performance.

3.1 Eyeriss
Eyeriss, an energy-efficient reconfigurable accelerator designed for deep convolutional neural
networks (CNNs). The paper titled ”Eyeriss: An Energy-Efficient Reconfigurable Accelera-
tor for Deep Convolutional Neural Networks” by Chen, Krishna, Emer, and Sze introduces
this innovative architecture.

Figure 12: Eyeriss system architecture.

The new design proposals in the Eyeriss accelerator in comparison to the DNN acceler-
ators we have seen so far:
1. New Dataflow:
The Eyeriss accelerator introduces an improved dataflow compared to its previous version.
This new dataflow enhances efficiency in processing neural network layers, resulting in bet-
ter performance and reduced energy consumption.
2. New Architecture:
The updated architecture of Eyeriss is more flexible and adaptable. It can handle various
convolutional neural network (CNN) shapes effectively. This flexibility allows it to accom-
modate different layer sizes and configurations, making it suitable for a wider range of deep
learning tasks.
3. Encoding and Decoding Technique:
Eyeriss leverages an innovative encoding and decoding technique. By analyzing data statis-
tics, it minimizes energy usage during data movement and storage. This optimization con-
tributes to overall energy efficiency and faster execution of neural network operations.
4. Fabricated Chip (Co-Processor):
Eyeriss is now available as a dedicated co-processor, designed to work alongside the main
processor. This fabricated chip accelerates deep learning workloads, particularly those in-
volving CNNs. Its integration enhances overall system performance and enables efficient
execution of neural network inference tasks.

17
The main features of Eyeriss are as follows:
1. Spatial Architecture and Memory Hierarchy:
- With this design they haved proposed a spatial architecture comprising 168 processing el-
ements (PEs) that establishes a four-level memory hierarchy. By leveraging low-cost levels,
such as PE scratch pads (spads) and inter-PE communication, we aim to minimize data
accesses to high-cost levels, including the large on-chip global buffer (GLB) and off-chip
DRAM.
2. Row Stationary (RS) CNN Dataflow:
- The Row Stationary (RS) dataflow dynamically reconfigures the spatial architecture to
efficiently compute a given CNN shape, optimizing for energy efficiency. This approach
adapts the architecture layer by layer, considering the varying dimensions of input feature
maps and filter weights.
3. Network-on-Chip (NoC) Support for RS Dataflow:
- The proposed NoC architecture combines multicast and point-to-point single-cycle data
delivery to facilitate the RS dataflow. This design enhances communication efficiency within
the system.
4. Energy-Efficiency Techniques:
- To further improve energy efficiency, they have employ run-length compression (RLC)
and PE data gating. These techniques exploit statistical properties of zero data in CNNs,
reducing unnecessary computations and memory accesses.
The Figure12 shows the block diagram of the Eyeriss deep neural network (DNN) accelera-
tor architecture. Here’s a brief description of the key components and their functionality:
-Two levels of control:
Top-Level Control:
This block is the central control unit that orchestrates the operations of the accelerator. It
manages the flow of data and the configuration of the computational blocks.
– Traffic between DRAM and GLB
– Traffic between GLB and PE array (NoC)
– Operation of RLC and ReLU
Low level:
– Control logic in each PE
- RLC Decoder/Enc.: RLC stands for Run-Length Coding, a form of lossless data com-
pression. The decoder and encoder manage the compression and decompression of data to
optimize memory bandwidth and storage efficiency.
- ReLU: This is the Rectified Linear Unit, a non-linear operation commonly used as an
activation function in neural networks.
- Global Buffer: A buffer memory on-chip that temporarily stores data such as intermediate
computations, inputs, and weights, it has a capacity of 108KB.
- Accelerator: This is the main computational engine of the architecture, where the actual
processing of the DNN occurs.
- Config Scan Chain: A configuration mechanism that sets up the accelerator for specific
tasks by loading configuration bits.
- 12x14 PE Array: A Processing Element (PE) array that performs the matrix and vector
operations fundamental to DNN computations. The array is composed of 12 by 14 PEs,
each performing a part of the overall computation in parallel, which is key for achieving high

18
throughput. PEs execute independent of each other.Each PE starts processing whenever
fmaps or psums arrives, unlike the systolic array.
- Spad: Scratchpad memory associated with each PE for storing local data.
- MAC Control: Multiply-Accumulate Control, which manages the multiply-accumulate op-
erations within each PE. This is a core operation in neural network computation.
- Processing Element: A single unit within the PE array that performs computation. It often
includes a small local memory and a simple processor capable of operations like multiply-
accumulate.
- 64 bits interface with Off-Chip DRAM: This is the external memory source for the accel-
erator, used to store the larger datasets and weights that cannot be held within the on-chip
memory due to space constraints and data bus width for the communication between the
DRAM and the accelerator, is transferred in 64-bit chunks.
The architecture diagram also shows data paths and control signals, two different clock
domain23 Link Clock and Core Clock, indicating that different parts of the system may op-
erate at different clock rates. The data paths between the various components suggest the
flow of data, such as feature maps (Ifmap/Ofmap) and filters, through the system during
operation. Different levels of Communication path between:
– PE and Global Buffer (GLB)
– PE and PE
– PE and scratchpad (spad)
This architecture is designed to efficiently process deep neural networks by optimizing data
movement, reducing memory bandwidth requirements, and parallelizing computations.
ENERGY-EFFICIENT FEATURES:
The Eyeriss chip focuses on two main approaches to improve the energy efficiency:
1) reducing data movement and
2) exploiting data statistics.
1. Energy-Efficient Dataflow: Row Stationary (RS)
In Eyeriss, the designer implemented RS dataflow that maps the computation of any given
CNN shape onto the PE array. It is reconfigurable for different shapes and optimizes for the
best energy efficiency. The RS dataflow minimizes data movement for all data types (ifmap,
filter, and psums/ofmap) simultaneously and takes the energy costs at different levels of
the memory hierarchy into account. Data accesses to the high-cost DRAM and GLB are
minimized through maximally reusing data from the low-cost spads and inter-PE commu-
nication. Compared with the earlier dataflows in the previous accelerators, the RS dataflow
is 1.4–2.5 times more energy efficient in AlexNet, a widely used CNN [7]. To minimize the
movement of ifmaps and filters, the goal is to maximize three forms of data reuse.
1) Convolutional Reuse: Each filter weight is reused E × F times in the same ifmap
plane, and each ifmap pixel is usually reused R × S times in the same filter plane.
2) Filter Reuse: Each filter weight is reused across the batch of N ifmaps.
3) Ifmap Reuse: Each ifmap pixel is reused across M filters (to generate M ofmap chan-
nels). To minimize the movement of psums, it is desirable that the psum accumulation
2 In digital circuit design, a clock domain refers to a part of a circuit that operates under the control of

a particular clock signal. Each clock domain has its clock that synchronizes the operation of the flip-flops
and other timing-sensitive elements within that domain.
3 Also a technique use to reduce the power consumption of a digital circuit by selectively turning off the

clock signal to parts of the circuit when they are not in use. By gating the clock signal, the flip-flops in the
gated region do not switch states, which reduces dynamic power consumption is refers as Clock Gating.

19
across C × R × S values into one ofmap value can be done as soon as possible to save both
the storage space and memory R/W energy. However, maximum input data reuse cannot
be achieved simultaneously with immediate psum reduction, since the psums generated by
multiply and accumulations (MACs) using the same filter or ifmap value are not reducible.
Thus, the RS dataflow uses a systematic approach to optimize for all data types simultane-
ously.
Row-Stationary (RS) novel dataflow scheme designed to optimize energy efficiency in the
processing of deep convolutional neural networks (CNNs). This dataflow is designed to min-
imize data movement energy consumption on a spatial architecture, specifically targeting
both convolutional and fully-connected layers in deep neural networks.
Here are the key points about the RS dataflow:
1. Motivation: Deep convolutional neural networks (CNNs) achieve high accuracy but
come with high computational complexity. The Row Stationary dataflow scheme is designed
to minimize data movement, which is a significant source of energy consumption in hard-
ware accelerators. By reducing the amount of data that needs to be moved around, the
Row Stationary approach aims to improve the energy efficiency of the system. The need to
process hundreds of filters and channels simultaneously in high-dimensional convolutions re-
sults in significant data movement. While parallel compute paradigms address computation
requirements, energy consumption remains high due to data movement costs.
2. RS Dataflow: This approach keeps a row of data stationary (fixed in one place) as
much as possible during computations. Instead of moving data to the processing units, the
Row Stationary dataflow brings the computation to the data. This significantly reduces the
energy consumed in data transfers, which is critical for improving overall energy efficiency.
The RS dataflow minimizes data movement energy by exploiting local data reuse of filter
weights and feature map activations. It achieves this by:
- Local Reuse: Leveraging local storage for filter weights and feature map pixels. The Row
Stationary dataflow makes extensive use of local memory (or on-chip memory) to store in-
termediate computations and weights close to the processing units. This reduces the need to
access external memory, which is both power-intensive and slow, further enhancing energy
efficiency.
- Partial Sum Accumulations: Minimizing data movement during partial sum accumula-
tion.
- Reconfiguration for Different Layers: Adapting to different CNN shape configurations.
One of the key features of the Eyeriss accelerator is its reconfigurability to adapt to differ-
ent types of layers in a CNN, such as convolutional layers, pooling layers, and fully connected
layers. The Row Stationary dataflow can be reconfigured to optimize the processing for each
type of layer, ensuring that energy efficiency is maintained across the diverse operations in
CNNs.
3. Energy Efficiency: The paper demonstrates through experimental results that the Row
Stationary dataflow, when implemented in the Eyeriss accelerator, leads to significant im-
provements in energy efficiency compared to other dataflow schemes. The RS dataflow is
more energy-efficient than existing dataflows in both convolutional (1.4 to 2.5 times) and
fully-connected layers (at least 1.3 times for batch size larger than 16) when evaluated using
CNN configurations like AlexNet.
4. Spatial Architecture: Eyeriss, the proposed accelerator, employs the RS dataflow on a

20
spatial architecture with 168 processing elements. It reconfigures computation mappings to
maximize local data reuse and reduce expensive data movement.
2. Exploit Data Statistics:
To further improve energy efficiency, data statistics of CNN is exploited to:
1) reduce DRAM accesses using compression, which is the most energy consuming data
movement per access, on top of the optimized dataflow; and
2) skip the unnecessary computations to save processing power.

Figure 13: Encoding of the RLC implementation on Eyeriss system architecture.

RLC is used in Eyeriss to exploit the zeros in fmaps and save DRAM bandwidth. Fig-
ure13 shows an example of RLC encoding. Consecutive zeros with a maximum run length
of 31 are represented using a 5-b number as the Run. The next value is inserted directly
as a 16-b Level, and the count for run starts again. Every three pairs of run and level are
packed into a 64-b word, with the last bit indicating if the word is the last one in the code.
Based on our experiments using AlexNet with the ImageNet data set, the compression rate
of RLC only adds 5%–10% overhead to the theoretical entropy limit. Except for the input
data to the first layer of a CNN, all the fmaps are stored in RLC compressed form in the
DRAM. The accelerator reads the encoded ifmaps from DRAM, decompresses it with the
RLC decoder, and writes it into the GLB. The computed ofmaps are read from the GLB,
processed by the ReLU module optionally, compressed by the RLC encoder, and transmit-
ted to the DRAM. This saves both space and R/W bandwidth of the DRAM.
Here are the key aspects of Eyeriss DNN accelerator:
1. Objective:
- Eyeriss aims to achieve state-of-the-art accuracy while minimizing energy consumption
across the entire system, including both the accelerator chip and off-chip DRAM.
- It focuses on supporting various CNN shapes by reconfiguring the architecture.
2. Architecture Highlights:
- Efficient Dataflow: Eyeriss employs an efficient dataflow design that minimizes data move-
ment. It includes:
- A spatial array for processing CNN layers.
- A memory hierarchy to exploit data reuse.
- An on-chip network to facilitate communication.
- Exploiting Data Statistics:
- Zeros skipping/gating: Eyeriss avoids unnecessary reads and computations by leveraging
data statistics.
- Data compression: Reduces off-chip memory bandwidth, which is a critical factor in energy

21
efficiency.
3. CNN Support:
- Eyeriss is capable of handling state-of-the-art CNNs with:
- Multiple layers.
- Millions of filter weights.
- Varying shapes (filter sizes, number of filters, and channels).
4. Energy Efficiency:
- By minimizing data movement and exploiting data statistics, Eyeriss achieves remarkable
energy efficiency.
- The design considers the entire system, including both on-chip and off-chip components.
5. Real-Time Performance:
- Eyeriss operates in real-time, delivering accurate results with minimal energy consump-
tion.
- It addresses the challenges posed by deep CNNs, which are computationally intensive.
In summary, Eyeriss represents a significant advancement in DNN accelerators, combining
architectural efficiency, energy savings, and support for complex CNNs. The paper provides
further technical details and insights into its implementation. If you’re interested, you can
explore the full paper [4].

3.2 Eyeriss-V2
Eyeriss v2, a flexible accelerator designed for emerging deep neural networks (DNNs) on
mobile devices. The paper titled ”Eyeriss v2: A Flexible Accelerator for Emerging Deep
Neural Networks on Mobile Devices” by Chen, Yang, Emer, and Sze introduces this inno-
vative architecture.
Here are the key aspects of Eyeriss v2:
1. Objective:
- Eyeriss v2 targets resource-constrained platforms, such as mobile devices, where energy
efficiency and compactness are critical.
- It aims to handle compact and sparse DNNs, which differ significantly from traditional
large models in terms of layer shapes and sizes.
2. Architecture Highlights:
- Hierarchical Mesh Network:
- Eyeriss v2 introduces a highly flexible on-chip network called the hierarchical mesh.
- This mesh adapts to varying data reuse and bandwidth requirements for different data
types.
- It improves the utilization of computation resources by efficiently routing data.
- Sparse Data Processing:
- Eyeriss v2 processes sparse data directly in the compressed domain for both weights and
activations. - This approach enhances both processing speed and energy efficiency when
dealing with sparse models. - Dataflow:
- Eyeriss v2 employs a new dataflow called Row-Stationary Plus (RS+).
- RS+ enables spatial tiling of data from all dimensions, fully utilizing parallelism for high
performance.
3. Performance and Efficiency:

22
- With sparse MobileNet models, Eyeriss v2 achieves impressive results:
- In a 65nm CMOS process, it achieves a throughput of 1470.6 inferences/sec.
- The energy efficiency is 2560.3 inferences/J at a batch size of 1.
- Compared to the original Eyeriss running MobileNet, Eyeriss v2 is 12.6 times faster and
2.5 times more energy efficient.
4. Eyexam Analysis Methodology:
- The paper introduces Eyexam, a systematic approach to understanding performance limits
for DNN processors.
- Eyexam considers specific characteristics of the DNN model and accelerator design to
tighten performance bounds.
Eyeriss v2 addresses the challenges posed by compact and sparse DNNs, providing high
performance and energy efficiency for mobile devices. For further technical insights, you
can refer to the full paper [5].

23
References
[1] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao:
a small-footprint high-throughput accelerator for ubiquitous machine-learning,” in
Proceedings of the 19th International Conference on Architectural Support for
Programming Languages and Operating Systems, ser. ASPLOS ’14. New York, NY,
USA: Association for Computing Machinery, 2014, p. 269–284. [Online]. Available:
https://doi.org/10.1145/2541940.2541967

[2] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun,
and O. Temam, “Dadiannao: A machine-learning supercomputer,” in 2014 47th Annual
IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 609–622.

[3] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and
O. Temam, “Shidiannao: shifting vision processing closer to the sensor,” in Proceedings
of the 42nd Annual International Symposium on Computer Architecture, ser. ISCA ’15.
New York, NY, USA: Association for Computing Machinery, 2015, p. 92–104. [Online].
Available: https://doi.org/10.1145/2749469.2750389

[4] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfig-
urable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State
Circuits, vol. 52, no. 1, pp. 127–138, 2017.

[5] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for
emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and
Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.

[6] J. F. Kurose and K. W. Ross, Computer Networking: A Top-Down Approach (6th Edi-
tion), 6th ed. Pearson, 2012.

[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep


convolutional neural networks,” in Advances in Neural Information Processing Systems,
F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., vol. 25. Curran Associates,
Inc., 2012. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2012/
file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

24

You might also like