Co-Design of A Novel CMOS Highly Parallel, Low-Power, Multi-Chip Neural Network Accelerator

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Co-design of a novel CMOS highly parallel,

low-power, multi-chip neural network accelerator


W. Hokenmaier, R. Jurasek E. Bowen, R. Granger, D. Odom
Semiconductor Hardware Development Software, Algorithm, and Architecture Design
Green Mountain Semiconductor Inc. Non-Von LLC
Burlington, VT USA Lyme, NH USA
[email protected] [email protected]
[email protected] richard [email protected]
[email protected]

Abstract—Why do security cameras, sensors, and siri use cloud developments from the software team. The process included
servers instead of on-board computation? The lack of very-low- sometimes modifying or even dropping certain instructions in
power, high-performance chips greatly limits the ability to field order to keep the node size minimal. This effective simula-
untethered edge devices. We present the NV-1, a new low-power
ASIC AI processor that greatly accelerates parallel processing tion and communication approach facilitated the two teams’
(∼10X) with dramatic reduction in energy consumption (> joint optimization of the final design for size, performance,
100X), via many parallel combined processor-memory units, i.e., and power, ensuring that proposed hardware implementations
a drastically non-von-Neumann architecture, allowing very large continued to meet the intended performance targets.
numbers of independent processing streams without bottlenecks Notably, the approach encompasses the design of allowing
due to typical monolithic memory. The current initial prototype
fab arises from a successful co-development effort between AI network sizes far beyond a single die; the communication
algorithm- and software-driven architectural design and VLSI protocol expands seamlessly beyond individual die boundaries,
design realities. An innovative communication protocol minimizes allowing a multitude of identical chiplet processors to be
power usage, and data transport costs among nodes were vastly connected to achieve a targeted network node count. Thus
reduced by eliminating the address bus, through local target a given configuration may be as small as a single chip or
address matching. Throughout the development process, the
software/architecture team was able to innovate alongside the chiplet for applications domains such as internet-of-things
circuit design team’s implementation effort. A digital twin of (IoT) low-power devices, and also can directly scale up to
the proposed hardware was developed early on to ensure that huge arrays for some uses such as server farms, while still
the technical implementation met the architectural specifications, operating at comparatively very low power budgets. (Although
and indeed the predicted performance metrics have now been the approach is fully compatible with very small fab tech-
thoroughly verified in real hardware test data. The resulting
device is currently being used in a fielded edge sensor application; nology, the initial low-cost prototype presented here used
additional proofs of principle are in progress demonstrating the 28nm TSMC manufacturing.) The interface integrates with
proof on the ground of this new real-world extremely low-power FPGAs and SoCs for overall communication, so that one or
high-performance ASIC device. multiple chiplets can act either alone or as massively parallel
Index Terms—co-design, low-power design, parallel processors, AI coprocessors.
neural network accelerator, memory, instruction set design
The first prototype (NV-1) includes 3200 cores per chip
with seamless I/O compatibility to increase array size via
I. I NTRODUCTION
chaining chips. During testing, this chip achieved 447 GB/s
The low-power, high-performance AI processor described per 0.25 W, thus demonstrating both high performance and a
here, the NV-1, is a joint development effort between Non-Von radical power-use improvement over other comparable hard-
LLC and Green Mountain Semiconductor. Non-Von’s archi- ware devices (see further discussion in Results). The chip also
tecture designs originated from a novel machine ”instruction has been fielded in real-world settings, performing real-time
set” of fundamental parallel operations and software, that were processing of a chemical sensor, with a power budget of < 10
initially derived (at Dartmouth College’s Brain Engineering mW, providing a direct initial demonstration that the chip is
Laboratory) from the operation and arrangement of circuitry operational and applications-ready.
in the brain [1]. The collaboration between Non-Von and
Green Mountain Semiconductor then arose to confront the II. BACKGROUND
challenge of translating this instruction set into a correspond- Software developers have forever been at the mercy of the
ingly efficient hardware implementation. Throughout the hardware that is available to them [2]. The limitations of
collaborative process, from the initial architecture designs given hardware designs superimpose substantial constraints on
all the way through tapeout, a digital-twin approach has algorithm and software design. In particular, algorithms that
been used to enable the closed-loop communication between are intrinsically parallel will be enormously slowed down by
hardware capabilities from the engineering team and algorithm typical hardware. The typical approach has been to use GPUs
sensors, medical devices, and much more, in environments
where large batteries or power sources are extremely limited
— has been a longstanding unmet need. Rather than the
continued repurposing of hardware developed for other tasks,
such as GPUs, the hardware presented here was specifically
developed for low-power, high-performance massively parallel
systems.
Current hardware for neural network solutions still use
GPUs for training [4]. Hardware engineers have opened up
low level access for software engineers to explore more
efficient algorithms, optimizing data movement and improving
efficiency. Because of the success of GPUs in neural network
accelerations, engineers have developed many different hard-
ware solutions from training chips, inference chips, low power
edge devices and high performance cloud architectures. For
Fig. 1. Traditional instruction and reduced instructions for same task
example, Google has released the TPU (Tensor Processing
Unit) for its data servers.
However, the current solutions for generative AI are not
scalable, with current models taking racks containing hundreds
of chips to run [4]. Moreover, the power needs (and cooling
needs) for current hardware typically entails very specific sit-
ing for server farms, often specifically at sites of hydroelectric
dams and other resources [5]. Convolutional neural networks,
other deep neural networks, and transformers, all can be made
Fig. 2. Multiple cores executing instructions somewhat more efficient, but for many hardware solutions it
requires a large amount of batching to achieve efficiency.
From a software engineers perspective, this is clearly a
and related hardware, but GPUs were of course designed for limitation and drives solutions to an outcome that may not
specialized image-processing operations, rather than broader be needed for the original problem. Current state of the art
parallel algorithms, and most systems typically must be written instruction sets also impose limitations. Creating hardware
(or re-written) for GPU compatibility. that focuses specifically on the instructions necessary, instead
The systems designed at Dartmouth and Non-Von [1] de- allows for vastly more efficient designs to be realized. Most
rive from operations of extremely large numbers of simple current cores have the ability to run in a flexible manner
parallel elements in complex arrangements (neurons in brain supporting more than a program may need [17], and although
circuitry); these are intrinsically massively parallel (rather this gives a flexible processing architecture, it trades that
than parallelized versions of inherently serial methods). Such flexibility for impaired performance. In NV-1 we instead focus
intrinsically parallel algorithms are greatly sped up by ap- on a specific instruction set that is hugely accelerated, while
propriate parallel hardware, but it is highly rare and unusual other portions of software can be picked up by a coprocessor.
for such hardware to be constructed for these parallel algo-
rithms. Instead, the software must typically be adapted and III. D ESIGN
compromised to available hardware, rather than new hardware Co-design of the software and hardware systems was cru-
architectures being developed to accommodate the parallel cial for rendering Non-Von’s initial pioneering instruction set
designs. The repurposing of GPUs to run accelerated neural architecture for neural network acceleration into a complete
networks provides this necessity for compromised software working solution. To facilitate this parallel development of
[3]; this approach is now so widespread that it is almost software and hardware, a digital twin was created in the form
forgotten that GPUs are indeed far from an ideal hardware of a C++ software executable hardware model. This was done
environment for parallel systems in general. from the beginning of the project based initially on behavioral
Again, GPUs were developed for particular image- Verilog models, and then maintained throughout the project, as
processing tasks rather than for parallel algorithms in general. high level models were subsequently replaced with synthesized
GPUs simply have been used in this adapted form solely be- RTL code. The model allowed for the abstraction of hardware
cause they existed, and they were far closer to parallel software details and provided an equivalent behavioral representation of
needs than standard CPU designs. But to take seriously the the hardware to be developed. By both parties agreeing on the
needs of massively parallel software, and to design hardware functionality of the model, a clear goal for the programming
specifically for these needs, has been almost entirely absent and for the hardware design was defined. This methodology
from the field. Moreover, the need for very low power use was the groundwork for later verification on the resulting hard-
— such as required for fielded ”internet of things” (IOT), ware. Post tapeout, the results wanted from the silicon were to
Fig. 3. Physical layout of a single core

get physical power numbers along with showing functionality von Neumann implementations.
in hardware. The same waveforms used to simulate the chip
The initial minimal concept utilizes 64k cores. While any
were able to be used as vectors in the physical testing, further
core can perform any of the defined instructions, in typical
gaining confidence in the methodology.
practice each core is initialized to perform just one task.
By allowing only one task per core, the run time sending
of instructions is not needed, and both the power and time
for sending the instruction is removed. This is both different
from a traditional CPU where instructions are sent for each
command during execution, and from a GPU where a single
instruction is sent to all cores and the same instruction is
processed on every core with different data (SIMD). In the NV-
1 chip presented here, data can be sent from each core to every
other core. Each core maintains a boot-loaded address table
defining its connections to other cores. This in particular was
a concept easily realized in software, but not a straightforward
task in hardware. Physical wiring limitations and timing con-
siderations are problematic for bidirectional communication of
64k cores. Each core has a memory depth for core connections.
256 individual 16 bit numbers allow for the node to receive up
to 256 other nodes output. An epoch is defined as the action
Fig. 4. Node sub-blocks
of every core processing the messages from every other core
in its received address memory and passing the results on for
the next epoch. With intelligent programming of each core,
The architecture of a single NV-1 node is made up of
repetitive tasks can be executed with very high efficiency.
four main sub-blocks. (Fig 4). The Message Handler is the
interface of the node. The block handles all node-to-node For this prototype, a multi-project wafer tapeout, the maxi-
communication on the bus, along with control of the system, mum chip size was intentionally limited. The jointly developed
decoding the nodes programming and initiating the system to reduced instruction set made it possible to optimize the core
start during a node activation. The Memory Handler and the physical size to maximize the number of compute cores per
SRAM work hand in hand, holding the nodes’ communication die. Furthermore, innovation was needed to achieve a fully
information. The brains of the system is the IPU which handles configurable bidirectional communication solution for up to
the functionality of the nodes doing all of the calculations with 64k cores. The predefined address table removes the power
data handed to it from the Message Handler. This structure and area intensive address bus, such that only data is being
is repeated throughout the whole chip in an array creating a transmitted. This first prototype includes 3200 cores. It is
distributed computational system that can process inferences notable that the communication protocol extends seamlessly
with radically less power and fewer operations than in typical beyond die boundaries, enabling the creation of arbitrary-
Compute core utilization under memory bottleneck
Non-Von NV1, single-chip configuration 100% [see derivations in this manuscript]
Embedded CPU, ARM Cortex-A8 50.8% memory: DDR3 specs, TOPS: 2 DMIPS/MHz (x/1M fr MIPS to TIPS), x*1000 fr 1 MHz to 1 GHz, https://www.ti.com/lit/ds/symlink/am3358.pdf
NVIDIA Jetson TX2 0.73% memory: https://developer.nvidia.com/embedded/jetson-tx2, TOPS from https://developer.nvidia.com/embedded/jetson-modules
NVIDIA Jetson Orin Nano 4GB 0.06% memory: https://developer.nvidia.com/embedded/jetson-modules, TOPS from https://tinyurl.com/NvidiaJetsonTops
Data ctr GPU, NVIDIA H100 SXM, tensor cores 0.03% memory: https://www.nvidia.com/en-us/data-center/h100/, TOPS from https://www.nvidia.com/en-us/data-center/h100/
Google Coral Dev Board Micro 0.03% memory: LPDDR4 "(4-channel, 32-bit bus width)", https://tinyurl.com/CoralMem, int8 TOPS: https://coral.ai/products/accelerator-module/
Google TPUv4 0.07% https://cloud.google.com/tpu/docs/system-architecture-tpu-vm
Intel Habana Gaudi 2 0.63% https://developer.habana.ai/resources/habana-models-performance/
Tenstorrent Grayskull 0.01% https://tenstorrent.com/cards/
Cerebras 100% https://f.hubspotusercontent30.net/hubfs/8968533/WSE-2%20Datasheet.pdf
Rebellions_ Atom 0.03% https://rebellions.ai/wp-content/uploads/2023/09/Rebellions_ATOMProduct_Brief_v.3.2.pdf
Graphcore Colossus MK2 0.03% memory from https://www.graphcore.ai/products/ipu, TOPS from https://www.graphcore.ai/products/ipu

Fig. 5. Utilization percentages in the presence of memory bottlenecks

TABLE I
a) 30 b) C ROSS -C HIP S LOPE I NTERCEPT C URRENT AVERAGES ( M A) ACCORDING
per instruction

TO F REQUENCY (MH Z )
current (mA)

20 Condition Slope
DIN at VSS: Y = 3.25x + 6.3
10 DIN at DVDD: Y = 3.23x + 6.4
DIN at ¼ Clk: Y = 5.10x + 6.4
DIN at ½ Clk: Y = 6.95x + 6.4
0
st r
in he
m
ift

r.
su
ot
sh

Methodology (UVM). The expected data for this testbench


was validated from both the GMS side and the Non-Von design
Fig. 6. (a) Relative current per instruction for NV-1 chip; (b) NV-1 (28nm teams. This proved to be a good vehicle of cross understanding
TSMC fab) from both hardware and software sides of the GMS and Non-
Von design team. A shared C++ model was used to generate
sized arrays. Each die acts as a fully modular chiplet entity the expected data; this model was iteratively updated and
which can interact either with identical neighbors or can checked by both teams to ensure that correct functionality was
connect to a host computer or a hub which may in turn interpreted in the same way from the top level abstraction
interface with other NV chiplet networks. Up to 21 chiplets to the hardware. Once the correct functionality was agreed
can be combined to create an network of 64k cores. The first upon, the checker component of the UVM testbench could be
demonstration uses printed circuit board interconnects. Next utilized.
generation designs target a significantly larger overall network The testbench is able to run a full chip simulation in Verilog,
size in the millions, and may leverage advanced high density with either random nodes or pre-programmed in order to
2.5D and 3D heterogeneous packaging methods for lowest test potential corner cases. The whole system is then run,
power and further increased performance. both testing the proper setup procedure, and end-functionality
The prototype chip NV-1, a proof of concept array of correctness. The chip is viewed as a black box at the top to
3200 nodes, was successfully completed via joint development ensure proper data out and the nodes are also checked at the
efforts of GMS and Non-Von. The chip has functionality and greybox level to ensure proper node-to-node communication.
showcases the architecture’s very low power consumption. Within the testbench, nodes have been verified for correct
Designed in 28nm technology, the total array has dimensions message receiving and computation. This node message is then
of 3mm by 4mm. Further iterations of this device along with properly shifted through chip output and deemed correct at the
smaller technology nodes will continue to push a smaller black box level.
footprint. Figure 3 shows the single node architecture, with The verification effort found correct functionality for all
its processing portion on the left and the SRAM block to of the instructions in the instruction set, along with correct
save connectivity on the right. The digital twin throughout the communication between nodes, and proper operation at the
design stages served as a blueprint for the design, ensuring that chip level.
at each stage the hardware interpretation of what the network Figure 6a shows relative current per instruction for the NV-1
should be achieving lines up with the software concepts. chip design, measured at 6.25 MHz, providing the root values
This relationship continues on in the following section going for calculating speed and power tradeoffs, which are shown in
past design and into verification, where the model is used to Figures 5 and 7. It is worth noting that these figures amount to
determine correct functionality in real time in silicon. a max memory bandwidth of 447 GB/s per 0.25 W of power
(number of nodes * single read per clock * clock speed, i.e.,
IV. R ESULTS 447 GB/s = 3200 nodes * 50 MHz * (16 + 8 bits) / 8 / 1024 /
Throughout the design process the functionality of the 1024 / 1024) for a single NV-1 chip, and a corresponding 7.2
chip was under the scrutiny of the Universal Verification TB/s for an array of 16 chips. (Note that Fig 6 shows values
NV1 NV1 NV2 12nm NV2 7nm Embedded CPU NVIDIA NVIDIA Jetson NVIDIA H100 Google Coral Google
(1 chip) (16 chips) ( 8x8mm chip) ( 6x6mm chip) ARM Cortex-A8 Jetson TX2 Orin Nano 4GB SXM (tensor cores) Dev Board Micro TPUv4
::: Power (mW) :::
Power, Idle 6.2 99 18 10 17 ~100 ? ? 388 90,000
Power, Nominal 36 576 336 58 ? 7100 ? ? 1050 170,000
Power, Peak Workload 243 3893 20,348 3091 1552 7500 10,000 700,000 3000 192,000
::: Adjusted Power* (mW @ 7 nm equivalent) :::
Idle 0.4 6.2 6 10 0.2 ? ? ? ? ?
Nominal 2.25 36 114 58 ? 1359 ? ? ? ?
Peak Workload 15 243 6924 3091 18 1436 7656 2,143,750 ? ?
::: Peak Compute Throughput (TOPS) :::
Unstructured Sparse Data @ 50% 0.2 2.6 41 67 0.002 1.3 10 1979 4 275
Bool Arithmetic 21 329 10,441 17,043 0.5 ? ? ? ? ?
::: Best-case Efficiency (TOPS/W) :::
Unstructured Sparse @ 50% 0.66 0.66 7 21 0.001 0.2 1 3 2 1.4
Bool Arithmetic 85 85 1908 5495 0.3 ? ? ? ? ?
::: Best-case Adjusted Efficiency** (TOPS/ adj W) :::
Unstructured Sparse @ 50% 11 11 6 22 0.1 1 1.3 1 ? ?
Bool Arithmetic 1352 1352 1508 5513 28 ? ? ? ? ?

* Power numbers, adjusting for differences in fab process; y = (nm^2) / (7^2)


** TOPS per adjusted power
Fig. 7. Power, TOPS, and efficiency across multiple architectures

for 6.25 MHz whereas Fig 7 and memory bandwidth figures of 3.35 TB/s [4]. Calculating the peak memory bandwidth
are values for 50 MHz). of NV-1 entails summing node-internal memory reads that
The NV architecture was designed from first principles to can be performed during the course of computing a single
eliminate almost all memory bandwidth bottlenecks, which is operation: f = (max num ops per sec * max bits per op) /
a considerable throughput and efficiency limitation in CPUs 8 * 1024 * 1024 * 1024). Here we simply report the percent
and GPUs. Because it is so typical for memory to be off- utilization that is possible given the nature of a memory
chip, the concept of memory bandwidth is thus often thought bottleneck on particular hardware. Let f = min(compute,
of in terms of I/O protocol (such as DDR3), rather than in bandwidth/n bytes per op) / compute where n bytes per op
terms of the effect that it has on the time and efficiency = 3*16/8 =6 assumes that an operation uses two 16-bit inputs
costs of real applied usage. Imagine beginning with a current as operands and one 16-bit instruction. Then units(f) = ((GB/s
GPU and inquiring how its performance would be affected by / 1024) / bytes required per op) / TOPS. Figure 5 shows this
changes to its memory. First of all, if memory could be placed as compute core utilization in the presence of the memory
on-chip this would itself result in an enormous speedup in bottleneck.
processing of the GPU in real applications. Even with on-chip We emphasize that these numbers are intended to illustrate
memory, much of the von-Neumann bottleneck would still the struggle that is presented by monolithic external memories.
slow the system down if that memory still has to be treated as In practice, caches are used to avoid this, and those caches are
a monolithic entity that must be processed, so secondly, if the not represented in these numbers in Fig 5. Standard approaches
newly on-chip memory could then instead be distributed across become very limited, as seen in the figure, because though they
processing units into memory blocks that were independent of readily add more compute power (TOPS), they nonetheless
each other, then further speedups could be achieved. These two cannot add memory bandwidth anywhere near as easily. (This
steps (placement on chip, and independent distribution across is reflected in how the ARM Cortex does well in this figure:
processors) are at the heart of the new architecture, rendering it is a single core, so not much compute to consume memory
it highly non-von-Neumann in design. cycles.) In sum, this is not to say that memory bandwidth
Note that these enormous speedups do not change the TOPS considerations are the sole factor in performance, but we wish
measures at all. TOPS measures are treated independently of to emphasize that it is in fact important and it is routinely
any memory usage costs. That is, enormous speedups due to overlooked in measures such as TOPS.
elimination of memory bottlenecks will not even show up as It should also be noted that the NV-1 is merely the first
an improvement, if all one looks at are TOPS measures. Thus fabbed issuance of the Non-Von chip line; substantial further
TOPS measures are highly misleading in such cases, since increases already are estimated in the upcoming NV-2 chip, us-
they cannot reflect speedups that arise due to re-architecting ing the same estimation methods that correctly led to previous
of memory. very accurate predictions of NV-1 performance. It is highly
We therefore provide a range of measures that are intended notable that NV-1 does not use caches at all, nor a global
to enable approximate apples-to-apples comparisons, i.e., what memory space. Designers of GPUs extensively use caches to
theoretic and pragmatic gains would be achieved when switch- minimize the burden of their memory bottlenecks; these come
ing from the characteristics of one type of chip to another type, at a cost of power, space (e.g., for cache coherence logic), and
such as CPUs to GPUs, CPUs or GPUs to non-von-Neumann unpredictable timing. Figure 7 contains partial information,
architectures, etc. extracted from a range of sources, to roughly compare power,
A contemporary GPU has a reported peak memory bandwidth TOPS, and the resulting efficiency ratio, across a range of
multiple different hardware architectures. R EFERENCES
V. C ONCLUSIONS [1] Granger R. ”Engines of the brain: The computational instruction set
of human cognition.” AI Magazine 27: 15-32 (2006).
The NV-1 test chip has been successfully manufactured Moorkanikara J, Felch A, Chandrashekar A, Dutt N, Granger R,
Nicolau A, Veidenbaum A. ”Brain-derived vision algorithm on
(28nm TSMC technology), received in packaged dies, and high-performance architectures.” Int’l Journal of Parallel Prog.
functionally characterized and verified. System-level integra- 27: 345-269 (2009).
tion has been carried out to incorporate the chips in an existing Chandrashekar A, Granger R. ”Derivation of a novel efficient
supervised learning algorithm from cortical-subcortical loops.”
sensor apparatus that has been tested in fielded conditions. Front. Comput. Neurosci. 5: 50. doi: 10.3389/fncom.2011.00050
The measured results from this new chip, shown in Figures (2012).
5, 6, and 7, demonstrate that it exhibits very high memory Bowen E, Granger R, Rodriguez A. ”A logical re-conception
of neural networks: Hamiltonian bitwise part-whole
bandwidth performance at radically low power usage, outper- architecture.” Amer. Assoc. of Artif. Intell. (AAAI)
forming standard competing chips by orders of magnitude. https://openreview.net/forum?id=hP4dxXvvNc8 (2023).
The aim was to produce a new generation of AI hardware, [2] Barrett R, Borkarb S, Dosanjh S, Hammond S, Heroux M, Hu X,
rather than ongoing adaptation of systems such as GPUs, Luitjense J, Parker S, Shalf J, Tangd L. “On the role of co-design
that were intrinsically designed for quite different purposes. in high performance computing” Advances in Parallel Computing
24:141-155 (2013)
The new NV platform is specifically designed to accelerate
[3] Alcorn, P. “Intel 13th-Gen Raptor Lake specs, release date, bench-
massively parallel software, thus providing a natural processor marks, and more” (20 Oct 2022) tinyurl.com/raptorlakespec
and coprocessor setting for innovative development of rad- [4] https://www.nvidia.com/en-us/data-center/h100/
ically parallel systems. Moreover, these new platforms will https://www.researchgate.net/publication/224262634 GPUs
execute at extremely low power — that is, at just tiny fractions and the Future of Parallel Computing
of the power budgets of typical extant devices. Working [5] https://www.theatlantic.com/technology/archive/2024/03/ai-water-
demonstrations have been implemented to run the Whisper climate-microsoft/677602/
https://arxiv.org/abs/2312.12705
transformer-based real-time speech-to-text system with very
[6] memory: DDR3 specs, TOPS: 2 DMIPS/MHz (x/1M
low power, and to run a fielded real-time chemical sensor also fr MIPS to TIPS), x*1000 fr 1 MHz to 1 GHz,
with very low power (< 10 mW). https://www.ti.com/lit/ds/symlink/am3358.pdf
This project successfully demonstrates how software and [7] memory: https://developer.nvidia.com/embedded/jetson-tx2, TOPS
hardware engineers can work together to co-design and opti- from https://developer.nvidia.com/embedded/jetson-modules
mize overall outcomes in terms of die size, performance, and [8] memory: https://developer.nvidia.com/embedded/jetson-modules,
power consumption. Rather than the necessity of compromis- TOPS from https://tinyurl.com/NvidiaJetsonTops
ing, via the use of hardware designs that happen to be there for [9] memory: https://www.nvidia.com/en-us/data-center/h100/,
TOPS from https://www.nvidia.com/en-us/data-center/h100/
other purposes, the possibility now arises to take innovative
[10] memory: LPDDR4 ”(4-ch 32-bit bus width)” ,
algorithms and software, and produce hardware ASIC designs https://tinyurl.com/CoralMem,
that are well fitted to executing such software both with high int8 TOPS: https://coral.ai/products/accelerator-module/
performance and very low power. [11] https://cloud.google.com/tpu/docs/system-architecture-tpu-vm
With the ever increasing demands of AI hardware capa- [12] https://developer.habana.ai/resources/habana-models-performance/
bilities, especially in fielded low-power settings, this type of [13] https://tenstorrent.com/cards/
codevelopment effort, aided by a digital twin allowing for
[14] https://tinyurl.com/cerebrasWSE-2
a continuous interdisciplinary verification and communication
[15] https://tinyurl.com/rebellionsATOM
loop, may guide future projects to optimize TOPS/W not only
[16] memory from https://www.graphcore.ai/products/ipu,
as a pure hardware engineering task but as a joint endeavor. TOPS from https://www.graphcore.ai/products/ipu
Design efforts are under way towards the next version, NV-2,
[17] https://oscarlab.github.io/papers/instrpop-systor19.pdf
which will further improve on power usage and minimize the
[*] Partial funding for the work reported herein was provided by the
physical size of each core through resource sharing. Office of Naval Research.
Current edge-focused processors are highly challenged by
restrictive low power budgets and high performance require-
ments at the edge in practice, and they still typically resort
to using cloud computation that is costly (both in dollars
and in power usage). We show here that even this initial
prototype NV-1 device already drastically outperforms current
technology in parallel computation tasks, both in performance
and in power consumption. The ongoing approach addresses
a very clear need that is seen across industries attempting to
deploy AI and ML in real fielded applications.

You might also like