TensorFlow Lite Micro Embedded Machine L
TensorFlow Lite Micro Embedded Machine L
A BSTRACT
Deep learning inference on embedded devices is a burgeoning field with myriad applications because tiny
embedded devices are omnipresent. But we must overcome major challenges before we can benefit from this
opportunity. Embedded processors are severely resource constrained. Their nearest mobile counterparts exhibit at
least a 100—1,000x difference in compute capability, memory availability, and power consumption. As a result,
the machine-learning (ML) models and associated ML inference framework must not only execute efficiently but
also operate in a few kilobytes of memory. Also, the embedded devices’ ecosystem is heavily fragmented. To
maximize efficiency, system vendors often omit many features that commonly appear in mainstream systems,
including dynamic memory allocation and virtual memory, that allow for cross-platform interoperability. The
hardware comes in many flavors (e.g., instruction-set architecture and FPU support, or lack thereof). We introduce
TensorFlow Lite Micro (TF Micro), an open-source ML inference framework for running deep-learning models
on embedded systems. TF Micro tackles the efficiency requirements imposed by embedded-system resource
constraints and the fragmentation challenges that make cross-platform interoperability nearly impossible. The
framework adopts a unique interpreter-based approach that provides flexibility while overcoming these challenges.
This paper explains the design decisions behind TF Micro and describes its implementation details. Also, we
present an evaluation to demonstrate its low resource requirement and minimal run-time performance overhead.
1 I NTRODUCTION PPG optical sensors, and other devices enable consumer and
industrial applications, including predictive maintenance
Tiny machine learning (TinyML) is a burgeoning field at (Goebel et al., 2020; Susto et al., 2014), acoustic-anomaly
the intersection of embedded systems and machine learning. detection (Koizumi et al., 2019), visual object detection
The world has over 250 billion microcontrollers (IC Insights, (Chowdhery et al., 2019), and human-activity recognition
2020), with strong growth projected over coming years. As (Chavarriaga et al., 2013; Zhang & Sawchuk, 2012).
such, a new range of embedded applications are emerging
for neural networks. Because these models are extremely Unlocking machine learning’s potential in embedded de-
small (few hundred KBs), running on microcontrollers or vices requires overcoming two crucial challenges. First
DSP-based embedded subsystems, they can operate contin- and foremost, embedded systems have no unified TinyML
uously with minimal impact on device battery life. framework. When engineers have deployed neural networks
to such systems, they have built one-off frameworks that
The most well-known and widely deployed example of this require manual optimization for each hardware platform.
new TinyML technology is keyword spotting, also called These frameworks have tended to be narrowly focused, lack-
hotword or wakeword detection (Chen et al., 2014; Gru- ing features to support multiple applications and lacking
enstein et al., 2017; Zhang et al., 2017). Amazon, Apple, portability across a wide range of hardware. The developer
Google, and others use tiny neural networks on billions of experience has therefore been painful, requiring hand opti-
devices to run always-on inferences for keyword detection— mization of models to run on a specific device. And altering
and this is far from the only TinyML application. Low- these models to run on another device necessitated manual
latency analysis and modeling of sensor signals from micro- porting and repeated optimization effort. A second-order
phones, low-power image sensors, accelerometers, gyros, effect of this situation is that the slow pace and high cost
1
Google 2 Harvard University. Correspondence to: Pete Warden of training and deploying models to embedded hardware
<[email protected]>, Vijay Janapa Reddi (contributions prevents developers from easily justifying the investment
made as a Google Visiting Researcher) <[email protected]>. required to build new features.
TensorFlow Lite Micro
Another challenge limiting TinyML is that hardware vendors • We enable hardware vendors to provide platform-
have related but separate needs. Without a generic TinyML specific optimizations on a per-kernel basis without
framework, evaluating hardware performance in a neutral, writing target-specific compilers
vendor-agnostic manner has been difficult. Frameworks are
tied to specific devices, and it is hard to determine the source • We allow hardware vendors to easily integrate their ker-
of improvements because they can come from hardware, nel optimizations to ensure performance in production
software, or the complete vertically integrated solution. and comparative hardware benchmarking
The lack of a proper framework has been a barrier to acceler- • Our model-architecture framework is open to a wide
ating TinyML adoption and application in products. Beyond machine-learning ecosystem and the TensorFlow Lite
deploying a model to an embedded target, the framework model conversion and optimization infrastructure
must also have a means of training a model on a higher-
• We provide benchmarks that are being adopted by
compute platform. TinyML must exploit a broad ecosystem
industry-leading benchmark bodies like MLPerf
of tools for ML, as well for orchestrating and debugging
models, which are beneficial for production devices. • Our framework supports popular, well-maintained
Prior efforts have attempted to bridge this gap, and we Google applications that are in production.
discuss some of them later (Section 6). Briefly, we can
distill the general issues facing many of the frameworks into This paper makes several contributions: First, we clearly lay
the following: out the challenges to developing a machine-learning frame-
work for embedded devices that supports the fragmented
• Inability to easily and portably deploy models across embedded ecosystem. Second, we provide design and imple-
multiple embedded hardware architectures mentation details for a system specifically created to cope
with these challenges. And third, we demonstrate that an
• Lack of optimizations that take advantage of the under- interpreter-based approach, which is traditionally viewed
lying hardware without requiring framework develop- as a low-performance alternative to compilation, is in fact
ers to make platform-specific efforts highly suitable for the embedded domain—specifically, for
• Lack of productivity tools that connect training machine learning. Because machine-learning performance
pipelines to deployment platforms and tools is largely dictated by linear-algebra computations, the inter-
preter design imposes minimal run-time overhead.
• Incomplete infrastructure for compression, quantiza-
tion, model invocation, and execution
2 T ECHNICAL C HALLENGES
• Minimal support features for performance profiling,
debugging, orchestration, and so on Many issues make developing a machine-learning frame-
work for embedded systems particularly difficult. In this
• No standard benchmarks that allow hardware vendors section, we summarize the main ones.
to quantify their chip’s performance in a fair and repro-
ducible manner 2.1 Missing Features
• Lack of testing in real-world applications. Embedded platforms are defined by their tight limitations.
Therefore, many advances from the past few decades that
To address these issues, we introduce TensorFlow Lite Mi- have made software development faster and easier are un-
cro (TF Micro), which mitigates the slow pace and high cost available to these platforms because the resource tradeoffs
of training and deploying models to embedded hardware are too expensive. Examples include dynamic memory
by emphasizing portability and flexibility. TF Micro makes management, virtual memory, an operating system, a stan-
it easy to get TinyML applications running across archi- dard instruction set, a file system, floating-point hardware,
tectures, and it allows hardware vendors to incrementally and other tools that seem fundamental to modern program-
optimize kernels for their devices. TF Micro gives vendors mers (Kumar et al., 2017). Though some platforms provide
a neutral platform to prove their hardware’s performance in a subset of these features, a framework targeting widespread
TinyML applications. It offers these benefits: adoption in this market must avoid relying on them.
• Our interpreter-based approach is portable, flexible, 2.2 Fragmented Market and Ecosystem
and easily adapted to new applications and features
Many embedded-system uses only require fixed software
• We minimize the use of external dependencies and developed alongside the hardware, usually by an affili-
library requirements to be hardware agnostic ated team. The lack of applications capable of running
TensorFlow Lite Micro
on the platform is therefore much less important than it ter use of engineering resources than shipping more-highly
is for general-purpose computing. Moreover, backward custom executables. This run-time flexibility is hard to jus-
instruction-set-architecture (ISA) compatibility with older tify when code size is a concern and the potential uses are
software matters less than in mainstream systems because fewer. As a result, developers often must break through the
everything that runs on an embedded system is probably a library’s abstraction if they want to make modifications to
compiled from source code anyway. Thus, the hardware can suit their target hardware.
aggressively diversify to meet power requirements, whereas
even the latest x86 processor can still run instructions that 2.4 Ongoing Changes to Deep Learning
are nearly three decades old (Intel, 2013).
Machine learning remains in its infancy despite its break-
These differences mean the pressure to converge on one neck pace. Researchers are still experimenting with new
or two dominant platforms or ISAs is much weaker in the operations and network architectures to glean better predic-
embedded space, leading to fragmentation. Many ISAs have tions from their models. Their success in improving results
thriving ecosystems, and the benefits they bring to partic- leads product designers to demand these enhanced models.
ular applications outweigh developers’ cost of switching.
Companies even allow developers to add their own ISA Because new mathematical operations—or other fundamen-
extensions (ARM, 2019; Waterman & Asanovic, 2019). tal changes to neural-network calculations—often drive the
model advances, adopting these models in software means
Matching the wide variety of embedded architectures are porting the changes, too. Since research directions are hard
the numerous tool chains and integrated development envi- to predict and advances are frequent, keeping a framework
ronments (IDEs) that support them. Many of these systems up to date and able to run the newest, best models requires
are only available through a commercial license with the a lot of work. For instance, while TensorFlow has more
hardware manufacturer, and in cases where a customer has than 1,400 operations (TensorFlow, 2020e), TensorFlow
requested specialized instructions, they may be inaccessi- Lite, which is deployed on more than four billions edge
ble to everyone. These arrangements have no open-source devices worldwide, supports only about 130 operations. Not
ecosystem, leading to device fragmentation that prevents a all operations are worth supporting, however.
lone development team from producing software that runs
well on many different embedded platforms.
3 D ESIGN P RINCIPLES
2.3 Resource Constraints To address the challenges facing TinyML on embedded sys-
tems, we developed a set of developer principles to guide the
People who build embedded devices do so because a more
design of the TF Micro framework to address the challenges
general-purpose computing platform exceeds their design
mentioned previously in Section 2.
limits. The biggest drivers are cost, with a microcontroller
typically selling for less than a few dollars (IC Insights,
2020); power consumption, as embedded devices may re- 3.1 Minimize Feature Scope for Portability
quire just a few milliwatts of power, whereas mobile and We believe an embedded machine-learning (ML) framework
desktop CPUs require watts; and form factor, since capable should assume the model, input data, and output arrays are in
microcontrollers are smaller than a grain of rice (Wu et al., memory, and it should only handle ML calculations based on
2018). those values. The design should exclude any other function,
To meet their needs, hardware designers trade off capabil- no matter how useful. In practice, this approach means the
ities. A common characteristic of an embedded system is library should omit features such as loading models from a
its low memory capacity. At one end of the spectrum, a file system or accessing peripherals for inputs.
big embedded system has a few megabytes of flash ROM This strong design principle is crucial because many embed-
and at most a megabyte of SRAM. At the other end, a small ded platforms are missing basic features, such as memory
embedded system has just a few hundred kilobytes or fewer, management and library support (Section 2.1), that main-
often split between ROM and RAM (Zhang et al., 2017). stream computing platforms take for granted. Supporting
These constraints mean both working memory and perma- the myriad possibilities would make porting the ML frame-
nent storage are much smaller than most software written for work across devices unwieldy.
general-purpose platforms would assume. In particular, the Fortunately, ML models are functional, having clear inputs,
size of the compiled code in storage requires minimization. outputs, and possibly some internal state but no external
Most software written for general-purpose platforms con- side effects. Running a model need not involve calls to
tains code that often goes uncalled on a given device. The peripherals or other operating-system functions. To remain
reason is that choosing the code path at run time is a bet- efficient, we focus only on implementing those calculations.
TensorFlow Lite Micro
3.2 Enable Vendor Contributions to Span Ecosystem TensorFlow Training Environment TensorFlow Lite Exporter TensorFlow Lite
Flatbuffer File
Training Inference Ordered Op
Graph Graph
All embedded devices can benefit from high-performance List
Weights
Operator #3 A B A B
Operator #4
Operator #6 C C
Operator #7 D D
mechanisms) is possible during model invocation.
Operator #8
This simplistic approach works well for initial prototyping,
but it wastes memory because many allocations could over- (a) Naive (b) Bin packing
lap with others in time. One example is data structures that
are only necessary during initialization. Their values are Figure 4. Intermediate allocation strategies.
irrelevant after initialization, but because their lifetime is
the same as the interpreter’s, they continue to take up arena ized using rectangles (Figure 4a), where one dimension is
space. A model’s evaluation phase also requires variables memory size and the other is the time during which each
that need not persist from one invocation to another. allocation must be preserved. The overall memory can be
substantially reduced if some areas are reused or compacted
Hence, we modified the allocation scheme so that together. Figure 4b shows a more optimal memory layout.
initialization- and evaluation-lifetime allocations reside in a
separate stack relative to interpreter-lifetime objects. This Memory compaction is an instance of bin packing (Martello,
feat uses a stack that increments from the lowest address 1990). Calculating the perfect allocation strategy for arbi-
for the function-lifetime objects (“Head” in Figure 3) and a trary models without exhaustively trying all possibilities is
stack that decrements from the arena’s highest address for an unsolved problem, but a first-fit decreasing algorithm
interpreter-lifetime allocations (“Tail” in Figure 3). When (Garey et al., 1972) usually provides reasonable solutions.
the two stack pointers cross, they indicate a lack of capacity. In our case, this approach consists of gathering a list of all
The two-stack allocation strategy works well for both shared temporary allocations, including size and lifetime; sorting
buffers and persistent buffers. But model preparation also the list in descending order by size; and placing each allo-
holds allocation data that model inference no longer needs. cation in the first sufficiently large gap, or at the end of the
Therefore, we used the space in between the two stacks as buffer if no such gap exists. We do not support dynamic
temporary allocations when a model is in memory planning. shapes in the TF Micro framework, so we must know at
Any temporary data required during model inference resides initialization all the information necessary to perform this
in the persistent-stack allocation section. algorithm. The “Memory Planner” encapsulates this process
(Figure 2); it allows us to minimize the arena portion de-
Overall, our approach reduces the arena’s required mem- voted to intermediate tensors. Doing so offers a substantial
ory because the initialization allocations can be discarded memory-use reduction for many models.
after that function is done, and the memory is reusable for
evaluation variables. This approach also enables advanced Memory planning at run time incurs more overhead during
applications to reuse the arena’s function-lifetime section in model preparation than a preplanned memory-allocation
between evaluation calls. strategy. This cost, however, comes with the benefit of
model generality. TF Micro models simply list the operator
4.4.2 Memory Planner and tensor requirements. At run time, we allocate and enable
this capability for many model types.
A more complex optimization opportunity involves the
space required for intermediate calculations during model Offline-planned tensor allocation is an alternative memory-
evaluation. An operator may write to one or more output planning feature of TF Micro. It allows a more compact
buffers, and later operators may later read them as inputs. memory plan, gives memory-plan ownership and control
If the output is not exposed to the application as a model to the end user, imposes less overhead on the MCU during
output, its contents need only remain until the last operation initialization, and enables more-efficient power options by
that needs them has finished. Its presence is also unneces- allowing different memory banks to store certain memory
sary until just before the operation that populates it executes. areas. We allow the user to create a memory layout on a
Memory reuse is possible by overlapping allocations that host before run time. The memory layout is stored as model
are unneeded during the same evaluation sections. FlatBuffer metadata and contains an array of fixed-memory
arena offsets for an arbitrary number of variable tensors.
The memory allocations required over time can be visual-
TensorFlow Lite Micro
Head
TF Micro (stack) Model
Interpreter TF Micro Head
binds to
Interpreter (stack)
Figure 5. Memory-allocation strategy for a single model versus a multi-tenancy scenario. In TF Micro, there is a one-to-one binding
between a model, an interpreter and the memory allocations made for the model (which may come from a shared memory arena).
4.9 Build System serialized FlatBuffer format. We use the Visual Wake Words
(VWW) person-detection model (Chowdhery et al., 2019),
To address the embedded market’s fragmentation (Sec-
which represents a common microcontroller vision task of
tion 2.2), we needed our code to compile on many platforms.
identifying whether a person appears in a given image. The
We therefore wrote the code to be highly portable, exhibiting
model is trained and evaluated on images from the Microsoft
few dependencies, but it was insufficient to give potential
COCO data set (Lin et al., 2014). It primarily stresses and
users a good experience on a particular device.
measures the performance of convolutional operations.
Most embedded developers employ a platform-specific IDE
Also, we use the Google Hotword model, which aids in
or tool chain that abstracts many details of building subcom-
detecting the key phrase “OK Google.” This model is de-
ponents and presents libraries as interface modules. Simply
signed to be small and fast enough to run constantly on
giving developers a folder hierarchy containing source-code
a low-power DSP in smartphones and other devices with
files would still leave them with multiple steps before they
Google Assistant. Because it is proprietary, we use a version
could build and compile that code into a usable library.
with scrambled weights and biases.
Therefore, we chose a single makefile based build sys-
The benchmarks run multiple inputs through a single model,
tem to determine which files the library required, then gen-
measuring the time to process each input and produce an
erated the project files for the associated tool chains. The
inference output. The benchmark does not measure the
makefile held the source-file list, and we stored the platform-
time necessary to bring up the model and configure the run
specific project files as templates that the project-generation
time, since the recurring inference cost dominates total CPU
process filled in with the source-file information. That pro-
cycles on most long-running systems.
cess may also perform other postprocessing to convert the
source files to a format suitable for the target tool chain.
5.2 Benchmark Performance
Our platform-agnostic approach has enabled us to support a
variety of tool chains with minimal engineering work, but We provide two sets of benchmark results. First are the base-
it does have some drawbacks. We implemented the project line results from running the benchmarks on reference ker-
generation through an ad hoc mixture of makefile scripts and nels, which are simple operator-kernel implementations de-
Python. This strategy makes the process difficult to debug, signed for readability rather than performance. Second are
maintain, and extend. Our intent is for future versions to results for optimized kernels compared with the reference
keep the concept of a master source-file list that only the kernels. The optimized versions employ high-performance
makefile holds, but then delegate the actual generation to ARM CMSIS-NN and Cadence libraries (Lai et al., 2018).
better-structured Python in a more maintainable way. The results in Table 6 are for the CPU (Table 6a) and DSP
(Table 6b). The total run time appears under the “Total
5 S YSTEM E VALUATION Cycles” column, and the run time excluding the interpreter
appears under the “Calculation Cycles” column. The differ-
TF Micro has undergone testing and it has been deployed ence between them is the interpreter overhead.
extensively with many processors based on the Arm Cortex-
M architecture (Arm, 2020). It has been ported to other Comparing the reference kernel versions to the optimized
architectures including ESP32 (Espressif, 2020) and many kernel versions reveals considerable performance improve-
digital signal processors (DSPs). The framework is also ment. For example, between “VWW Reference” and
available as an Arduino library. It can generate projects for “VWW Optimized,” the CMSIS-NN library offers more than
environments such as Mbed (ARM, 2020) as well. In this a 4x speedup on the Cortex-M4 microcontroller. Optimiza-
section, we use two representative platforms to assess and tion on the Xtensa HiFi Mini DSP offers a 7.7x speedup.
quantify TF Micro’s computational and memory overheads. For Hotword, the speeds are 25% and 50% better than the
baseline reference model because less time goes to the ker-
nel calculations and each inner loop accounts for less time
5.1 Experimental Setup
with respect to the total run time of the benchmark model.
We selected two platforms on which to evaluate TF Micro
(Table 1). First is the Sparkfun Edge, which has an Ambiq Platform Processor Clock Flash RAM
Apollo3 MCU. Apollo3 is powered by an Arm Cortex-M4 Sparkfun Edge Arm CPU
96 MHz 1 MB 0.38 MB
core and operates in burst mode at 96 MHz (Ambiq Mi- (Ambiq Apollo3) Cortex-M4
cro, 2020). The second platform is an Xtensa Hifi Mini Xtensa DSP
Tensilica HiFi 10 MHz 1 MB 1 MB
DSP, which is based on the Cadence Tensilica architecture HiFi Mini
(Cadence, 2020).
Table 1. Embedded-platform benchmarking.
Our benchmarks are INT8 TensorFlow Lite models in a
TensorFlow Lite Micro
(a) Sparkfun Edge (Apollo3 Cortex-M4) (b) Xtensa HiFi Mini DSP
ongoing works that have the potential to mature and enable ically designed for Arm. It consists of an offline tool that
the broader ecosystem, much like our TF Micro effort. translates a TensorFlow model into C++ machine code, as
well as a run time for execution management.
ELL (Microsoft, 2020) The Embedded Learning Library
(ELL) is an open-source library from Microsoft for embed-
ded AI. ELL is a cross-compiler tool chain that enables users 7 C ONCLUSION
to run machine-learning models on resource constrained
TF Micro enables the transfer of deep learning onto embed-
platforms, similar to the platforms that we have evaluated.
ded systems, significantly broadening the reach of machine
Graph Lowering (GLOW) (Rotem et al., 2018) is an learning. TF Micro is a framework that has been specif-
open-source compiler that accelerates neural-network per- ically engineered to run machine learning effectively and
formance across a range of hardware platforms, both large efficiently on embedded devices with only a few kilobytes
and small. It initially targeted large machine-learning of memory. The framework fits in tens of kilobytes on mi-
systems, but NXP recently extended it to focus on Arm crocontrollers and DSPs and can handle many basic models.
Cortex-M MCUs and the Cadence Tensilica HiFi 4 DSPs.
TF Micro’s fundamental contributions are the design de-
GLOW employs optimized kernels from vendor-supported
cisions that address the challenges of embedded systems:
libraries. Unlike TF Micro’s flexible interpreter-based solu-
hardware heterogeneity in the fragmented ecosystem, miss-
tion, GLOW for MCUs is based on ahead-of-time compila-
ing software features, and resource constraints. We support
tion for both floating-point and quantized arithmetic.
multiple embedded platforms based on the widely-deployed
STM32Cube.AI (STMicroelectronics, 2020) is the only Arm Cortex-M series of microcontrollers, as well as other
other widely deployed production framework. It takes mod- ISAs such as DSP cores from Tensilica. The framework
els from Keras, TensorFlow Lite, and others to generate code does not require operating system support, any standard C or
optimized for a range of STM32-series MCUs. It supports C++ libraries, or dynamic memory allocation – features that
both FP32 and quantized models and comes with built-in are commonly taken for granted in non-embedded system
optimizations to reduce model size. By comparison, TF Mi- domains. This allows us to run bare-metal efficiently.
cro is more flexible, having been designed to serve a wide
The methods and techniques presented here are a snapshot
range of MCUs beyond the STMicroelectronics ecosystem.
of the progress made so far. As embedded system capabili-
TensorFlow-Native was an experimental Google system ties grow, so will the framework. For example, we are in the
that compiled TensorFlow graphs into C++ code. The sim- process of developing an offline memory planner for more
plicity of the resulting code allowed porting of the system to effective memory allocation and fine-grained user control,
many MCU and DSP targets. It lacked quantization support and investigating new approaches to support concurrent ex-
as well as platform-specific optimizations to achieve good ecution of ML models. In addition to minimizing memory
performance. As we described previously in Section 3, we consumption and improving performance, we are also look-
firmly believe that it is essential to leverage the existing ing into providing better support for vendor optimizations
infrastructure to enable broad adoption of the framework. and build system support for development environments.
Leveraging the existing toolchain is also essential to provide
strong engineering support for product-level applications 8 ACKNOWLEDGEMENTS
that run on many devices in the real-world.
TF Micro is an open-source project and a community-based
TinyEngine (Lin et al., 2020) is an inference engine for
open-source project. As such, it rests on the work of
MCUs. It is a code-generator-based compiler method that
many. We extend our gratitude to many individuals, teams,
helps eliminate memory overhead. The authors claim it
and organizations: Fredrik Knutsson and the CMSIS-NN
reduces memory usage by 2.7x and boosts the inference
team; Rod Crawford and Matthew Mattina from Arm; Raj
speed by 22% for their baseline. TF Micro, by contrast,
Pawate from Cadence; Erich Plondke and Evgeni Gousef
uses an interpreter-based method, and as our experiments
from Qualcomm; Jamie Campbell from Synopsys; Yair
show, the interpreter adds insignificant overhead.
Siegel from Ceva; Sai Yelisetty from DSP Group; Zain
TVM (Chen et al., 2018) is an open-source deep-learning Asgar from Stanford; Dan Situnayake from Edge Impulse;
compiler for CPUs, GPUs, and machine-learning accelera- Neil Tan from the uTensor project; Sarah Sirajuddin, Rajat
tors. It enables machine-learning engineers to optimize and Monga, Jeff Dean, Andy Selle, Tim Davis, Megan Kacholia,
run computations efficiently on any hardware back end. It Stella Laurenzo, Benoit Jacob, Dmitry Kalenichenko, An-
has been ported to Arm’s Cortex-M7 and other MCUs. drew Howard, Aakanksha Chowdhery, and Lawrence Chan
from Google; and Radhika Ghosal, Sabrina Neuman, Mark
uTensor (uTensor, 2020), a precursor to TF Micro, is a
Mazumder, and Colby Banbury from Harvard University.
lightweight machine-learning inference framework specif-
TensorFlow Lite Micro
Mattson, P., Reddi, V. J., Cheng, C., Coleman, C., Di- uTensor. uTensor, 2020. URL https://github.com/
amos, G., Kanter, D., Micikevicius, P., Patterson, D., uTensor/uTensor.
Schmuelling, G., Tang, H., et al. MLPerf: An industry
standard benchmark suite for machine learning perfor- Waterman, A. and Asanovic, K. The risc-v instruction set
mance. IEEE Micro, 40(2):8–16, 2020. manual, volume i: Unprivileged isa document, version
20190608-baseratified. RISC-V Foundation, Tech. Rep,
Microsoft. Embedded Learning Library, 2020. URL 2019.
https://microsoft.github.io/ELL/.
Wu, X., Lee, I., Dong, Q., Yang, K., Kim, D., Wang, J.,
Reddi, V. J., Cheng, C., Kanter, D., Mattson, P., Peng, Y., Zhang, Y., Saliganc, M., Yasuda, M., et al. A
Schmuelling, G., Wu, C.-J., Anderson, B., Breughe, M., 0.04 mm 3 16nw wireless and batteryless sensor system
Charlebois, M., Chou, W., et al. MLPerf inference bench- with integrated cortex-m0+ processor and optical commu-
mark. In 2020 ACM/IEEE 47th Annual International nication for cellular temperature measurement. In 2018
Symposium on Computer Architecture (ISCA), pp. 446– IEEE Symposium on VLSI Circuits, pp. 191–192. IEEE,
459. IEEE, 2020. 2018.
Rotem, N., Fix, J., Abdulrasool, S., Catron, G., Deng, Zhang, M. and Sawchuk, A. A. Usc-had: a daily activity
S., Dzhabarov, R., Gibson, N., Hegeman, J., Lele, M., dataset for ubiquitous activity recognition using wearable
Levenstein, R., et al. Glow: Graph lowering com- sensors. In Proceedings of the 2012 ACM Conference on
piler techniques for neural networks. arXiv preprint Ubiquitous Computing, pp. 1036–1043, 2012.
arXiv:1805.00907, 2018.
Zhang, Y., Suda, N., Lai, L., and Chandra, V. Hello edge:
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Keyword spotting on microcontrollers. arXiv preprint
and Salakhutdinov, R. Dropout: a simple way to prevent arXiv:1711.07128, 2017.
neural networks from overfitting. The journal of machine
learning research, 15(1):1929–1958, 2014.