0% found this document useful (0 votes)
192 views13 pages

TensorFlow Lite Micro Embedded Machine L

Machine learning con microcontroladores

Uploaded by

Pablo Rigo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
192 views13 pages

TensorFlow Lite Micro Embedded Machine L

Machine learning con microcontroladores

Uploaded by

Pablo Rigo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

T ENSOR F LOW L ITE M ICRO :

E MBEDDED M ACHINE L EARNING ON T INY ML S YSTEMS

Robert David 1 Jared Duke 1 Advait Jain 1 Vijay Janapa Reddi 1 2


Nat Jeffries 1 Jian Li 1 Nick Kreeger 1 Ian Nappier 1 Meghna Natraj 1
Shlomi Regev 1 Rocky Rhodes 1 Tiezhen Wang 1 Pete Warden 1
arXiv:2010.08678v2 [cs.LG] 20 Oct 2020

A BSTRACT
Deep learning inference on embedded devices is a burgeoning field with myriad applications because tiny
embedded devices are omnipresent. But we must overcome major challenges before we can benefit from this
opportunity. Embedded processors are severely resource constrained. Their nearest mobile counterparts exhibit at
least a 100—1,000x difference in compute capability, memory availability, and power consumption. As a result,
the machine-learning (ML) models and associated ML inference framework must not only execute efficiently but
also operate in a few kilobytes of memory. Also, the embedded devices’ ecosystem is heavily fragmented. To
maximize efficiency, system vendors often omit many features that commonly appear in mainstream systems,
including dynamic memory allocation and virtual memory, that allow for cross-platform interoperability. The
hardware comes in many flavors (e.g., instruction-set architecture and FPU support, or lack thereof). We introduce
TensorFlow Lite Micro (TF Micro), an open-source ML inference framework for running deep-learning models
on embedded systems. TF Micro tackles the efficiency requirements imposed by embedded-system resource
constraints and the fragmentation challenges that make cross-platform interoperability nearly impossible. The
framework adopts a unique interpreter-based approach that provides flexibility while overcoming these challenges.
This paper explains the design decisions behind TF Micro and describes its implementation details. Also, we
present an evaluation to demonstrate its low resource requirement and minimal run-time performance overhead.

1 I NTRODUCTION PPG optical sensors, and other devices enable consumer and
industrial applications, including predictive maintenance
Tiny machine learning (TinyML) is a burgeoning field at (Goebel et al., 2020; Susto et al., 2014), acoustic-anomaly
the intersection of embedded systems and machine learning. detection (Koizumi et al., 2019), visual object detection
The world has over 250 billion microcontrollers (IC Insights, (Chowdhery et al., 2019), and human-activity recognition
2020), with strong growth projected over coming years. As (Chavarriaga et al., 2013; Zhang & Sawchuk, 2012).
such, a new range of embedded applications are emerging
for neural networks. Because these models are extremely Unlocking machine learning’s potential in embedded de-
small (few hundred KBs), running on microcontrollers or vices requires overcoming two crucial challenges. First
DSP-based embedded subsystems, they can operate contin- and foremost, embedded systems have no unified TinyML
uously with minimal impact on device battery life. framework. When engineers have deployed neural networks
to such systems, they have built one-off frameworks that
The most well-known and widely deployed example of this require manual optimization for each hardware platform.
new TinyML technology is keyword spotting, also called These frameworks have tended to be narrowly focused, lack-
hotword or wakeword detection (Chen et al., 2014; Gru- ing features to support multiple applications and lacking
enstein et al., 2017; Zhang et al., 2017). Amazon, Apple, portability across a wide range of hardware. The developer
Google, and others use tiny neural networks on billions of experience has therefore been painful, requiring hand opti-
devices to run always-on inferences for keyword detection— mization of models to run on a specific device. And altering
and this is far from the only TinyML application. Low- these models to run on another device necessitated manual
latency analysis and modeling of sensor signals from micro- porting and repeated optimization effort. A second-order
phones, low-power image sensors, accelerometers, gyros, effect of this situation is that the slow pace and high cost
1
Google 2 Harvard University. Correspondence to: Pete Warden of training and deploying models to embedded hardware
<[email protected]>, Vijay Janapa Reddi (contributions prevents developers from easily justifying the investment
made as a Google Visiting Researcher) <[email protected]>. required to build new features.
TensorFlow Lite Micro

Another challenge limiting TinyML is that hardware vendors • We enable hardware vendors to provide platform-
have related but separate needs. Without a generic TinyML specific optimizations on a per-kernel basis without
framework, evaluating hardware performance in a neutral, writing target-specific compilers
vendor-agnostic manner has been difficult. Frameworks are
tied to specific devices, and it is hard to determine the source • We allow hardware vendors to easily integrate their ker-
of improvements because they can come from hardware, nel optimizations to ensure performance in production
software, or the complete vertically integrated solution. and comparative hardware benchmarking

The lack of a proper framework has been a barrier to acceler- • Our model-architecture framework is open to a wide
ating TinyML adoption and application in products. Beyond machine-learning ecosystem and the TensorFlow Lite
deploying a model to an embedded target, the framework model conversion and optimization infrastructure
must also have a means of training a model on a higher-
• We provide benchmarks that are being adopted by
compute platform. TinyML must exploit a broad ecosystem
industry-leading benchmark bodies like MLPerf
of tools for ML, as well for orchestrating and debugging
models, which are beneficial for production devices. • Our framework supports popular, well-maintained
Prior efforts have attempted to bridge this gap, and we Google applications that are in production.
discuss some of them later (Section 6). Briefly, we can
distill the general issues facing many of the frameworks into This paper makes several contributions: First, we clearly lay
the following: out the challenges to developing a machine-learning frame-
work for embedded devices that supports the fragmented
• Inability to easily and portably deploy models across embedded ecosystem. Second, we provide design and imple-
multiple embedded hardware architectures mentation details for a system specifically created to cope
with these challenges. And third, we demonstrate that an
• Lack of optimizations that take advantage of the under- interpreter-based approach, which is traditionally viewed
lying hardware without requiring framework develop- as a low-performance alternative to compilation, is in fact
ers to make platform-specific efforts highly suitable for the embedded domain—specifically, for
• Lack of productivity tools that connect training machine learning. Because machine-learning performance
pipelines to deployment platforms and tools is largely dictated by linear-algebra computations, the inter-
preter design imposes minimal run-time overhead.
• Incomplete infrastructure for compression, quantiza-
tion, model invocation, and execution
2 T ECHNICAL C HALLENGES
• Minimal support features for performance profiling,
debugging, orchestration, and so on Many issues make developing a machine-learning frame-
work for embedded systems particularly difficult. In this
• No standard benchmarks that allow hardware vendors section, we summarize the main ones.
to quantify their chip’s performance in a fair and repro-
ducible manner 2.1 Missing Features
• Lack of testing in real-world applications. Embedded platforms are defined by their tight limitations.
Therefore, many advances from the past few decades that
To address these issues, we introduce TensorFlow Lite Mi- have made software development faster and easier are un-
cro (TF Micro), which mitigates the slow pace and high cost available to these platforms because the resource tradeoffs
of training and deploying models to embedded hardware are too expensive. Examples include dynamic memory
by emphasizing portability and flexibility. TF Micro makes management, virtual memory, an operating system, a stan-
it easy to get TinyML applications running across archi- dard instruction set, a file system, floating-point hardware,
tectures, and it allows hardware vendors to incrementally and other tools that seem fundamental to modern program-
optimize kernels for their devices. TF Micro gives vendors mers (Kumar et al., 2017). Though some platforms provide
a neutral platform to prove their hardware’s performance in a subset of these features, a framework targeting widespread
TinyML applications. It offers these benefits: adoption in this market must avoid relying on them.

• Our interpreter-based approach is portable, flexible, 2.2 Fragmented Market and Ecosystem
and easily adapted to new applications and features
Many embedded-system uses only require fixed software
• We minimize the use of external dependencies and developed alongside the hardware, usually by an affili-
library requirements to be hardware agnostic ated team. The lack of applications capable of running
TensorFlow Lite Micro

on the platform is therefore much less important than it ter use of engineering resources than shipping more-highly
is for general-purpose computing. Moreover, backward custom executables. This run-time flexibility is hard to jus-
instruction-set-architecture (ISA) compatibility with older tify when code size is a concern and the potential uses are
software matters less than in mainstream systems because fewer. As a result, developers often must break through the
everything that runs on an embedded system is probably a library’s abstraction if they want to make modifications to
compiled from source code anyway. Thus, the hardware can suit their target hardware.
aggressively diversify to meet power requirements, whereas
even the latest x86 processor can still run instructions that 2.4 Ongoing Changes to Deep Learning
are nearly three decades old (Intel, 2013).
Machine learning remains in its infancy despite its break-
These differences mean the pressure to converge on one neck pace. Researchers are still experimenting with new
or two dominant platforms or ISAs is much weaker in the operations and network architectures to glean better predic-
embedded space, leading to fragmentation. Many ISAs have tions from their models. Their success in improving results
thriving ecosystems, and the benefits they bring to partic- leads product designers to demand these enhanced models.
ular applications outweigh developers’ cost of switching.
Companies even allow developers to add their own ISA Because new mathematical operations—or other fundamen-
extensions (ARM, 2019; Waterman & Asanovic, 2019). tal changes to neural-network calculations—often drive the
model advances, adopting these models in software means
Matching the wide variety of embedded architectures are porting the changes, too. Since research directions are hard
the numerous tool chains and integrated development envi- to predict and advances are frequent, keeping a framework
ronments (IDEs) that support them. Many of these systems up to date and able to run the newest, best models requires
are only available through a commercial license with the a lot of work. For instance, while TensorFlow has more
hardware manufacturer, and in cases where a customer has than 1,400 operations (TensorFlow, 2020e), TensorFlow
requested specialized instructions, they may be inaccessi- Lite, which is deployed on more than four billions edge
ble to everyone. These arrangements have no open-source devices worldwide, supports only about 130 operations. Not
ecosystem, leading to device fragmentation that prevents a all operations are worth supporting, however.
lone development team from producing software that runs
well on many different embedded platforms.
3 D ESIGN P RINCIPLES
2.3 Resource Constraints To address the challenges facing TinyML on embedded sys-
tems, we developed a set of developer principles to guide the
People who build embedded devices do so because a more
design of the TF Micro framework to address the challenges
general-purpose computing platform exceeds their design
mentioned previously in Section 2.
limits. The biggest drivers are cost, with a microcontroller
typically selling for less than a few dollars (IC Insights,
2020); power consumption, as embedded devices may re- 3.1 Minimize Feature Scope for Portability
quire just a few milliwatts of power, whereas mobile and We believe an embedded machine-learning (ML) framework
desktop CPUs require watts; and form factor, since capable should assume the model, input data, and output arrays are in
microcontrollers are smaller than a grain of rice (Wu et al., memory, and it should only handle ML calculations based on
2018). those values. The design should exclude any other function,
To meet their needs, hardware designers trade off capabil- no matter how useful. In practice, this approach means the
ities. A common characteristic of an embedded system is library should omit features such as loading models from a
its low memory capacity. At one end of the spectrum, a file system or accessing peripherals for inputs.
big embedded system has a few megabytes of flash ROM This strong design principle is crucial because many embed-
and at most a megabyte of SRAM. At the other end, a small ded platforms are missing basic features, such as memory
embedded system has just a few hundred kilobytes or fewer, management and library support (Section 2.1), that main-
often split between ROM and RAM (Zhang et al., 2017). stream computing platforms take for granted. Supporting
These constraints mean both working memory and perma- the myriad possibilities would make porting the ML frame-
nent storage are much smaller than most software written for work across devices unwieldy.
general-purpose platforms would assume. In particular, the Fortunately, ML models are functional, having clear inputs,
size of the compiled code in storage requires minimization. outputs, and possibly some internal state but no external
Most software written for general-purpose platforms con- side effects. Running a model need not involve calls to
tains code that often goes uncalled on a given device. The peripherals or other operating-system functions. To remain
reason is that choosing the code path at run time is a bet- efficient, we focus only on implementing those calculations.
TensorFlow Lite Micro

3.2 Enable Vendor Contributions to Span Ecosystem TensorFlow Training Environment TensorFlow Lite Exporter TensorFlow Lite
Flatbuffer File
Training Inference Ordered Op
Graph Graph
All embedded devices can benefit from high-performance List
Weights

kernels optimized for a given microprocessor. But no one


team can easily support such kernels for the entire embed-
ded market because of the ecosystem’s fragmentation (see
Section 2.2). Worse, optimization approaches vary greatly Figure 1. Model-export workflow.
depending on the target architecture.
with the former creating models and the latter deploying
The companies with the strongest motivation to deliver max-
them. Since product developers are the ones who discover
imum performance on a set of devices are the ones that
the export errors, they may lack the expertise or permission
design and sell the underlying embedded microprocessors.
to retrain the model.
Although developers at these companies are highly experi-
enced at optimizing traditional numerical algorithms (e.g., Model operators have no governing principles or a unified
digital signal processing) for their hardware, they often lack set of rules. Even if an inference framework supports an
deep-learning experience. Therefore, evaluating whether operation, particular data types may not, or the operation
optimization changes are detrimental to model accuracy and may exclude certain parameter ranges or may only serve in
overall performance is difficult. conjunction with other operations. This situation creates a
barrier to providing error messages that guide developers.
To improve the development experience for hardware ven-
dors and application developers, we make sure optimizing Resource constraints also add many requirements to an ex-
the core library operations is easy. One goal is to ensure porter. Most training frameworks focus on floating-point
substantial technical support (tests and benchmarks) for de- calculations, since they are the most flexible numerical rep-
veloper modifications and to encourage submission to a resentation and are well optimized for desktop CPUs and
library repository (details in Section 4). GPUs. Fitting into small memories, however, makes eight-
bit and other quantized representations valuable for embed-
3.3 Reuse TensorFlow Tools for Scalability ded deployment. Some techniques can convert a model
trained in floating point to a quantized representation (Krish-
The TensorFlow training environment includes more than namoorthi, 2018), but they all increase exporter complexity.
1,400 operations, similar to other training frameworks (Ten- Some also require support during the training process, ne-
sorFlow, 2020e). Most inference frameworks, however, cessitating changes to the creation framework as well. Other
explicitly support only a subset of these operations, making optimizations are also expected during export, such as fold-
exports difficult. An exporter takes a trained model (such ing constant expressions into fixed values—even in complex
as a TensorFlow model) and generates a TensorFlow Lite cases like batch normalization (Zhang et al., 2017)—and
model file (.tflite); after conversion, the model file can be removing dropout and similar operations that are only useful
deployed to a client device (e.g., a mobile or embedded sys- during training (Srivastava et al., 2014).
tem) and run locally using the TensorFlow Lite interpreter.
Exporters receive a constant stream of new operations, most Because writing a robust model converter takes a tremen-
defined only by their implementation code. Because the dous amount of engineering work, we built atop the existing
operands lack clean semantic definitions beyond their im- TensorFlow Lite tool chain, as Figure 1 shows. We exploited
plementations and unit tests, supporting these operations the strong integration with the TensorFlow training environ-
is difficult. Attempting to do so is like working with the ment and extended it for deeply embedded machine-learning
elaborate CISC ISA without access to a basic data sheet. systems. For example, we reused the TensorFlow Lite refer-
ence kernels in TF Micro, thus giving users a harmonized
Manually converting one or two models (and all the asso- environment for model development and execution.
ciated operations) to a new representation is easy. Users
will want to convert a large space of potential models, how-
3.4 Build System for Heterogeneous Support
ever, and the task of understanding and changing model
architectures to accommodate a framework’s requirements Another crucial feature of an embedded inference frame-
is difficult. Often, only after users have built and trained a work is a flexible build environment. The build system
model do they discover whether all of its operations are com- must support the highly heterogeneous ecosystem and avoid
patible with the target inference framework. Worse, many falling captive to any one platform. Otherwise, developers
users employ high-level APIs, such as Keras (Chollet et al., would avoid adopting it due to the lack of portability and so
2015), which may hide low-level operations, complicating would the hardware platform vendors.
the task of removing depencence on operations. Also, re-
In desktop and mobile systems, frameworks commonly pro-
searchers and product developers often split responsibilities,
vide precompiled libraries and other binaries as the main
TensorFlow Lite Micro

software-delivery method. This approach is impractical


in embedded platforms because they encompass too many Application
different devices, operating systems, and tool-chain combi-
nations to allow a balancing of modularity, size, and other
Client API
constraints. Additionally, embedded developers must often
make code changes to meet such constraints.
TF Micro
In response, we prioritize code that is easy to build using a Interpreter
wide variety of IDEs and tool chains. This approach means
we avoid techniques that rely on build-system features that Model Memory Operator
do not genearlize across platforms. Examples of such fea- Loader Planner Resolver
tures include setting custom include paths, compiling tools
for the host processor, using custom binaries or shell scripts
to produce code, and defining preprocessor macros on the Operator API
command line.
Operator Operator
Our principle is that we should be able to create source files Implementation ... Implementation
and headers for a given platform, and users should then be
able to drag and drop those files into their IDE or tool chain
and compile them without any changes. We call it the “Bag
Figure 2. Implementation-module overview.
of Files” principle. Anything more complex would prevent
adoption by many platforms and developers.
plementation functions.

4 I MPLEMENTATION A C API call handles all communication between the inter-


preter and operators to ensure operator implementations are
In this section, we discuss our implementation decisions modular and independent of the interpreter’s implementa-
and tradeoffs. We begin with a system overview (Figure 2) tion. This approach eases replacement of operator imple-
and then describe specific modules in detail. mentations with optimized versions, and it also encourages
reuse of other systems’ operator libraries (e.g., as part of a
4.1 System Overview code-generation project).
The first step in developing a TF Micro application is to The fourth step, after initialization, is model execution. The
create a live neural-network-model object in memory. To do application retrieves pointers to the memory regions that
so, the application developer produces an “operator resolver” represent the model inputs and populates them with values
object through the client API. The “OpResolver” API con- (often derived from sensors or other user-supplied data).
trols which operators link to the final binary, minimizing Once the inputs are available, the application invokes the
executable size. interpreter to perform the model calculations. This process
involves iterating through the topologically sorted opera-
The second step is to supply a contiguous memory array,
tions, using offsets calculated during memory planning to
called the “arena,” that holds intermediate results and other
locate the inputs and outputs, and calling the evaluation
variables the interpreter needs. Doing so is necessary be-
function for each operation.
cause we assume dynamic memory allocation, such as mal-
loc or new, is unavailable. Finally, after it evaluates all the operations, the interpreter
returns control to the application. Invocation is a simple
The third step is to create an interpreter instance (Sec-
blocking call, but an application can still perform one from
tion 4.2), supplying it with the model, operator resolver, and
a thread, and platform-specific operators can still split their
arena as arguments. The interpreter allocates all required
work across processors. Once invocation finishes, the ap-
memory from the arena during the initialization phase. We
plication can query the interpreter to determine the location
intentionally avoid any allocations afterward to ensure heap
of the arrays containing the model-calculation outputs and
fragmentation avoids causing errors for long-running ap-
then use those outputs.
plications. Operator implementations may need to allocate
memory for use during the evaluation, so the preparation The framework omits any threading or multitasking support,
functions of each operator are called during this phase, al- since any such features would require less-portable code and
lowing their memory requirements to be communicated to operating-system dependencies. However, we support mul-
the interpreter. The application-supplied OpResolver maps titenancy. The framework can run multiple models as long
the operator types listed in the serialized model to the im- as they do not need to run concurrently with one another.
TensorFlow Lite Micro

4.2 TF Micro Interpreter 4.3.2 Model Representation


TF Micro is an interpreter-based machine-learning frame- We also copied the TensorFlow Lite representation, the
work. The interpreter loads a data structure that clearly stored schema of data and values that represent the model.
defines a model. Although the execution code is static, the This schema was designed for mobile platforms with storage
interpreter handles the model data at run time, and this data efficiency and fast access in mind, so it has many features
controls which operators to execute and where to draw the that eased development for embedded platforms. For ex-
model parameters from. ample, operations reside in a topologically sorted list rather
than a directed-acyclic graph. Performing calculations is
We chose an interpreter on the basis of our experience de-
as simple as looping through the operation list in order,
ploying production models on embedded hardware. We see
whereas a full graph representation would require prepro-
a need to easily update models in the field—a task that may
cessing to satisfy the operations’ input dependencies.
be infeasible using code generation. Using an interpreter,
however, sharing code across multiple models and applica- The biggest drawback of this representation is that it was
tions is easier, as is maintaining the code, since it allows designed to be portable from system to system, so it requires
updates without re-exporting the model. run-time processing to yield the information that inferenc-
ing requires. For example, it abstracts operator parameters
The alternative to an interpreter-based inference engine is to
from the arguments, which later pass to the functions that
generate native code from a model during export using C or
implement those operations. Thus, each operation requires
C++, baking operator function calls into fixed machine code.
a few code lines executed at run time to convert from the
It can increase performance at the expense of portability,
serialized representation to the structure in the underlying
since the code would need recompilation for each target.
implementation. The code-size overhead is small, but it
We incorporate some important code-generation features in reduces the readability and compactness of the operator
our approach. For example, because our library is buildable implementations.
from source files alone (Section 3.4), we achieve much of
Memory planning is a related issue. On mobile devices, Ten-
the compilation simplicity of generated code.
sorFlow Lite supports variable-size inputs, so all dependent
operations may also vary in size. Planning the optimal mem-
4.3 Model Loading ory layout of intermediate buffers for the calculations must
As mentioned, the interpreter loads a data structure that therefore take place at run time when all buffer dimensions
clearly defines a model. For this work, we used the Ten- are known.
sorFlow Lite portable data schema (TensorFlow, 2020b).
Reusing the export tools from TensorFlow Lite enabled us 4.4 Memory Management
to import a wide variety of models at little engineering cost.
We are unable to assume the operating system can dynami-
cally allocate memory. So the framework allocates and man-
4.3.1 Model Serialization
ages memory from a provided memory arena. During model
TensorFlow Lite for smartphones and other mobile devices preparation, the interpreter determines the lifetime and size
employs the FlatBuffer serialization format to hold models of all buffers necessary to run the model. These buffers in-
(TensorFlow, 2020a). The binary footprint of the accessor clude run-time tensors, persistent memory to store metadata,
code is typically less than two kilobytes. It is a header- and scratch memory to temporarily hold values while the
only library, making compilation easy, and it is memory model runs (Section 4.4.1). After accounting for all required
efficient because the serialization protocol does not require buffers, the framework creates a memory plan that reuses
unpacking to another representation. nonpersistent buffers when possible while ensuring buffers
are valid during their required lifetime (Section 4.4.2).
The downside to this format is that its C++ header requires
the platform compiler to support the C++11 specification.
4.4.1 Persistent Memory and Scratchpads
We had to work with several vendors to upgrade their tool
chains to handle this version, but since we had implicitly We require applications to supply a fixed-size memory arena
chosen modern C++ by basing our framework on Tensor- when they create the interpreter and to keep the arena intact
Flow Lite, it has been a minor obstacle. throughout the interpreter’s lifetime. Allocations with the
same lifetime can treat this arena as a stack. If an allocation
Another challenge of this format was that most of our target
takes up too much space, we raise an application-level error.
devices lacked file systems, but because it uses a memory-
mapped representation, files are easy to convert into C To prevent memory errors from interrupting a long-running
source files containing data arrays. These files are com- program, we ensure that allocations only occur during the
pilable into the binary, to which the application can refer. interpreter’s initialization phase. No allocation (through our
TensorFlow Lite Micro

Memory Size Memory Size


Lowest address of buffer Highest address of buffer
Global Tensor Arena Buffer
Time Operator #1
Head Tail
Alloc Alloc “Temp” Allocation Arena Operator #2
Allocations Allocations

Operator #3 A B A B
Operator #4

Figure 3. Two-stack allocation strategy. Operator #5

Operator #6 C C

Operator #7 D D
mechanisms) is possible during model invocation.
Operator #8
This simplistic approach works well for initial prototyping,
but it wastes memory because many allocations could over- (a) Naive (b) Bin packing
lap with others in time. One example is data structures that
are only necessary during initialization. Their values are Figure 4. Intermediate allocation strategies.
irrelevant after initialization, but because their lifetime is
the same as the interpreter’s, they continue to take up arena ized using rectangles (Figure 4a), where one dimension is
space. A model’s evaluation phase also requires variables memory size and the other is the time during which each
that need not persist from one invocation to another. allocation must be preserved. The overall memory can be
substantially reduced if some areas are reused or compacted
Hence, we modified the allocation scheme so that together. Figure 4b shows a more optimal memory layout.
initialization- and evaluation-lifetime allocations reside in a
separate stack relative to interpreter-lifetime objects. This Memory compaction is an instance of bin packing (Martello,
feat uses a stack that increments from the lowest address 1990). Calculating the perfect allocation strategy for arbi-
for the function-lifetime objects (“Head” in Figure 3) and a trary models without exhaustively trying all possibilities is
stack that decrements from the arena’s highest address for an unsolved problem, but a first-fit decreasing algorithm
interpreter-lifetime allocations (“Tail” in Figure 3). When (Garey et al., 1972) usually provides reasonable solutions.
the two stack pointers cross, they indicate a lack of capacity. In our case, this approach consists of gathering a list of all
The two-stack allocation strategy works well for both shared temporary allocations, including size and lifetime; sorting
buffers and persistent buffers. But model preparation also the list in descending order by size; and placing each allo-
holds allocation data that model inference no longer needs. cation in the first sufficiently large gap, or at the end of the
Therefore, we used the space in between the two stacks as buffer if no such gap exists. We do not support dynamic
temporary allocations when a model is in memory planning. shapes in the TF Micro framework, so we must know at
Any temporary data required during model inference resides initialization all the information necessary to perform this
in the persistent-stack allocation section. algorithm. The “Memory Planner” encapsulates this process
(Figure 2); it allows us to minimize the arena portion de-
Overall, our approach reduces the arena’s required mem- voted to intermediate tensors. Doing so offers a substantial
ory because the initialization allocations can be discarded memory-use reduction for many models.
after that function is done, and the memory is reusable for
evaluation variables. This approach also enables advanced Memory planning at run time incurs more overhead during
applications to reuse the arena’s function-lifetime section in model preparation than a preplanned memory-allocation
between evaluation calls. strategy. This cost, however, comes with the benefit of
model generality. TF Micro models simply list the operator
4.4.2 Memory Planner and tensor requirements. At run time, we allocate and enable
this capability for many model types.
A more complex optimization opportunity involves the
space required for intermediate calculations during model Offline-planned tensor allocation is an alternative memory-
evaluation. An operator may write to one or more output planning feature of TF Micro. It allows a more compact
buffers, and later operators may later read them as inputs. memory plan, gives memory-plan ownership and control
If the output is not exposed to the application as a model to the end user, imposes less overhead on the MCU during
output, its contents need only remain until the last operation initialization, and enables more-efficient power options by
that needs them has finished. Its presence is also unneces- allowing different memory banks to store certain memory
sary until just before the operation that populates it executes. areas. We allow the user to create a memory layout on a
Memory reuse is possible by overlapping allocations that host before run time. The memory layout is stored as model
are unneeded during the same evaluation sections. FlatBuffer metadata and contains an array of fixed-memory
arena offsets for an arbitrary number of variable tensors.
The memory allocations required over time can be visual-
TensorFlow Lite Micro

Head
TF Micro (stack) Model
Interpreter TF Micro Head
binds to
Interpreter (stack)

TfLiteTensor | data | shape Shape array


Memory
Allocator TfLiteTensor | data | shape Shape array TfLiteTensor | data | shape Shape array
owns
TfLiteTensor | data | shape Shape array Tail TfLiteTensor | data | shape Shape array
(stack) Memory owns
Tensor Arena Allocator TfLiteTensor | data | shape Shape array Tail
binds to (stack)
Model
Tensor Arena

(a) Single-model (b) Multiple models.

Figure 5. Memory-allocation strategy for a single model versus a multi-tenancy scenario. In TF Micro, there is a one-to-one binding
between a model, an interpreter and the memory allocations made for the model (which may come from a shared memory arena).

4.5 Multitenancy Because the model execution’s latency, power consumption,


and code size tend to be dominated by the implementations
Embedded-system constraints can force application-model
of these operations, they are typically specialized for partic-
developers to create several specialized models instead of
ular platforms to take advantage of hardware characteristics.
one large monolithic model. Hence, supporting multiple
We attracted library optimizations from hardware vendors
models on the same embedded system may be necessary.
such as Arm, Cadence, Ceva, and Synopsys.
If an application has multiple models that need not run si-
Well-defined operator boundaries mean it is possible to de-
multaneously, TF Micro supports multitenancy with some
fine an API that communicates the inputs and outputs but
memory-planner changes that are transparent to the devel-
hides implementation details behind an abstraction. Sev-
oper. TF Micro supports memory-arena reuse by enabling
eral chip vendors have provided a library of neural network
the multiple model interpreters to allocate memory from a
kernels designed to deliver maximum neural-network per-
single arena. We allow interpreter-lifetime areas to stack on
formance when running on their processors. For example,
each other in the arena and reuse the function-lifetime sec-
Arm has provided optimized CMSIS-NN libraries divided
tion for model evaluation. The reusable (nonpersistent) part
into several functions, each covering a category: convolu-
is set to the largest requirement, based on all models allo-
tion, activation, fully connected layer, pooling, softmax, and
cating in the arena. The nonreusable (persistent) allocations
optimized basic math. TF Micro uses CMSIS-NN to deliver
grow for each model—allocations are model specific.
high performance as we demonstrate in Section 5.
4.6 Multi-threading
4.8 Platform Specialization
TF Micro is thread-safe as long as there is no state corre-
TF Micro gives developers flexibility to modify the library
sponding to the model that is kept outside the interpreter
code. Because operator implementations (kernels) often
and the model’s memory allocation within the arena. The
consume the most time when executing models, they are
interpreter’s only variables are kept in the arena, and each
prominent targets for platform-specific optimization.
interpreter instance is uniquely bound to a specific model.
Therefore, TF Micro can safely support multiple interpreter We wanted to make swapping in new implementations easy.
instances running from different tasks or threads. To do so, we allow specialized versions of the C++ source
code to override the default reference implementation. Each
TF Micro can also run safely on multiple MCU cores. Since
kernel has a reference implementation in a directory, but sub-
the only variables used by the interpreter are kept in the
folders contain optimized versions for particular platforms
arena, this works well in practice. The executable code is
(e.g., the Arm CMSIS-NN library).
shared, but the arenas ensure there are no threading issues.
As we explain in Section 4.9, the platform-specific source
4.7 Operator Support files replace the reference implementations during all build
steps when targeting the named platform or library (e.g., us-
Operators are the calculation units in neural-network graphs. ing TAGS="cmsis-nn"). Each platform is given a unique
They represent a sizable amount of computation, typically tag. The tag is a command line argument to the build system
requiring many thousands or even millions of individual that replaces the reference kernels during compilation. In
arithmetic operations (e.g., multiplies or additions). They a similar vein, library modifiers can swap or change the
are functional, with well-defined inputs, outputs, and state implementations incrementally with no changes to the build
variables as well as no side effects beyond them. scripts and the overarching build system we put in place.
TensorFlow Lite Micro

4.9 Build System serialized FlatBuffer format. We use the Visual Wake Words
(VWW) person-detection model (Chowdhery et al., 2019),
To address the embedded market’s fragmentation (Sec-
which represents a common microcontroller vision task of
tion 2.2), we needed our code to compile on many platforms.
identifying whether a person appears in a given image. The
We therefore wrote the code to be highly portable, exhibiting
model is trained and evaluated on images from the Microsoft
few dependencies, but it was insufficient to give potential
COCO data set (Lin et al., 2014). It primarily stresses and
users a good experience on a particular device.
measures the performance of convolutional operations.
Most embedded developers employ a platform-specific IDE
Also, we use the Google Hotword model, which aids in
or tool chain that abstracts many details of building subcom-
detecting the key phrase “OK Google.” This model is de-
ponents and presents libraries as interface modules. Simply
signed to be small and fast enough to run constantly on
giving developers a folder hierarchy containing source-code
a low-power DSP in smartphones and other devices with
files would still leave them with multiple steps before they
Google Assistant. Because it is proprietary, we use a version
could build and compile that code into a usable library.
with scrambled weights and biases.
Therefore, we chose a single makefile based build sys-
The benchmarks run multiple inputs through a single model,
tem to determine which files the library required, then gen-
measuring the time to process each input and produce an
erated the project files for the associated tool chains. The
inference output. The benchmark does not measure the
makefile held the source-file list, and we stored the platform-
time necessary to bring up the model and configure the run
specific project files as templates that the project-generation
time, since the recurring inference cost dominates total CPU
process filled in with the source-file information. That pro-
cycles on most long-running systems.
cess may also perform other postprocessing to convert the
source files to a format suitable for the target tool chain.
5.2 Benchmark Performance
Our platform-agnostic approach has enabled us to support a
variety of tool chains with minimal engineering work, but We provide two sets of benchmark results. First are the base-
it does have some drawbacks. We implemented the project line results from running the benchmarks on reference ker-
generation through an ad hoc mixture of makefile scripts and nels, which are simple operator-kernel implementations de-
Python. This strategy makes the process difficult to debug, signed for readability rather than performance. Second are
maintain, and extend. Our intent is for future versions to results for optimized kernels compared with the reference
keep the concept of a master source-file list that only the kernels. The optimized versions employ high-performance
makefile holds, but then delegate the actual generation to ARM CMSIS-NN and Cadence libraries (Lai et al., 2018).
better-structured Python in a more maintainable way. The results in Table 6 are for the CPU (Table 6a) and DSP
(Table 6b). The total run time appears under the “Total
5 S YSTEM E VALUATION Cycles” column, and the run time excluding the interpreter
appears under the “Calculation Cycles” column. The differ-
TF Micro has undergone testing and it has been deployed ence between them is the interpreter overhead.
extensively with many processors based on the Arm Cortex-
M architecture (Arm, 2020). It has been ported to other Comparing the reference kernel versions to the optimized
architectures including ESP32 (Espressif, 2020) and many kernel versions reveals considerable performance improve-
digital signal processors (DSPs). The framework is also ment. For example, between “VWW Reference” and
available as an Arduino library. It can generate projects for “VWW Optimized,” the CMSIS-NN library offers more than
environments such as Mbed (ARM, 2020) as well. In this a 4x speedup on the Cortex-M4 microcontroller. Optimiza-
section, we use two representative platforms to assess and tion on the Xtensa HiFi Mini DSP offers a 7.7x speedup.
quantify TF Micro’s computational and memory overheads. For Hotword, the speeds are 25% and 50% better than the
baseline reference model because less time goes to the ker-
nel calculations and each inner loop accounts for less time
5.1 Experimental Setup
with respect to the total run time of the benchmark model.
We selected two platforms on which to evaluate TF Micro
(Table 1). First is the Sparkfun Edge, which has an Ambiq Platform Processor Clock Flash RAM
Apollo3 MCU. Apollo3 is powered by an Arm Cortex-M4 Sparkfun Edge Arm CPU
96 MHz 1 MB 0.38 MB
core and operates in burst mode at 96 MHz (Ambiq Mi- (Ambiq Apollo3) Cortex-M4
cro, 2020). The second platform is an Xtensa Hifi Mini Xtensa DSP
Tensilica HiFi 10 MHz 1 MB 1 MB
DSP, which is based on the Cadence Tensilica architecture HiFi Mini
(Cadence, 2020).
Table 1. Embedded-platform benchmarking.
Our benchmarks are INT8 TensorFlow Lite models in a
TensorFlow Lite Micro

Total Calculation Interpreter Total Calculation Interpreter


Model Model
Cycles Cycles Overhead Cycles Cycles Overhead
VWW VWW
18,990.8K 18,987.1K < 0.1% 387,341.8K 387,330.6K < 0.1%
Reference Reference
VWW VWW
4,857.7K 4,852.9K < 0.1% 49,952.3K 49,946.4K < 0.1%
Optimized Optimized
Google Hotword Google Hotword
45.1K 43.7K 3.3% 990.4K 987.4K 0.3%
Reference Reference
Google Hotword Google Hotword
36.4K 34.9K 4.1% 88.4K 84.6K 4.3%
Optimized Optimized

(a) Sparkfun Edge (Apollo3 Cortex-M4) (b) Xtensa HiFi Mini DSP

Figure 6. Performance results for TF Micro target platforms.

The “Interpreter Overhead” column in both Table 6a and Model


Persistent Nonpersistent Total
Table 6b is insignificant compared with the total model run Memory Memory Memory
time on both the CPU and DSP. The overhead on the micro- Convolutional
1.29 kB 7.75 kB 9.04 kB
controller CPU (Table 6a) is less than 0.1% for long-running Reference
models, such as VWW. In the case of short-running models VWW
26.50 kB 55.30 kB 81.79 kB
such as Google Hotword, the overhead is still minimal at Reference
about 3% to 4%. The same general trend holds in Table 6b Google Hotword
12.12 kB 680 bytes 12.80 kB
for non-CPU architectures like the Xtensa HiFi Mini DSP. Reference

Table 2. Memory consumption on Sparkfun Edge.


5.3 Memory Overhead
We assess TF Micro’s total memory usage. TF Micro’s tent and fair way to measure hardware performance. MLPerf
memory usage includes the code size for the interpreter, is adopting the benchmarks (Mattson et al., 2020; Reddi
memory allocator, memory planner, etc. plus any operators et al., 2020), and the tinyMLPerf benchmark suite imposes
that are required by the model. Hence, the total memory accuracy metrics for them (Banbury et al., 2020).
usage varies greatly by the model. Large models and models
with complex operators (e.g. VWW) consume more mem- Although benchmarks measure performance, profiling is
ory than their smaller counterparts like Google Hotword. In necessary to gain useful insights into model behavior. TF
addition to VWW and Google Hotword, in this section, we Micro has hooks for developers to instrument specific code
added an even smaller reference convolution model contain- sections (TensorFlow, 2020d). These hooks allow a TinyML
ing just two convolution layers, a max-pooling layer, a dense application developer to measure overhead using a general-
layer, and an activation layer to emphasize the differences. purpose interpreter rather than a custom neural-network
engine for a specific model, and they can examine a model’s
Overall, TF Micro applications have small footprint. Ta- performance-critical paths. These features allow identifica-
ble 2 shows that for the convolutional and Google Hotword tion, profiling, and optimization of bottleneck operators.
models, the memory consumed is at most 13 KB. For the
larger VWW model, the framework consumes 26.5 KB.
6 R ELATED W ORK
To further analyze memory usage, recall that TF Micro allo-
cates program memory into two main sections: persistent Progress on frameworks targeting embedded devices is start-
and nonpersistent. Table 2 reveals that depending on the ing to rise. Besides frameworks, libraries that attempt to
model characteristics, one section can be larger than the increase performance through optimized calls for MCU-
other. The results show that we adjust to the needs of the based neural-network acceleration are in development as
different models while maintaining a small footprint. well. In this section, we discuss several related frameworks
from both the industry as well as academia institutions.
5.4 Benchmarking and Profiling Embedded machine learning is still developing with much
headroom for research innovation, development, and deploy-
TF Micro provides a set of benchmarks and profiling APIs
ment. Therefore, rather than make any head-on comparisons
(TensorFlow, 2020c) to compare hardware platforms and to
(as that would be not very meaningful at this early stage of
let developers measure performance as well as identify op-
evolution), we instead see this as an opportunity to identify
portunities for optimization. Benchmarks provide a consis-
TensorFlow Lite Micro

ongoing works that have the potential to mature and enable ically designed for Arm. It consists of an offline tool that
the broader ecosystem, much like our TF Micro effort. translates a TensorFlow model into C++ machine code, as
well as a run time for execution management.
ELL (Microsoft, 2020) The Embedded Learning Library
(ELL) is an open-source library from Microsoft for embed-
ded AI. ELL is a cross-compiler tool chain that enables users 7 C ONCLUSION
to run machine-learning models on resource constrained
TF Micro enables the transfer of deep learning onto embed-
platforms, similar to the platforms that we have evaluated.
ded systems, significantly broadening the reach of machine
Graph Lowering (GLOW) (Rotem et al., 2018) is an learning. TF Micro is a framework that has been specif-
open-source compiler that accelerates neural-network per- ically engineered to run machine learning effectively and
formance across a range of hardware platforms, both large efficiently on embedded devices with only a few kilobytes
and small. It initially targeted large machine-learning of memory. The framework fits in tens of kilobytes on mi-
systems, but NXP recently extended it to focus on Arm crocontrollers and DSPs and can handle many basic models.
Cortex-M MCUs and the Cadence Tensilica HiFi 4 DSPs.
TF Micro’s fundamental contributions are the design de-
GLOW employs optimized kernels from vendor-supported
cisions that address the challenges of embedded systems:
libraries. Unlike TF Micro’s flexible interpreter-based solu-
hardware heterogeneity in the fragmented ecosystem, miss-
tion, GLOW for MCUs is based on ahead-of-time compila-
ing software features, and resource constraints. We support
tion for both floating-point and quantized arithmetic.
multiple embedded platforms based on the widely-deployed
STM32Cube.AI (STMicroelectronics, 2020) is the only Arm Cortex-M series of microcontrollers, as well as other
other widely deployed production framework. It takes mod- ISAs such as DSP cores from Tensilica. The framework
els from Keras, TensorFlow Lite, and others to generate code does not require operating system support, any standard C or
optimized for a range of STM32-series MCUs. It supports C++ libraries, or dynamic memory allocation – features that
both FP32 and quantized models and comes with built-in are commonly taken for granted in non-embedded system
optimizations to reduce model size. By comparison, TF Mi- domains. This allows us to run bare-metal efficiently.
cro is more flexible, having been designed to serve a wide
The methods and techniques presented here are a snapshot
range of MCUs beyond the STMicroelectronics ecosystem.
of the progress made so far. As embedded system capabili-
TensorFlow-Native was an experimental Google system ties grow, so will the framework. For example, we are in the
that compiled TensorFlow graphs into C++ code. The sim- process of developing an offline memory planner for more
plicity of the resulting code allowed porting of the system to effective memory allocation and fine-grained user control,
many MCU and DSP targets. It lacked quantization support and investigating new approaches to support concurrent ex-
as well as platform-specific optimizations to achieve good ecution of ML models. In addition to minimizing memory
performance. As we described previously in Section 3, we consumption and improving performance, we are also look-
firmly believe that it is essential to leverage the existing ing into providing better support for vendor optimizations
infrastructure to enable broad adoption of the framework. and build system support for development environments.
Leveraging the existing toolchain is also essential to provide
strong engineering support for product-level applications 8 ACKNOWLEDGEMENTS
that run on many devices in the real-world.
TF Micro is an open-source project and a community-based
TinyEngine (Lin et al., 2020) is an inference engine for
open-source project. As such, it rests on the work of
MCUs. It is a code-generator-based compiler method that
many. We extend our gratitude to many individuals, teams,
helps eliminate memory overhead. The authors claim it
and organizations: Fredrik Knutsson and the CMSIS-NN
reduces memory usage by 2.7x and boosts the inference
team; Rod Crawford and Matthew Mattina from Arm; Raj
speed by 22% for their baseline. TF Micro, by contrast,
Pawate from Cadence; Erich Plondke and Evgeni Gousef
uses an interpreter-based method, and as our experiments
from Qualcomm; Jamie Campbell from Synopsys; Yair
show, the interpreter adds insignificant overhead.
Siegel from Ceva; Sai Yelisetty from DSP Group; Zain
TVM (Chen et al., 2018) is an open-source deep-learning Asgar from Stanford; Dan Situnayake from Edge Impulse;
compiler for CPUs, GPUs, and machine-learning accelera- Neil Tan from the uTensor project; Sarah Sirajuddin, Rajat
tors. It enables machine-learning engineers to optimize and Monga, Jeff Dean, Andy Selle, Tim Davis, Megan Kacholia,
run computations efficiently on any hardware back end. It Stella Laurenzo, Benoit Jacob, Dmitry Kalenichenko, An-
has been ported to Arm’s Cortex-M7 and other MCUs. drew Howard, Aakanksha Chowdhery, and Lawrence Chan
from Google; and Radhika Ghosal, Sabrina Neuman, Mark
uTensor (uTensor, 2020), a precursor to TF Micro, is a
Mazumder, and Colby Banbury from Harvard University.
lightweight machine-learning inference framework specif-
TensorFlow Lite Micro

R EFERENCES Garey, M. R., Graham, R. L., and Ullman, J. D. Worst-case


analysis of memory allocation algorithms. In Proceed-
Ambiq Micro. Apollo 3 Blue Datasheet,
ings of the fourth annual ACM symposium on Theory of
2020. URL https://cdn.sparkfun.com/
computing, pp. 143–150, 1972.
assets/learn tutorials/9/0/9/
Apollo3 Blue MCU Data Sheet v0 9 1.pdf. Goebel, K. et al. NASA PCoE Datasets, 2020. URL https:
//ti.arc.nasa.gov/tech/dash/groups/
ARM. Arm Enables Custom Instructions for embed-
pcoe/prognostic-data-repository/.
ded CPUs, 2019. URL https://www.arm.com/
company/news/2019/10/arm-enables- Gruenstein, A., Alvarez, R., Thornton, C., and Ghodrat, M.
custom-instructions-for-embedded- A cascade architecture for keyword spotting on mobile
cpus. devices. arXiv preprint arXiv:1712.03603, 2017.
ARM. Mbed, 2020. URL https://os.mbed.com. IC Insights. MCUs Expected to Make Mod-
est Comeback after 2020 Drop, 2020. URL
Arm. Arm Cortex M, 2020. URL https: https://www.icinsights.com/news/
//developer.arm.com/ip-products/ bulletins/MCUs-Expected-To-Make-
processors/cortex-m. Modest-Comeback-After-2020-Drop--/.
Banbury, C. R., Reddi, V. J., Lam, M., Fu, W., Fazel, Intel. Intel-64 and ia-32 architectures software developer’s
A., Holleman, J., Huang, X., Hurtado, R., Kanter, manual. Volume 3A: System Programming Guide, Part, 1
D., Lokhmotov, A., et al. Benchmarking tinyml (64), 2013.
systems: Challenges and direction. arXiv preprint
arXiv:2003.04821, 2020. Koizumi, Y., Saito, S., Uematsu, H., Harada, N., and
Imoto, K. ToyADMOS: A dataset of miniature-
Cadence. Tensilica Hi-Fi DSP Family, 2020. URL machine operating sounds for anomalous sound detec-
https://ip.cadence.com/uploads/928/ tion. In Proceedings of IEEE Workshop on Applica-
TIP PB HiFi DSP FINAL-pdf. tions of Signal Processing to Audio and Acoustics (WAS-
PAA), pp. 308–312, November 2019. URL https:
Chavarriaga, R., Sagha, H., Calatroni, A., Digumarti, S. T.,
//ieeexplore.ieee.org/document/8937164.
Tröster, G., Millán, J. d. R., and Roggen, D. The op-
portunity challenge: A benchmark database for on-body Krishnamoorthi, R. Quantizing deep convolutional networks
sensor-based activity recognition. Pattern Recognition for efficient inference: A whitepaper. arXiv preprint
Letters, 34(15):2033–2042, 2013. arXiv:1806.08342, 2018.
Chen, G., Parada, C., and Heigold, G. Small-footprint Kumar, A., Goyal, S., and Varma, M. Resource-efficient
keyword spotting using deep neural networks. In 2014 machine learning in 2 kb ram for the internet of things.
IEEE International Conference on Acoustics, Speech and In International Conference on Machine Learning, pp.
Signal Processing (ICASSP), pp. 4087–4091. IEEE, 2014. 1935–1944, 2017.
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, Lai, L., Suda, N., and Chandra, V. CMSIS-NN: Efficient
H., Cowan, M., Wang, L., Hu, Y., Ceze, L., et al. TVM: neural network kernels for Arm Cortex-M cpus. arXiv
An automated end-to-end optimizing compiler for deep preprint arXiv:1801.06601, 2018.
learning. In 13th {USENIX} Symposium on Operating
Systems Design and Implementation ({OSDI} 18), pp. Lin, J., Chen, W.-M., Lin, Y., Cohn, J., Gan, C., and Han,
578–594, 2018. S. Mcunet: Tiny deep learning on IoT devices. arXiv
preprint arXiv:2007.10319, 2020.
Chollet, F. et al. Keras, 2015. URL https://
keras.io/. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft
Chowdhery, A., Warden, P., Shlens, J., Howard, A., and COCO: Common objects in context. In European confer-
Rhodes, R. Visual wake words dataset. arXiv preprint ence on computer vision, pp. 740–755. Springer, 2014.
arXiv:1906.05721, 2019.
Martello, S. Chapter 8: Bin packing, knapsack prob-
Espressif. Espressif ESP32, 2020. URL https: lems: algorithms and computer implementations. Wiley-
//www.espressif.com/en/products/socs/ Interscience series in discrete mathematics and optimiza-
esp32. tion, 1990.
TensorFlow Lite Micro

Mattson, P., Reddi, V. J., Cheng, C., Coleman, C., Di- uTensor. uTensor, 2020. URL https://github.com/
amos, G., Kanter, D., Micikevicius, P., Patterson, D., uTensor/uTensor.
Schmuelling, G., Tang, H., et al. MLPerf: An industry
standard benchmark suite for machine learning perfor- Waterman, A. and Asanovic, K. The risc-v instruction set
mance. IEEE Micro, 40(2):8–16, 2020. manual, volume i: Unprivileged isa document, version
20190608-baseratified. RISC-V Foundation, Tech. Rep,
Microsoft. Embedded Learning Library, 2020. URL 2019.
https://microsoft.github.io/ELL/.
Wu, X., Lee, I., Dong, Q., Yang, K., Kim, D., Wang, J.,
Reddi, V. J., Cheng, C., Kanter, D., Mattson, P., Peng, Y., Zhang, Y., Saliganc, M., Yasuda, M., et al. A
Schmuelling, G., Wu, C.-J., Anderson, B., Breughe, M., 0.04 mm 3 16nw wireless and batteryless sensor system
Charlebois, M., Chou, W., et al. MLPerf inference bench- with integrated cortex-m0+ processor and optical commu-
mark. In 2020 ACM/IEEE 47th Annual International nication for cellular temperature measurement. In 2018
Symposium on Computer Architecture (ISCA), pp. 446– IEEE Symposium on VLSI Circuits, pp. 191–192. IEEE,
459. IEEE, 2020. 2018.

Rotem, N., Fix, J., Abdulrasool, S., Catron, G., Deng, Zhang, M. and Sawchuk, A. A. Usc-had: a daily activity
S., Dzhabarov, R., Gibson, N., Hegeman, J., Lele, M., dataset for ubiquitous activity recognition using wearable
Levenstein, R., et al. Glow: Graph lowering com- sensors. In Proceedings of the 2012 ACM Conference on
piler techniques for neural networks. arXiv preprint Ubiquitous Computing, pp. 1036–1043, 2012.
arXiv:1805.00907, 2018.
Zhang, Y., Suda, N., Lai, L., and Chandra, V. Hello edge:
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Keyword spotting on microcontrollers. arXiv preprint
and Salakhutdinov, R. Dropout: a simple way to prevent arXiv:1711.07128, 2017.
neural networks from overfitting. The journal of machine
learning research, 15(1):1929–1958, 2014.

STMicroelectronics. STM32Cube.AI, 2020. URL


https://www.st.com/content/st com/en/
stm32-ann.html.

Susto, G. A., Schirru, A., Pampuri, S., McLoone, S., and


Beghi, A. Machine learning for predictive maintenance:
A multiple classifier approach. IEEE Transactions on
Industrial Informatics, 11(3):812–820, 2014.

TensorFlow. TensorFlow Lite FlatBuffer Model, 2020a.


URL https://www.tensorflow.org/lite/
api docs/cc/class/tflite/flat-buffer-
model.

TensorFlow. TensorFlow Lite Guide, 2020b. URL https:


//www.tensorflow.org/lite/guide.

TensorFlow. Tensorflow Lite Micro Benchmarks, 2020c.


URL https://github.com/tensorflow/
tensorflow/tree/master/tensorflow/
lite/micro/benchmarks.

TensorFlow. Tensorflow Lite Micro Profiler, 2020d.


URL https://github.com/tensorflow/
tensorflow/blob/master/tensorflow/
lite/micro/micro profiler.cc.

TensorFlow. TensorFlow Core Ops, 2020e.


URL https://github.com/tensorflow/
tensorflow/blob/master/tensorflow/
core/ops/ops.pbtxt.

You might also like