0% found this document useful (0 votes)
25 views57 pages

Thesis - Anomaly Detection

Uploaded by

Muhammad Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views57 pages

Thesis - Anomaly Detection

Uploaded by

Muhammad Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Applicability of TinyML for

maintenance predictability

Niklas Exell
Master’s thesis in Computer Engineering
Supervisor: Jerker Björkqvist
Åbo Akademi University
Faculty of Science and Engineering
Information Technologies
October, 2023
Abstract

Currently, a diverse set of systems are deployed where a lot of data


is or could be gathered in order to do some kind of analysis, either in
real-time or retroactively.
One type of analysis that could bring large savings is maintenance
prediction based on data gathered from a running machine. In a worst-
case scenario, it is obvious how detecting a fault in a machine early in
order to prevent a catastrophic failure of the machine can save a lot of
money. But also predicting smaller failures, which might ”only” cause
downtime, by predicting what kind of maintenance needs to be done
and therefore also having some foresight into what kinds of parts need
to be acquired in order to do the required maintenance as smoothly
as possible.
In order to do this we can use different kinds of strategies to do the
processing. Both when it comes to what kind of processing is done as
well as where it is done.
When it comes to what kind of processing is to be done it is
mostly in the form of some kind of machine learning. Either retroact-
ively, most commonly with some form of classification or clustering
algorithm, or with a model-based approach by using training data to
train a model that is then in real-time used for inference.
As to where to do the processing, it can either be done centrally,
on large servers, or more decentrally on the edge. If the concept of
edge computing is taken to the extreme the computing is done on the
microcontrollers that the sensors are connected to. TinyML is the
concept of doing machine learning on the edge on microcontrollers.
This thesis will cover some types of analysis that can be done with
machine learning as well as where this processing should be done as
well as the possibility of using TinyML to do the bulk of the analysis
directly on the microcontrollers that the sensors are connected to.
Some examples of potential applications of TinyMl are also covered,
both with the use of accelerometer data.

Keywords:
TinyML, Machine Learning, Embedded Systems, Edge, Decentralised

1
Contents

1 Preface 4

2 Introduction 5

3 Machine learning 6
3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 7
3.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Anomaly Detection 11
4.1 Categories of anomaly detection. . . . . . . . . . . . . . . . . 11
4.2 Use cases of anomaly detection . . . . . . . . . . . . . . . . . 12
4.3 Anomaly types . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.4 Examples of anomaly detection methods . . . . . . . . . . . . 14

5 Edge Computing 17
5.1 Why edge computing? . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Why not edge computing? . . . . . . . . . . . . . . . . . . . . 17
5.3 Examples of edge computing . . . . . . . . . . . . . . . . . . . 17
5.4 Microcontrollers on the edge . . . . . . . . . . . . . . . . . . . 18
5.4.1 Are microcontrollers necessary? . . . . . . . . . . . . . 19
5.5 Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.6 Environmental impact . . . . . . . . . . . . . . . . . . . . . . 20

6 TinyML 22
6.1 What is tinyml . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2 Motivation of TinyML . . . . . . . . . . . . . . . . . . . . . . 22
6.2.1 Power consumption . . . . . . . . . . . . . . . . . . . . 22
6.2.2 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3 Difference to conventional Machine Learning . . . . . . . . . . 23
6.4 TensorFlow vs TensorFlow Lite vs TensorFlow Lite Micro . . . 24
6.4.1 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . 24
6.4.2 TensorFlow Lite . . . . . . . . . . . . . . . . . . . . . . 24
6.4.3 TensorFlow Lite Micro . . . . . . . . . . . . . . . . . . 25
6.5 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6 FlatBuffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.7 Adapting models for microcontrollers . . . . . . . . . . . . . . 30

2
6.8 Computational and hardware need . . . . . . . . . . . . . . . 30
6.8.1 Training and deployment . . . . . . . . . . . . . . . . . 30
6.8.2 Deployed . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 TinyML for pattern recognition and maintenance prediction 33


7.1 Example of simple TinyML applications: Magic wand . . . . . 33
7.1.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 35
7.1.2 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.2 Wake up phrase with TinyML . . . . . . . . . . . . . . . . . . 39
7.3 Maintainence prediction in marine diesel engines using TinyML.
40
7.3.1 Collected data . . . . . . . . . . . . . . . . . . . . . . . 40
7.3.2 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8 Conclusion 49

9 Summary in Swedish 51
9.1 Inledning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2 Maskininlärning . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.3 Identifiering av avvikelser . . . . . . . . . . . . . . . . . . . . 51
9.4 Kantberäkning . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.5 Maskininlärning på mikrokontroller - TinyML . . . . . . . . . 52
9.6 Analys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.7 Sammanfattning . . . . . . . . . . . . . . . . . . . . . . . . . 54

3
1 Preface
With evergrowing data generation on the edge, processing the data on the
edge becomes ever more valuable. Simultaneously machine learning is also
growing in popularity. This thesis focuses on the combination of these two
in the form of TinyML.
This thesis was written with the guidance of Prof. Jerker Björkqvist who
provided excellent feedback.

4
2 Introduction
In today’s world with an accelerating amount of devices, data is often gathered
which is then later analysed to be used as feedback in some form. Most of
this computation is done in the cloud, which means that large amounts of
data need to be transported to, stored and analysed at a central location.
This means that a large amount of computation is needed in one place and
large links are needed to send the data to the central location and a large
amount of storage is needed there for all the raw data from all the deployed
devices equipped with sensors.
This is why edge computing, which as the name implies, does the com-
puting at the edge as close to where the data is generated as possible. Doing
the processing close to where the data is generated can also have advantages
in that the latency for feedback is lower and therefore performance can be
better. By doing the computation on the edge nodes cost savings are also
achieved by not needing a large amount of central computation or a large
link to the central location.
These are especially factors onboard a ship where a link to the cloud
onshore can not be taken for granted while out at sea and use of a high
bandwidth uplink, though Ethernet or WiFi while at port or cellular while
close to shore, is limited by lack of proximity to the port/shore.
This thesis will research whether it is possible to predict maintenance
needs using machine learning by the use of TinyML, computed on the edge.
TinyML enables using ML models on the edge close to or on the data-
gathering embedded devices.
This scope is specifically for medium size marine diesel engines. Similarly
to how a marine engineer onboard a ship can listen to the engines in the
engine room and recognize that some sound seems off, and sometimes even
pinpoint where the change is coming from, a model should also be able to
be trained to detect these changes. The hypothesis of this thesis is that it
might be possible to train a model to recognize maintenance needs based on
the vibrations gathered by accelerometers.
In this thesis, I will analyse data gathered on a Roll-on/Roll-off ferry
that operated in the Baltic Sea between Finland and Sweden. This ship
is equipped with four Wärtsilä 12V32 4SA four-stroke diesel engines. Each
of the four engines had sensor units with accelerometers attached to the
engine block directly and the engine frameset. The engine room also had
two sensor units that were used as a reference as well as to monitor the
compute hardware enclosure temperature. All the described sensors were
retrofitted.

5
3 Machine learning
Machine Learning shortened as ML is a subset of Artificial Intelligence (AI).
In Machine Learning a Machine meaning a computer ”learns” about a phe-
nomenon of interest by using:

• Training data which consists of as much and clear data as possible


where the phenomena occur.

• A diverse set of algorithms for analysis, training and inference.

Based on this training data and a Machine Learning algorithm a model


can be created that is able to predict an output based on a given input.
Machine learning can be divided into Supervised, Unsupervised and Re-
inforcement Learning:

3.1 Supervised Learning


In Supervised Learning (SL) the data is tagged, this means that the algorithm
can know from the tags what the desired output is on a set of inputs. When
a SL algorithm is fed training data, which consists of input data and desired
output data(tag) it produces an inferred function which can be used to map
before unseen input data. For this to work well the algorithm needs to be
able to generalize the data and not simply remember the data exactly, which
is called overfitting.
There are multiple algorithms used in SL some common examples are:

• Support-vector machines: A supervised learning model that uses clas-


sification algorithms to sort data into two groups[1]. Compared to
Neural networks the computation needed is lower but the classification
also has to be simpler.

• Regression Models: Regression is used to describe the relationship


between two variables by fitting a line to the observed data. With
Linear regression, a straight line is fitted to the data while with both
logistic and nonlinear a curved line is fitted[2].

• Neural Networks are often created with Supervised Learning meaning


labelled training data is used in training (described more below).

6
3.2 Unsupervised Learning
In Unsupervised Learning (UL) the data is not tagged.
UL can be broadly classified into Probabilistic and Neural Networks:

• The two main methods types of Probabilistic UL are principal com-


ponent and cluster analysis:

– Principal component is used to find the vectors that best describe


the axes along which the data follows.
– In cluster analysis the algorithm, as the name implies, segments
the data into datasets with common attributes. Hence cluster
analysis is used to group untagged data into clusters with com-
monalities and can be used to detect anomalies.

• Neural Network (described below).

3.3 Reinforcement Learning


In Reinforcement Learning (RL) the algorithm is given a numerical perform-
ance score in guidance to the desired output. This way the algorithm can
achieve an optimal or near-optimal result in a similar way to what happens
in the biological brain with positive reinforcements, such as pleasure or in-
gestion of food, and negative reinforcement, such as pain or hunger. When
the performance of the algorithm is analysed a notation of so-called regret
can be given which is the difference between the wanted outcome and the
outcome.

3.4 Neural Networks


Neural Networks (NN) can be both supervised and unsupervised. Neural Net-
works or sometimes called Artificial Neural Networks (ANN) are computing
systems that are slightly inspired by the naturally occurring biological neural
networks that exist in the brains of animals.
NNs consist of artificial neurons and edges. An artificial neuron loosely
resembles its biological counterpart the neuron by processing the received
signal input, by some form of a non-linear function applied to the sum of
the inputs i.e. the Sigmoid function, and outputting that new signal to
other artificial neurons that might be connected through edges. The signals
consist of real numbers. The network is trained by adjusting weights that are

7
associated with the neurons and edges. These weights are multipliers that
affect the signal downstream of that neuron or edge. This means that neural
networks are weighted graphs.
Neural networks are often arranged in layers, especially in deep learning.
If the neurons are connected to all neurons in the layers above and below the
network is called fully connected however, multiple patterns of connection
exist.
The first layer that receives external input is called the input layer and
the last layer which gives output is called the output layer. Between the
input and output layers, there can be zero to multiple so-called hidden layers.
Single-layer and unlayered networks also exist.
A neural network can be configured to either only feed information for-
ward in the network (feedforward neural network) or have a form of memory
from earlier input data (recurring neural network):

• Feedforward Neural Networks (FNN) are the simplest neural networks


since their output is only dependent on a single set of input data. FNNs
are set up such that the layers in it only have connections from the layer
directly before it and to the layer directly after it, such that no loops
are formed. This means that the network only is a feedforward neural
network if it is a directed weighted graph. As a consequence data only
flows one way in an FNN.

• Recurring Neural Networks (RNN) on the other hand are set up in such
a way that connections can form loops. This means that the output of
an RNN is dependent on multiple successive sets of input data. This
means that an RNN can have internal memory and can therefore either
process variable-length input data or more easily process consecutive
data sequences where the data is only valid in the correct order i.e.
speech recognition.
The most common RNN type is Long Short-Term Memory (LSTM)
ANN which is an RNN with the addition of long-term memory in the
form of an internal state where context relating to the current data
sequence can be stored. This long-term memory is used to help with
the vanishing gradient problem, though LSTMs can also suffer from it.
The vanishing gradient problem occurs because over multiple cycles a
”memory” based on previous input data may vanish due to trending to
zero or infinity. Examples of uses of LSTM ANNs are speech recognition
and machine translation.

8
3.4.1 Training
When an Artificial Neural Network (ANN) is trained the term backpropaga-
tion is often used. Backpropagation is an algorithm that is used for the
backwards propagation of errors using gradient descent. Backpropagation is
used to adjust the weight values of the network. This is done backwards,
starting from the outputs, since the desired outputs are known and errors
can then be calculated by comparing the current value and the desired state.
Gradient descent is then used to adjust the weights such that errors are min-
imized. This is done for each weight and the new weights are applied at
the end of a learning iteration. When training an ANN multiple learning
iterations are used to increase performance.
Before training is started the desired type of network needs to be chosen.
Depending on the data, the number of inputs and outputs need to be chosen
before training starts.
When a neural network is trained, some static hyperparameters need to
be set before the training is started. Some relate to the network structure,
and some to the training algorithm.
Some examples of hyperparameters are[3]:

• Network structure:

– Number of hidden layers and neurons:


As described above a NN can have differing numbers of layers and
neurons in those layers, which need to be chosen before training
starts. Having a network that is too small will cause underfitting.
– Dropout:
Dropout is a technique used to avoid overfitting. With dropout, a
larger NN can be started with and later some neurons are dropped.
This means that in the end, the final model will be smaller than
what was started with.
– Network weight initialization:
A uniform distribution of starting weights is most commonly used
but sometimes some other scheme might be helpful.
– Activation function:
Activation functions are used to get non-linearity into the model.
Most commonly the Rectifier activation function is used, other
examples are Sigmoid and Softmax.

• Training algorithm:

9
– Learning rate:
When the network is trained a learning rate is set which sets the
size of the step taken to adjust the model. A large learning rate
shortens the training time however at the expense of the precision
of the model at the end. An adaptive learning rate can be applied
in order to decrease training time and increase precision and avoid
oscillations of the weights.

Figure 1: Learning rate

– Momentum:
Momentum helps to avoid oscillations by knowing the direction of
the next step.
– Number of epochs:
The number of times the full training data is shown during the
training.
– Batch size
The number of samples given to the network before an update
happens.

10
4 Anomaly Detection
When doing maintenance prediction anomalies can be good predictors of
parts starting to fail or sub-optimal running and of maintenance being needed.
As IBM[4] states ”Anomaly detection is a process in machine learning
that identifies data points, events, and observations that deviate from a data
set’s normal behaviour. And, detecting anomalies from time series data is
a pain point that is critical to address for industrial applications.” Anomaly
detection has a large interest in order for us to solve many problems.

4.1 Categories of anomaly detection.


As with Machine Learning in general, anomaly detection can also be done in
three ways:
• Supervised anomaly detection: As with supervised machine learning
supervised anomaly detection also requires labelling of all the data.
But with anomaly detection, the data is labelled ”normal” or ”anom-
aly”. Supervised anomaly detection is barely used due to the need for
labelling which rarely exists and the challenge of labelling anomalies
since by their nature the anomalies can be chaotic.
• Semi-supervised anomaly detection: With semi-supervised anomaly de-
tection again as with ML only some of the data is labelled. In this way,
the labelling is easier since the data points that are known to be nor-
mal or anomalous can be labelled as such and the rest is left blank.
Sometimes a model can be built of what is known to be normal data
so that going forward a prediction can be done whether a data point is
normal or anomalous.
Semi-supervised anomaly detection is more common than supervised
anomaly detection since it requires less labelling but labelled anomalies
are rare so there might not be enough data to generalize the anomalies
and there might be anomalies that stem from different phenomena as
opposed to the captured and labelled anomalies.
• Unsupervised anomaly detection: As with unsupervised machine learn-
ing unsupervised anomaly detection, the classification is not based on
labelled data to build a model but simply algorithms to detect data
that stands out from the norm.
Unsupervised methods are the most common since labelling the data
is so expensive.

11
4.2 Use cases of anomaly detection
Anomaly detection can be applied to many things. Some examples are, as
given in[4]:

• Outlier detection: Outlier detection is used to detect any outliers or


data that largely varies in range from the normal operating range or
state of the system within the training data. All of the data is analyzed
to find outliers.

• Novelty detection: Novelty detection is done by training with the use


of data that is known to be gathered during normal operation without
anomalies. The goal then is to analyze testing data to see whether
there is any novel behaviour and if so label the data as anomaly or
novelty.
To do this some data that is known to be normal is needed, this is a
form of semi-supervised anomaly detection.

• Event extraction: Event extraction can be done if the data operates in


different states, for example, a sensor could be on or off. The goal is to
detect when sensors behave differently during an event.

• Data cleaning: Data cleaning can be done to remove outliers or sudden


infrequent changes in the distribution of data. This is often done before
needing the data for something that benefits from cleaner data.

4.3 Anomaly types


When trying to find anomalies there are different types of anomalies that can
occur in data, as given in[5]:


Figure 2: Point anomaly, source:[5]

• Collective anomaly: If an anomaly is represented by multiple data


points.

Figure 3: Collective anomaly, source:[5]

• Contextual anomaly: If an anomaly is represented by a data point in


context to other data points. By itself, the data point might be normal
but in context, it might be anomalous.

13
Figure 4: Contextual anomaly, source:[5]

4.4 Examples of anomaly detection methods


• K-Means clustering:
With K-Means clustering the data consisting of n observations is di-
vided into a k number of clusters. The observations are assigned to
clusters based on the lowest Euclidian square distance to the centroid
of the cluster, however, at the start, the clustering centroid is just a
random seed value.
Although K-Means clustering seems simple and effective it does have
drawbacks:

– The number of clusters (k) needs to be chosen, though the elbow


method is of great assistance for this. In the elbow method, the
sum of the squares and the numbers of clusters are plotted, and
then by looking a the ”elbow” in the graph, a suitable size of K
can be extracted. This of course requires iterative running of the
algorithm while incrementing K.
– Only works if clusters belong together due to proximity.
– The computation is NP-hard because each observation needs to be
computed against all centroids. However, heuristics can be used
to shortcut to an approximate result and only from there forward
computing all relations.

• Local Outlier Factor[5]:

14
Local Outlier Factor is a density-based method for finding anomalies.
For each observation, the nearest neighbours are calculated. Then with
the computed neighbourhood, the local density is computed with Local
Reachability Density. Finally, the LOF score is calculated by compar-
ing the LRD and the previous Nearest neighbour.

15
Figure 5: Anomaly detection algorithms to choose from, source:[5]

16
5 Edge Computing
Edge computing at its simplest is putting the computation as close to the
edge, where the data is created and used, as possible or as IBM states:
”Edge computing is a distributed computing framework that brings en-
terprise applications closer to data sources such as IoT devices or local edge
servers. This proximity to data at its source can deliver strong business bene-
fits, including faster insights, improved response times and better bandwidth
availability.”[6]

5.1 Why edge computing?


When utilizing edge computing compared to conventional centralized com-
puting (Cloud) a majority of the processing is done close to the source of
the data and actuators that might need instructions based on the data. This
means that the data does not need to be sent to a centralized server which
would add delay, and need comparably larger network throughput. Histor-
ically this has not been much of a problem due to the number of remote
nodes not being that high, however, deployments are growing all the time
due to cheaper sensor devices being available, which makes more large-scale
projects financially feasible, which can create enormous amounts of data. In-
tel quotes: ”It’s estimated that by 2025, 75 percent of data will be created
outside of central data centres, where most processing takes place today” [7]
Edge computing is especially beneficial when nodes are deployed in remote
places with limited connectivity such as ships out at sea.
The origins of edge computing can be seen in content delivery networks,
which were created in the late 1990s predating edge. CDNs are implemented
by deploying some of the servers closer to the users in order to reduce latency
and data sent over larger distances.

5.2 Why not edge computing?


When a system is deployed that does not generate significant amounts of data
and the uplinks from the nodes are sufficient a centralized computing system
might be preferable due to simplicity due to a large amount of redundancy
in many edge computing systems.

5.3 Examples of edge computing


• Autonomous vehicles:

17
Autonomous vehicles are probably one of the applications that are most
talked about. In self-driving cars, decisions need to be made extremely
quickly and no delay is tolerated. Hence, the processing needs to be
done on the edge, in the car itself, because obviously if mistakes happen
people can be seriously injured if not killed due to large masses moving
at high speeds close to each other.
Examples of self-driving systems on the OEM side are Teslas Autopilot
and on the aftermarket side Comma.ai’s Openpilot. While the imple-
mentations of these systems vary, with Tesla using more sensors beyond
cameras and Comma focusing on just cameras since humans only need
sight to drive, they both rely on edge computing to make decisions in
time.

• Computer vision:
For the application of computer vision, edge computing is quite clearly
the way to go due to the large amount of data gathered by the camera
sensor(s), especially with moving objects at high resolution. By pro-
cessing the images at the point of capture only data that is interpreted
needs to be sent forward, and if actuation is needed it can be done
immediately once the processing is done. Examples, where computer
vision is used, are input for autonomous vehicles and barcode readers.

• Industrial automation and control:


Similarly to how autonomous vehicles need quick response times in-
dustrial automation and control need good response times for descent
precision while a vast processing capacity is not needed. An example
of this is Programmable logic controllers (PLC) which are used for a
large portion of industrial automation due to their ability to be recon-
figured more easily compared to conventional circuits they largely have
replaced.

5.4 Microcontrollers on the edge


Since the advantage of edge computing is not needing to send much data
forward, if any, due to the processing being done there. Doing it on the
devices that the sensors are connected to would be beneficial. This means
that most if not all of the processing is able to be run on the microcontrollers
and barely any data need to be sent compared to the amount of raw data
gathered from the sensors. The decreased ratio is of course highly dependent
on the type of data that is gathered and the processing that is done on the

18
data. Often a microcontroller is needed anyway to connect the sensors, so
doing processing on them might be beneficial. Though the microcontroller
of course needs to have enough address space and computational power in
order to sufficiently do the wanted computations as well as both gather the
sensor data and sustain communications to other nodes in the system.

5.4.1 Are microcontrollers necessary?


When looking at edge computing a microcontroller can look very tempting
since it establishes the computation directly next to the sensors. But there
are of course pros and cons to using microcontrollers compared to doing the
processing further downstream on a system with more performance and a
full Operating system:
Pros:

• Price: The price of microcontrollers is significantly cheaper than fully-


fledged systems with OS and all, especially if there is one node for each
sensor location. It might also be possible to utilize a microcontroller
that is already needed in order to connect a sensor to the system.

• Scale up well: With microcontrollers, once there is an implementation


in place it is quite easy to scale up the system if most of the computation
is done on the deployed microcontroller devices. If the computation
is done more centrally, the processing infrastructure then also needs
to grow at a sufficient pace such that the computational capacity is
sufficient at all times for the data throughput.

• Power consumption: Power consumption is a hard metric to compare


and could also be a con. However, when comparing the processing
on microcontrollers on the edge versus on large servers more centrally,
doing it more centrally has more overhead which most likely means a
larger net power consumption. Some overheads are:

– More data needs to be sent.


– The servers need to run operating systems which have some over-
head compared to microcontrollers more directly running the code
(though microcontrollers can also sometimes have an OS).

However, on some systems where the edge nodes have a strict power
budget, for example, due to running on battery and/or photovoltaic
cells, it might be better to not do any processing on these nodes and
instead send the data to nodes that do not have a strict power budget.

19
Though sending large amounts of data can also need a considerable
amount of power so some pre-processing might be necessary to achieve
the lowest power consumption possible.

• Size: The size of a single microcontroller is much smaller than a fully-


fledged system and in many cases, the microcontroller and the sensors
can be implemented on the same printed circuit board. In some cases,
this might also make the system mechanically more reliable since the
need for connectors could be reduced.

Cons:

• Development time: It is harder and takes longer to implement everything


on microcontrollers versus doing some parts on generic hardware.

• Address space: Larger data sets can be loaded on servers and therefore
a larger context could be used in the computation.

• Performance: More complex computations can be done on servers since


they have a larger amount of computational power.

• Centralized implementation: Sometimes it might be better to use an


implementation that for example relies on a single high-performance
system using, for example, computer vision to monitor something in-
stead of a fleet of sensors attached to microcontrollers scattered around.

5.5 Hybrid
When doing computation on gathered data the approach used does of course
not have to be 100% edge computing or centralised. Usually, a better ap-
proach might be to do some simple pre-processing on the edge that can
drastically condense the data needed to be sent, while only needing a small
amount of computational performance of the edge nodes. Then on a central
server, the condensed data from multiple nodes could be further processed
with the advantage of wider context by using data from multiple nodes. The
degree to which how much processing is done where varies considerably with
the implementation at hand.

5.6 Environmental impact


When thinking about the environmental impact of implementing a system
either by using microcontrollers or a larger computational device it is of

20
course highly dependent on what is attempted to be solved and the way
the problem at hand is approached. If the problem is simply connecting a
number of spread-out sensors and doing some processing of the data from
those sensors, using microcontrollers for each node of course has a smaller
environmental impact as well as cost compared to a larger system.
It becomes more complicated to compare if different approaches are used:
For example, doing object tracking by using microcontrollers attached to
each tracked object compared to using a single larger system that has the
performance to do the task with computer vision. So keeping this in mind
either can be the better choice depending on the application and implement-
ation.

21
6 TinyML
In this chapter, I will paraphrase significantly from the book ”TinyML Ma-
chine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Mi-
crocontrollers” by Pete Warden & Daniel Situnayake [8]

6.1 What is tinyml


TinyML is defined as running a neural network on an energy cost of less
than 1mW [8]. Warden clarifies this seemingly arbitrary number as what is
needed to achieve a battery life of one year with a coin battery.

6.2 Motivation of TinyML


The two main motivations for using TinyML are power consumption and
cost.

6.2.1 Power consumption


As a comparison, the extremely popular and excellent Raspberry Pi boards
use in the order of hundreds of milliwatts and x86 processors in laptops and
desktops use between a few watts to a few hundred watts. Both are way too
much in order to have long battery life in the field. Whereas microcontrollers
use in the order of milliwatts, this, of course, depends largly on the duty cycle
and clock speed of the implementation. A large factor however is what kind
of peripherals are connected as these use power too. Communication radios
especially can use a significant amount of energy if used frequently, especially
at longer ranges due to the inverse square law.
To compare this to power sources:
• A CR2032 coin battery might hold 2500 J. (About one month at 1
mW.)
• A AA battery might hold 15000 J. (About six months at 1 mW.)
• Harvesting temperature differences from industrial machines might gen-
erate 1 to 10 mW per square centimetre.
• Indoor photo voltaic cells might generate 1µW per square centimetre.
• Outdoor photo voltaic cells might generate 1mW per square centimetre.
So when it comes to self-powering devices only outdoor photo voltaic and
industrial temperature difference generation is currently viable.

22
6.2.2 Cost
When comparing microcontrollers to systems intended to run an operating
system as well as the needed application(s) it is noticeable that microcon-
trollers are much cheaper. As a comparison, the cheapest Raspberry Pi, the
Raspberry Pi Zero, is about 5€ and can be used as a server. However, more
typically a larger x86 server is used which can cost anywhere from 1000€ to
100 000€. When comparing this to 32-bit microcontrollers that are available
for much less than 1€ the difference is significant if large deployments are
needed. Though the comparison is not straightforward since a server can do
the processing of the data from a large array of sensor-gathering nodes.
These same microcontrollers also have and will benefit from traditional
analogue and electromechanical control circuits being able to be replaced
with software-defined alternatives on microcontrollers which will continue to
further bring down the price, as well as flexibility, as more microcontrollers
are produced of devices installed on the edge.

6.3 Difference to conventional Machine Learning


TinyML differs from conventional ML in a few ways since it runs on micro-
controllers:

• As stated earlier power consumption is kept low in order to have the


possibility of a deployment which is self-sufficient in terms of power
consumption for long deployments (months/years).

• Embedded 32-bit chip means that little ram is available, a few hundred
kilobytes, which means that models have to be kept small.

• No full Linux since a memory controller and 1Mb of ram would be


needed which is not available on a microcontroller. By not having an
OS the system does not have any other processes running than the one
being developed meaning the system is simpler to understand.

• Dynamic memory is often avoided since it is not needed and not using it
increases reliability and makes the implementation more deterministic.

• Training is not done on the device but on a server/workstation and


then quantized and loaded onto the device. This is done because it is
simply not possible due to the hardware restraints.

• Weights are often quantized to 8-bit integers after training before be-
ing loaded onto the device due to floating point arithmetic not being

23
guaranteed on microcontrollers. By going to 8-bit integers from 32-bit
floating point precision is obviously lost, however, training requires the
largest dynamic range and it is still done with 32-bit floating point so
no precision is lost there.

6.4 TensorFlow vs TensorFlow Lite vs TensorFlow Lite


Micro
6.4.1 TensorFlow
As Warden and Situnayake compactly describe it: ”TensorFlow is Google’s
open source machine learning library, with the motto “An Open Source Ma-
chine Learning Framework for Everyone.” It was developed internally at
Google and first released to the public in 2015. ”[8]

6.4.2 TensorFlow Lite


However, TensorFlow is aimed at servers where there are gigabytes of RAM
and terabytes of storage. This is why Google developed TensorFlow Lite
with lower size requirements in order to easily run neural networks on mobile
devices. In order to decrease the size some features were cut:

• No training can be done with TensorFlow Lite. All training of mod-


els needs to be done on (a) server(s)/desktop(s) in the fully-fledged
TensorFlow.

• Does not support all data types, for example, double precision floating
point.

• Less used operations such as tf.depth3em


are dropped.

Because of this TensorFlow Lite can fit into a few hundred kilobytes which
makes it able to fit into size-constrained applications. TensorFlow Lite also
has good support for 8-bit quantization of networks. Comparing the size of
8-bit versus 32-bit values a 75% savings in space can be achieved, assuming
a dense mapping of 8-bit values is supported.

24
6.4.3 TensorFlow Lite Micro
While TensorFlow Lite with its constraints is good and compact enough for
mobile devices, microcontrollers have even tighter constraints than this and
that is what TensorFlow Lite Micro is created for. When the Google team
started creating TensorFlow Lite Micro they knew that they would have a
bunch of constraints running it on microcontrollers:

• No operating system dependencies


Since some of the platforms they were aiming for don’t have operating
systems at all and machine learning algorithms fundamentally are a
bunch of mathematical calculations, avoiding dependencies would en-
able compatibility.

• No standard C or C++ library dependencies at linker time


Since the devices aimed for with Tensorflow Lite Micro only have very
limited memory and even seemingly simple functions can take up relat-
ively large amounts of memory, linker time parts of the libraries wanted
to be cut to keep lean. An important exception to this is the C math
library which is obviously valuable for all the calculations done.

• No floating point hardware expected


As many embedded platforms don’t have floating point arithmetic
hardware the implementation can not rely on floating point in performance-
critical calculations and instead needs to focus on 8-bit integer arith-
metic (however there is also support for floating point if needed).

• No dynamic memory allocation


In many implementations, microcontrollers need to continuously run
without rebooting for extended periods of time, months if not even
years. Therefore using dynamic memory allocation can cause memory
fragmentation and is therefore not reliable. While developing with
dynamically allocated memory it is also significantly harder to know
exactly how much memory is used and it might not cause a problem
until after testing. Therefore TensorFlow Lite Micro is implemented
using a specified fixed-size arena at initialization. If the specified arena
is too small an error is returned immediately and the area will need to
be enlarged and recompiled.

• Requires C++11

25
The TensorFlow Lite Micro was written in C++11 in order to keep
consistency with TensorFlow Lite and for the ability to not have to
rewrite it from scratch. So the team decided to trade support for older
devices with sharing code with TensorFlow Lite.

• It expects 32-bit processors


Similarly to the last point, the development team decided to keep con-
sistency with TensorFlow Lite by expecting 32-bit processors.

When the team developing TensorFlow Lite Micro was deciding how to
implement the model on microcontrollers they compared the advantages and
disadvantages of an interpreted model and code generation.

Interpreted model :
With an Interpreted model the model is loaded into data structures that
define the model that is separate from the executed code which is static.

Code generation :
With code generation, the model is generated into C or C++ code with
parameters stored as data arrays and the architecture expressed as a series
of function calls. The generated code is often comprised of a single large file
with a few entry points that can be included with the other code needed and
then compiled.
Here are some key advantages of code generation:

• Ease of build
Since the model is defined directly in code without dependencies it is
easy to implement since it can just be copied into the code and then
compiled together with the rest of the needed code.

• Modifiability
Since all the code is in a single file and without dependencies it is easy
to modify it without needing to know and find what parts of libraries
are included.

• Inline data
Since the model is implemented in source code no additional files are
needed and therefore no loading or parsing is needed.

26
• Code size
If the platform and model are known only needed code needs to be
included keeping the size down.

Disadvantages of code generation:

• Upgradability
If you have locally modified the code and you then want to upgrade to
a newer version of the framework it might entail a significant amount
of work to patch your changes and the updated framework together.

• Multiple models Having multiple models with code generation also


means that there will be a significant amount of source duplication
which can be harder to support.

• Replacing models Since the model is expressed as generated source code


each time the model needs to be changed a recompilation is needed.

However, the team realized that many of the advantages of code genera-
tion can be had by using project generation instead.

Project Generation :
In TensorFlow Lite project generation creates copies of only the files needed
to build a model and optionally sets IDE-specific project files so that they
can be built easily. Project generation retains most of the advantages of code
generation but also adds some:

• Upgradability All source files are copies of the original and kept in the
same place in the folder hierarchy. This means that if local changes are
made upstream upgrades can be merged using standard merge tools.

• Multiple and replacement models As the underlying code is an inter-


preter the models can be changed out without compilation or multiple
models can be used.

• Inline data Model parameters can still be compiled into the program
if needed so no unpacking or parsing is needed. This is done using
FlatBuffer serialization format.

• External dependencies All the required header and source files are
copied into the project so no dependencies need to be separately down-
loaded and installed.

27
The largest advantage that does not come automatically is the code size
since the interpreter structure makes it hard to know which code paths will
never be called. In TensorFlow Lite this can be resolved using the OpResolver
mechanism to register only the kernel implementations expected to be used
in the application.

6.5 Quantization
Since microcontroller hardware is better suited for integer calculations, to
be able to use the model that has been trained with floating point numbers,
due to the numbers vastly fluctuating during training, the model needs to
be quantized from containing floating point numbers into integers before
deploying it to a microcontroller.
A reduction from 32-bit floating point to 8-bit integers also gives a 75%
reduction in storage needed for the completed model while not having a
noticeable impact on the accuracy of inference.
Another benefit of using 8-bit integers is that many signal processing
algorithms also use 8-bit integer multiply and accumulate instructions which
means that the same hardware can be utilized for TinyML.
Running a fully quantized model is also more efficient which gives us
better latency on almost all devices.
As quantization is an active research field there are many opinions on
how it should be done. For weights, this is somewhat easy since the range
is known for each layer after the training process. However, it is trickier for
activation since the range of the output is not known. If a range too small is
used there will be clipping at the maximum and/or minimum and if a range
too large is used the accuracy will suffer.
When using TensorFlow and TensorFlow Lite the quantization is done at
the same time as when the model is converted from a TensorFlow training
environment to a TensorFlow Lite graph. Two types of quantisation are done
when converting to a TensorFlow Lite graph are:

• Post-training weight quantization is the most accessible type of quant-


ization that only quantizes the weights to 8-bit integers, which reduces
the size of the model by 75% compared to 32-bit floating point but
leaves the activation layers as floating point numbers. This means that
the device the model is implemented on would still need floating point
support which can be rare for embedded devices. On the other hand,
quantization is easier to do since it does not require knowledge of the
activation layers.

28
• Post-training integer quantization is used to create a model that only
contains integers. This means that no floating point hardware is needed,
which is desirable since floating point hardware can be rare. However,
when doing the quantization context of the ranges of input needs to
be supplied in the form of example input that the model could expect
to receive while deployed. Having the right range will result in greater
accuracy without clipping.

6.6 FlatBuffers
In order to have efficient storage of the model FlatBuffers are used. ”Flat-
Buffers is an efficient cross-platform serialization library for C++, C#, C,
Go, Java, Kotlin, JavaScript, Lobster, Lua, TypeScript, PHP, Python, Rust
and Swift. It was originally created at Google for game development and
other performance-critical applications.”
Flatbuffers are well described by Warden et al.[8] and in the white paper
by Google[9] and here are the main points borrowed from there:

• Designed for performance-critical applications so works well for embed-


ded systems.

• The serialized form and runtime in-memory representation is exactly


the same so no parsing or copying is needed.

• With the help of schemas the Flatbuffer compiler creates native code (
C, C++, Python, Java...).

[10] The motivation for using FlatBuffers is to avoid the need to un-
pack/parse data. ”A FlatBuffer is a binary buffer containing nested objects
(structs, tables, vectors,..) organized using offsets so that the data can be
traversed in place just like any pointer-based data structure. Unlike most in-
memory data structures, however, it uses strict rules of alignment and endian-
ness (always little) to ensure these buffers are cross-platform. Additionally,
for objects that are tables, FlatBuffers provides forwards/backwards com-
patibility and general optionality of fields, to support most forms of format
evolution.” [9] FlatBuffers are generated with the help of a schema which
describes the object types which are used to compile efficient code for data
access.

29
6.7 Adapting models for microcontrollers
In order to be able to run the models on microcontrollers the models need
to be adapted to run on microcontrollers. This is done with a converter
that takes a trained model from Python and creates a TensorFlow Lite file.
However, there are some things to consider.

• Quantization, as mentioned earlier by quantizing the weights and ac-


tivation layers to integers, floating point hardware, which is not present
on many embedded devices, is eliminated as a requirement to run the
model. As well as making it more efficient, both in storage and com-
putation.

• All values that need to be variables during the training process, such
as weights, need to be turned into constants.

• Features only needed for training need to be removed.

• While the models are trained on desktops/servers the models can easily
become dependent on features of the desktop environment that are not
supported on microcontrollers. Such as snippets of Python code or
advanced operations. This needs to be resolved before deploying onto
microcontrollers.

• Most microcontrollers also don’t have a filesystem so the model needs to


be compiled into the executable. This then also means that every time
the model needs to be updated the executable needs to be recompiled
and re-uploaded to the microcontroller, meaning over-the-air updates
are somewhat harder.

• FlatBuffers are used so that the model data can be loaded into memory
without the need to unpack or parse it. A FlatBuffer is exactly the
same in memory as its serialized form this means that the model can
be directly accessed from flash memory without needing to copy or
parse it into RAM.

6.8 Computational and hardware need


6.8.1 Training and deployment
When doing machine learning on performance-constrained devices if a model-
based approach is used a model of course needs to be trained. While the
computational need for inference with the completed model does not require

30
much computation, creating it does. So in the same way as with conventional
ML, the computationally intense training is completed on a workstation or
server which is also able to use larger training data sets since it can have vast
amounts of RAM and non-volatile storage.
A decision also needs to be made on whether a specific set of hardware is to
be used for all devices or if generalized hardware is to be used. Meaning will
the system only use a defined type of microcontroller and sensors or will the
system be able to use differing hardware? In that case, some normalization
needs to be added to both:

• The magnitude of the data collected. For example, two different ac-
celerometers might output the same physical acceleration in different
magnitudes and data types digitally.

• The sample rate. In order to consistently detect phenomena the sample


rate needs to be consistent meaning that the time axis is not scaled
depending on what sensor is used. This way when a phenomenon hap-
pens it is captured in the same amount of samples independent of what
hardware it was captured on.

If in the future a performance increase of the model is desired based on


more data collected. In most cases, the model needs to be retrained on
more performant hardware. Depending on the way a model is loaded onto
deployed hardware might also limit how and when the model is updated. If
the model is only able to be loaded onto the devices by flashing someone
of course physically then needs to be able to reach each device and flash it,
which might not be physically or logistically possible due to the locations
and ways as well as the number of devices that are deployed. To mitigate
this over the air (OTA) updates can be used, however, then the system will
require both:

• Some form of communication hardware on the deployed devices as well


as a network in place to support the devices, this might already be a
requirement if connectivity to the remote devices is needed for other
purposes. For example, remote configuration or real-time feedback.

• A device as well as an implementation that supports downloading and


upgrading the model while deployed. This can be done in two ways:

– Only keeping the updated model in volatile memory and simply


switching to it once downloaded and risk losing the model if power
is lost and needing to fetch the model again on bootup from a

31
server. This of course will increase recovery time after a power
loss and might not even be possible if the network connectivity is
only intermittent.
– Writing the model to non-volatile memory. For this, the device
needs to have non-volatile memory that is able to be written to
during runtime.

6.8.2 Deployed
When it comes to the requirements of the devices used with TinyML many
of the hardware requirements come from where the system will be deployed,
for how long and what resources are available.
So when looking at the constraints there are two opposing constraints,
power consumption versus computational power and address space:

Power consumption When it comes to the power used, obviously, the less
that is used the longer a device can be deployed with a set amount of energy
stored or generated. As Warden [8] states a 1 mW or below energy cost can
make many new applications possible. As stated earlier this forces us to use
microcontrollers. Since the overhead in computation and storage to run even
a light operating system could consume more power than is available, not
running one is often a must. Also, the energy-saving sleep of the device is
significantly more complicated when an OS is involved.

Computation and address space As mentioned earlier TinyML needs


a 32-bit processor to run. This means that while the focus is on embedded
low-power devices any microcontroller won’t do. Since decent computational
power, as well as enough address space in order to fit and run the model
as well as the supporting framework, is still needed. The exact need of the
application is obviously hard to judge before it is implemented in code and
a model is trained when a model-based approach is used. Once these are
known the hardware can be easily optimized. This approach is of course
not very flexible if no computational/address overhead is saved for future
improvements and additions. Though unused future-proofing is an unwanted
expense.
Things that need to fit into the address space:

• Operating system size: If some feature of an OS is a requirement the


size of the OS in the required configuration to include the needed fea-
tures needs to be included.

32
• Tensorflow Lite Micro code size: In order to run the model the Tensor-
flow Lite Micro code of course needs to be included so that the Neural
Network and the operators implemented in the model can be used.
TensorFlow Lite Micro is designed to work with as little as 20KB of
flash and 4KB of SRAM in some applications.

• Model data size: The size of the model is of course very application
dependent and needs to be large enough in order to be able to generalize
the phenomena at hand.

• Application code size: The application code again is very application


dependent (as the name implies) and varies depending on the logic
needed for the application.

Example of size: As an example of the total size of all the pieces


that are needed for a TinyML application to run, I compiled the Magic
Wand example in the TinyML github[11]. The next section will describe this
example more. With the Arduino NANO 33 BLE as a target, the compiled
size is 172720 bytes, which in this case is 17% of the available storage.

Figure 6: Output from Arduino IDE after compilation of the magic wand
example from the TinyML GitHub[11].

7 TinyML for pattern recognition and main-


tenance prediction
7.1 Example of simple TinyML applications: Magic
wand
This example is taken from the TFLite-Micro GitHub repository[11]. It is
also covered in the book ”TinyML Machine Learning with TensorFlow Lite
on Arduino and Ultra-Low-Power Microcontrollers” by Pete Warden and
Daniel Situnayake [8].
In this example, a ”magic wand” is built, meaning a microcontroller
that is able to detect ”spells” that are cast by recognizing, with the help

33
of TinyML on accelerometer data, gestures that are made. The three ges-
tures that are recognized are the ”wing”, ”ring” and ”slope”.

Figure 7: What the gestures look like

The application can be divided into a few components:

• The main loop:


Runs all the components in order.

• Accelerometer handler:
Reads the values of the accelerometer in a way applicable to the hard-
ware in use and writes it to the model’s input tensor. In the case of
the Arduino Nano 33 BLE Sense the data is also down-sampled from
119 Hz to 25 Hz.

• TFLite interpreter:
Runs the TensorFlow Lite model. This is the interesting part and will
be covered next.

• Model:
Contains the underlying data about the gestures to be recognized gathered
during the training phase.

• Gesture predictor:
Takes the output of the model and decided whether a gesture has been
made based on probability and the number of consecutive positive pre-
dictions.

34
• Output handler:
When a gesture has been recognized outputs to LED light and the serial
port what gesture was recognized in a way applicable to the hardware
in use.

Figure 8: The components of the magic wand application

7.1.1 Performance
When executing the shapes I tried to mimic the way Pete Warden did them
in his presentations on YouTube. As a result, I can consistently perform
detectable wing shapes, however, the detection of the ring and slope shapes
is poor. As Warden states, they should be harder than the wing to perform
but I am not sure to what degree.
The execution of the shapes is checked by outputting the accelerometer
data after sub-sampling and axis normalization.

35
Wing Shape As can be seen in the plot (Figure 10) the accelerometer
data used for prediction is somewhat noisy so the task of detecting a shape
is challenging, especially considering the model needs to be kept very small.
But on the Z axis, a somewhat clear pattern to the motions of the wing
shape can be seen: peaks from the direction changes, but with some noise.
However, the X and Y axis are noisier and it is much harder to see any
pattern there other than slightly from the rotation of the device during the
execution. What is also interesting is that the model did not detect the
wing shape unless the shape was executed somewhat violently, meaning the
accelerometer clips. Considering all this the model does well for the wing
shape.

Figure 9: Illustration of execution of Wing shape [8]

36
Figure 10: Accelerometer data plot of a successful try to detect a wing shape.
When looking down at the Arduino NANO 33 BLE sense with the USB port
facing us, the axis are: X = Red, Y = Green and Z = Blue.

Ring Shape As can be seen in the plot (Figure 12) the attempts look
similar to the wing movement but when analyzing close there is a difference.
In the ring, a somewhat constant acceleration toward the centre of the circle
should be seen while the wing execution should have sudden direction change
peaks. However, the model is mostly not able to detect the Ring shape being
executed.

37
Figure 11: Illustration of execution of Ring shape [8]

Figure 12: Accelerometer data plot of a few unsuccessfully tries to detect a


Ring shape. When looking down at the Arduino NANO 33 BLE sense with
the USB port facing us, the axis are: X = Red, Y = Green and Z = Blue.

Slope Shape As can be seen in the plot (Figure 14) when executing the
Slope shape there is some structure to it with the three phases of acceleration
(start, direction change and stop) but the model seems to struggle to detect
the execution.

Figure 13: Illustration of execution of Slope shape [8]

38
Figure 14: Accelerometer data plot of a few unsuccessfully tries to detect a
Slope shape. When looking down at the Arduino NANO 33 BLE sense with
the USB port facing us, the axis are: X = Red, Y = Green and Z = Blue.

Performance conclusion So either I am interpreting the execution of the


shapes wrong or the model is extremely honed in on the way Warden executes
them.

7.1.2 Epilogue
This example has since been removed[12] from the examples in the repository,
but it can still be found in the repository history.

7.2 Wake up phrase with TinyML


An example of where a TinyML-like approach is already deployed and many
people are familiar with is the voice assistants that use a wake-up phrase.
For example, Google’s Google Assistant responds to the wake phrase ”Ok
google” and Amazons Alexa responds to the wake-up phrase ”Alexa”. Both
of these have implementations in IoT devices for Google with the Google
Home devices (now called Nest) and for Amazon with the Amazon Echo.
Both of these use light models for the detection of the wake-up phrase.
Because it is possible to detect these somewhat simple phrases locally there is
no need to send the audio of these to a server to be recognized. It saves both
a significant amount of network bandwidth as well as the need for huge server
infrastructure to constantly try to recognize when someone simply wants to
have the device respond. Though these devices are not battery powered these

39
same assistants are also implemented on mobile phones, which are somewhat
battery constrained (not as much as battery IoT devices but still) and benefit
from the lighter implementations.
However once the wake-up phrase has been detected the following speech,
containing the actual request, is sent to the cloud to be parsed and acted upon
since it is much more complex to parse.
However, both of these systems are largely proprietary so from the out-
side, it is hard to know exactly how they work.

7.3 Maintainence prediction in marine diesel engines


using TinyML.
The data used in this example was collected in collaboration between Åbo
Akademi University and Wärsilä Finland Oy. This was done and docu-
mented, by Andrei-Raoul Morariu et al. in the article Edge-based vibration
monitoring of marine vessel engines[13] which I will paraphrase.

7.3.1 Collected data


The data was collected on a cruise ferry travelling in the Baltic Sea between
Finland and Sweden. This ship uses four Wärtsilä 12V32 4SA four-stroke
diesel engines for its traction which are fitted with sensors on the engine
blocks and stands. The sensors used were Texas Instruments Sensor Booster
Pack for temperature, accelerometer and gyroscope enclosed in metallic boxes
to protect them from factors such as temperature and humidity from now on
called SU.
The sensors are connected to Raspberry Pi 4’s from now on called PU,
though I2C, each SU were connected to a PU on a dedicated I2C bus. I2C
was chosen due to compatibility between the SUs and PUs. The PUs were
installed in two enclosures two PUs in each. In the enclosures, there is
also a power supply and a switch to connect the PUs. These switches are
then connected to a switch in the control room which has connected to it
an Industrial PC that collects the data and sends it to shore when possible
through different means.

40
Figure 15: Inside the enclosure.

Eight SUs have been placed in the following order:

• Two on each engine:

– The first one on the engine block.


– The second one on the engine frameset, in close proximity to the
first one.

• One in each enclosure (of which there are 2)

The SUs were connected to the two PUs in the two enclosures following
order:

• The first PU is connected to 3 SUs:

– The SU in the enclosure.


– One SU from the first or third engine.
– One SU from the second or fourth engine.

• The second PU is connected to two SUs:

41
– One SU from the first or third engine.
– One SU from the second or fourth engine.

This configuration is used due to redundancy.

Figure 16: Installation setup

42
7.3.2 Analyses
For the analysis, an arbitrary data period is chosen since the data should be
somewhat cyclic due to the ferry operating on the same route generally. The
acceleration dataset used consists of 20 million samples of four variables:

• A Timestamp in 64-bit (integer) epoch time.

• The acceleration for each axis ( x, y, z) in a arbitrary unit as a float64.

The dataset is complete and contains no field without a value (null).


Though there are some fields that have a value of exactly zero for acceleration
in the x and y axis though that should be rare in accelerometer data. This
might either be due to the system writing a zero in case of an error or simply
by chance since the x and y-axis hover around zero. The z-axis does not
hover around zero since it is aligned with gravity.
The number of zero values per axis:

• X: 8689

• Y: 3303

• Z: 0

In figures 17, 18 and 19 the acceleration against time is plotted to see the
general behaviour of the data. By doing this we can see that there is some
noise in the data, more so in the x and z-axis than in the y-axis. Interestingly
the noise for the x-axis goes to 2500 for the most part but slopes off at the
ends whereas the noise for the other axis goes to zero arbitrarily throughout.
From this, we can also see that the z-axis is up and down in the real world
since it has a constant DC bias reflecting the gravity of the earth.

43
2500

2000

1500

1000

500

0
01-22 12

01-22 15

01-22 18

01-22 21

01-23 00

01-23 03

01-23 06

01-23 09

01-23 12
Figure 17: Raw plot off acceleration off accelerometer, x-axis = date and
y-axis = acceleration

600

400

200

200

400

600
01-22 12

01-22 15

01-22 18

01-22 21

01-23 00

01-23 03

01-23 06

01-23 09

01-23 12

Figure 18: Raw plot off acceleration off accelerometer y-axis, plot x-axis =
date and y-axis = acceleration

44
1400

1200

1000

800

600

400

200

0
01-22 12

01-22 15

01-22 18

01-22 21

01-23 00

01-23 03

01-23 06

01-23 09

01-23 12
Figure 19: Raw plot off acceleration off accelerometer z-axis, plot x axis =
date and y-axis = acceleration

Since the data in the y-axis is the cleanest, having the least outliers, we
will use it and a threshold of 200 as a maximum value over a time window
(minimum would also work as well) as input data to train a simple model
that determines if the engine is running. Obviously, this is not an example of
something that would require machine learning to detect since we can easily
know whether or not the machine is running, but an example of simple data
that can be analyzed with tinyML. This shows that a model can be trained
on collected accelerometer data. With data this simple it is also hard to
know if the model is overfitted since it is so simple.

45
1.0

0.8

0.6

0.4

0.2

0.0

0 2500 5000 7500 10000 12500 15000 17500 20000

Figure 20: Plot of which windows the acceleration has been higher than the
threshold and the engine is assumed to be running. Window length of 1000.

Since my computer has limited resources I could not train any more soph-
isticated models. Since the computer lacked sufficient RAM and the limited
computing performance made iterating a very long process, I settled for Lin-
ear Regression since it should be able to mimic what we intuitively have done
by choosing a threshold.
Code used for data formatting, training and prediction:

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

#Importing data
path='data/20200513_20200609/raw_csv/ME1_SU1_csv/acceleration_0*.csv'
df=pd.concat(map(pd.read_csv,sorted(glob.glob(path))),ignore_index=True)

#blocksize for creating training data


blocksize=1000

blocks= int(len(df.y)/blocksize)
ymax=[0]*blocks

#Finding the largest amplitude in a block


for x in range (blocks):
ymax[x]=np.max(df.y[x*blocksize:x*blocksize+blocksize-1])

46
#Checking if max amplitude in the block is over the threshold
ymaxbin=[False]*blocks

for a in range (len(ymax)):


if ymax[a] > 200:
ymaxbin[a]=True

#Interpolating the data to be the same length as to start with


ymaxlong=[]
for a in ymaxbin:
if a==False:
#for b in range (blocksize):
ymaxlong += [False]*blocksize
else:
#for b in range (blocksize):
ymaxlong += [True]*blocksize

#Model input data


datax=np.array(df.y).reshape(-1,1)

#Model output training data


datay= ymaxlong

#Fitting the data to the model


reg = LinearRegression().fit(datax, datay)

#Doing an output prediction on the Input data


predicted=reg.predict(datax)

#Averaging the output over 10 samples to get rid of erroneous data samples
n=10
avgResult = np.average(predicted.reshape(-1, n), axis=1)

#Rounding the data to binary since we only have two classes


predictedbin=avgResult.round()

Below is the plot of the predicted output, we can see that although the
general shape of the wanted output is there the output varies significantly
even though the output state should be stable for long periods.

47
1.0

0.8

0.6

0.4

0.2

0.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00


1e6

Figure 21: Prediction of when the engine is running plotted.

Even though the problem seems simple we can see that analysing se-
quential data can be challenging and a simple Linear regression is often not
enough and some more complicated algorithms need to be used. Here we
can also see that for the model’s training, we still need a significant amount
of compute performance even though it can be possible to run inference on
minimal hardware. With both this example and the TinyML Wand example,
we can see that analysing sequential data can be pretty challenging.

48
8 Conclusion
With the vast amount of data collected today and in the future, it is possible
to see that there is plenty that can be done with the data with the help
of Machine Learning. Both on more conventional computers with operating
systems and perhaps accelerators as well as on the Edge and in some cases
on the Edge on Microcontrollers. When it comes to the way the analysis
is to be done, as can be seen in this thesis, there is a vast and evergrowing
way of doing things based on new research and implementations of ways to
compute the data:

• With post-processing, all the processing is done afterwards.

• With a model-based approach, a model is trained with the use of train-


ing data and then deployed to do inference on data gathered in real-
time.

• In the future we might also have models that are trained in real-time
as data is gathered.

However, as can also be seen in this thesis it is important that the quality
of the data is good, both:

• The quality of the data itself, meaning accuracy and precision.

• That the right kind of data is gathered. This of course requires planning
ahead in regard to what the possible end uses are for the gathered
data. For example, if the problem to be solved changes after the data
is already gathered it might be challenging to utilize the data.

For these reasons among others, good quality data is quite valuable. So
gathering as much and diverse data as possible can often be of great value as
long as it is done ethically. As data needs to be gathered from the real world,
the process can not be as agile as other parts of computer engineering.
As can be seen in the thesis as well, TinyML as well as edge computing
might not always be the right choice for applying machine learning in a
system. If the processing needs a holistic view of data gathered in the system
a more centralized approach is better. Similarly, if only small amounts of data
are generated and sufficient uplinks are in place a more centralized approach
might be preferred due to simplicity.
In regard to the analysis, I feel that I did not manage to get any truly
meaningful output. However, I think that the task of maintenance predict-
ability would be better to be done with a more diverse set of data from the

49
engines not just acceleration data. Labelled data of when an engine is not
running optimally or about to break would also be very valuable.
So to conclude, when planning to do data analysis it is important to figure
out what the root requirements are and based on those select the right kind
of processing to be done as well as where it is to be done.

50
9 Summary in Swedish
Titel: Tillämplighet av TinyML för förutsägbart underhåll.

9.1 Inledning
I dagens värld har vi en ökande mängd apparater som många samlar in data
i någon form som sedan ofta används för analys av systemet eller någon sorts
respons, endera reglering eller styrning. När datamängderna ökar explosion-
sartat kan det uppstå problem med hur data hanteras, lagras och behandlas.
För att hjälpa oss hantera all den genererade datan kan vi planera våra
system på olika sätt beroende på kraven för systemen. Denna avhandling
handlar om huruvida det är möjligt att behandla den genererade datan med
hjälp av kantberäkning på mikrokontroller, specifikt maskininlärning, i form
av TinyML.

9.2 Maskininlärning
Maskininlärning är ett mycket omdiskuterat område i dagens värld eftersom
vi har väldiga mängder med data samlade från olika sorters system. Då vi be-
handlar denna data kan vi inte alltid intuitivt veta hur olika variabler i datan
relaterar till varandra, men med hjälp av maskininlärning kan vi med algor-
itmer behandla data för att upptäcka korrelationer eller bygga upp modeller
som beskriver beteendet hos systemet eller systemen där datat är insamlat.
Dessa korrelationer och modeller kan sedan användas för att analysera till
exempel prestandan eller hälsan av systemet eller noder i systemet, eller för
att förutspå beteendet hos systemet eller noder i systemet.

9.3 Identifiering av avvikelser


Som namnet antyder används anomaliupptäckningsalgoritmer för att hitta
olika slags anomalier i en datamängd. Det är värdefullt att hitta anomalin i
datan för anomalin kan endera betyda att en del av sensordatat är felaktigt
eller att en del av systemet endera är söndrig eller håller på att gå sönder.
Anomalier kan förekomma i olika former beroende på beteendet hos systemet
som övervakas. Ett anomali kan t.ex. upptäckas genom att:

• En datapunkt statistiskt sett ligger så mycket utanför det förväntade


datastorleksområdet att värdet inte kan vara korrekt.

• En datapunkt förekommer med ett värde av avvikande storlek.

51
• I vissa system kan två eller flera variabler vara extremt korrelerade och
ifall variablerna i ett eller flera sampel inte följer korrelationen kan vi
misstänka att ett anomali förekommit.

9.4 Kantberäkning
I kantberäkning görs beräkningar på den så kallade kanten av systemet, alltså
på de apparaterna där datan genereras, eller fysiskt relativt nära dem. Detta
kan ge oss följande fördelar:

• Minskar på mängden rå data som behöver skickas för att produceras.

• Mindre behov av central beräkningskraft.

• Responstiden minskar ifall vi på noderna behöver agera på den lokalt
samlade datan.

Beroende på arkitekturen av systemet kan det finnas en betydande mängd


beräkningskraft nära kanten av systemet som i ett konventionellt system
förblir oanvänd. Dessa processorer kan endera vara i form av mikrokon-
trollrar eller mera konventionella datorer (x86) med en betydande prestanda.
Beräkningsbehovet varierar för olika system beroende på vad vi vill härleda
ur datan. Är beräkningen beroende av data från hela systemet är det oftast
inte lönsamt att göra beräkningen på noder nära kanten. Om beräkningen
däremot endast är beroende av data från noden själv eller noder i direkt
närhet så kan vi få betydande fördelar av att göra beräkningen där.

9.5 Maskininlärning på mikrokontroller - TinyML


TinyML är ett koncept att köra maskininlärningsmodeller direkt på mik-
rokontrollrar som samlar data. Modellerna är skapade på mera konvention-
ella datorer, endera servrar eller arbetsdatorer. Jämfört med konventionell
behandling av data är det svårare att göra alla beräkningar på mikrokon-
trollrar eftersom de har begränsade resurser:

• Processorprestandan är flera storleksordningar mindre än en vanlig pro-


cessor (x86).

• Lagringskapaciteten är begränsad, både arbetsminne (RAM), program-


minne (ROM flash) och lagringsminne (non-volatile memory).

52
• Energianvändningen kan också vara begränsad på grund av läget där
noden måste befinna sig. Detta betyder att vi inte alltid har en “oänd-
lig” mängd med ström från elnätet utan noderna kanske är batterid-
rivna, ibland med solceller och ibland utan. Detta i sin tur innebär att
noderna måste vara väldigt energisnåla.

Men vi har också fördelar med att göra databehandlingen på kanten på
mikrokontrollrar:

• Behovet att skicka stora mängder data till en central server minskar
drastiskt.

• Eftersom vi redan behöver ha en mikrokontroller i kant noden för


att koppla sensorerna till sparar vi resurser att använda beräkning-
skraften som finns i noderna. vilket betyder att vi behöver mindre
processorkraft i servrar.

9.6 Analys
Datan som analyserades är data samlad från en bilfärja som färdades i Bot-
tenviken mellan Vasa i Finland och Umeå i Sverige. I båtens maskinrum
och på motorerna och deras fästen placerades accelerometersensorer. Data
från dessa sensorer analyserades. För analysen användes ett slumpmässigt
tidsintervall. Analysen gjordes med linjär regression för att undersöka hur
bra algoritmen klarar av klassificering av data.

1.0

0.8

0.6

0.4

0.2

0.0

0 2500 5000 7500 10000 12500 15000 17500 20000

Figure 22: Graf av träningsdata där motorn har antagits vara igång. Datat
är delat upp i 1000 datapunkter långa fönster.

53
1.0

0.8

0.6

0.4

0.2

0.0

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00


1e6

Figure 23: Modellens uppskattning när motorn är igång.

Med analysen kan det ses att en så simpel algoritm inte är så bra på
klassificering av sekventiella data. Formen av det önskade resultaten ses
tydligt i uppskattningen, men under perioderna motorn konstant är igång
hoppar uppskattningen extremt ofta till att motorn är av.

9.7 Sammanfattning
Som med de flesta problemen så finns det inte alltid en lösning som passar
perfekt för att lösa alla problem. TinyML är alltså ett verktyg till i verktygs-
backen för att kunna lösa problem var maskin inlärning kan vara lösningen,
men man måste fortfarande fundera på vad målet är med lösningen. TinyML
är alltså ett bra verktyg när vi vill minska på nätverksbehovet och latensen
och maximera användningen av mikrokontrollrar som vi redan har i bruk.
Samtidigt ser vi också att TinyML inte kan bota att vi fortfarande behöver
data av hög kvalitet för att kunna utnyttja den till högsta grad.

54
References
[1] Bruno Stecanella. Support Vector Machines (SVM) Algorithm Explained .
https : / / monkeylearn . com / blog / introduction - to - support -
vector-machines-svm/. Accessed 2022-05.
[2] Rebecca Bevans. Simple Linear Regression — An Easy Introduction
Examples. https://www.scribbr.com/statistics/simple-linear-
regression/. Accessed 2022-05.
[3] Pranoy Radhakrishnan. “What are Hyperparameters ? and How to
tune the Hyperparameters in a Deep Neural Network?” In: (2017). url:
https://towardsdatascience.com/what- are- hyperparameters-
and - how - to - tune - the - hyperparameters - in - a - deep - neural -
network-d0604917584a.
[4] What is anomaly detection? https://developer.ibm.com/learningpaths/
get-started-anomaly-detection-api/what-is-anomaly-detection/.
Accessed 2022-04.
[5] Sahil Garg. Algorithm selection for Anomaly Detection. https : / /
medium.com/analytics-vidhya/algorithm-selection-for-anomaly-
detection-ef193fd0d6d1. Accessed 2022-04.
[6] What is edge computing? https://www.ibm.com/cloud/what- is-
edge-computing. Accessed 2021-06.
[7] What is Edge Computing? https://www.intel.com/content/www/
us/en/edge-computing/what-is-edge-computing.html. Accessed
2021-06.
[8] Pete Warden and Daniel Situnayake. TinyML: Machine learning with
TENSORFLOW lite on Arduino and ultra-low power microcontrollers.
O’Reilly Media Inc., 2020.
[9] FlatBuffers white paper . https://google.github.io/flatbuffers/
flatbuffers_white_paper.html. Accessed 2021-10.
[10] FlatBuffers. https://google.github.io/flatbuffers/. Accessed
2021-10.
[11] https : / / github . com / tensorflow / tflite - micro / tree / main /
tensorflow/lite/micro/examples/magic_wand.
[12] https://github.com/tensorflow/tflite-micro/commit/bef8fe8bc6183cc4e1ce852579

55
[13] Andrei-Raoul Morariu, Wictor Lund, Andreas Lundell et al. “Edge-
based Vibration Monitoring of Marine Vessel Engines”. English. In:
12th Symposium on High-Performance Marine Vehicles. Ed. by Ber-
tram Volker. Symposium on High-Performance Marine Vehicles : HIPER
; Conference date: 12-10-2020 Through 14-10-2020. Germany: Technis-
che Universität Hamburg-Harburg, Oct. 2020, pp. 239–250.

56

You might also like