Zhang 2020
Zhang 2020
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2020.3007361, IEEE Embedded
Systems Letters
Abstract— Users of embedded and cyber-physical systems expect can provide proactive thermal management, we can avoid potentially
dependable operation for an increasingly diverse set of applications dangerous execution scenarios. Proaction requires prediction. A
and environments. Reactive self-diagnosis techniques either use number of strategies have been proposed for on-chip thermal prediction,
unnecessarily conservative guardbands, or do not prevent catastrophic
failures. In this letter we utilize machine-learning techniques to and the methods can be classified into two categories.
design a prediction engine in order to predict failures on-device in The first prediction method builds models based on measured
embedded systems. We evaluate our prediction engine’s effectiveness temperature and power consumption [21], [19], [14], [15], [17]. The
for predicting temperature behavior on a mobile system-on-chip, and second method builds the prediction model indirectly using equations,
propose a realizable hardware implementation for the use-case. without thermal measurements [5], [4], [7]. However, there have been
I. Introduction many successful applications of machine learning techniques employed
in failure detection or prediction of large-scale systems. With sufficient
The complexity of embedded system platforms and the applications sensor input, machine learning models can extract complex or subtle
they support are continuously increasing: they run large and evolving dynamics, potentially resulting in accurate predictions when applied
applications on heterogeneous multi- or many-core processing to new execution scenarios. Failure prediction has been proposed
platforms. Examples include automated and autonomous driving, smart using support vector machines (SVMs) [10], [3], convolutional neural
buildings, industry 4.0, and personal medical devices. Such systems networks (CNNs) [16], and a combination of techniques [8].
are required to provide dependable operation for the user while dealing RNNs are naturally suited for learning temporal sequences and
with a large number of internal and external variabilities, threats, and modeling time series behaviors. RNNs have been applied to predict
uncertainties in their lifetimes. various behavior in large-scale systems [20], [13], [6]. In [6], the
To provide such dependable operation, self-diagnosis techniques authors compare an RNN solution with an LSTM solution, and observe
are developed for early detection of degradation and imminent that LSTMs significantly outperform RNNs in terms of accuracy.
failures, in order to maximize system life-cycle. These techniques In [18], [2], [11] LSTMs are used in other domains for time series pre-
can be combined with unsupervised platform self-adaptation to meet dictions such as water quality estimation, stock transaction prediction,
performance and safety targets. Self-diagnosis techniques that are mechanical states, and more. The authors compare LSTM networks
reactive may (a) not be sufficient to address catastrophic failures, or with alternatives such as back propagation neural networks, online
(b) take overly conservative approaches that hinder performance. sequential extreme learning machines, and support vector regression
For example, consider thermal management of an embedded system- machines (SVRM), and demonstrate the superiority of LSTMs.
on-chip (SoC). One technique is to define a temperature threshold,
and throttle performance (e.g., via dynamic voltage-frequency scaling III. CONTRIBUTIONS
(DVFS)) when the threshold is exceeded. This approach is reactive and We propose a method for predicting runtime behavior in hardware:
must act conservatively to prevent overheating. The conservative fre- the Long Short-Term Prediction Engine. In this section, we describe how
quency throttling may degrade performance, potentially unnecessarily. our predictor is composed by walking through our use-case: predicting
If the temperature behavior could be predicted, a proactive approach runtime temperature behavior on an embedded system-on-chip. Our
could manage the temperature without sacrificing performance exces- goal is to predict temperature behavior such that critical thermal sce-
sively. However, system dynamics such as temperature can behave narios can be detected in advance and avoided, with a solution that can
nonlinearly, and are hard to predict without workload knowledge. feasibly be integrated in an embedded SoC. Our SoC consists of four
Machine learning techniques such as neural networks are useful ARM A15 cores, with shared L2 cache connected via bus. We measure
for identifying complex system dynamics. However, neural networks total power and temperature of the entire core cluster, as well as per-core
are complex and difficult to deploy on power-constrained embedded utilization. To generate workloads, we use a synthetic microbenchmark
systems. In this paper, we propose a failure prediction technique for [12] that is configurable. The microbenchmark is able to stress the
embedded systems using long short-term memory (LSTM), a type of architecture in a wide range and we generated a “general-purpose”
recurrent neural network (RNN). We demonstrate the effectiveness workload by executing the microbenchmark in phases that exercised
of our predictor for predicting temperature behavior with respect to different behavior in these various dimensions. We execute different
a threshold on an ODROID-XU3 [9] platform, making it a candidate sequences on multiple cores to emulate different applications to train
for mitigating overheating failures and implementing efficient control the model and test its performance. The prediction engine consists
policies. We specify an implementation that is realizible in hardware on of two parts: a short-term binary model and a long-term regression
low-power embedded systems. The specific contributions are as follows: model. The short-term binary model makes precise predictions quickly,
• We propose a method for hardware hazard prediction called Long
useful for subtle changes, i.e., anticipating violations of a temperature
Short-Term Prediction Model. threshold. The long-term regression model can make a prediction
• We propose an architecture and hardware implementation of
further in advance, useful to predict general behavior in less-critical
non-intrusive prediction engine based on Long Short-Term Prediction scenarios, i.e., predicting temperature trends in a safe state.
Model to predict temperature behavior in embedded systems.
• We evaluate the predictor using measured temperature data from A. Short-Term Binary Model
an ODROID XU-3. The short-term binary model is used to predict unwanted behavior,
II. Background and Related Work i.e., constraint violation. In our case in which we have a temperature
threshold we do not want to violate, the short-term binary model is
When modern systems-on-chip (SoCs) operate near peak utilized when the measured temperature is nearing the threshold. In
performance for extended periods, power dissipation can increase the this scenario, a slight rise in temperature will cause a failure (violation
temperature to the point that it adversely impacts chip reliability. If we of constraint), thereby it is important to have a high recall rate. The
1 Harbin Institute of Technology, 1162620312 at stu.hit.edu.cn recall rate must be tuned carefully to balance accuracy and overhead.
2 Center for Embedded and Cyber-physical Systems, UC Irvine, 1) Model Defintion
{minjun.seo,bdonyana,dutt,kurdahi} at uci.edu
*This work was partially supported by NSF Grant CCF-1704859 Our initial short-term binary model is defined as follows:
1943-0663 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: City, University of London. Downloaded on July 14,2020 at 08:07:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2020.3007361, IEEE Embedded
Systems Letters
• Input: temperature, core utilization, power overhead. The LSTM internal structure is defined in the following
• Output: probability of failure (after boundary limitation, the model equations. x refers to input features, h is output result, W,b are weights
produces a binary result: ‘0’ refers to normal and number ‘1’ refers and bias and c are intermediate variables.
to failure) it =σ(Wxixt +Whiht−1 +bi) (7)
2) Model Training ft =σ(Wxf xt +Whf ht−1 +bf ) (8)
Figure 1 shows measured temperature data from the ODROID-XU3 ot =σ(Wx0xt +Wh0ht−1 +b0) (9)
1
. We first isolate the data above the critical point (85 ◦C) to use as c̃t =tanh(Wxcxt +Whcht−1 +bc) (10)
training data. Because the range of the data is reduced, we amplify ct =ft ct−1 +it c̃t (11)
the changes of data to increase its variation. When performing
amplification at runtime, we must consider constraints such as the ht =o tanh(ct) (12)
real-time hardware implementation and the short failure intervals. We Figure 3 (black and blue) illustrates the architecture of the proposed
create a method called Sliding Average Amplification to efficiently RNN/LSTM model which contains two RNN/LSTM layers (the
preprocess data in order to increase variation and applied it on the four RNN and LSTM structure provide comparable accuracy, shown in
features. The method takes local data (5 timesteps) and uses Min-Max Section IV), one fully connected layer, and one binary classification
Normalization to amplify the values. The following equations show the layer based on sigmoid activation. The input features are time sequences
calculation of Sliding Average Amplification. D(t) refers to the feature of temperature, per-core utilization, and power. After calculation of
value at t and i refers to the number of timesteps defined as local data. time step t in the first layer, the result is conveyed to step t + 1 the
1X
n same layer and the step t in the second layer. At the same time, step
average(t)= D(t−i) (1) t+1 data is added into the next step calculation. In each RNN/LSTM
n i=0 layer, there are 8 time steps and 64 hidden layers. In the last time step,
max(t)=MAX{D(t−i),D(t−i+1),...,D(t)} (2) the result is passed to a fully-connected layer and a sigmoid layer for
classification. The output result is the failure probability. When the
min(t)=MIN{D(t−i),D(t−i+1),...,D(t)} (3) value is greater than 0.9, we define it as failure and output 1.
D(t)−average(t)
amplified(t)= ×100 (4)
max(t)−min(t) Output
Figure 2 shows the amplified data along with the original. The orange
Sigmoid Layer 4
curve is the original data and the blue curve is the amplified data.
3) Improved Loss Function Fully Connected Layer 3
Our initial binary model still has a significant issue: it is trained
with imbalanced data. Normal samples (i.e., non-critical temperatures) ··· LSTM/Fully Connected Layer 2
account for nearly 99.5 % of the training data. Due to the low ratio
of failure samples (i.e., critical temperatures), the model is highly ··· LSTM Layer 1
Load4
confident in identifying critical samples, which is misleading. We Load3
Load2 Recycle Cell State
Load1
augment the classic binary cross-entropy loss function with weights in Power
Temperature
Input
order to increase model sensitivity to normal samples. y is the predicted
value and ŷ is the actual value. The weight factor α is determined
empirically based on the rate of failure samples in the training data. t Time Steps
Loss=−(αylogŷ+(1−α)(1−y)log(1−ŷ)) (5) Fig. 3: Integrated model structure. The structures are shared between
the short-term binary model and the long-term regression structure,
α=0.992 (6) depending on which is active. Functionality and structure specific
4) Model Structure to the short-term binary model is in blue, and specific to long-term
regression model is red.
We propose the simplest structure of an RNN prediction model
that provides the required accuracy in order to minimize the hardware
1 Our use-case system, containing the described SoC.
B. Long-Term Regression Model
The long-term regression model is used to predict behavior in the
Temperature (C)
1943-0663 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: City, University of London. Downloaded on July 14,2020 at 08:07:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2020.3007361, IEEE Embedded
Systems Letters
vary in instructions-per-cycle (IPC), utilization, and cache miss rate, where Pn and Rn are the precision and recall at the nth threshold. F1
exercising the processor in a wide range. score is a measure of a test’s accuracy, and is defined as the weighted
For unicore workloads, we first run each benchmark on one core to harmonic mean of the precision and recall of the test. F1 score conveys
emulate stable workload state. Then we combine multiple benchmarks a balance between precision (P ) and recall (R):
and start them one by one to emulate changing workload state on one 2×P ×R
core. For multicore workloads, we assign different benchmarks on F 1= (14)
P +R
different cores and start them simultaneously. For shifting workloads,
we assign the same benchmarks on different cores and start them at
Predicted Measured
different times. 1
Raw data collected from the ODROID-XU3 does not initially appear
stable, making filtering essential.2 After trying several filters to smooth 0
0 0.5 1 1.5 2 2.5 3 3.5 4
the raw data and considering the hardware feasibility, we conclude Time (s)
that the data preprocessed by recursion average filter produces the most
accurate model. Filter sizes of each input are determined empirically. Fig. 4: Sample prediction of one workload. Binary events (i.e.,
experiencing critical temperature) are predicted and observed.
3) Model Structure
LSTM has the nature of storing long-term memory, therefore, to 2) Evaluation Results
deal with the long-term cases, we choose LSTM structure for our
model. Compared to a short-term model, increased historical data is The model can predict up to 8 steps (40 ms) ahead. The F1 score is
needed to ensure precision when predicting a large temperature range 0.43 and the AP score is 0.78. The latency of short-term binary model
far in advance. This leads to increased model time step and execution is 0.088 ms (Based on execution in python, no hardware acceleration).
time. Therefore, we apply stateful LSTM theory in the cell structure, Figure 4 shows the prediction result of one dataset. The orange shows
fitting output cell state as the initial state. In this way, the structure measured failures and the blue shows predicted failures. Observe that
can remember long-term memory and better adapt. there are a number of mispredicted failures (false positives). This is
Figure 3 (black and red) illustrates the architecture of the proposed preferable to false negatives (non-predicted failures), as we are trying to
LSTM model. The input features are time sequences of temperature, anticipate and potentially avoid undesirable system state. In fact, in the
per-core utilization, and power. After calculation of step t, the cell state experiment shown in Figure 4, the recall value is 1, which means that
is recycled to next term calculation. There are 8 time steps in the LSTM all measured failures are predicted – i.e., we have no false negatives.
layer and 64 hidden layers in each cell. We need 16 previous steps a) Model Structure Tradeoffs
for prediction, therefore the cell state will be passed for initialization
every second iteration. To ensure practical utility of our hardware predictor in low-power
embedded systems, it is important to balance precision and complexity.
C. Hardware Implementation Framework Considering the feasibility constraints, we explore the impact of several
To integrate the short- and long-term models, we specify single hyper-parameters and layer structures on the model performance.
a shared-hardware implementation that supports all of Figure 3. A Parameters include RNN type, model structure, number of hidden
judgement module receives temperature values from the sensor and de- neurons, decimal digits, and number of time steps. We evaluate the
cides which model to activate. If temperature is ≥85 ◦C, the short-term RNNs and LSTMs based on AP, F1, recall (performance), runtime, and
prediction model is activated and its weights are loaded into the model degree of prediction. Figure 5 shows how different hyper-parameters
structure. If it is <85 ◦C, the long-term prediction model is activated. effect the model performance. The left y-axes measure AP score, F1
To reduce structural overhead, the core LSTM and fully connected score, and recall score. The right y-axes measure the time it takes to
layers are partially shared, composed with the least common parameters generate one prediction. The solid lines refer to the model with LSTM
(LSTM: 8 time steps, 64 hidden layers; fully connected: 4 hidden layers and the dotted lines refer to the model with RNN layers. Figure
layers). The excess time steps can be stored in a state buffer and fed 5a shows how the number and type of layers effect performance. It
back (Figure 3). indicates that LSTM has better accuracy. Prediction time increases
Using the LSTM implementation of Chang et al. [1], we calculate with the number of layers. Therefore, it is best to apply 2-layer
12960 FFs, 7201 LUTs, and 16 BRAM overhead. The LSTM hardware LSTM. Figure 5b shows how the number of previous timesteps effects
is 20 times faster than the Zync ZC7020 ARM-based hard-core proces- performance. After five timesteps, the accuracy plateaus and prediction
sor (4.4 µs per inference) , 44 times more power-efficient than a soft- time increases, therefore using five timesteps is the best choice. Figure
ware implementation with the Zync ZC7020 (performance-per-watt). 5c shows how the number of neurons effects performance. Accuracy
IV. Evaluation pleateus beyond 32 neurons, thus we choose 32 neurons in the network.
Figure 5d shows how the decimal digit influences performance. Two
We evaluate the effectiveness of both our Short-Term Binary digits is the minimum number to maintain accuracy. Figure 5e shows
Model and Long-Term Regression Model separately, using additional how accuracy degrades as the prediction moves further in advance.
measured data from the ODROID-XU3. The measured data consists of
the model input data measured at 5 ms intervals. We perform sensitivity B. Long-Term Regression Model
analyses of LSTM/RNN models for different parameters and structures. For the regression model, we use mean absolute error (MAE) to
A. Short-Term Binary Model Evaluation evaluate the accuracy, where yi is the predicted temperature k steps in
advance (Pi), and ŷi is the measured temperature at step i+k (Mi+k ):
1) Evaluation Metrics n
1X
The output of the short-term binary model is a binary classification. MAE = |Pi −Mi+k | (15)
We evaluate the model by average precision score (AP) and F1 score. n i=1
The average precision score summarizes a precision-recall curve as Figure 6 shows a sample time plot of one one experiment. The
the weighted mean of precision achieved at each recall threshold, with orange dashed line shows the measured temperature 64 steps (320 ms)
the increase in recall from theXprevious threshold used as the weight: in advance. The latency of the long-term regression model is 0.108 ms
AP = (Rn −Rn−1)Pn (13) (no hardware acceleration). The blue is the predicted temperature in
realtime. The MAE achieved by the predictor for 320 ms in advance is
n
0.018. The highest accuracy achieved by existing prediction methods is
2 Data is stored in a userspace buffer, sampled from sensors via kernel drivers every 0.024 MAE [17], and the longest prediction step is 500 ms [4], which
5ms. we improve by 25% and 36% respectively.
1943-0663 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: City, University of London. Downloaded on July 14,2020 at 08:07:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2020.3007361, IEEE Embedded
Systems Letters
V. Conclusion average precision score. The long term model outputs temperature
We propose a new LSTM-based method for hardware hazard values up to 320 ms in advance with a MAE of 0.018. We simplify
prediction called Long Short Term Prediction Engine. The prediction the structure of the network and hyper-parameters to find one suited
engine uses two models to provide prediction of both urgent and for hardware realization, sharing parts of the network and automatically
normal conditions, which have different prediction requirements. switching between the two models according to temperature.
The integrated model is trained and tested on data collected on the
ODROID-XU3 platform. The short term model makes precise binary [1] A. X. M. Chang, B. Martini, and E. Culurciello, “Recurrent
prediction near critical conditions 40 ms in advance, and reaches 0.78 neural networks hardware implementation on fpga,” arXiv preprint
arXiv:1511.05552, 2015.
AP-LSTM F1-LSTM Recall-LSTM Latency-LSTM [2] Z. Chen, Y. Liu, and S. Liu, “Mechanical state prediction based on lstm
AP-RNN F1-RNN Recall-RNN Latency-RNN neural netwok,” in Chinese Control Conference, 2017.
1 0.2 [3] A. Chigurupati, R. Thibaux, and N. Lassar, “Predicting hardware failure
using machine learning,” in Reliability and Maintainability Symposium,
AP/F1/Recall
Time (ms)
0.8 0.15 2016.
0.6
0.4
0.1 [4] R. Cochran and S. Reda, “Consistent runtime thermal prediction and
5·10−2 control through workload phase detection,” in ACM/IEEE Design
0.2 Automation Conference, 2010.
0 0 [5] A. K. Coskun, T. S. Rosing, and K. C. Gross, “Utilizing predictors for
1 2 3
Layers efficient thermal management in multiprocessor socs,” IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 2009.
(a) Comparison for number of network layers. [6] F. D. d. S. Lima, G. M. R. Amaral, L. G. d. M. Leite, J. P. P. Gomes, and
1 0.2 J. d. C. Machado, “Predicting failures in hard drives with lstm networks,”
AP/F1/Recall
Time (ms)
0.8 0.15
0.6 [7] Y. Ge, Q. Qiu, and Q. Wu, “A multi-agent framework for thermal aware
0.4
0.1 task migration in many-core systems,” IEEE Transactions on Very Large
0.2 5·10−2 Scale Integration Systems, 2012.
0 0 [8] I. Giurgiu, J. Szabo, D. Wiesmann, and J. Bird, “Predicting dram
2 3 4 5 6 7 8 9 10 reliability in the field with machine learning,” in ACM/IFIP/USENIX
Timesteps Middleware Conference: Industrial Track, 2017.
[9] Hardkernel, “ODROID-XU,” Tech. Rep. [Online]. Available:
(b) Comparison for number of time steps considered in the network. http://www.hardkernel.com/main/main.php
1 0.2 [10] R. Kumar, S. Vijayakumar, and S. A. Ahamed, “A pragmatic approach
AP/F1/Recall
0.8 0.15
0.6 and big data technologies,” in IEEE International Advance Computing
0.1 Conference, 2014.
0.4
0.2 5·10−2 [11] S. Liu, G. Liao, and Y. Ding, “Stock transaction prediction modeling and
0 0 analysis based on lstm,” in IEEE Conference on Industrial Electronics
0 20 40 60 80 100 120 and Applications, 2018.
Neurons [12] T. Mück, S. Sarma, and N. Dutt, “Run-dmc: Runtime dynamic
(c) Comparison for number of neurons. heterogeneous multicore performance and power estimation for energy
efficiency,” in International Conference on Hardware/Software Codesign
1 and System Synthesis, 2015.
AP/F1/Recall
0.8 [13] S. Huang, C. Fung, K. Wang, P. Pei, Z. Luan, and D. Qian, “Using
0.6 recurrent neural networks toward black-box system anomaly prediction,”
0.4 in IEEE/ACM International Symposium on Quality of Service, 2016.
0.2 [14] S. Sharifi, D. Krishnaswamy, and T. S. Rosing, “Prometheus: A proactive
0 method for thermal management of heterogeneous mpsocs,” IEEE
1 2 3
Decimal Digits Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 2013.
(d) Comparison for number of decimal places (precision) used in the model. [15] G. Singla, G. Kaur, A. K. Unver, and U. Y. Ogras, “Predictive dynamic
thermal and power management for heterogeneous mobile platforms,”
1 in Design, Automation Test in Europe Conference Exhibition, 2015.
AP/F1/Recall
76 [20] C. Xu, G. Wang, X. Liu, D. Guo, and T. Liu, “Health status assessment
74 and failure prediction for hard drives with recurrent neural networks,”
IEEE Transactions on Computers, 2016.
72 [21] I. Yeo, C. C. Liu, and E. J. Kim, “Predictive dynamic thermal
70 management for multicore systems,” in ACM/IEEE Design Automation
68 Conference, 2008.
0 2 4 6 8 10 12 14 16 18 20
Time (s)
1943-0663 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: City, University of London. Downloaded on July 14,2020 at 08:07:55 UTC from IEEE Xplore. Restrictions apply.