WCET Analysis Based on Micro-Architecture Modeling for Embedded System Security

Li, Meng; Xiao, Kun; Zhou, Yong; Huang, Dajun

doi:10.3390/app14167277

Open AccessArticle

WCET Analysis Based on Micro-Architecture Modeling for Embedded System Security

School of Computer Science and Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7277; https://doi.org/10.3390/app14167277

Submission received: 12 June 2024 / Revised: 20 July 2024 / Accepted: 15 August 2024 / Published: 19 August 2024

(This article belongs to the Special Issue State-of-the-Art of Network Attack Detection and Situation Awareness Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

To ensure the timely execution of hard real-time applications, scheduling analysis techniques must consider safe upper bounds on the possible execution durations of tasks or runnables, which are referred to as Worst-Case Execution Times (WCET). Bounding WCET requires not only program path analysis but also modeling the impact of micro-architectural features present in modern processors. In this paper, we model the ARMv8 ISA and micro-architecture including instruction cache, branch predictor, instruction prefetching strategies, out-of-order pipeline. We also consider the complex interactions between these features (e.g., cache misses caused by branch predictions and branch misses caused by instruction pipelines) and estimate the WCET of the program using the Implicit Path Enumeration Technique (IPET) static WCET analysis method. We compare the estimated WCET of benchmarks with the observed WCET on two ARMv8 boards. The ratio of estimated to observed WCET values for all benchmarks is greater than 1, demonstrating the security of the analysis.

Keywords:

Worst Case Execution Time (WCET); security; micro-architecture analysis; cache modeling; branch prediction modeling

1. Introduction

Real-time systems are characterized by the existence of timing constraints, in which tasks must be completed within a limited time frame. In hard real-time systems, missing a deadline can directly impact the system’s security. For instance, if an airbag system in a car does not deploy within the required time frame, it could result in severe injury or death during an accident. Ensuring real-time performance is therefore essential for maintaining security in these systems. The Worst-Case Execution Time (WCET) of a computational task is the maximum length of time the task could take to execute on a specific hardware platform. Its offline estimation helps in choosing appropriate scheduling algorithms, ensuring the system’s performance, and guarantying the system’s energy efficiency [1]. Since the wide application of real-time systems makes real-time analysis of programs particularly important, WCET estimation has been studied extensively [2], and various approaches have been developed to derive such bounds [3]. In embedded systems, ARM processors are considered mainstream because of their high performance, low power consumption, reasonable price, and complete maintenance system. They are particularly prominent in the fields of real-time control, connected automated vehicles, and mobile phones. Thus, the analysis of WCET for ARM processors is crucial. Methods to estimate the WCET of programs on newer ARM platforms like ARMv8 should be provided.

Estimating the WCET of a program is difficult because the execution of a program depends on specific hardware [4]. In terms of micro-architecture, modern processors are often equipped with pipelines, caches, branch prediction, and other features. These units not only perform their own functions but also interact with each other. For example, modern processors often load multiple cache lines at once when transferring data from memory to cache. When dealing with branch instructions, this loading scheme can lead to additional cache invalidation [5]. When a processor’s branch instruction is mispredicted, instructions along the taken branch are fetched and executed, while instructions along the wrong path are undone, incurring a branch misprediction penalty. The branch misprediction penalty can be substantially larger than the pipeline length [6]. These interactions between functional units greatly complicate the analysis of program execution, and there is little research on them [7,8].

In this paper, we model the ARMv8 ISA and the features of the processor’s micro-architecture. These features include cache, dynamic branch predictor, prefetching strategy, and out-of-order pipeline. We also consider the interactions between these features. For example, the cache miss caused by branch prediction and the branch miss caused by the instruction pipeline. We estimate the WCET of a program using a static analysis method. The static estimation is performed by establishing and solving an Integer Linear Programming (ILP) problem using IPET, where the control flow information of the program and the processor’s micro-architecture are converted into a series of linear constraints and a target WCET equation. The estimated WCET of a program can be obtained by solving its corresponding ILP problem. Using this process, we present a WCET analysis tool based on the open-source project Chronos [9]. The tool is tested on WCET benchmarks. By comparing the estimated WCET of benchmarks with the observed results, it can be concluded that, in most cases, the analysis performs reliable WCET estimations.

Our contributions cam be summarized as follows:

We use the IPET static WCET analysis method to estimate the WCET of programs on the ARMv8 platform.
In the analysis, we model features of the micro-architecture of the platform, along with the interactions between them.
We evaluate the performance of our WCET analysis tool by comparing the estimated WCET of benchmarks with the observed results. The results show that our tool can provide reliable WCET estimations.

The remainder of this paper is organized as follows. Section 2 explores common knowledge of WCET analysis, reviews prior research in the field, and elucidates its features. Section 3 introduces our WCET estimation method. Section 4 presents a detailed implementation of our analysis tool. Section 5 shows our evaluation method and experimental results. Lastly, Section 6 concludes the paper.

2. Related Work

There are three types of WCET analysis methods: static analysis, dynamic analysis, and their hybrid [10]. Dynamic analysis obtains the estimated WCET by executing the given task on the given hardware or a simulator, for some set of inputs, and measuring the execution time of the task or its parts. Static analysis is the process of analyzing program code using offline methods without executing the code. Due to the introduction of mathematical theory, the estimated result of the static analysis is higher than the actual value and is considered the safest. This approach is usually used for hard real-time systems that have stringent execution time requirements [11]. Static analysis is comprised of three main sub-tasks: control flow analysis, processor behavior analysis, and WCET calculation. The tool in this paper is designed using static analysis.

The first task of IPET static analysis is to perform control flow analysis and construct the control flow diagram (CFG). CFG is a directed graph with one entry and one exit. Each node, or basic block, in a CFG represents the maximum sequence of consecutive instructions. Except for the last instruction, there is no more branch instructions in the block. Each edge between nodes represents a branch in the control flow. Basic blocks are divided by scanning the symbols using binary. Figure 1 shows an example program and its corresponding CFG.

Currently, there are three types of static WCET analysis methods: the Tree-based Technique, Implicit Path Enumeration Technique [12,13], and Path-Based Technique [14]. The Tree-based Technique translates the structure of a program into a syntax tree, where reduction is performed from the bottom of the tree upwards. When reduction reaches the root of the syntax tree, the WCET of the program is obtained. The Implicit Path Enumeration Technique converts the longest execution time path search problem into the problem of finding the maximum number of executions for each basic block. This can be modeled and solved using the ILP method. The Path-Based Technique is similar to the Tree-based Technique, but it uses a Scope Graph to represent the structure of the program hierarchically. Table 1 provides a comparison of the characteristics of the methods. We adopt IPET in this paper because this method achieves good performance in various aspects when the input program is not too complex.

The purpose of control flow analysis is to find the program path with the longest execution time. Since IPET is based on the construction and solving of an ILP, the WCET target equation is given by Equation (1), ignoring hardware configuration. From basic block 0 to basic block n there is one possible control flow of the program, where the corresponding execution time and execution count of each block are

c o s t_{i}

and

c n t_{i}

, respectively.

W C E T = m a x \sum_{i = 0}^{N} (c o s t_{i} \times c n t_{i});

(1)

Then, the purpose of control flow analysis shifts to obtaining constraints from the program structure for the target equation. There are three kinds of constraints that should be established: basic block constraints, loop iteration upper limits, and infeasible path constraints [13]. Basic block constraints are relations regarding the execution count of basic blocks. Loop iteration upper limits include the maximum and minimum execution count of multi-exit loops, dependency count of inner loops in nested loops, and others constraints. Common methods for loop bound analysis are abstract interpretation [15] and symbolic execution [16]. Infeasible path constraints are used to exclude redundant control flow in the program that will never be reached, thereby improving the efficiency of the analysis. The detection of an Infeasible path is necessary to analyse reliable hard, critical real-time systems [17]. Ref. [18] proposed an abstract interpretation technique to analyze Infeasible paths in programs, and the technique requires manual intervention in test cases and cannot be fully automated yet.

After obtaining the preliminary WCET target equation and a series of constraints from the control flow of the program, the execution time of each basic block along the longest path is calculated by modelling the micro-architecture of the processor. This is a difficult task because most modern processors use multi-level cache, dynamic branch predictor, pipeline, etc., to improve CPU throughput and the complex behavior and interaction of these units make the execution time very unpredictable [5,19].

To analyse the influence of an instruction cache on program execution time, ref. [20] proposes a method based on IPET, and the basic idea is to estimate whether the cache access is hit or not by analyzing potential cache conflicts within the program and using Cache Conflict Graph (CCG) to model the cache line.

Dynamic branch prediction used by modern processors is based on the History State (HS) branch prediction algorithm. The idea is to predict a branch based on the execution history. The implementation uses the Branch History Register (BHR) to save the address of the Branch History Table (BHT). When a branch instruction is executed, the branch prediction for this instruction is looked up and updated in BHT. Ref. [21] considers the impact of branch prediction on WCET analysis by adding HS nodes in CFG.

Superscalar out-of-order CPUs can achieve higher performance than in-order CPUs, but it is difficult to guarantee the WCET of the software. Ref. [22] modeled basic blocks and performed an analysis of the five-stage pipeline, considering cache effects by using an execution graph, and avoided enumeration of all possible instruction execution times in the analysis by using intervals to represent the execution time of the instructions. Their analysis of pipelines can also be expressed in the form of linear constraints, meaning it can be easily integrated into IPET. Their methods have been adopted by the static WCET analysis tool Chronos.

3. Design

The analysis method in this paper is divided into three steps: compile, analyze, and solve and the process of these works is shown in Figure 2. In the compile step, C source files are compiled and disassembled. We focus on analyzing programs on the ARM architecture, so the output of this step is an object file in ELF format and its disassembled code on the ARM platform. In the analysis step, we perform control flow analysis and micro-architecture modeling to collect the target WCET equation and linear constraints, then combine them into the target ILP problem. In the solve step, the ILP problem obtained in the previous steps is solved to get the estimated WCET value.

3.1. Control Flow Analysis

Each node, or basic block, in the CFG is the maximum sequence of consecutive instructions. Let the directed edge

E_{B \to B^{'}}

represent the count of executions from a basic block B to

B^{'}

, and let

v_{B}

denote the execution count of a basic block. For any node in the CFG, the count of control flows into the node is equal to the count of flows out of the node, which is equal to the node’s execution count. Thus, the basic block constraint Equation (2) is obtained.

The other two kinds of constraints mentioned in Section 2, namely the loop iteration upper limit and infeasible path constraints, are provided by the user in our analysis. Figure 3 shows a simple C program and its corresponding CFG, along with the structural constraints and loop iteration bounds. These form an ILP problem together with (1). By solving this preliminary problem, we can estimate the execution count of each basic block without considering micro-architecture details.

\sum_{i} d_{i, j} = c n t_{j} = \sum_{k} d_{j, k};

(2)

Within these constraints, variable

d_{n}

denotes the execution count of the path,

c n t_{n}

denotes the execution count of each basic block, and

L_{n}

denotes the execution count of a loop in the CFG. The identification of loops in the CFG will be described in Section 4.

3.2. Instruction Cache Analysis

In this paper, we use a Cache Conflict Graph (CCG) to model direct-mapping cache schemes and FIFO replacement policies. In the direct-mapping scheme, each block of the main memory is mapped to only one specific cache location. We divide instructions in a basic block into several cache basic blocks based on the size of the physical cache line configured by the user, represented by a node in the CCG. Each split block is then mapped to a cache line based on its starting address. If multiple cache basic blocks are mapped to the same cache line, conflicts may occur during execution, represented by edges between nodes in the CCG. Assuming there are four cache sets, each with one cache line, a program, its cache mapping table, and the CCG of the cache is set to zero, as shown in Figure 4. S and E represent the start and end nodes.

B_{i, j}

denotes the j th cache basic block i and edge

c e c_{m, n, k, l}

denotes the count cache basic block

B_{k l}

, which is evicted by

B_{m n}

.

The target equation, considering the instruction cache model, is revised to (3).

n_{i}

is the number of cache blocks in basic block i,

c o s t_{i, j}^{h i t}

(

c o s t_{i, j}^{m i s s}

) denotes the execution time of

B_{i, j}

when the cache hits (misses), and

c n t_{i, j}^{h i t}

(

c n t_{i, j}^{m i s s}

) denotes the execution count of

B_{i, j}

when the cache hits (misses). Estimated WCET is the maximum sum of the execution time of the cache basic blocks in all possible control flows in the CFG.

W C E T = m a x \sum_{i = 1}^{N} \sum_{j = i}^{n_{i}} (c o s t_{i, j}^{h i t} \times c n t_{i, j}^{h i t} + c o s t_{i, j}^{m i s s} \times c n t_{i, j}^{m i s s});

(3)

In the CFG and CCG of a program, the execution count of a basic block is equal to the execution counts of all the cache basic blocks within it. The sum of the control flow into a basic block

B_{i}

,

i f_{i}

is equal to the control flow out of the basic block

o f_{i}

and the branch connected to the starting and ending nodes will only be executed once. Based on these observations, the following constraints can be established for the target equation:

c n t_{i, j}^{h i t} + c n t_{i, j}^{m i s s} = c n t_{i, k}^{h i t} + c n t_{i, k}^{m i s s}, j \neq k;

(4)

c n t_{i} = i f_{i} = o f_{i};

(5)

c n t_{i} = \sum_{u, v \in i f_{i}} c e c_{(u, v, i, j)} = \sum_{u, v \in o f_{i}} c e c_{(i, j, u, v)};

(6)

\sum_{i \in i f_{i}} c e c_{(i, E)} = \sum_{i \in o f_{i}} c e c_{(S, i)};

(7)

3.3. Branch Prediction Analysis

This paper adopts the GAg prediction model for micro-architecture analysis. In this model, the branch predictor first looks up the Global History Register (GHR) to obtain the current global history of the branches when predicting a branch instruction. This history is then used as an index to access entry into the Pattern History Table (PHT), which contains the results of the branch direction. Prediction errors can lead to pipeline flushes and delays in instruction prefetching. To analyze the impact of branch prediction on program execution time, let

b c m_{i}

denote the count of prediction errors and

p e n a l t y

represent the delay penalty. The target equation is then revised to:

W C E T = \sum_{i = 0}^{N} (c o s t_{i} \times c n t_{i} + p e n a l t y \times b m c_{i});

(8)

To obtain the constraints bounding the count of mispredictions under the GAg model, information about the execution history of each branch instruction must be recorded. Therefore, we introduce the Historical State (HS) into the CFG. Since there is only one branch instruction per basic block, each basic block will have an HS attribute that records the possible branch history when the control flow reaches that block. Assuming that the BHR is a two-bit register, the CFG, after adding the HS attribute to each basic block, is shown in Figure 5. Let 0 represent the branch that is not taken and 1 represent the branch that is taken. Basic block

B_{0}

is the starting block, so its HS is always 00. Basic block

B_{1}

has two incoming edges and the two possible paths to this block are

B_{0} \to B_{1}

and

B_{2} \to B_{3} \to B_{1}

. The branch history of these two paths are not taken (

B_{0}

), not taken (

B_{0} \to B_{1}

), and taken (

B_{2} \to B_{3}

), not taken (

B_{3} \to B_{1}

); therefore, the HS is

{00, 11}

. In an execution, the current HS of each basic block is used as its global history to access and modify the branch prediction result in the BHT, as proposed by the GAg prediction model.

After adding HS information, the condition still holds that for each basic block in the CFG, the sum of the control flows into the block equals the sum of the control flows out of the block. Let

d_{i, j}

denote the execution count from

B_{i}

to

B_{j}

,

h s \in B_{i}

represents all HS of basic block

B_{i}

,

b m c_{i}^{h s}

, and

c n t_{i}^{h s}

represents the corresponding variable when the control flow comes with the

h s

historical state. Then, the constraints for the target equation can be obtained:

d_{i, j} = \sum_{h s \in B_{i}} d_{i, j}^{h s};

(9)

b m c_{i} = \sum_{h s \in B_{i}} b m c_{i}^{h s};

(10)

c n t_{i} = \sum_{h s \in B_{i}} c n t_{i}^{h s} \geq b m c_{i}^{h s};

(11)

If there is an edge from basic block

B_{j}

to basic block

B_{i}

, and the HS of this control flow at

B_{j}

is

h^{p r e}

, the HS at

B_{i}

in this control flow can occur after recording the branch of the edge

B_{j} \to B_{i}

. The sum of the incoming control flow from all possible basic blocks under all possible HSs is the execution count of a basic block. Constrains can be obtained in Equation (12).

c n t_{i}^{h} = \sum_{j} d_{j, i}^{h^{p r e}} = \sum_{j} \sum_{h^{'} \in {h | h^{p r e} \to h}} d_{j, i}^{h^{'}};

(12)

Let

s p e c_{(i, j)}^{(h, n)}

denote the count of the executions when the HS h remains unchanged on the longest path from

B_{i}

to

B_{j}

. This means that along the path, branches are either always taken or always not taken, and n represents the branch result of

B_{i}

. If a basic block takes (1) the branch to the next block and the branch prediction under HS h is incorrect, the count of the incorrect prediction must be smaller than the total execution count when the basic block takes the branch. So, the following constraints can be obtained:

b m c_{i}^{(h, 1)} \leq \sum_{j} s p c_{(i, j)}^{(h, 1)};

(13)

b m c_{i}^{(h, 0)} \leq \sum_{j} s p c_{(i, j)}^{(h, 0)};

(14)

3.4. Prefetching Strategy

In the previous two subsections, the analyses of the instruction cache and the dynamic branch predictor were performed independently. When analyzing the cache using the CCG, the impact of branch prediction on program execution time was ignored. Similarly, when analyzing the impact of branch prediction using HS, the influence of cache hits or misses due to the branch predictor was not considered. In reality, instruction cache prefetching occurs based on the predicted branch direction, which affects whether instructions hit the cache and, consequently, influences the program’s execution time. In this paper, we assume that the processor can only allow one branch instruction to be in the prediction stage at any given time. This ensures that all previous instructions have finished executing before the branch instruction is processed.

When branch prediction is incorrect, an invalid cache may cause a cache miss, resulting in delays. For this scenario, we introduce virtual nodes for the CCG. If the cache line blocks on the predicted path conflict with the other cache line blocks, virtual node

B_{i, j}^{x}

is introduced for cache basic block

B_{(i, j)}

, to describe the mutual influence between branch prediction and instruction cache. x denotes the actual execution outcome of the branch instruction. If a cache line basic block along the predicted path does not conflict with other blocks, there is no need to add a virtual node. After adding virtual nodes, the CCG in Figure 4 is modified to the CCG shown in Figure 6. For branch instruction 2, when the actual execution is 0 but the prediction is 1, we do not need to add additional nodes because, in this scenario, instruction prefetching will proceed along the erroneous path and no instructions within basic block

B_{2}

conflict with other instructions. Similarly, for branch instruction 3, when the execution is 1 but the prediction is 0, we also do not need to add nodes.

B_{3.1}^{(2, 1)}

denotes the impact on cache line 3.1 when the actual execution branch instruction 2 is 1, but the prediction is 0. The edge

B_{1.2} \to B_{3.1}^{(2, 1)}

indicates that, due to the branch instruction 2 is predicted as 0, 3.1 is already in the cache. However, when instruction 1.2 is executed, the cache miss for 1.2 occurs because of 3.1. For edge

B_{3.1}^{(2, 1)} \to B_{3.1}

, when the prediction for branch instruction 2 is 0 but the execution is 1, the incorrect prediction results in instruction 3.1 being cached. This scenario actually improves the cache hit rate.

After adding virtual nodes, the target equation is modified to Equation (15).

d e l a y_b

represents the total time consumed by the branch prediction.

d e l a y_c

denotes the total time consumed by the cache misses. The Equation illustrates that when a branch prediction error occurs, instruction prefetching will proceed along the erroneous path. Consequently, when the program executes along the correct path, at least one cache miss will occur. Here, l denotes the number of instructions fetched in a single prefetch operation.

W C E T = \sum_{i = 1}^{N} [c o s t_{i} \times c n t_{i} + d e l a y_b + d e l a y_c] + \sum_{b \in B r a n c h (P)} [c m p \times l];

(15)

d e l a y_b = b m p \times b m_{i};

(16)

d e l a y_c = \sum_{j = 1}^{n_{i}} (c m p \times c m_{(i, j)});

(17)

3.5. Pipeline Analysis

The purpose of pipeline analysis is to comprehensively consider the effects of both the cache model and branch prediction, and to calculate the execution time for a basic block. Ref. [23] analyzes the impact of out-of-order execution pipelines on program execution time, noting that the execution time cannot be simply represented by the longest path of instruction execution; dependencies between instructions must also be considered. This paper achieves this by analyzing the commonly used five-stage pipeline architecture in ARMv8 processors: Instruction Fetch (IF), Decode (ID), Instruction Execute (EX), Memory Access (MEM), and Write Back (WB). We use execution graphs to model and analyze the pipelines of out-of-order executions. Execution graphs illustrate the dependencies between instructions using directed edges as follows:

The dependency between different stages of the same instruction. The completion of the previous stage is required before proceeding to the next stage.
The dependency between different instructions in the same pipeline stages. Earlier instructions in the program have a higher priority.
The data dependency between instructions.
Queuing for idle Instruction Fetch Buffers (I-buffers) and Reorder Buffers (ROB).

Assuming the size of the Instruction I-buffer is 2 and the size of the Reorder ROB is 4, a program and its corresponding execution graph is shown in Figure 7. The edge

W B \to E X

represents the data dependency between instructions,

I D \to I F

and

M E M \to I D

represent the dependencies between instructions due to the sizes of the I-buffer and ROB, respectively, and the dashed line

E X \to E X

indicates instructions competing for the same functional unit.

Since the analysis of the cache is performed using the CCG, the cache is ignored when establishing execution graphs; we assume that all instructions hit the cache. However, errors in branch prediction can cause the instruction pipeline to prefetch instructions along an incorrect path. To address this, we add instructions with the same size as the Reorder Buffer (ROB) along the erroneous path after the mispredicted branch instruction in the execution graph.

For a node in execution phase, its earliest finish time and latest finish time can be obtained using Algorithm 1.

e a r l i e s t [t_{i}^{s t a r t}]

denotes the earliest start time of node i,

e a r l i e s t [t_{i}^{f i n i s h}]

denotes the earliest finish time,

l a t e s t [t_{i}^{s t a r t}]

denotes the latest start time, and

l a t e s t [t_{i}^{f i n i s h}]

denotes the latest finish time. The latest finishing time of the last node is the estimated WCET. The functions in the algorithm are:

$L a t e s t T i m e s (G)$ calculates the latest ready, start, and finish times for the nodes. The latest start time depends on its latest ready time, which depends on the latest finish time of its predecessor or competitors. If a competitor has a lower priority than the instruction in question, the competitor will be excluded, and those whose execution times do not overlap will also be excluded. If a competitor has a higher priority than the instruction, it is assumed that all nodes preceding this competitor will delay the node. Once a node’s latest start time is obtained, the latest ready times of its successor nodes are updated.
$E a r l i e s t T i m e s (G)$ calculates the earliest ready, start, and finish times of nodes. Unlike $L a t e s t T i m e s (G)$ , the calculation of these times only considers competitors that conflict with the node’s preparation time and hardware resources.

Algorithm 1: Basic Block WCET Analysis Algorithm

4. Implementation

This section provides a description of the implementation process of the tool, including the establishment of the CFG and CCG, the extension of the CFG, and WCET calculation.

Each basic block in the CFG must indicate whether it contains a branch instruction. For blocks with branch instructions, there are two outgoing edges representing the two possible execution directions of the branch instruction. In addition to edges, other information, such as the basic block index, must also be stored. The data type of a basic block is shown in Table 2.

To analyze the impact of the cache on program execution time, basic blocks are divided into multiple cache basic blocks. If a basic block is smaller than the size of a cache line configured by the user, it is treated as a single node. Otherwise, the basic block is divided into multiple cache basic blocks. The data type for a cache basic block is shown in Table 3. Each cache basic block must record information such as the starting instruction address, the index of the cache set it is mapped to, and the index of the basic block it belongs to. This data type can also be used to describe nodes in the CCG. Consequently, the edges in the CCG only need to record the indices of the start and end nodes, and the count of conflicts.

For branch prediction analysis, the introduction of HS requires the extension of CFG. A new node is used to store the branch history information for nodes in the CFG. The data type is shown in Table 4. The main steps in analyzing branch prediction are collecting the set of instructions on the erroneous path, constructing the HS of nodes, and analyzing the HSs of edges to feasible nodes. Function

c o l l e c t_m p_i n s t s

is used to collect instructions along the predicted path, function

b u i l d_b f g

collects control flow information between adjacent branch instructions under a specific HS, and function

b u i l d_b t g

uses BFS to collect the information of all nodes reachable from a specific CFG node.

The pipeline model analysis is implemented using execution graphs, with the goal of determining the execution time of a basic block. When computing the execution time of a basic block, it is insufficient to consider the block in isolation; the block’s context within the CFG must also be considered. For example, if the instruction fetch buffer size is 2 and the instruction reorder buffer size is 4, then before the execution of a basic block, there will be 5 instructions waiting in the pipeline and one instruction being executed. Instructions in the cache may have dependencies on nearby instructions. Instructions that have dependencies on the previous basic block are referred to as the prologue, while those with dependencies on the subsequent basic block are called the epilogue. By using the functions

c o l l e c t_p r o l o g s

and

c o l l e c t_e p i l o g s

to traverse paths in the CFG, we can establish the context between cache basic blocks. This allows us to account for dependencies within the contextual environment when analyzing the execution graph.

In the implementation of the pipeline analysis, the function

e s t_u n i t s

calculates the execution time of each basic block in the CFG. The function

c t x_u n i t_t i m e

analyzes paths within the CFG, focusing on branch prediction and the contextual environment of the basic blocks. The function

c r e a t e_e g r a p h

establishes the execution graph for each basic block, while the function

e s t_e g r a p h

applies Algorithm 1 to derive the final WCET target equation.

IPET solves problems using ILP. The overall construction process for the ILP problem is shown in Figure 8. There are two branches related to branch prediction: the function after the first branch prediction extracts constraints related to branch prediction, while the function after the second branch extracts constraints related to the erroneous path prefetching strategy. The roles of the other functions are as follows:

$c o s t_f u n c$ generates linear target equations based on the micro-architecture configured by the user. For scenarios involving branch prediction execution, $c o s t_t e r m (B P_C P R E D)$ or $c o s t_t e r m (B P_M P R E D)$ is invoked to construct target equations for correct branch prediction or branch prediction errors, respectively. $m p c o s t_f u n c$ is invoked to establish the target Equation (15) to consider branch prediction and the instruction cache, along with their interactions.
$t c f g_c o n s$ generates constraints derived from the control flow information.
$b f g_c o n s$ generates constraints derived from the branch prediction.
$c a c h e_c o n s$ and $m p_c a c h e_c o n s$ generates constraints for cache hits or misses, as well as for constraints, when executing along erroneous paths.
$u s e r_c o n s$ reads user-input constraint files and parses the linear constraint for the ILP problem.

Figure 8. Process of implementing the ILP problem.

The ILP designed in this paper is based on the ILOG/CPLEX format [24]. There are two ILP solvers capable of solving problems in this format: CPLEX, a commercial software, and

l p_s o l v e r

, a free and open-source alternative. This paper uses

l p_s o l v e r

5.5.0.4 as the third-party ILP solver for calculating the WCET.

5. Results and Discussion

Figure 9 gives the analysis WCET in cycles for the set of Mälardalen benchmarks [25] supported by our WCET analysis tool. The processor configuration of the tool we use in experiments is listed in Table 5. Some user-provided constraints we used for the benchmarks are shown in Table 6.

Among Mälardalen WCET benchmarks, the insertsort program has the largest estimated WCET, in contrast to the binary search program

b s

and the finite impulse response filter program

f i r

. This is because we changed the length of the target reversed array in

i n s e r t s o r t

to 1024 but

b s

only sorts 15 elements.

Two boards equipped with ARMv8-A CPUs, the Raspberry Pi 4 Model B from Sony, Pencoed, Wales, and the Firefly ROC-RK3568-PC-SE from T-CHIP, Zhongshan, Guangdong Province, China, are also used to obtain the observed WCETs of the benchmarks in a physical environment. Execution time in processor cycles is obtained by reading the generic timer register of the ARMv8-A CPU. The difference between the values before and after execution is the measured execution time of the program. Five measured execution times are averaged to obtain the final observed WCET. We configure the analysis tool to make the micro-architecture similar to the target processor. The two sets of measured and observed WCETs of the benchmarks are shown in Figure 10 and Figure 11.

Given a C program and a processor configuration, the static analysis method guarantees that the estimated WCET is not less than the program’s actual execution time for any input. As shown in the figure, the ratio of the estimated to observed WCET values for all benchmarks is greater than 1, which ensures the reliability of the analysis tool. The analysis result is considered precise if the estimated value is close to the observed value.

However, some estimated WCET values for benchmarks are up to 2–3 times larger than the observed WCET. There may be two reasons for these differences. First, our micro-architecture modeling may not be detailed enough. For instance, features like data caches and Translation Lookaside Buffers (TLBs) supported by modern processors are not included in our model, leading to a lack of constraints. Second, the differences may be attributed to the pessimistic nature of the static analysis method.

6. Conclusions

This paper designs a WCET analysis tool that includes using a CCG to address cache conflicts, employing a control flow analysis method based on historical states to analyze branch predictors, and examining instruction cache prefetching strategies based on erroneous branch prediction paths by adding virtual nodes to the CCG. When calculating the execution time of a basic block, this paper integrates instruction pipelines into the analysis and uses execution graphs to determine the execution time of individual basic blocks.

After implementation, we evaluate the performance of the analysis tool by comparing the estimated WCET of benchmarks with the observed values for two boards. The ratio of estimated to observed WCET values for all benchmarks is greater than 1, demonstrating the tool’s reliability. Some estimated values differ greatly from the observed values. This discrepancy is due to our incomplete modeling of modern processor features.

Future works will mainly include the following:

Model and analyze shared caches among multi-core to improve analysis accuracy.
Optimize the micro-architecture model to make it more suitable for modern processors.
Adopt algorithms with lower time complexity so that the tool can analyze more complex programs.
Use estimated WCET bounds to perform schedulability analyses for programs [26].

Author Contributions

Conceptualization, M.L. and Y.Z.; methodology, M.L. and K.X.; software, M.L.; validation, M.L., K.X. and Y.Z.; formal analysis, M.L.; writing—original draft preparation, D.H.; writing—review and editing, D.H.; supervision, K.X.; funding acquisition, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is based on the research results of CMIOT-UESTC Joint Laboratory of Operating System.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bouziane, R.; Rohou, E.; Gamatié, A. Energy-Efficient Memory Mappings based on Partial WCET Analysis and Multi-Retention Time STT-RAM. In Proceedings of the 26th International Conference on Real-Time Networks and Systems, Chasseneuil-du-Poitou, France, 10–12 October 2018; pp. 148–158. [Google Scholar]
Lee, J.; Shin, S.Y.; Nejati, S.; Briand, L.; Parache, Y.I. Estimating Probabilistic Safe WCET Ranges of Real-Time Systems at Design Stages. ACM Trans. Softw. Eng. Methodol. 2023, 32, 1–33. [Google Scholar] [CrossRef]
Lugo, T.; Lozano, S.; Fernández, J.; Carretero, J. A Survey of Techniques for Reducing Interference in Real-Time Applications on Multicore Platforms. IEEE Access 2022, 10, 21853–21882. [Google Scholar] [CrossRef]
Pedro-Zapater, A.; Segarra, J.; Tejero, R.G.; Viñals, V.; Rodríguez, C. Reducing the WCET and analysis time of systems with simple lockable instruction caches. PLoS ONE 2020, 15, e0229980. [Google Scholar] [CrossRef] [PubMed]
Segarra, J.; Cortadella, J.; Gran Tejero, R.; Viñals-Yufera, V. Automatic Safe Data Reuse Detection for the WCET Analysis of Systems With Data Caches. IEEE Access 2020, 8, 192379–192392. [Google Scholar] [CrossRef]
Eyerman, S.; Smith, J.E.; Eeckhout, L. Characterizing the branch misprediction penalty. In Proceedings of the 2006 IEEE International Symposium on Performance Analysis of Systems and Software, Austin, TX, USA, 19–21 March 2006. [Google Scholar]
Zhang, Q.; Huangfu, Y.; Zhang, W. Statistical regression models for WCET estimation. Qual. Technol. Quant. Manag. 2019, 16, 318–329. [Google Scholar] [CrossRef]
Chattopadhyay, S.; Roychoudhury, A. Unified Cache Modeling for WCET Analysis and Layout Optimizations. In Proceedings of the 2009 30th IEEE Real-Time Systems Symposium, Washington, DC, USA, 1–4 December 2009; pp. 47–56. [Google Scholar]
Li, X.; Liang, Y.; Mitra, T.; Roychoudhury, A. Chronos: A timing analyzer for embedded software. Sci. Comput. Program 2007, 69, 56–67. [Google Scholar] [CrossRef]
Reghenzani, F.; Massari, G.; Fornaciari, W.; Galimberti, A. Probabilistic-WCET Reliability: On the experimental validation of EVT hypotheses. In Proceedings of the International Conference on Omni-Layer Intelligent Systems, Crete, Greece, 5–7 May 2019; pp. 229–234. [Google Scholar]
Puschner, P.; Burns, A. A review of worst-case execution-time analysis. Real Time Syst. 2000, 18, 115–128. [Google Scholar] [CrossRef]
Li, Y.T.S.; Malik, S.; Wolfe, A. Efficient microarchitecture modeling and path analysis for real-time software. In Proceedings of the 16th IEEE Real-Time Systems Symposium, Pisa, Italy, 5–7 December 1995; pp. 298–307. [Google Scholar]
Li, Y.T.S.; Malik, S. Performance analysis of embedded software using implicit path enumeration. IEEE T. Comput. Aid D 1997, 16, 1477–1487. [Google Scholar] [CrossRef]
Stappert, F.; Ermedahl, A.; Engblom, J. Efficient longest executable path search for programs with complex flows and pipeline effects. In Proceedings of the 2001 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, Atlanta, GA, USA, 16–17 November 2001; pp. 132–140. [Google Scholar]
Healy, C.; Sjodin, M.; Rustagi, V.; Whalley, D. Bounding loop iterations for timing analysis. In Proceedings of the Fourth IEEE Real-Time Technology and Applications Symposium, Denver, CO, USA, 3–5 June 1998; pp. 12–21. [Google Scholar]
Gómez, G.; Liu, Y.A. Automatic time-bound analysis for a higher-order language. SIGPLAN Not. 2002, 37, 75–86. [Google Scholar] [CrossRef]
Ruiz, J.; Cassé, H.; Michiel, M.d. Working Around Loops for Infeasible Path Detection in Binary Programs. In Proceedings of the 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation, Shanghai, China, 17–18 September 2017; pp. 1–10. [Google Scholar]
Ferdinand, C.; Heckmann, R.; Langenbach, M.; Martin, F.; Schmidt, M.; Theiling, H.; Thesing, S.; Wilhelm, R. Reliable and Precise WCET Determination for a Real-Life Processor. In Proceedings of the Embedded Software, First International Workshop, EMSOFT 2001, Tahoe City, CA, USA, 8–10 October 2001. [Google Scholar]
Lundqvist, T.; Stenstrom, P. Timing anomalies in dynamically scheduled microprocessors. In Proceedings of the 20th IEEE Real-Time Systems Symposium, Phoenix, AZ, USA, 1–3 December 1999; pp. 12–21. [Google Scholar]
Li, Y.-T.S.; Malik, S.; Wolfe, A. Cache modeling for real-time software: Beyond direct mapped instruction caches. In Proceedings of the 17th IEEE Real-Time Systems Symposium, Washington, DC, USA, 4–6 December 1996; p. 254. [Google Scholar]
Mitra, T.; Roychoudhury, A.; Xianfeng, L. Timing analysis of embedded software for speculative processors. In Proceedings of the 15th International Symposium on System Synthesis, Kyoto, Japan, 2–4 October 2002; pp. 126–131. [Google Scholar]
Xianfeng, L.; Roychoudhury, A.; Mitra, T. Modeling out-of-order processors for software timing analysis. In Proceedings of the 25th IEEE International Real-Time Systems Symposium, Lisbon, Portugal, 5–8 December 2004; pp. 92–103. [Google Scholar]
Bai, Z.; Cassé, H.; Carle, T.; Rochange, C. Computing Execution Times With Execution Decision Diagrams in the Presence of Out-of-Order Resources. IEEE T. Comput. Aid D 2023, 42, 3665–3678. [Google Scholar] [CrossRef]
IBM ILOG CPLEX Optimization Studio. Available online: https://www.ibm.com/products/ilog-cplex-optimization-studio (accessed on 21 April 2024).
Gustafsson, J.; Betts, A.; Ermedahl, A.; Lisper, B. The Mälardalen WCET Benchmarks: Past, Present and Future; Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik: Wadern, Germany, 2010; Volume 15, pp. 136–146. [Google Scholar]
Maiza, C.; Rihani, H.; Rivas, J.M.; Goossens, J.; Altmeyer, S.; Davis, R.I. A Survey of Timing Verification Techniques for Multi-Core Real-Time Systems. ACM Comput. Surv. 2019, 52, 1–38. [Google Scholar] [CrossRef]

Figure 1. A C program and its CFG.

Figure 2. Process of WCET analysis.

Figure 3. CFG and constraints of a program.

Figure 4. (a) CFG of a program. (b) Cache Table of a program. (c) CCG of a program.

Figure 5. CFG with HS information.

Figure 6. CCG with virtual nodes.

Figure 7. ARM assembly and its execution graph.

Figure 9. Estimated WCET of benchmarks.

Figure 10. WCET experimental data from the Raspberry Pi 4 Model B with Crotex-A75 processor.

Figure 11. WCET experimental data from the Firefly ROC-RK3568-PC-SE board with Crotex-A55 processor.

Table 1. Comparison of the three analysis methods.

Analysis Method	Tree-Based	Implicit Path Enumeration	Path-Based
Efficiency	Highest	High	Low
Accuracy	Average	Good	Best
Ability to Describe Flow Information	Poor	Good	Best
Affected by Compiler Optimization	Yes	No	No

Table 2. Data type of basic block.

Data Type	Name	Description
$i n t$	$i d$	Index of the basic block.
$p r o c_t^{*}$	$p r o c$	Function the node belongs to.
$a d d r_t$	$s a$	Starting address of the node.
$i n t$	$s i z e$	Size of the basic block.
$i n t$	$n u m_i n s t$	Number of instructions in the basic block.
$d e_i n s t_t^{*}$	$c o d e$	Pointer to the first instruction in the basic block.
$b b_t y p e_t$	$t y p e$	Whether include branch instruction.
$c f g_e d g e_t^{*}$	$o u t_n$ , $o u t_t$	Outing edges of the basic block.
$i n t$	$n u m_i n$	Number of the edges out.
$c f g_e d g e_t^{* *}$	$i n t$	Set of the edges in.

Table 3. Data type of cache basic block.

Data Type	Name	Description
$i n s t_t$	$s t a r t_i n s t$	Starting instruction of the basic block.
$s i z e_t$	$l b_s i z e$	Size of the cache basic block.
$u n s i g n e d$	$l b_s e t$	Cache line the basic block mapped to.
$c f g_n o d e_t$	$c f g_b b$	Basic block the cache basic block belongs to.
$u n s i g n e d$	$i d$	Index of the cache basic block in CFG.
$c c g_e d g e_^{*}$	$e d g e s_i n$	Set of edges in.
$c c g_e d g e_^{*}$	$e d g e s_o u t$	Set of edges out.

Table 4. Data type of HS node in CFG.

Data Type	Name	Description
$t c f g_n o d e_t^{*}$	$b b i$	Basic node pointed to.
$s h o r t$	$b h r$	Value of BHR.
$s h o r t$	$p i$	Index of prediction table, calculated by BHR.
$t c f g_n o d e_t^{*}$	$o u t$	Set of edges out.
$t c f g_n o d e_t^{*}$	$i n$	Set of edges in.
$i n t$	$f l a g s$	Flags indicating whether the path is feasible.

Table 5. Processor configuration.

Module	Parameter	Description
Cache	-cache:il1 il1:768:64:3:l	L1 instruction cache that is a 48 KB 3-way set-associative cache with a 64-byte cache line; LRU replacement policie.
	-cache:il2 il2:16384:64:16:l	L1 instruction cache that is a 1 MB 16-way set-associative cache with a 64-byte cache line; LRU replacement policie.
	-cache:dl1 none	Data caching is not considered
Branch Prediction	-bpred:2lev 1 1024 4 1	Size of first level entry is 1; size of second level entry is 1024; Width of BHR is 4; Support branch history.
Branch Prediction	-fetch:mplat 15	Branch mis-prediction latency is 15 cycles
Instruction Pipeline	-fetch:ifqsize 4	Instruction prefetch queue size is 4.
	-decode:width 4	Instruction decode width is 4 insts/cycle.
	-issue:width 4	Instruction transmission width is 4 insts/cycle.
	-commit:width 4	Instruction commisstion width is 4 insts/cycle.
	-issue:inorder true	Run pipeline with in-order issue.
	-issue:wrongpath true	Issue instructions down wrong execution paths
Others	-ruu:size 128	Size of register update unit is 128.
	-lsq:size 64	Size of load/store queue is 64.
	-mem:width 8	Size of memory block is 32 bytes.

Table 6. User constraints for benchmarks.

Insertsort	cnt	Matmult	bs
c0.1 − 1024 c0.0 ≤ 0 c0.3 − 1024 c0.2 ≤ 0 c0.5 − 512 c0.4 ≤ 0	c0.2 − 128 c0.1 ≤ 0 c0.1 − 128 c0.0 ≤ 0 c0.6 − 128 c0.5 ≤ 0 c0.6 − 128 c0.5 ≤ 0 c0.5 − 128 c0.4 ≤ 0	c0.1 − 24 c0.0 ≤ 0 c0.2 − 24 c0.1 ≤ 0 c0.6 − 24 c0.5 ≤ 0 c0.5 − 24 c0.4 ≤ 0 c0.11 − 24 c0.10 ≤ 0 c0.10 − 24 c0.9 ≤ 0 c0.9 − 24 c0.8 ≤ 0	c0.1 − 1024 c0.0 ≤ 0 c0.3 − 1024 c0.2 v 0 c0.4 − 512 c0.3 ≤ 0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, M.; Xiao, K.; Zhou, Y.; Huang, D. WCET Analysis Based on Micro-Architecture Modeling for Embedded System Security. Appl. Sci. 2024, 14, 7277. https://doi.org/10.3390/app14167277

AMA Style

Li M, Xiao K, Zhou Y, Huang D. WCET Analysis Based on Micro-Architecture Modeling for Embedded System Security. Applied Sciences. 2024; 14(16):7277. https://doi.org/10.3390/app14167277

Chicago/Turabian Style

Li, Meng, Kun Xiao, Yong Zhou, and Dajun Huang. 2024. "WCET Analysis Based on Micro-Architecture Modeling for Embedded System Security" Applied Sciences 14, no. 16: 7277. https://doi.org/10.3390/app14167277

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WCET Analysis Based on Micro-Architecture Modeling for Embedded System Security

Abstract

1. Introduction

2. Related Work

3. Design

3.1. Control Flow Analysis

3.2. Instruction Cache Analysis

3.3. Branch Prediction Analysis

3.4. Prefetching Strategy

3.5. Pipeline Analysis

4. Implementation

5. Results and Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI