1. Introduction
Real-time systems are characterized by the existence of timing constraints, in which tasks must be completed within a limited time frame. In hard real-time systems, missing a deadline can directly impact the system’s security. For instance, if an airbag system in a car does not deploy within the required time frame, it could result in severe injury or death during an accident. Ensuring real-time performance is therefore essential for maintaining security in these systems. The Worst-Case Execution Time (WCET) of a computational task is the maximum length of time the task could take to execute on a specific hardware platform. Its offline estimation helps in choosing appropriate scheduling algorithms, ensuring the system’s performance, and guarantying the system’s energy efficiency [
1]. Since the wide application of real-time systems makes real-time analysis of programs particularly important, WCET estimation has been studied extensively [
2], and various approaches have been developed to derive such bounds [
3]. In embedded systems, ARM processors are considered mainstream because of their high performance, low power consumption, reasonable price, and complete maintenance system. They are particularly prominent in the fields of real-time control, connected automated vehicles, and mobile phones. Thus, the analysis of WCET for ARM processors is crucial. Methods to estimate the WCET of programs on newer ARM platforms like ARMv8 should be provided.
Estimating the WCET of a program is difficult because the execution of a program depends on specific hardware [
4]. In terms of micro-architecture, modern processors are often equipped with pipelines, caches, branch prediction, and other features. These units not only perform their own functions but also interact with each other. For example, modern processors often load multiple cache lines at once when transferring data from memory to cache. When dealing with branch instructions, this loading scheme can lead to additional cache invalidation [
5]. When a processor’s branch instruction is mispredicted, instructions along the taken branch are fetched and executed, while instructions along the wrong path are undone, incurring a branch misprediction penalty. The branch misprediction penalty can be substantially larger than the pipeline length [
6]. These interactions between functional units greatly complicate the analysis of program execution, and there is little research on them [
7,
8].
In this paper, we model the ARMv8 ISA and the features of the processor’s micro-architecture. These features include cache, dynamic branch predictor, prefetching strategy, and out-of-order pipeline. We also consider the interactions between these features. For example, the cache miss caused by branch prediction and the branch miss caused by the instruction pipeline. We estimate the WCET of a program using a static analysis method. The static estimation is performed by establishing and solving an Integer Linear Programming (ILP) problem using IPET, where the control flow information of the program and the processor’s micro-architecture are converted into a series of linear constraints and a target WCET equation. The estimated WCET of a program can be obtained by solving its corresponding ILP problem. Using this process, we present a WCET analysis tool based on the open-source project Chronos [
9]. The tool is tested on WCET benchmarks. By comparing the estimated WCET of benchmarks with the observed results, it can be concluded that, in most cases, the analysis performs reliable WCET estimations.
Our contributions cam be summarized as follows:
We use the IPET static WCET analysis method to estimate the WCET of programs on the ARMv8 platform.
In the analysis, we model features of the micro-architecture of the platform, along with the interactions between them.
We evaluate the performance of our WCET analysis tool by comparing the estimated WCET of benchmarks with the observed results. The results show that our tool can provide reliable WCET estimations.
The remainder of this paper is organized as follows.
Section 2 explores common knowledge of WCET analysis, reviews prior research in the field, and elucidates its features.
Section 3 introduces our WCET estimation method.
Section 4 presents a detailed implementation of our analysis tool.
Section 5 shows our evaluation method and experimental results. Lastly,
Section 6 concludes the paper.
2. Related Work
There are three types of WCET analysis methods: static analysis, dynamic analysis, and their hybrid [
10]. Dynamic analysis obtains the estimated WCET by executing the given task on the given hardware or a simulator, for some set of inputs, and measuring the execution time of the task or its parts. Static analysis is the process of analyzing program code using offline methods without executing the code. Due to the introduction of mathematical theory, the estimated result of the static analysis is higher than the actual value and is considered the safest. This approach is usually used for hard real-time systems that have stringent execution time requirements [
11]. Static analysis is comprised of three main sub-tasks: control flow analysis, processor behavior analysis, and WCET calculation. The tool in this paper is designed using static analysis.
The first task of IPET static analysis is to perform control flow analysis and construct the control flow diagram (CFG). CFG is a directed graph with one entry and one exit. Each node, or basic block, in a CFG represents the maximum sequence of consecutive instructions. Except for the last instruction, there is no more branch instructions in the block. Each edge between nodes represents a branch in the control flow. Basic blocks are divided by scanning the symbols using binary.
Figure 1 shows an example program and its corresponding CFG.
Currently, there are three types of static WCET analysis methods: the Tree-based Technique, Implicit Path Enumeration Technique [
12,
13], and Path-Based Technique [
14]. The Tree-based Technique translates the structure of a program into a syntax tree, where reduction is performed from the bottom of the tree upwards. When reduction reaches the root of the syntax tree, the WCET of the program is obtained. The Implicit Path Enumeration Technique converts the longest execution time path search problem into the problem of finding the maximum number of executions for each basic block. This can be modeled and solved using the ILP method. The Path-Based Technique is similar to the Tree-based Technique, but it uses a Scope Graph to represent the structure of the program hierarchically.
Table 1 provides a comparison of the characteristics of the methods. We adopt IPET in this paper because this method achieves good performance in various aspects when the input program is not too complex.
The purpose of control flow analysis is to find the program path with the longest execution time. Since IPET is based on the construction and solving of an ILP, the WCET target equation is given by Equation (
1), ignoring hardware configuration. From basic block 0 to basic block
n there is one possible control flow of the program, where the corresponding execution time and execution count of each block are
and
, respectively.
Then, the purpose of control flow analysis shifts to obtaining constraints from the program structure for the target equation. There are three kinds of constraints that should be established: basic block constraints, loop iteration upper limits, and infeasible path constraints [
13]. Basic block constraints are relations regarding the execution count of basic blocks. Loop iteration upper limits include the maximum and minimum execution count of multi-exit loops, dependency count of inner loops in nested loops, and others constraints. Common methods for loop bound analysis are abstract interpretation [
15] and symbolic execution [
16]. Infeasible path constraints are used to exclude redundant control flow in the program that will never be reached, thereby improving the efficiency of the analysis. The detection of an Infeasible path is necessary to analyse reliable hard, critical real-time systems [
17]. Ref. [
18] proposed an abstract interpretation technique to analyze Infeasible paths in programs, and the technique requires manual intervention in test cases and cannot be fully automated yet.
After obtaining the preliminary WCET target equation and a series of constraints from the control flow of the program, the execution time of each basic block along the longest path is calculated by modelling the micro-architecture of the processor. This is a difficult task because most modern processors use multi-level cache, dynamic branch predictor, pipeline, etc., to improve CPU throughput and the complex behavior and interaction of these units make the execution time very unpredictable [
5,
19].
To analyse the influence of an instruction cache on program execution time, ref. [
20] proposes a method based on IPET, and the basic idea is to estimate whether the cache access is hit or not by analyzing potential cache conflicts within the program and using Cache Conflict Graph (CCG) to model the cache line.
Dynamic branch prediction used by modern processors is based on the History State (HS) branch prediction algorithm. The idea is to predict a branch based on the execution history. The implementation uses the Branch History Register (BHR) to save the address of the Branch History Table (BHT). When a branch instruction is executed, the branch prediction for this instruction is looked up and updated in BHT. Ref. [
21] considers the impact of branch prediction on WCET analysis by adding HS nodes in CFG.
Superscalar out-of-order CPUs can achieve higher performance than in-order CPUs, but it is difficult to guarantee the WCET of the software. Ref. [
22] modeled basic blocks and performed an analysis of the five-stage pipeline, considering cache effects by using an execution graph, and avoided enumeration of all possible instruction execution times in the analysis by using intervals to represent the execution time of the instructions. Their analysis of pipelines can also be expressed in the form of linear constraints, meaning it can be easily integrated into IPET. Their methods have been adopted by the static WCET analysis tool Chronos.
3. Design
The analysis method in this paper is divided into three steps: compile, analyze, and solve and the process of these works is shown in
Figure 2. In the compile step, C source files are compiled and disassembled. We focus on analyzing programs on the ARM architecture, so the output of this step is an object file in ELF format and its disassembled code on the ARM platform. In the analysis step, we perform control flow analysis and micro-architecture modeling to collect the target WCET equation and linear constraints, then combine them into the target ILP problem. In the solve step, the ILP problem obtained in the previous steps is solved to get the estimated WCET value.
3.1. Control Flow Analysis
Each node, or basic block, in the CFG is the maximum sequence of consecutive instructions. Let the directed edge
represent the count of executions from a basic block
B to
, and let
denote the execution count of a basic block. For any node in the CFG, the count of control flows into the node is equal to the count of flows out of the node, which is equal to the node’s execution count. Thus, the basic block constraint Equation (
2) is obtained.
The other two kinds of constraints mentioned in
Section 2, namely the loop iteration upper limit and infeasible path constraints, are provided by the user in our analysis.
Figure 3 shows a simple C program and its corresponding CFG, along with the structural constraints and loop iteration bounds. These form an ILP problem together with (
1). By solving this preliminary problem, we can estimate the execution count of each basic block without considering micro-architecture details.
Within these constraints, variable
denotes the execution count of the path,
denotes the execution count of each basic block, and
denotes the execution count of a loop in the CFG. The identification of loops in the CFG will be described in
Section 4.
3.2. Instruction Cache Analysis
In this paper, we use a Cache Conflict Graph (CCG) to model direct-mapping cache schemes and FIFO replacement policies. In the direct-mapping scheme, each block of the main memory is mapped to only one specific cache location. We divide instructions in a basic block into several cache basic blocks based on the size of the physical cache line configured by the user, represented by a node in the CCG. Each split block is then mapped to a cache line based on its starting address. If multiple cache basic blocks are mapped to the same cache line, conflicts may occur during execution, represented by edges between nodes in the CCG. Assuming there are four cache sets, each with one cache line, a program, its cache mapping table, and the CCG of the cache is set to zero, as shown in
Figure 4.
S and
E represent the start and end nodes.
denotes the
j th cache basic block
i and edge
denotes the count cache basic block
, which is evicted by
.
The target equation, considering the instruction cache model, is revised to (
3).
is the number of cache blocks in basic block
i,
(
) denotes the execution time of
when the cache hits (misses), and
(
) denotes the execution count of
when the cache hits (misses). Estimated WCET is the maximum sum of the execution time of the cache basic blocks in all possible control flows in the CFG.
In the CFG and CCG of a program, the execution count of a basic block is equal to the execution counts of all the cache basic blocks within it. The sum of the control flow into a basic block
,
is equal to the control flow out of the basic block
and the branch connected to the starting and ending nodes will only be executed once. Based on these observations, the following constraints can be established for the target equation:
3.3. Branch Prediction Analysis
This paper adopts the GAg prediction model for micro-architecture analysis. In this model, the branch predictor first looks up the Global History Register (GHR) to obtain the current global history of the branches when predicting a branch instruction. This history is then used as an index to access entry into the Pattern History Table (PHT), which contains the results of the branch direction. Prediction errors can lead to pipeline flushes and delays in instruction prefetching. To analyze the impact of branch prediction on program execution time, let
denote the count of prediction errors and
represent the delay penalty. The target equation is then revised to:
To obtain the constraints bounding the count of mispredictions under the GAg model, information about the execution history of each branch instruction must be recorded. Therefore, we introduce the Historical State (HS) into the CFG. Since there is only one branch instruction per basic block, each basic block will have an HS attribute that records the possible branch history when the control flow reaches that block. Assuming that the BHR is a two-bit register, the CFG, after adding the HS attribute to each basic block, is shown in
Figure 5. Let 0 represent the branch that is not taken and 1 represent the branch that is taken. Basic block
is the starting block, so its HS is always 00. Basic block
has two incoming edges and the two possible paths to this block are
and
. The branch history of these two paths are not taken (
), not taken (
), and taken (
), not taken (
); therefore, the HS is
. In an execution, the current HS of each basic block is used as its global history to access and modify the branch prediction result in the BHT, as proposed by the GAg prediction model.
After adding HS information, the condition still holds that for each basic block in the CFG, the sum of the control flows into the block equals the sum of the control flows out of the block. Let
denote the execution count from
to
,
represents all HS of basic block
,
, and
represents the corresponding variable when the control flow comes with the
historical state. Then, the constraints for the target equation can be obtained:
If there is an edge from basic block
to basic block
, and the HS of this control flow at
is
, the HS at
in this control flow can occur after recording the branch of the edge
. The sum of the incoming control flow from all possible basic blocks under all possible HSs is the execution count of a basic block. Constrains can be obtained in Equation (
12).
Let
denote the count of the executions when the HS
h remains unchanged on the longest path from
to
. This means that along the path, branches are either always taken or always not taken, and
n represents the branch result of
. If a basic block takes (1) the branch to the next block and the branch prediction under HS
h is incorrect, the count of the incorrect prediction must be smaller than the total execution count when the basic block takes the branch. So, the following constraints can be obtained:
3.4. Prefetching Strategy
In the previous two subsections, the analyses of the instruction cache and the dynamic branch predictor were performed independently. When analyzing the cache using the CCG, the impact of branch prediction on program execution time was ignored. Similarly, when analyzing the impact of branch prediction using HS, the influence of cache hits or misses due to the branch predictor was not considered. In reality, instruction cache prefetching occurs based on the predicted branch direction, which affects whether instructions hit the cache and, consequently, influences the program’s execution time. In this paper, we assume that the processor can only allow one branch instruction to be in the prediction stage at any given time. This ensures that all previous instructions have finished executing before the branch instruction is processed.
When branch prediction is incorrect, an invalid cache may cause a cache miss, resulting in delays. For this scenario, we introduce virtual nodes for the CCG. If the cache line blocks on the predicted path conflict with the other cache line blocks, virtual node
is introduced for cache basic block
, to describe the mutual influence between branch prediction and instruction cache.
x denotes the actual execution outcome of the branch instruction. If a cache line basic block along the predicted path does not conflict with other blocks, there is no need to add a virtual node. After adding virtual nodes, the CCG in
Figure 4 is modified to the CCG shown in
Figure 6. For branch instruction 2, when the actual execution is 0 but the prediction is 1, we do not need to add additional nodes because, in this scenario, instruction prefetching will proceed along the erroneous path and no instructions within basic block
conflict with other instructions. Similarly, for branch instruction 3, when the execution is 1 but the prediction is 0, we also do not need to add nodes.
denotes the impact on cache line 3.1 when the actual execution branch instruction 2 is 1, but the prediction is 0. The edge
indicates that, due to the branch instruction 2 is predicted as 0, 3.1 is already in the cache. However, when instruction 1.2 is executed, the cache miss for 1.2 occurs because of 3.1. For edge
, when the prediction for branch instruction 2 is 0 but the execution is 1, the incorrect prediction results in instruction 3.1 being cached. This scenario actually improves the cache hit rate.
After adding virtual nodes, the target equation is modified to Equation (
15).
represents the total time consumed by the branch prediction.
denotes the total time consumed by the cache misses. The Equation illustrates that when a branch prediction error occurs, instruction prefetching will proceed along the erroneous path. Consequently, when the program executes along the correct path, at least one cache miss will occur. Here,
l denotes the number of instructions fetched in a single prefetch operation.
3.5. Pipeline Analysis
The purpose of pipeline analysis is to comprehensively consider the effects of both the cache model and branch prediction, and to calculate the execution time for a basic block. Ref. [
23] analyzes the impact of out-of-order execution pipelines on program execution time, noting that the execution time cannot be simply represented by the longest path of instruction execution; dependencies between instructions must also be considered. This paper achieves this by analyzing the commonly used five-stage pipeline architecture in ARMv8 processors: Instruction Fetch (IF), Decode (ID), Instruction Execute (EX), Memory Access (MEM), and Write Back (WB). We use execution graphs to model and analyze the pipelines of out-of-order executions. Execution graphs illustrate the dependencies between instructions using directed edges as follows:
The dependency between different stages of the same instruction. The completion of the previous stage is required before proceeding to the next stage.
The dependency between different instructions in the same pipeline stages. Earlier instructions in the program have a higher priority.
The data dependency between instructions.
Queuing for idle Instruction Fetch Buffers (I-buffers) and Reorder Buffers (ROB).
Assuming the size of the Instruction I-buffer is 2 and the size of the Reorder ROB is 4, a program and its corresponding execution graph is shown in
Figure 7. The edge
represents the data dependency between instructions,
and
represent the dependencies between instructions due to the sizes of the I-buffer and ROB, respectively, and the dashed line
indicates instructions competing for the same functional unit.
Since the analysis of the cache is performed using the CCG, the cache is ignored when establishing execution graphs; we assume that all instructions hit the cache. However, errors in branch prediction can cause the instruction pipeline to prefetch instructions along an incorrect path. To address this, we add instructions with the same size as the Reorder Buffer (ROB) along the erroneous path after the mispredicted branch instruction in the execution graph.
For a node in execution phase, its earliest finish time and latest finish time can be obtained using Algorithm 1. denotes the earliest start time of node i, denotes the earliest finish time, denotes the latest start time, and denotes the latest finish time. The latest finishing time of the last node is the estimated WCET. The functions in the algorithm are:
calculates the latest ready, start, and finish times for the nodes. The latest start time depends on its latest ready time, which depends on the latest finish time of its predecessor or competitors. If a competitor has a lower priority than the instruction in question, the competitor will be excluded, and those whose execution times do not overlap will also be excluded. If a competitor has a higher priority than the instruction, it is assumed that all nodes preceding this competitor will delay the node. Once a node’s latest start time is obtained, the latest ready times of its successor nodes are updated.
calculates the earliest ready, start, and finish times of nodes. Unlike , the calculation of these times only considers competitors that conflict with the node’s preparation time and hardware resources.
Algorithm 1: Basic Block WCET Analysis Algorithm |
|
4. Implementation
This section provides a description of the implementation process of the tool, including the establishment of the CFG and CCG, the extension of the CFG, and WCET calculation.
Each basic block in the CFG must indicate whether it contains a branch instruction. For blocks with branch instructions, there are two outgoing edges representing the two possible execution directions of the branch instruction. In addition to edges, other information, such as the basic block index, must also be stored. The data type of a basic block is shown in
Table 2.
To analyze the impact of the cache on program execution time, basic blocks are divided into multiple cache basic blocks. If a basic block is smaller than the size of a cache line configured by the user, it is treated as a single node. Otherwise, the basic block is divided into multiple cache basic blocks. The data type for a cache basic block is shown in
Table 3. Each cache basic block must record information such as the starting instruction address, the index of the cache set it is mapped to, and the index of the basic block it belongs to. This data type can also be used to describe nodes in the CCG. Consequently, the edges in the CCG only need to record the indices of the start and end nodes, and the count of conflicts.
For branch prediction analysis, the introduction of HS requires the extension of CFG. A new node is used to store the branch history information for nodes in the CFG. The data type is shown in
Table 4. The main steps in analyzing branch prediction are collecting the set of instructions on the erroneous path, constructing the HS of nodes, and analyzing the HSs of edges to feasible nodes. Function
is used to collect instructions along the predicted path, function
collects control flow information between adjacent branch instructions under a specific HS, and function
uses BFS to collect the information of all nodes reachable from a specific CFG node.
The pipeline model analysis is implemented using execution graphs, with the goal of determining the execution time of a basic block. When computing the execution time of a basic block, it is insufficient to consider the block in isolation; the block’s context within the CFG must also be considered. For example, if the instruction fetch buffer size is 2 and the instruction reorder buffer size is 4, then before the execution of a basic block, there will be 5 instructions waiting in the pipeline and one instruction being executed. Instructions in the cache may have dependencies on nearby instructions. Instructions that have dependencies on the previous basic block are referred to as the prologue, while those with dependencies on the subsequent basic block are called the epilogue. By using the functions and to traverse paths in the CFG, we can establish the context between cache basic blocks. This allows us to account for dependencies within the contextual environment when analyzing the execution graph.
In the implementation of the pipeline analysis, the function calculates the execution time of each basic block in the CFG. The function analyzes paths within the CFG, focusing on branch prediction and the contextual environment of the basic blocks. The function establishes the execution graph for each basic block, while the function applies Algorithm 1 to derive the final WCET target equation.
IPET solves problems using ILP. The overall construction process for the ILP problem is shown in
Figure 8. There are two branches related to branch prediction: the function after the first branch prediction extracts constraints related to branch prediction, while the function after the second branch extracts constraints related to the erroneous path prefetching strategy. The roles of the other functions are as follows:
generates linear target equations based on the micro-architecture configured by the user. For scenarios involving branch prediction execution,
or
is invoked to construct target equations for correct branch prediction or branch prediction errors, respectively.
is invoked to establish the target Equation (
15) to consider branch prediction and the instruction cache, along with their interactions.
generates constraints derived from the control flow information.
generates constraints derived from the branch prediction.
and generates constraints for cache hits or misses, as well as for constraints, when executing along erroneous paths.
reads user-input constraint files and parses the linear constraint for the ILP problem.
Figure 8.
Process of implementing the ILP problem.
Figure 8.
Process of implementing the ILP problem.
The ILP designed in this paper is based on the ILOG/CPLEX format [
24]. There are two ILP solvers capable of solving problems in this format: CPLEX, a commercial software, and
, a free and open-source alternative. This paper uses
5.5.0.4 as the third-party ILP solver for calculating the WCET.
5. Results and Discussion
Figure 9 gives the analysis WCET in cycles for the set of Mälardalen benchmarks [
25] supported by our WCET analysis tool. The processor configuration of the tool we use in experiments is listed in
Table 5. Some user-provided constraints we used for the benchmarks are shown in
Table 6.
Among Mälardalen WCET benchmarks, the insertsort program has the largest estimated WCET, in contrast to the binary search program and the finite impulse response filter program . This is because we changed the length of the target reversed array in to 1024 but only sorts 15 elements.
Two boards equipped with ARMv8-A CPUs, the Raspberry Pi 4 Model B from Sony, Pencoed, Wales, and the Firefly ROC-RK3568-PC-SE from T-CHIP, Zhongshan, Guangdong Province, China, are also used to obtain the observed WCETs of the benchmarks in a physical environment. Execution time in processor cycles is obtained by reading the generic timer register of the ARMv8-A CPU. The difference between the values before and after execution is the measured execution time of the program. Five measured execution times are averaged to obtain the final observed WCET. We configure the analysis tool to make the micro-architecture similar to the target processor. The two sets of measured and observed WCETs of the benchmarks are shown in
Figure 10 and
Figure 11.
Given a C program and a processor configuration, the static analysis method guarantees that the estimated WCET is not less than the program’s actual execution time for any input. As shown in the figure, the ratio of the estimated to observed WCET values for all benchmarks is greater than 1, which ensures the reliability of the analysis tool. The analysis result is considered precise if the estimated value is close to the observed value.
However, some estimated WCET values for benchmarks are up to 2–3 times larger than the observed WCET. There may be two reasons for these differences. First, our micro-architecture modeling may not be detailed enough. For instance, features like data caches and Translation Lookaside Buffers (TLBs) supported by modern processors are not included in our model, leading to a lack of constraints. Second, the differences may be attributed to the pessimistic nature of the static analysis method.
6. Conclusions
This paper designs a WCET analysis tool that includes using a CCG to address cache conflicts, employing a control flow analysis method based on historical states to analyze branch predictors, and examining instruction cache prefetching strategies based on erroneous branch prediction paths by adding virtual nodes to the CCG. When calculating the execution time of a basic block, this paper integrates instruction pipelines into the analysis and uses execution graphs to determine the execution time of individual basic blocks.
After implementation, we evaluate the performance of the analysis tool by comparing the estimated WCET of benchmarks with the observed values for two boards. The ratio of estimated to observed WCET values for all benchmarks is greater than 1, demonstrating the tool’s reliability. Some estimated values differ greatly from the observed values. This discrepancy is due to our incomplete modeling of modern processor features.
Future works will mainly include the following:
Model and analyze shared caches among multi-core to improve analysis accuracy.
Optimize the micro-architecture model to make it more suitable for modern processors.
Adopt algorithms with lower time complexity so that the tool can analyze more complex programs.
Use estimated WCET bounds to perform schedulability analyses for programs [
26].