0% found this document useful (0 votes)
37 views19 pages

Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

Uploaded by

Shreyas Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views19 pages

Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

Uploaded by

Shreyas Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Received 21 April 2023, accepted 6 May 2023, date of publication 11 May 2023, date of current version 17 May 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3275437

Validation of Task Scheduling Techniques in


Multithread Time Predictable Systems
ERNEST ANTOLAK AND ANDRZEJ PUŁKA , (Senior Member, IEEE)
Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100 Gliwice, Poland
Corresponding author: Ernest Antolak ([email protected])
This work was supported in part by the European Social Funds through ‘‘CyPhiS—the program of modern Ph.D. studies in the field of
cyber-physical systems’’ under Project POWR.03.02.00-00-I007/17-00, and in part by the Ministry of Science and Higher Education
under Grant BKM-573/RAu-11/2022 and Grant BK-246/RAu-11/2022.

ABSTRACT This paper presents a simulation-based environment for verification of static task scheduling
methodology in a time predictable system. Different types of processed tasks are distinguished and presented
a unified system design methodology consisting of the selection of real time system configuration, task
mapping, scheduling, and generation of a sequence of task identifiers to control the interleaved pipeline.
An original Worst Case Timing Analyzer (WCTA) has been developed to automate the design process.
The methodology was introduced into the original PRET (PREcision Timed) architecture recently presented
(Antolak and Pulka, 2020), (Antolak and Pulka, 2021). The PRET system was implemented on a Virtex7
FPGA (Field Programmable Gate Array) platform. A dedicated verification environment is proposed that
allows on-line real time system monitoring, analysis of timing parameters, and comparing the results with
initial requirements and design constraints. The practical experiments presented in the paper proved the
correct operation of the author’s hardware architecture. The obtained results confirmed the validity of the
proposed scheduling method and the concept of calculating the execution times of tasks before they are
started, which allows for optimal hardware matching to the tasks to be performed.

INDEX TERMS Real-time systems, timing simulation, dynamic scheduling, multitasking, pipeline inter-
leaving, multithreading.

I. INTRODUCTION as C [8], or C++ do not give direct control over the elapsed
Contemporary processors and multiprocessor systems are time of tasks [9], which, combined with a large number
capable of processing very complex software algorithms, of abstraction levels, allows only very rough control of
but they exhibit a very high degree of hardware com- timing [10], [11], [12].
plexity. In addition, the rapid development of semiconduc- The paper discusses scheduling methodologies in the sys-
tor technology means that these systems are clocked with tem presented in [2]. It focused on the generation of task iden-
ever faster clock signals. Paradoxically, these technolog- tifiers sequence that is the key issue of interleaved pipeline
ical advances are causing timing predictability issues to processing of threads. We introduced a new category of tasks
arise for such systems [3]. Initially, problems with proces- with the highest priority, called strong hard timed tasks and
sor timing predictability were solved at the software layer, we adjusted the scheduling process according to the type of
creating real-time operating systems, or so-called RTOSs task. It was shown that the appropriate order and frequency
(Real-Time Operating System) [4], [5]. Unfortunately, this of tasks in a core’s pipeline decides overall system efficiency
approach to the problem is fraught with a great deal of and predictability. The proposed real-time system was imple-
inaccuracy, since task handling is done at a very high level mented in the Verilog language, and its synthesis and imple-
of abstraction [6], [7]. Most programming languages such mentation were carried out in the Vivado 2018.3 environment.
Full simulation of such a system consumes a huge amount of
The associate editor coordinating the review of this manuscript and time, and it is practically impossible to get a complete picture
approving it for publication was Laxmisha Rai . of the system’s behavior. Mapping the hourly operation of

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


VOLUME 11, 2023 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 46979
E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

a system implemented in an FPGA requires more than a developed their ideas. This work resulted in the 2010 pub-
week of simulation. Hence, the idea of developing a dedicated lication [43] of the new PRET architecture that supported the
program to enable efficient time analysis of the developed concurrent execution of programs. This architecture allowed
system emerged. Moreover, an offline Worst Case Timing integrating independent components maintaining temporal
Analyzer (WCTA) dedicated to the designed system was properties of tasks. This solution used threads’ interleav-
created. WCTA delivers many timing parameters that validate ing mechanism, scratchpad memories, and a composable
the developed methodology. It was assumed that regardless and predictable DRAM (dynamic random-access memory)
of when a task is started, it should be completed within controller. Another paper from this research team [44] pre-
the deadline. The proposed WCTA analyzer is based on a sented a substantial implementation of a precision-timed
time model of the pipelined processing path with interleaved machine – PTARM architecture. This architecture is based
threads. The authors presented the subsequent stages of this on the ARM family processors and implements a subset of
processing in the paper. A series of experiments proved the ARMv4 microarchitecture. The PTARM solution improved
correctness of the approach. system performance by using a refined thread-interleaved
The main contribution of this paper is: pipeline, an exposed memory hierarchy, and a repeatable
• Proposing an Interleaving Cycle Register (ICR) of task DRAM memory controller. In 2014, the CHESS group devel-
identifier sequences; oped a new platform for processing mixed-criticality tasks
• Developing an extended task model; called FlexPRET [42]. This paper distinguished hard-time
• Dividing the task scheduling process into stages; threads (HRTT) and soft real-time threads (SRTT). The Flex-
• Unifying the task mapping method - developing a PRET architecture generated by the CHISEL tool [46] pro-
universal BLTS (Balanced Load Task Scheduling) vided hardware-based isolation to HRTT tasks, while SHTT
algorithm; tasks efficiently utilized available processing resources. The
• Proposing a method of task scheduling by composing the authors of [42] introduced dedicated timing instructions to
contents of the ICR (hardware alternative for RTOS); ISA based on the RISC-V processors family and proposed
• Developing a pre-layout Worst Case Timing Analyzer thread scheduler that kept control over the threads processed
(WCTA) to perform a timing analysis of the designed by the pipeline. The entire structure was also implemented in
system, and in particular to estimate the execution times a FPGA device.
of individual tasks; The paper [8] addressed the analysis of time-predictable
• Final hardware validation of the system. systems at a higher level of abstraction. The authors of this
work introduce a new lightweight and concurrent language,
The paper consists of six sections. First, the related work PRET-C (Precision Time version of C). PRET-C, thanks to its
is briefly discussed, then the main elements of the proposed syntax, synchronous semantics, and very simple mechanisms
multitask system architecture and assumed tasks model are handling time, is well suited for predictable PRET archi-
recalled. In the fourth section the main stages of the schedul- tectures. The authors also proposed a hardware accelerator
ing process implemented in the methodology are described. for PRET-C execution over soft-core processors allowing
Section five covers the experiments and discussion of the time-predictable execution of tasks with high efficiency. This
obtained results. The paper is summarized with the final time-predictable architecture was called ARPRET.
conclusions. Yet another interesting solution is presented in another
paper [45], in which the authors presented their own
II. RELATED WORK time-predictable architecture called ARPA-MT. ARPA-MT
The idea of time-predictable systems, the PREcision Timed architecture consists of 3 main elements: the main processing
machines (PRETs), was formulated and presented as early as unit, two coprocessors Cop0-MEC responsible for memory
2007 by Edwards and Lee [13]. It assumed that such a system operations management and exceptions and interrupts han-
should always, regardless of load, perform scheduled tasks dling, and Cop2-MEC implementing and accelerating RTOS
on time. The concept was developed by many researchers: functions in hardware. The ARPA-MT [45] structure con-
Thiele and Wilhelm [14] formulated a set of recommenda- tains a very interesting 5-staged pipeline with the first two
tions and guidelines to facilitate the design of time-critical stages (IF and ID) replicated for each thread. While the other
embedded systems. Ip and Edwards [15] proposed extending 3 stages of the pipeline (EX, MA and WB) process different
the command list of RISC processors to include a DEAD- threads that are interleaved. This solution presented inter-
LINE instruction to enable time-critical task control. The esting results of hardware-software synergy while designing
CHESS (Center for Hybrid and Embedded Software Sys- real-time systems.
tems) [16] group at UC Berkeley, led by Prof. Edward Lee, A group of researchers from Denmark, Austria, France,
developed the idea of the PRET processor, and proposed, and the US presented the new original architecture of the
among other things, a way to predictably access memory multi-core processor called Patmos [17]. Then in 2015,
by proposing a method called ‘‘memory wheel’’. Then, the a group of 24 European researchers presented the T-CREST
CHESS group, in cooperation with centers from Germany project [3], which demonstrated a multi-core approach to
(Saarland University) and Sweden (Linköping University), the original PRET system concept. Problems arising from

46980 VOLUME 11, 2023


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

starting the system, only the number and type of tasks being
executed are known, while the number of different tasks and
the timing of their startup are unknown. Therefore, to ensure
the above assumptions, it is necessary to find the most heavily
loaded task sequence on the system and prove before system
startup that the task will be executed on time.
As an example of the application of such a system, one
can recall an automotive safety system (see the diagram
in Fig. 1). Such a system can run a cyclic task, called
FIGURE 1. An example of the system run. every 25 µs by a hardware timer (Task Z1). This task calcu-
lates the approximate braking distance at a given speed and
scheduling tasks running in parallel in multicore systems weather conditions, and all operations must be completed in
were published in [18], [19], [20], [21], [22], and [23], along less than 20 µs. Additionally, each time the driver presses
with attempts to solve them. the brake, the system must calculate whether, and if so,
It is noticeable that the research carried out on time pre- which wheels have skidded. To make sure that safety systems
dictable systems for many years involved the analysis of very such as ABS (Anti-lock Braking System) or traction control
different issues related to both hardware [12], [14], [17], [24], react correctly, this event must be detected no later than, for
[25], [26] and software [18], [19], [27], [28] development. example, 10 µs after the brake pedal is pressed (see Task Z2).
Researchers often look for general solutions [8], [9], but many The wheels can also get into a skid after braking has started,
works concern dedicated solutions [17], [25]. In [29] authors so checking whether the wheels have gotten into a skid should
provide an overview of many issues related to the design of take place all the time the brake is depressed (high state of the
embedded time-predictable systems. signal Z2).
Authors of some papers [30], [31] analyzed the problem of
load-balanced scheduling and proposed techniques enabling A. SYSTEM ARCHITECTURE
reduction of the energy consumed by the systems. In [32] The system architecture uses the thread interleaving method
one can find a technique of the effective utilization of pro- to avoid data and control hazards. In that case, there is
cessing elements in the clusters of processors with the shared no need for complex forwarding and jump prediction cir-
L1 cache memory based on the optimized synchronization cuits. The thread interleaving method has been extended by
and communication between the system components. Some the authors [1], [2] and has also been used to exchange
works demonstrated the application of heuristic methods and data between tasks. The proposed architecture is based on
AI algorithms in the task scheduling process [33], [34], [35]. pipeline processing. It can consist of 1 to 8 reconfigurable
In the paper [1], an original real-time system solution cores. It is possible to reconfigure the pipelines, which can
was proposed. It was based on thread interleaved pipeline contain from 5 to 12 stages [2], depending on application
processing. The solution allows flexible configuration of the requirements. As described in [2], the basic five stages of
pipeline and uses a set of dedicated scheduling algorithms. the pipeline can be expanded to include sub-stages. Thus,
In a subsequent paper [2], the problem of energy optimization the IF (Instruction Fetch) stage can contain three sub-stages
in time-predictable systems was analyzed and the authors pro- (Select Bank and Instruction Address, Instruction Fetch,
posed a new solution of this problem in the form of enhanced Select Instruction); the ID (Instruction Decode) stage can be
and dedicated scheduling algorithms used for various design divided into two sub-stages (Select General Purpose Register
goals and constraints. In the presented paper, the problem Bank, Instruction Decode); the EXE (Execution) stage can be
of generating sequences of thread identifiers in the process expanded to three processing cycles (Shift, ALU, EXE); sim-
of scheduling tasks of different types was more thoroughly ilarly, the MA (Memory Access) stage can also be expanded
addressed. The methodology was based on analyzing the tim- to three cycles (Select Bank and MEM Data Address, MEM
ing conditions on a specially constructed Worst Case Timing Data Fetch, Select MEM Data).
Analyzer (WCTA), which makes it possible at the system Interleaving threads requires the introduction of multiplex-
design stage to accurately analyze the timing of the system. ers to switch task data, therefore the authors proposed to
Unlike other approaches where the authors analyze Deadline place the multiplexers in separate stages of the pipeline. This
Miss Ratio (DMR) [36], this solution does not allow any task minimizes the impact of the number of tasks on the maximum
to exceed the deadline time. operating frequency of the system. The microarchitecture of
our system is modeled after the ISA (Instruction Set Architec-
III. OVERWIEW OF THE SYSTEM ture) of the ARM processor family. ISA was enhanced with
Every task executed in the system must be completed before special timing instructions [1]:
its strictly defined execution time (deadline) regardless of the • AD counter_number – the instruction activating the
system load. Such tasks are inherently asynchronous since appropriate deadline counter;
they can be triggered by external interrupts, the occurrence of • DD counter_number– the instruction deactivating the
which is very difficult or even impossible to predict. Before appropriate deadline counter;

VOLUME 11, 2023 46981


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

as the number of supported threads rises is minimal. If the


system is built of several cores, some tasks can be processed
concurrently. Moreover, the configurable cores may differ in
their internal structures, so the hardware architecture of a
designed system can be adjusted to the number of processed
tasks and their timing requirements [1], [2].

B. PIPELINE CYCLE
As mentioned in the above section, every core has a dedicated
number of assigned threads. The threads work in the inter-
leaved scheme [38]. The interleaving eliminates the impact
of the pipeline processing on the timing unpredictability.
The order and frequency of the threads being processed is
configurable to allow the best possible match between the
FIGURE 2. The structure of the core.
core and the tasks being executed. The tasks performed by
the threads are switched (interleaved) according to the order
• SD counter_number value – the instruction loading the stored in a special Interleaving Cycle Register (ICR). The
value to the deadline counter (setting the deadline); register contains threads’ identifiers, and its length is also
• WFD counter_number value – the instruction sus- adjusted. With the appropriate ICR design, some threads can
pending the execution of the program until the deadline be processed more frequently than others. This mechanism
counter reaches 1. allows the core to match tasks in such a way that threads with
The system structure consists of one or more cores con- very strict timing constraints, and those with less rigorous
nected to a common bus. The bus is used to exchange infor- constraints, can run within a single core. Fig. 3 shows an
mation between threads. The system’s cores communicate example of the contents of the ICR in a simplified five-thread
with external peripherals via input/output ports, which are core. ICR has a length of 6 threads and processes threads 1,
mapped as part of the data memory of each thread. The 2, 3, 1, 4 and 5. Assuming that the timing requirements for
system’s cores communicate with external peripherals via thread 1 are much stricter, this thread must be executed twice
input/output ports, which are mapped as part of the data as often as the other threads 2, 3, 4, and 5. Therefore, the
memory of each thread. Every thread can have its own indi- identifier of this thread appears more frequently in the ICR.
vidual input/output ports. In addition, each thread is started The entire sequence is repeated from the beginning of the
by a corresponding individual external signal. The proce- system’s operation until it is shut down.
dure of launching a given thread can be implemented either
hardware-wise (using an external interrupt) or software-wise
by executing the corresponding instruction by another proces-
sor thread. The system architecture is based on the authors’
previous solutions [1], [2], [29]. The current implementation
of the system allows simultaneous processing of up to 255 dif-
ferent tasks, with theoretically any distribution of these tasks
between cores (the theoretical maximum number of cores is
also 255).
A configurable number of supported threads are pro-
vided for every core. Because these threads are not depen-
dent on each other, they must have their own independent
FIGURE 3. An example of ICR content.
resources such as memory and register files. In order to make
the cores as compact as possible, only these independent
resources shall be additional elements dedicated to a thread. To reduce the relationship between the number of threads
The remaining elements of the core, such as the processing processed by a single core, the maximum frequency of the
pipeline, are shared by threads mapped and executed in a clock, and to simplify the inspection of timing predictabil-
single core. Fig. 2 presents the structure of a configurable ity of the system, the pipeline can be extended with addi-
core. The size of a core depends on the number of threads tional stages, but one needs to remember that a core with
processed in a given core. For this purpose, a special param- an extended pipeline must process the appropriate number of
eter is defined for every core. The value of this parameter threads [38].
determinates the number of register files, memory banks and
size of the multiplexors responsible for switching threads’ C. TYPES OF THE PROCESSED TASKS
resources. The rest of the core is designed to be independent In the designed system, all tasks are of the hard
of this parameter. So, the increase in resource requirements type [39], [40], which means that no thread is allowed to miss

46982 VOLUME 11, 2023


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

FIGURE 4. Various types of the processed tasks.

completion before its deadline time. The scheduling process


calculates the maximum and minimum execution time for
each task, which under certain conditions is always less than
the deadline time. To facilitate the scheduling process, 3 types
of tasks were introduced. They differ in the way they access
the data exchange bus:
1) NCT: independent tasks that do not cooperate with one
another;
2) CT: cooperating tasks in which the time of data
exchange fluctuates between a certain minimum and
maximum values;
3) CTHM: cooperating tasks with a fixed data exchange
time.
The latter category of tasks (CTHM) allows minimizing the
difference between the maximum and minimum execution
time of a task at the expense of higher system load (fewer
such tasks can be assigned to a single core of the system).
Moreover, another special group of tasks denoted by SHT
(Strong Hard Timed) was introduced into the scheduling
process. Only tasks in the NTC or CTHM category can belong
to this group of tasks (Fig. 4). STH tasks have the highest
priority during the scheduling process (especially since the
proportion of such tasks is relatively small). Introducing such
a priority minimizes the difference between the maximum FIGURE 5. Main stages of the tasks’ scheduling process.
and minimum execution time for such a task.

D. MODEL OF TASKS
The task model was extended [2] with two new parameters The previously defined TFi (Task Frequency) parame-
(flags) related to the new types of tasks introduced. Thus, the ter [1], [2] which allows determining the requirement for
model of a given task unambiguously indicates its type and computational power for each task, was also remodeled.
processing. Finally, the assumed model of the task processed Changes introduced into the micro-architecture of the system
in the proposed system is the following: made it necessary to modify the TFi parameter. Its final
version is presented by equation (2):
Ti = {Ci , Mi , Di , CTHMi , SHTi } (1) h i
where: Ci + Mi · MinMindistance
dur
+2
TFi [MHz] = (2)
Ci number of standard (without Mi ) instruc- Di [µs]
tions of i-th task;
Mi number of memory access instructions that where:
refer to the data of other threads; Mdur number of clock cycles required to exe-
cute an instruction of data exchange with
Di maximum acceptable execution time another thread;
(deadline) of i-th task;
CTHMi flag (valid for Mi > 0) indicating whether Minindistance interleave depth corresponding to the min-
i-th task is of type CT (‘0’) or CTHM (‘1’); imum allowed distance between pipeline
SHTi flag indicating whether i-th task is strong stages processing the same task in a given
hard timed (‘1’). moment of time.

VOLUME 11, 2023 46983


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

IV. TASKS’ SCHEDULLING PROCESS


The proposed tasks’ scheduling procedure consists of several
steps (Fig. 5). Before starting, it is necessary to complete
a preliminary, preparation stage of the scheduling process,
during which we determine all the necessary parameters
describing the tasks (1) and calculate the task frequency TFi
parameter for every task (2).
The first stage of the scheduling procedure is called task
mapping. During this stage tasks are assigned to processing
cores, Usually, a balanced load of cores is sought, i.e., that
the sum of TFs for all cores is as close as possible. In this
stage the minimum theoretical operating frequency of each
core (Fsysj ) should also be estimated.
The second stage is called task arrangement. During this
step of the scheduling process, based on TFi parameters, the
sequences of tasks’ identifiers are created. Each sequence is
stored in the appropriate ICR of a given core. The length of
the ICR and the sequence are selected so as to process each
task at the appropriate frequency,
The initial simulation stage runs a special application
(the WCTA developed by the authors) that estimates several
factors of the system. First, the maximum number of clock
cycles necessary to complete a task is calculated for every
task mapped in a given core. Then, based on the frequencies
Fsysj determined in the task splitting phase, the maximum
FIGURE 6. Various strategies of tasks mapping.
execution time of each task is calculated. If the maximum
execution time of any task is greater than its Di time, a new
frequency of the core (Fszej ), in which the task is located, frequency of the clock signal must be increased significantly.
is calculated. This frequency is calculated based on the max- This is because as the number of cores is reduced, the number
imum number of clock cycles needed to execute the task and of concurrent tasks decreases. In turn, raising the frequency of
its deadline (Di ) time. the system results in a significant increase in dynamic power
The fourth stage is frequency determination which selects consumption.
the system frequency Fszemax common for all cores as a max- Another strategy, shown in Fig.6 as ‘Methodology 2’, is to
imum value among Fszej . This solution will enable potential reduce the energy consumed by the system. In this case, the
further communication between tasks allocated in different designer should aim to increase the number of cores as much
cores. as possible. This would result in greater parallelization of
During the last step, the final simulation of the entire calculations, i.e., multiple tasks could be performed simul-
system is carried out. All cores operate at the same frequency taneously, and this, in turn, would enable the system clock
Fszemax . The simulation results provide a wide range of frequency to be reduced. This will result in a reduction of
information, the most relevant of which seems to be data on power consumption. The disadvantage of this solution is the
the maximum and minimum execution time for each task. increased demand for hardware resources. In addition, note
Each stage of the task scheduling process is discussed in that there is an important limitation on the minimum number
more detail below. of tasks processed by the core pipeline [1], [2]. This limitation
is a direct result of the thread interleaving [38] used. It should
be noted that the task arrangement algorithm responsible for
A. TASKS MAPPING creating sequences of thread identifiers (implemented at the
Depending on the design requirements, constraints, and many next scheduling stage) gives better results when the number
other factors, the task mapping process can be carried out in of processed tasks is relatively large. Therefore, it is essential
several ways. to find the optimal number of system cores. Other strate-
In a case, when the number of resources required for gies tend to achieve a trade-off between energy consumption
the implementation of the system has to be minimized, one and resource utilization. These strategies are analyzed in the
should aim to minimize the number of cores processing example presented in section V.
the tasks. This strategy is represented by ‘Methodology1’ in However, the starting point is the task mapping algorithm.
Fig. 6. As our previous research [2] has shown, this approach The previous paper [2] presented two algorithms BLIS II and
does lead to some reduction in resources (in the extreme case STODER II. The algorithms were responsible for selecting
to a single core with pipelined processing), but as a result, the the appropriate system structure, the number of cores, and

46984 VOLUME 11, 2023


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

2) CN – the number of cores that depends on the selected


optimization methodology (Fig. 6);
3) Types of tasks and their TFi parameters.
As it was mentioned above, before starting the algorithm,
a set TF containing the TFi parameters of all tasks should
be determined. Additionally, when performing these calcula-
tions, the elements of the TF matrix are ordered and further
operations are performed on the sorted TFsorted vector.
In general, it is assumed that the number of system cores
is known before the BLTS algorithm starts. This is true for
the first two methodologies: MINRES and MAXPRO. When
it is necessary to reduce the cost of the system and minimize
the required resources (MINRES), then the number of cores
CNMINRES should be selected according to the formula (3):
&P '
N
i=1 TF i
CNMINRES ≥ (3)
Fmax
In the case of the MAXPRO methodology, when the energy
consumed by the system is minimized, which is equivalent to
minimizing the operating frequency, the number of cores is
maximized. However, the number of working cores is limited
by a condition related to the minimum distance expressed in
cycles of pipelined processing between consecutive instruc-
tions of the same task Minindistance . In practice, the number of
cores CNMAXPRO handling N tasks will be determined from
the relation (4):
 
N
CNMAXPRO ≥ (4)
2 · Minindistance
In contrast, the third SFERA methodology proposed in the
paper [2] is an intermediate solution, in which the cost of
the system with specific energy constraints is to be reduced.
Unfortunately, in this case, it may be necessary to perform
iterative calculations (Fig. 8). Usually, the procedure starts
from taking a certain number of cores, resulting from the
assumed initial system cost. After the mapping (BLTS algo-
rithm), the power consumed by the system must be esti-
mated. If the imposed constraints are not met, the number of
FIGURE 7. Block diagram of BLTS algorithm. cores must be increased and the entire procedure should be
repeated (Fig. 8).
The mapping procedure (Fig. 7) starts from assigning SHT
assigning tasks to these cores. The algorithms were presented tasks to the cores (step A: of the algorithm). In the optimal
on different optimization goals (methodologies). Here, based
on further research and experiments, the modified approach
using a single task mapping algorithm called BLTS (Bal-
anced Load Task Scheduling) was presented. BLTS (Fig. 7)
is designed for all types of tasks, including newly introduced
SHT-type tasks. The algorithm has been adapted to the current
version of the system architecture [2], in which the value of
the Mdur parameter is independent of the number of working
cores. The algorithm assigns tasks to the appropriate cores,
based on such information as:
1) Fmax – the maximum operation frequency of the
system. It depends on the system implementation FIGURE 8. SFERA methodology for scheduling tasks in the regime of
technology; efficient energy dissipation with a small amount of resource utilization.

VOLUME 11, 2023 46985


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

case, each core should perform at most only one SHT task. consecutive instructions of a given thread to meet its timing
If it turns out that the number of SHT tasks exceeds the requirements. The value of this parameter can be calculated
number of cores, then in the next step (task arrangement), from the following expression (5)
we should increase the values of their task’s frequencies TFi
Fcorej − Fcmarg
accordingly or freeze the TWi time window value for each TWi = (5)
SHT task. TFi
Then the procedure of pre-assignment of tasks to cores where:
is performed (steps B1 -B3 ), during which two consecutive Fcorej [MHz] Theoretical operating frequency of the j-th
tasks with maximum and minimum TFi values are alternately core of the system
assigned to successive cores. It should be noted that by oper- Fcmarg [MHz] The system’s operating frequency margin
ating on a vector of TFsorted , it is not necessary to search for
the maximum and minimum values each time. The first and The Fcmarg parameter is an experimentally selected oper-
last elements of the TFsorted vector are taken. When all tasks ating frequency margin of the system, and the results of
are assigned and the matrix TFsorted is empty (step C), the previous experiments [2] have shown that the best results are
procedure is terminated. achieved with a value of about 4-12% of the Fcorej parameter.
The next step (D) of the algorithm is called ‘‘Load balanc- The decision determining which thread will be processed
ing of the cores’’ and is similar to the BLIS II algorithm [2]. by the system in the next clock cycle is made based on the
Namely, the relative difference between the most loaded core states of the TCi registers. When the algorithm is started,
and the least loaded one for a certain used percentage ratio the initial states of the TCi registers are calculated by copy-
should be minimized. ing the values stored in the TWi registers. However, there
When we obtain a system with sufficiently evenly loaded is one exception to this rule, no two registers TCi and
cores, we take the appropriate frequency values (step E) and TCj (i ̸= j) can contain the same value at any moment of time
check that all the obtained values of the operating frequen- (a conflict would arise). To avoid such a situation, in case
cies of the cores are within the acceptable operating range the specified value is already occupied, the nearest free lower
(step F). If any operating frequency exceeds the permissible state is searched and written to the given TCi register.
value of Fmax , the number of cores must be increased and the
algorithm should be repeated from step A.
The entire procedure is presented in the example depicted
in the section describing the experiments.

B. TASKS ARRANGEMENT
This stage of task scheduling is responsible for preparing a
sequence of thread identifiers (ThID) which will ultimately
be stored in the ICR. The order of ThIDs determines the
order of tasks processed in the core pipeline. For simulation
purposes, the algorithm generates the sequence of identifiers
in ASCII code and stores it in the appropriate file. The length
of the sequence is adjusted experimentally and this parameter
is denoted by WL (Window Length). In order to ensure the
continuity of tasks’ processing so that the length of the ICR
does not determine the processing time, the register works in
a round-robin scheme.
The scheduling algorithm is implemented as a program
in C# and mimics the behavior of a digital circuit imple-
mented in hardware. Thus, this program can be transferred
into the corresponding hardware structure quickly and easily.
Each task has its own independent set of counting registers: FIGURE 9. ICR sequence generation algorithm.

TCi , TWi and TTWi . The first TCi reverse counting register
(Thread Counter) stores the value corresponding to the num- The third register TTWi (Temporary Time Window) is used
ber of the clock cycles after which i-th thread is to be loaded just when conflict of TC values arise. This register holds
into the pipeline, in order to meet its deadline. When a given the value corresponding to the difference between the actual
register TCi reaches 0, it means that appropriate thread must cycle of i-th task occurrence and the cycle resulting from the
be executed. TWi value (5). The search procedure seeks to have a thread
The second register TWi (Time Window) contains the executed as early as possible, but in a critical situation, its ser-
value representing i-th task’s period (the number of clock vice may occur later than the TWi parameter (in which case
cycles), i.e., the maximum period of the execution of the TTWi value is negative). The task scheduling algorithm

46986 VOLUME 11, 2023


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

strives to make the average TWi value of each task as close and implemented it in C# in the form of a timing analyzer
as possible to the value resulting from equation (5). WCTA. Input data contains information about tasks (their
Fig. 9 shows the scheme of the ICR sequence generation parameters) and the threads’ identifiers sequence (ICR) gen-
procedure. In each loop of the algorithm, the value of all erated in the previous stage. The simulation of the configured
TC registers is decremented. When the value of any of them system is done at a high level without stepping into details at
reaches 0, it means that such a thread will be processed in the signal level.
the next clock cycle. Then either the value of TWi (if it is During the initial simulation, the necessary information
free at the time) is assigned to this TCi counting register or a concerning on-line cores’ timing parameters is collected. The
new value is searched for, and the difference is written to the initial simulation also allows reporting possible errors and
TTWi register. suggesting the system’s frequency correction. Obtained data
It may also occur that in a given cycle none of the counters shows also the spread of timing parameters. The following
has reached the zero state, in which case a thread with ThID0 steps should be performed for each core operating in the
is added to the cycle. This identifier means that no thread system:
will be processed in a given cycle - it will be a so-called idle To calculate the maximum and minimum execution time
cycle not processing any instruction. Then the WL value is of a given task, the length of the task must be expressed by
decremented and the entire cycle is repeated until the value the number of standard instructions. In the case of the not
of the WL counter (ICR sequence length counter) reaches 0. cooperating tasks (NCT), the problem is trivial, because this
The discussed scheduling algorithm can also eliminate number corresponds to the Ci parameter of the task.
the resulting idle cycles. If the next thread to be added to For CT type tasks (where Mi ̸= 0) the length of the data
the cycle is to be the ThID0 thread, the algorithm checks exchange with another thread is expressed by Mdur param-
whether it is possible to swap this address for another one. eter. The method of the estimation of Mdur was discussed
If so, the algorithm swaps it and updates its TC and TTW in [1], [2]. Mdur corresponds to the number of clock cycles
register values accordingly. This could minimize the number needed for the completion of the memory operation. To deter-
of idle cycles at the expense of faster processing of some mine the maximum number of standard instructions (Ci )
threads. This means that this procedure may cause a larger corresponding to the data exchange operation (Mi ) (in the
difference between the minimum and maximum execution worst case), the entire ICR sequence should be analyzed
time of standard threads. Therefore, this part of the algorithm for the number of occurrences of the identifier of a specific
is optional. task within a time window of the length of Mdur cycles
SHT threads are hardware threads that either have no (Fig. 10). The entire process is illustrated in Fig. 10 for
instructions to exchange data with other threads or execute three tasks #1, #5 and #12. During the initial simulation,
instructions to exchange data with other threads with a fixed a time window of Mdur clock bars is created. The first part
execution time (Fig. 4). This means that when SHT threads of the ICR sequence is analyzed in this time window. The
are used, all task execution time discrepancies originating number of occurrences of each ThIDi identifier is stored in the
from the system microarchitecture are eliminated. To pre- corresponding Mdur Insi parameter. The time window is then
serve these properties of SHT threads, these threads take pri- shifted by one symbol towards the end of the ICR sequence
ority in the scheduling process and always appear in the ICR and these values are updated as necessary. To reduce the com-
sequence at the exact moments determined by their TW value. putational complexity of the algorithm, a search of the entire
This ensures that the difference between their maximum and time window is performed only once at the beginning, while
minimum execution times is minimal. after each shift of the time window only the new identifier
Unfortunately, the scheduling of threads of this type comes entering the time window from the right side and the identifier
with some limitations. SHT-type threads have priority in
scheduling of threads over regular threads, but there can be
no conflict in scheduling of two threads of this type (SHT),
because one of them could not be an SHT thread. To prevent
such conflicts, the TW parameters of SHT-type threads run-
ning in a single core must be the same or must be a multiple
of the smallest TW of the SHT-type thread. (If there is a
conflict, the TW parameter of any of the SHT tasks should
be increased, which will result in faster processing.) With the
above assumption, the distances between the ThIDs of SHT
threads in the generated sequence will always be constant,
thus avoiding SHT thread conflicts.

C. INITIAL SIMULATION
The next stage of the proposed scheduling procedure is sim- FIGURE 10. Finding the maximum (WCT) number of standard instructions
ulation. We have developed our own simulation environment corresponding to the memory instruction for three tasks.

VOLUME 11, 2023 46987


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

ejected from the window from the left side are checked. As a with the identifier of the corresponding thread. In this way, the
result, we obtain a set of Mdur Insi values corresponding to maximum and minimum number of clock cycles required to
the number of standard instructions Ci , with a total execution execute each task mapped to a given core can be determined.
time equivalent to the maximum time required to exchange The method of determining the MaxExeTicks1 parameter is
data for the i-th thread. It should be noted that for tasks of shown by the example depicted in Fig. 11. Assuming that
type SHT, the value Mdur Insi is constant in all time windows. the MaxI1 parameter is 2, the longest sequence of job ThIDs
Now the total maximum length of a given thread MaxIi that begins and ends with the identifier ‘‘1’’ and contains
can be expressed by the number of standard instructions. The MaxI1 + 1 of these identifiers (3 in this example) is searched
following relation (6) is, in practice, strongly overestimated for in the sequence of identifiers stored in the ICR. The
(pessimistic) although it gives a 100 percent guarantee of longest sequence (MaxExeInterval1 ) has a length of 18. In the
time predictability, since, as is well known, operations with worst-case scenario, i.e., the worst possible moment to start
memory are affected by the greatest amount of uncertainty. the task execution, the second ID is shown (marked in red).
If the length of the task is 2 instructions (MaxI1 = 2), the
MaxIi = Mdur Insi · Mi + Ci (6)
task will complete as late as the point marked in green, that
where: is, after 17 clock cycles. Thus, MaxExeTicks1 is going to be
MaxIi The equivalent of the maximum number of equal to 17.
standard instructions needed to perform the In the best case, i.e., when the task starts at the best possible
i-th task moment, the shortest sequence starting and ending with the
Mdur Insi Maximum number of standard instructions identifier ‘‘1’’ and containing exactly MinI1 such identifiers
needed to execute a data exchange instruc- is searched. MinExeTicks1 will just be equal to the length of
tion with another thread this sequence.
In practice, the MaxExeTicksi and MinExeTicksi values
For NCT and CTHM tasks, the number of instructions denoting the maximum and minimum number of the clock
needed will always be constant. For NCT and CTHM tasks, cycles, respectively, needed to execute the i-th task should
the number of instructions needed will always be constant. be expanded by few cycles associated with the launch of the
Thus, if MinIi denotes the minimum number of standard pipeline of a given core, depending on its configuration [1].
instructions needed to execute a data exchange instruction The parameter MaxExeTicksi divided by the deadline Di
with another thread, both parameters are equal, i.e. MaxIi = has a significant practical meaning, namely, it denotes the
MinIi . Moreover, in particularly advantageous situations, minimum operating frequency of the core at which the i-th
when the data exchange operation takes the same length of task is predictable. Thus, the minimum operating frequency
time as the standard instruction, CT tasks can be completed of the core can be determined from the relation (7):
faster and MinIi = Mi + Ci .  
MaxExeTicksi
MinFreq = max (7)
i=1,...,N Di
where:
N Number of tasks mapped to the core
MaxExeTicksi maximum number of the clock cycles
needed to execute the i-th task

D. FREQUENCY DETERMINATION
If the frequency of the core calculated from the equation (5)
is higher than the frequency Fsysk determined at the stage
of mapping tasks to the k-th core, the frequency should be
FIGURE 11. Finding the maximum number of system ticks required to
updated and taken as the new value of the operating frequency
complete the task. Fszek of this core, otherwise Fszek = Fsysk .
When the initial simulation of all cores is completed and
In the next step of the simulation, the maximum and mini- their operating frequencies determined, the final operating
mum number of clock ticks needed to execute a given thread frequency of the entire system should be calculated. To ensure
is calculated. The sequence of ThIDs checked in the simula- time predictability of all tasks, the maximum value among
tion looks for all possible intervals in which a given task can Fszek should be taken. At this point, the final simulation of
be completely executed from the start to the end. In the ICR the entire system can proceed.
sequence, we look for substrings containing MaxIi + 1 iden-
tifiers of the i-th task for the maximum task execution time E. THE FINAL SIMULATION
and MinIi identifiers for the case of minimum task execution Fig. 12 presents the main window (the interface) of the simu-
time, respectively. Each sub-sequence must begin and end lator. As with the initial simulation, the numerical parameters

46988 VOLUME 11, 2023


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

FIGURE 12. The interface of the system timing analyzer WCTA.

own verification environment (Fig. 13). The previously pub-


lished [2] task mapping methodology was modified and
combined with the WCTA. The method was tested for a
system processing 60 randomly generated tasks from a set
of implemented programs [1]. In each case, different opti-
mization goals and different system configurations were
analyzed. All experiments were conducted using Xilinx’s
Virtex-7 XC7VX485T-2FFG1761 chip, where the entire sys-
tem was implemented.

A. STRUCTURE OF THE VERIFICATION ENVIRONMENT


The structure of the proposed verification environment is
depicted in Fig. 13. This environment enables independent
FIGURE 13. The verification environment.
monitoring of the execution time of each task. It consists of
four main modules. The first module is an original multitask-
and types of tasks (1) performed in the system are known at ing PRET system [2]. According to the established concept
the input, as well as the generated sequence of the tasks iden- of operation of the time predictable system [13], tasks are
tifiers ICR. While performing the final simulation, with the triggered asynchronously by independent external signals.
knowledge of MaxExeTicksi and MinExeTicksi parameters, This function is performed by the second module: the tasks’
the maximum and minimum execution time for each task can triggering signal generator module (Pseudo-random signal
be determined, among other issues. This information is pre- generator) implemented by means of the LFSR register.
sented in columns ‘‘Max Exe Time’’ and ‘‘Min Exe Time’’, The next module (Task execution clock counting device)
respectively. The ‘‘MinFreq [MHz]’’ field provides interest- counts the number of clock cycles between the signal that
ing information. Here, the simulator provides the value of the starts the task and the feedback signal coming from the
minimum system frequency at which all tasks will fit within system signaling task execution. This module stores the max-
their deadlines. imum and minimum number of clock cycles counted since the
system startup.
V. EXPERIMENTAL VERIFICATION OF THE SYSTEM Finally, the fourth module: the transmitter converts parallel
To verify and validate the proposed new time predictable data into serial and transmits it to the external computer. The
design methodology, we have developed and arranged our transmission can be initialized automatically or manually.

VOLUME 11, 2023 46989


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

The transmitter can be used to transmit actual data coming representative example showing our approach step by step is
from the real-time system regarding similar parameters that included in the next subsection.
were obtained during the simulation (Fig. 12). This data is The tests comprised successively running three different
read out using a dedicated software where it is then analyzed. system configurations related to different design goals: mini-
mizing cost (MINRES), minimizing power (MAXPRO), and
B. VERIFICATION OF THE ADOPTED SCHEDULING aiming to meet given energy constraints while optimizing
METHODOLOGY AGAINST THE TIME PREDICTABILITY cost (SFERA). Next, each system was operated and per-
REQUIREMENTS OF THE SYSTEM formed the tasks assigned to the computing resources (cores)
A series of conducted experiments confirmed the advantages for about one hour. Then the data acquired from the measure-
of the adopted methodology based on the use of the WCTA. ment environment (the maximum and minimum execution
In this section, generalized, aggregated results showing time of each task) was compared to the data acquired from
averaged data obtained from at least one million executions the WCTA analyzer and the deadline (Di ) parameter of the
of each of the tested tasks are presented. It should be noted tasks. These experiments confirmed the correctness of the
that each time the test set was randomized. A more detailed concept and execution of the scheduling methodology and
no deadline was missed. Furthermore, the performance of the
task execution analyzer was verified. Each of the maximum
and minimum task execution times is within the allocation
calculated by WCTA.
Experiments have shown that the best results are obtained
for ICR sequences consisting of 2000-5000 identifiers (WL
parameter). The next three diagrams show the averaged
results obtained from a series of experiments. The complete
analysis, on the other hand, can be traced through a repre-
sentative example, which is included in the next point of this
section.
Figures 14 and 15 show the relative difference between
the maximum and minimum execution time of each type of
task related to the maximum execution time of the task (the
worst case) obtained from simulations and measurements of
the actual system, respectively.
The results obtained from the simulation are more pes-
simistic than the measurements made in the real system real-
ized in the FPGA. This is because in the analysis process, the
FIGURE 14. The relative difference between the maximum and minimum simulator always assumes the worst case of the task run time.
execution time obtained from WCTA related to the actual maximum
execution time (mean values for various types of tasks).
It can be seen from the graphs that the largest fluctuations in
task execution time occur for CT and CTHM tasks. A fairly
high repeatability of SHT task execution times of about 0.1%
was observed.
Fig. 16 shows the relative difference between the deadline
and the average task completion time. These charts show
how much earlier a given result can be obtained. The largest
differences reaching 44% were obtained for SHT tasks. This
is due to the fact that in the scheduling process, such tasks
are assigned the highest priority, which makes it necessary to
increase the frequency of work to meet the time requirements
for all tasks.

C. REPRESENTATIVE EXAMPLE
The following example shows the process of scheduling of
randomly selected 60 test tasks. In addition, randomization of
task parameters was carried out: Ci , Mi and Di . The following
constraints were assumed:
1) the maximum value of Ci is limited to 3000;
2) the maximum value of Mi is limited to 100;
FIGURE 15. The relative difference between the maximum and minimum
execution time obtained from the practical experiments related to the
3) the maximum value of TFi is limited to 4;
actual maximum execution time (mean values for various types of tasks). 4) the probability of random drawing SHT task equals 5%;

46990 VOLUME 11, 2023


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

TABLE 1. Parameters of test tasks. Based on the above assumptions, 60 test tasks were gener-
ated, as shown in Table 1.
For the current implementation of the system in the Virtex-
7 FPGA chip, Fmax = 150 MHz was assumed, and this value
includes some safety margin [1], [2]. And because the total
sum of all tasks’ frequencies (TFi ) is lower than Fmax , all
tasks can be allocated in a single core. We then proceeded
to determine the number of cores needed in each of the three
methodologies, with a power limit of 1 W for the SFERA con-
figuration. The resulting parameters are gathered in Table 2,
with 1 core sufficient for the MINRES methodology, and
3 cores for SFERA (assumed as the initial configuration)
proving sufficient to meet the power requirements.
The obtained final tasks mapping based on these parame-
ters and taking into account the previously determined task
frequencies is shown in Table 3.
As a result of running the task arrangement procedure, ICR
sequences with appropriately selected lengths are obtained.
For example, in the case of an architecture implemented
according to the MINRES scenario, the length of the window
is 2880 and it is a multiple of the time window of the largest
SHT-type thread (TW5 = 72). The WL parameters of the
other scenarios were selected similarly. Table 4 collects these
parameters and shows the minimum operating frequencies of
individual cores and the final frequency of the entire system
obtained from the simulation process.
Then, each version of the system was implemented in
the FPGA and tested in the environment shown in Fig. 13.
During the experiments, the power consumption of the differ-
ent system configurations (Fig. 17) and the requirements for
post-implementation hardware resources (Fig. 18) were com-
pared. In the case of energy demand, the difference between
the two extreme implementations of MINRES and MAXPRO
is about 42% for total power and almost 61% for dynamic
power, respectively.

5) the probability of random drawing CT task equals 25%;


6) the probability of random drawing CTHM task FIGURE 16. Relative difference between deadlines and average time to
equals 5%. complete a task.

VOLUME 11, 2023 46991


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

TABLE 2. Number of processing cores for different methodologies. TABLE 4. Selected results of tasks arrangement and initial simulation of
the system.

TABLE 3. Results of tasks mapping algorithm.

FIGURE 17. Comparison of the energy properties of the different


implementations of the system.

The relationship between resource utilization and the com-


plexity of the structure is much more complicated. While one
can observe a certain increase in logical resources as the num-
ber of cores increases, in the case of resources responsible for
switching, this relationship is not directly proportional. The
number of multiplexers strongly depends on the number of
tasks being performed, and in some cases high-level synthesis
tools manage to achieve some optimum.
A detailed analysis of the time predictability of tasks of
different types provides very interesting conclusions. At the
same time, it should be taken into account that this analysis
is based on averaged results, i.e., each task was performed
more than a million times. Graphical representation of all FIGURE 18. Comparison of the resources used for the different
data showing the obtained timing parameters of tasks is implementations of the system (post-implementation).

impossible due to the amount of data obtained. Therefore,


for greater readability, the results for the most individual also be observed that this difference is varied in each scenario,
cases are presented. Example graphs are provided for NCT, which is due to the different ICR sequences determined in the
CT, CTHM and SHT tasks in Figs. 19 – 22, respectively. scheduling process. In the case of this NCT task, the differ-
The graphs show Di time (Deadline), maximum (Max S) ences between WCTA indications and the measurements are
and minimum (Min S) task completion times calculated by very small and are at the level of single µs.
WCTA, and maximum (Max M), minimum (Min M) and In the case of the CT task (Fig. 20), a significant difference
averaged (Mean M) task completion times measured in the can be observed between the maximum task execution time
hardware-implemented system. calculated by the timing analyzer and the actual task execu-
Fig. 19 shows measurements for an NCT-type task tion time of the system. This is an effect resulting from the
(THID = 3). All the times are less than the Di time. It can WCET analysis performed during the simulation. Moreover,

46992 VOLUME 11, 2023


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

difference between maximum and minimum task execution


time which, in this case, is significantly reduced compared
to CT tasks.
In the case of the SHT task (Fig. 22), the highest con-
sistency was achieved between the calculations obtained by
simulation and the results of measurements of the practically
implemented system.
In order to validate the concept of the scheduling process
using a dedicated timing analyzer, the results obtained for
all tasks are presented collectively. To make this possible,
all the times are presented in percentage with respect to the
FIGURE 19. Results obtained for the not cooperating task (NCT) nr 3. Di parameter of each task. The diagrams show the average
execution times of each task for each system configuration
along with the range of changes (min-max), with Fig. 23
showing the data obtained from the simulation and Fig. 24
illustrating the results of measuring the real system.
The presented system adopts a completely asynchronous
mode of running tasks that depend on external factors. More-
over, tasks are not handled as a system interrupt. There-
fore, the WCET (Worst case execution time) and the BCET
(Best case execution time) are dependent on a task triggering
moment, which is completely random in a real-time system
(see Fig. 11). The authors of the survey concerning real-time
embedded systems [29] suggested a special quality measure
FIGURE 20. Results obtained for the cooperating task (CT) nr 2.
parameter defined as BCET to WCET ratio. It turns out
that this indicator is better the closer its value is to unity.
Fig. 25 presents the values of the quality measure index for
all the tasks analyzed in the example. The results obtained
in the process of simulating the system before launch are
much worse than the results obtained when analyzing the
operation of the system implemented in the FPGA chip. The
best quality measure values were obtained for SHT tasks,
which is consistent with the adopted scheduling strategy.
Also, for NCT tasks, the values of these coefficients are close
to unity, which is understandable because these tasks do not
require communication between threads. However, for the
FIGURE 21. Results obtained for the cooperating task with fixed data
actual parameters obtained from the hardware-implemented
exchange time (CTHM) nr 1. system, all quality measure values lie above the 0.93 level
regardless of the adopted system configuration and design
goals.

D. COMPARISON TO OTHER APPROACHES


A conclusive and fair quantitative comparison with other
solutions proposed in the literature is quite difficult because
different authors highlight different indicators of their solu-
tions, there is no full access to all data, and the used FPGA
platforms are made in different technologies. Therefore, the
comparative analysis is divided into two parts: a comparison
of the components of the solutions (descriptive analysis) and
a comparison of the resources of the selected solutions. The
FIGURE 22. Results obtained for the strong hard timed task (SHT) nr 5. results of the first analysis have been gathered in Table 5,
while the results of the numerical analysis of hardware
with memory access operations the timing parameters are resources are included in Fig. 26 and 27.
subject to the greatest degree of imprecision. The proposed solution is implemented in the new family of
The results obtained for the CTHM task (Fig. 21) confirm modern FPGA devices (implemented in 28 nm technology),
the achievement of the optimization goal of minimizing the but this was not the only factor that determined the quality

VOLUME 11, 2023 46993


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

FIGURE 23. The average execution times of each task for each system configuration along with the range of changes (min-max)
from WCTA.

FIGURE 24. The average execution times of each task for each system configuration along with the range of changes (min-max) from the
real system implemented in hardware.

46994 VOLUME 11, 2023


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

FIGURE 25. The quality measure factors [29] obtained for all tasks in three tested configurations in the simulation (left charts) and in
the implemented system (right charts).

TABLE 5. Comparison to other approaches.

of the proposed solution. We have proposed a simple and and we decided to compare the resources obtained after
effective task and context switching mechanism based on the implementation with these two approaches. The cited papers
interleaving cycle register controlling the pipeline process- provide resources for different system configurations and
ing. The processing cycles dedicated to a given task are con- different numbers of tasks, so some normalization of the
tinuously repeated and thus, without the need of interrupts, results has been introduced to make the comparison as rel-
the task can be executed regardless of when it starts with full evant as possible.
timing predictability. The proposed solution mimics RTOS at The diagram depicted in Fig. 26 contains the comparison of
the hardware layer. In the proposed solution, it is possible to resources consumed for the implementation of the presented
use a single thread as an operating system thread using timing architectures with the averaged resources required per single
instructions and deadline registers. task in the ARPA-MT system [45]. That architecture [45]
The two architectures closest to the proposed solution, used hardware support for RTOS (Cop2-OSC), while tasks
in our opinion, are ARPA-MT [45] and FlexPRET [42], were processed in the main pipeline (Cop0). Such averaging

VOLUME 11, 2023 46995


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

there are situations in which certain tasks may be completed


clearly before their deadlines, this situation is particularly true
for SHT tasks. For applications in secure systems, such a
situation could be critical. However, taking into account that
in our solution each task has a dedicated hardware deadline
counter, it can be used to release the data/input signals of such
a critical task.
The task processing strategy adopted in the approach,
based on thread interleaving, does not require mechanisms
for hazards control in the pipeline and other issues related to
FIGURE 26. Comparison of resource consumption for the entire system the prediction of branches, jumps, etc.
implementation related to a single task.

REFERENCES
[1] E. Antolak and A. Pułka, ‘‘Flexible hardware approach to multi-core time-
predictable systems design based on the interleaved pipeline processing,’’
IET Circuits, Devices Syst., vol. 14, no. 5, pp. 648–659, Aug. 2020, doi:
10.1049/iet-cds.2019.0521.
[2] E. Antolak and A. Pulka, ‘‘Energy-efficient task scheduling in design
of multithread time predictable real-time systems,’’ IEEE Access, vol. 9,
pp. 121111–121127, 2021, doi: 10.1109/ACCESS.2021.3108912.
[3] M. Schoeberl, ‘‘T-CREST: Time-predictable multi-core architecture for
embedded systems,’’ J. Syst. Archit., vol. 61, no. 9, pp. 449–471, Oct. 2015,
doi: 10.1016/j.sysarc.2015.04.002.
[4] M. Paolieri, E. Quinones, F. J. Cazorla, J. Wolf, T. Ungerer, S. Uhrig, and
Z. Petrov, ‘‘A software-pipelined approach to multicore execution of timing
predictable multi-threaded hard real-time tasks,’’ in Proc. 14th IEEE Int.
Symp. Object/Component/Service-Oriented Real-Time Distrib. Comput.,
Mar. 2011, pp. 233–240, doi: 10.1109/ISORC.2011.36.
[5] T. Ungerer, F. Cazorla, P. Sainrat, G. Bernat, Z. Petrov, C. Rochange,
E. Quinones, M. Gerdes, M. Paolieri, J. Wolf, H. Casse, S. Uhrig,
I. Guliashvili, M. Houston, F. Kluge, S. Metzlaff, and J. Mische,
FIGURE 27. Comparison of ‘processing’ resources (without storing ‘‘Merasa: Multicore execution of hard real-time applications supporting
components) consumption per a single task. analyzability,’’ IEEE Micro, vol. 30, no. 5, pp. 66–75, Sep. 2010, doi:
10.1109/MM.2010.78.
[6] D. D. Gajski, Embedded System Design: Modeling, Synthesis and Verifi-
cation. Dordrecht, The Netherlands: Springer, 2009.
can only provide a rough comparison; however, all results [7] E. A. Lee, ‘‘Absolutely positively on time: What would it take? [embedded
computing systems,’’ Computer, vol. 38, no. 7, pp. 85–87, Jul. 2005, doi:
indicate that our solution is more resource efficient. 10.1109/MC.2005.211.
Since in the FlexPRET solution [42] the data related to the [8] S. Andalam, P. S. Roop, A. Girault, and C. Traulsen, ‘‘A predictable
context of the tasks being processed (including register files) framework for safety-critical embedded systems,’’ IEEE Trans. Comput.,
vol. 63, no. 7, pp. 1600–1612, Jul. 2014, doi: 10.1109/TC.2013.28.
are stored in memory, the information reported on resources [9] D. Broman, ‘‘Precision timed infrastructure: Design challenges,’’ in Proc.
consumed ignores these items. Thus, in order to make the Electron. Syst. Level Synth. Conf. (ESLsyn), Austin, TX, USA, May 2013,
comparison adequate, the resources needed to implement the pp. 1–6.
[10] T. Henzinger and C. Kirsch, ‘‘The embedded machine: Predictable,
general-purpose registers of individual tasks in our solutions portable real-time code,’’ ACM Trans. Program. Lang. Syst., vol. 29, p. 33,
have been neglected. A direct comparison of resources, espe- Dec. 2002, doi: 10.1145/512529.512567.
cially with a large number of processed tasks, would give a [11] E. A. Lee, ‘‘The problem with threads,’’ Computer, vol. 39, no. 5,
pp. 33–42, May 2006, doi: 10.1109/MC.2006.180.
significantly falsified result.
[12] F. J. Cazorla, P. M. W. Knijnenburg, R. Sakellariou, E. Fernandez,
A. Ramirez, and M. Valero, ‘‘Predictable performance in SMT processors:
VI. SUMMARY Synergy between the OS and SMTs,’’ IEEE Trans. Comput., vol. 55, no. 7,
pp. 785–799, Jul. 2006, doi: 10.1109/TC.2006.108.
The presented methodology allows adjusting the hardware [13] S. A. Edwards and E. A. Lee, ‘‘The case for the precision timed (PRET)
structure so as to minimize the energy required for the machine,’’ in Proc. 44th ACM/IEEE Design Autom. Conf., San Diego, CA,
system’s operation or the number of resources required, USA, Jun. 2007, pp. 264–265, doi: 10.1109/DAC.2007.375165.
whilst ensuring the critical time predictability of the system. [14] L. Thiele and R. Wilhelm, ‘‘Design for timing predictability,’’
Real-Time Syst., vol. 28, nos. 2–3, pp. 157–177, Nov. 2004, doi:
The proposed method of simulating task execution allows 10.1023/B:TIME.0000045316.66276.6e.
to predict maximum and minimum task execution times [15] N. J. H. Ip and S. A. Edwards, ‘‘A processor extension for cycle-
even before the system starts up. The division into different accurate real-time software,’’ in Embedded and Ubiquitous Comput-
ing, vol. 4096, E. Sha, S.-K. Han, C.-Z. Xu, M.-H. Kim, L. T. Yang,
types of tasks makes it possible to match the differences and B. Xiao, Eds. Berlin, Germany: Springer, 2006, pp. 449–458, doi:
between maximum and minimum task execution times with 10.1007/11802167_46.
the requirements of the task application. [16] B. Lickly, I. Liu, S. Kim, H. D. Patel, S. A. Edwards, and E. A. Lee,
‘‘Predictable programming on a precision timed architecture,’’ in Proc. Int.
We were able to achieve a 100% fulfillment rate of task Conf. Compil., Archit. Synth. Embedded Syst. (CASES), Atlanta, GA, USA,
completion time, i.e. DMR=0% [36] for all tasks. Although 2008, p. 137, doi: 10.1145/1450095.1450117.

46996 VOLUME 11, 2023


E. Antolak, A. Pułka: Validation of Task Scheduling Techniques in Multithread Time Predictable Systems

[17] M. Schoeberl, P. Schleuniger, W. Puffitsch, F. Brandner, and [36] Y. Gao, G. Pallez, Y. Robert, and F. Vivien, ‘‘Dynamic scheduling strategies
C. W. Probst, ‘‘Towards a time-predictable dual-issue microprocessor: for firm semi-periodic real-time tasks,’’ IEEE Trans. Comput., vol. 72,
The Patmos approach,’’ in Proc. 1st Workshop Bringing Theory no. 1, pp. 55–68, Jan. 2023, doi: 10.1109/TC.2022.3208203.
Pract., Predictability Perform. Embedded Syst. (PPES), Grenoble, [37] E. Antolak and A. Pułka, ‘‘An analysis of the impact of gating techniques
France, 2011, pp. 11–21. Accessed: Nov. 12, 2019. [Online]. Available: on the optimization of the energy dissipated in real-time systems,’’ Appl.
http://drops.dagstuhl.de/opus/volltexte/2011/3077 Sci., vol. 12, no. 3, p. 1630, Jan. 2022, doi: 10.3390/app12031630.
[18] M. Fernández, R. Gioiosa, E. Quiñones, L. Fossati, M. Zulianello, [38] E. Lee and D. Messerschmitt, ‘‘Pipeline interleaved programmable DSP’s:
and F. J. Cazorla, ‘‘Assessing the suitability of the NGMP multi- Architecture,’’ IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-
core processor in the space domain,’’ in Proc. 10th ACM Int. Conf. 35, no. 9, pp. 1320–1333, Sep. 1987, doi: 10.1109/TASSP.1987.1165274.
Embedded Softw. (EMSOFT), Tampere, Finland, 2012, pp. 175–184, doi: [39] G. C. Buttazzo, Hard Real-Time Computing Systems, vol. 24. Boston, MA,
10.1145/2380356.2380389. USA: Springer, 2011, doi: 10.1007/978-1-4614-0676-1.
[19] A. Alhammad and R. Pellizzoni, ‘‘Time-predictable execution of multi- [40] E. L. Lamie, Real-Time Embedded Multithreading: Using ThreadX and
threaded applications on multicore systems,’’ in Proc. Design, Automat. ARM. San Francisco, CA, USA: CMP Books, 2005.
Test Eur. Conf. Exhib. (DATE), Dresden, Germany, 2014, pp. 1–6, doi: [41] A. Diavastos and T. E. Carlson, ‘‘Efficient instruction scheduling using
10.7873/DATE.2014.042. real-time load delay tracking,’’ ACM Trans. Comput. Syst., vol. 40, nos. 1–
[20] S. Moulik, R. Devaraj, and A. Sarkar, ‘‘COST: A cluster-oriented schedul- 4, pp. 1–21, Nov. 2022, doi: 10.1145/3548681.
ing technique for heterogeneous multi-cores,’’ in Proc. IEEE Int. Conf. [42] M. Zimmer, D. Broman, C. Shaver, and E. A. Lee, ‘‘FlexPRET: A pro-
Syst., Man, Cybern. (SMC), Miyazaki, Japan, Oct. 2018, pp. 1951–1957, cessor platform for mixed-criticality systems,’’ in Proc. IEEE 19th Real-
doi: 10.1109/SMC.2018.00337. Time Embedded Technol. Appl. Symp. (RTAS), Berlin, Germany, Apr. 2014,
[21] R. Pathan, P. Voudouris, and P. Stenström, ‘‘Scheduling parallel pp. 101–110, doi: 10.1109/RTAS.2014.6925994.
real-time recurrent tasks on multicore platforms,’’ IEEE Trans. Par- [43] I. Liu, J. Reineke, and E. A. Lee, ‘‘A PRET architecture supporting
allel Distrib. Syst., vol. 29, no. 4, pp. 915–928, Apr. 2018, doi: concurrent programs with composable timing properties,’’ in Proc. Conf.
10.1109/TPDS.2017.2777449. Rec. Forty 4th Asilomar Conf. Signals, Syst. Comput., Pacific Grove, CA,
USA, Nov. 2010, pp. 2111–2115, doi: 10.1109/ACSSC.2010.5757922.
[22] D. Kim, Y.-B. Ko, and S.-H. Lim, ‘‘Energy-efficient real-time
multi-core assignment scheme for asymmetric multi-core mobile [44] I. Liu, J. Reineke, D. Broman, M. Zimmer, and E. A. Lee, ‘‘A PRET
devices,’’ IEEE Access, vol. 8, pp. 117324–117334, 2020, doi: microarchitecture implementation with repeatable timing and
10.1109/ACCESS.2020.3005235. competitive performance,’’ in Proc. IEEE 30th Int. Conf. Comput.
Design (ICCD), Montreal, QC, Canada, Sep. 2012, pp. 87–93, doi:
[23] J. Chen, C. Du, P. Han, and Y. Zhang, ‘‘Sensitivity analysis of strictly
10.1109/ICCD.2012.6378622.
periodic tasks in multi-core real-time systems,’’ IEEE Access, vol. 7,
[45] A. S. R. Oliveira, L. Almeida, and A. D. B. Ferrari, ‘‘The ARPA-
pp. 135005–135022, 2019, doi: 10.1109/ACCESS.2019.2941958.
MT embedded SMT processor and its RTOS hardware accelerator,’’
[24] B. Forsberg, L. Benini, and A. Marongiu, ‘‘HePREM: Enabling predictable IEEE Trans. Ind. Electron., vol. 58, no. 3, pp. 890–904, Mar. 2011, doi:
GPU execution on heterogeneous SoC,’’ in Proc. Design, Autom. Test Eur. 10.1109/TIE.2009.2028359.
Conf. Exhib. (DATE), Dresden, Germany, Mar. 2018, pp. 539–544, doi: [46] CHISEL Main Web Page. Accessed: Mar. 16, 2023. [Online]. Available:
10.23919/DATE.2018.8342066. https://www.chisel-lang.org/community.html
[25] B. Akesson and K. Goossens, Memory Controllers for Real-Time
Embedded Systems, vol. 2. New York, NY, USA: Springer, 2012, doi:
10.1007/978-1-4419-8207-0.
[26] J. Reineke, I. Liu, H. D. Patel, S. Kim, and E. A. Lee, ‘‘PRET DRAM
controller: Bank privatization for predictability and temporal isolation,’’ in ERNEST ANTOLAK received the M.Sc. degree
Proc. 7th IEEE/ACM/IFIP Int. Conf. Hardw./Softw. Codesign Syst. Synth., in electronics and telecommunication engineer-
Taipei, Taiwan, 2011, p. 99, doi: 10.1145/2039370.2039388. ing from the Silesian University of Technol-
[27] L. M. AlBarakat, P. V. Gratz, and D. A. Jiménez, ‘‘MTB-fetch: Multi- ogy, Gliwice, Poland, in 2018. He is currently
threading aware hardware prefetching for chip multiprocessors,’’ IEEE pursuing the Ph.D. degree in project methodol-
Comput. Archit. Lett., vol. 17, no. 2, pp. 175–178, Jul. 2018, doi: ogy of designing real-time systems. His research
10.1109/LCA.2018.2847345. interests include real-time scheduling, designing
[28] M. Schoeberl, ‘‘A Java processor architecture for embedded real-time safety-critical embedded systems, cyber-physical
systems,’’ J. Syst. Archit., vol. 54, nos. 1–2, pp. 265–286, Jan. 2008, doi: systems, systems on chips, and energy-efficient
10.1016/j.sysarc.2007.06.001. digital architectures.
[29] P. Axer, R. Ernst, H. Falk, A. Girault, D. Grund, N. Guan, B. Jonsson,
P. Marwedel, J. Reineke, C. Rochange, M. Sebastian, R. V. Hanxleden,
R. Wilhelm, and W. Yi, ‘‘Building timing predictable embedded systems,’’
ACM Trans. Embedded Comput. Syst., vol. 13, no. 4, pp. 1–37, Dec. 2014, ANDRZEJ PUŁKA (Senior Member, IEEE)
doi: 10.1145/2560033.
received the M.Sc., Ph.D., and D.Sc. degrees
[30] Y. Kim, J. Kong, and A. Munir, ‘‘CPU-accelerator co-scheduling for CNN in electronics from Silesian Technical Univer-
acceleration at the edge,’’ IEEE Access, vol. 8, pp. 211422–211433, 2020,
sity, Gliwice, Poland, in 1988, 1997, and 2013,
doi: 10.1109/ACCESS.2020.3039278.
respectively.
[31] A. U. Rehman, ‘‘Dynamic energy efficient resource allocation strat-
Currently, he is the University Professor of the
egy for load balancing in fog environment,’’ IEEE Access, vol. 8,
Silesian University of Technology, Gliwice, and
pp. 199829–199839, 2020, doi: 10.1109/ACCESS.2020.3035181.
the Deputy Head of the Department of Electron-
[32] F. Glaser, G. Tagliavini, D. Rossi, G. Haugou, Q. Huang, and L. Benini,
ics, Electrical Engineering and Microelectronics.
‘‘Energy-efficient hardware-accelerated synchronization for shared-L1-
memory multiprocessor clusters,’’ IEEE Trans. Parallel Distrib. Syst., He is the author and coauthor of approximately
vol. 32, no. 3, pp. 633–648, Mar. 2021, doi: 10.1109/TPDS.2020.3028691. 90 scientific papers, including journal articles, book chapters, and conference
[33] H. Bahn and K. Cho, ‘‘Evolution-based real-time job scheduling for co-
papers. His research interests include the automated design of digital and
optimizing processor and memory power savings,’’ IEEE Access, vol. 8, mixed signal circuits in FPGAs, the modeling and simulation of electronic
pp. 152805–152819, 2020, doi: 10.1109/ACCESS.2020.3017014. embedded systems, VHDL, Verilog, SystemVerilog, SystemC, real-time
[34] H. Chniter, O. Mosbahi, M. Khalgui, M. Zhou, and Z. Li, ‘‘Improved systems—precision time machines (PRET), the design of energy efficient
multi-core real-time task scheduling of reconfigurable systems with systems, power optimization in SoC, AI, and commonsense reasoning mod-
energy constraints,’’ IEEE Access, vol. 8, pp. 95698–95713, 2020, doi: eling, and the applications of FPGA platforms for hardware acceleration of
10.1109/ACCESS.2020.2990973. complex computations in bioinformatics. He is a member of the Electronics
[35] Welcome to LPA. Accessed: Jan. 31, 2023. [Online]. Available: Commission of the Polish Academy of Sciences.
https://www.lpa.co.uk/

VOLUME 11, 2023 46997

You might also like