Validation of Task Scheduling Techniques in Multithread Time Predictable Systems
Validation of Task Scheduling Techniques in Multithread Time Predictable Systems
ABSTRACT This paper presents a simulation-based environment for verification of static task scheduling
methodology in a time predictable system. Different types of processed tasks are distinguished and presented
a unified system design methodology consisting of the selection of real time system configuration, task
mapping, scheduling, and generation of a sequence of task identifiers to control the interleaved pipeline.
An original Worst Case Timing Analyzer (WCTA) has been developed to automate the design process.
The methodology was introduced into the original PRET (PREcision Timed) architecture recently presented
(Antolak and Pulka, 2020), (Antolak and Pulka, 2021). The PRET system was implemented on a Virtex7
FPGA (Field Programmable Gate Array) platform. A dedicated verification environment is proposed that
allows on-line real time system monitoring, analysis of timing parameters, and comparing the results with
initial requirements and design constraints. The practical experiments presented in the paper proved the
correct operation of the author’s hardware architecture. The obtained results confirmed the validity of the
proposed scheduling method and the concept of calculating the execution times of tasks before they are
started, which allows for optimal hardware matching to the tasks to be performed.
INDEX TERMS Real-time systems, timing simulation, dynamic scheduling, multitasking, pipeline inter-
leaving, multithreading.
I. INTRODUCTION as C [8], or C++ do not give direct control over the elapsed
Contemporary processors and multiprocessor systems are time of tasks [9], which, combined with a large number
capable of processing very complex software algorithms, of abstraction levels, allows only very rough control of
but they exhibit a very high degree of hardware com- timing [10], [11], [12].
plexity. In addition, the rapid development of semiconduc- The paper discusses scheduling methodologies in the sys-
tor technology means that these systems are clocked with tem presented in [2]. It focused on the generation of task iden-
ever faster clock signals. Paradoxically, these technolog- tifiers sequence that is the key issue of interleaved pipeline
ical advances are causing timing predictability issues to processing of threads. We introduced a new category of tasks
arise for such systems [3]. Initially, problems with proces- with the highest priority, called strong hard timed tasks and
sor timing predictability were solved at the software layer, we adjusted the scheduling process according to the type of
creating real-time operating systems, or so-called RTOSs task. It was shown that the appropriate order and frequency
(Real-Time Operating System) [4], [5]. Unfortunately, this of tasks in a core’s pipeline decides overall system efficiency
approach to the problem is fraught with a great deal of and predictability. The proposed real-time system was imple-
inaccuracy, since task handling is done at a very high level mented in the Verilog language, and its synthesis and imple-
of abstraction [6], [7]. Most programming languages such mentation were carried out in the Vivado 2018.3 environment.
Full simulation of such a system consumes a huge amount of
The associate editor coordinating the review of this manuscript and time, and it is practically impossible to get a complete picture
approving it for publication was Laxmisha Rai . of the system’s behavior. Mapping the hourly operation of
a system implemented in an FPGA requires more than a developed their ideas. This work resulted in the 2010 pub-
week of simulation. Hence, the idea of developing a dedicated lication [43] of the new PRET architecture that supported the
program to enable efficient time analysis of the developed concurrent execution of programs. This architecture allowed
system emerged. Moreover, an offline Worst Case Timing integrating independent components maintaining temporal
Analyzer (WCTA) dedicated to the designed system was properties of tasks. This solution used threads’ interleav-
created. WCTA delivers many timing parameters that validate ing mechanism, scratchpad memories, and a composable
the developed methodology. It was assumed that regardless and predictable DRAM (dynamic random-access memory)
of when a task is started, it should be completed within controller. Another paper from this research team [44] pre-
the deadline. The proposed WCTA analyzer is based on a sented a substantial implementation of a precision-timed
time model of the pipelined processing path with interleaved machine – PTARM architecture. This architecture is based
threads. The authors presented the subsequent stages of this on the ARM family processors and implements a subset of
processing in the paper. A series of experiments proved the ARMv4 microarchitecture. The PTARM solution improved
correctness of the approach. system performance by using a refined thread-interleaved
The main contribution of this paper is: pipeline, an exposed memory hierarchy, and a repeatable
• Proposing an Interleaving Cycle Register (ICR) of task DRAM memory controller. In 2014, the CHESS group devel-
identifier sequences; oped a new platform for processing mixed-criticality tasks
• Developing an extended task model; called FlexPRET [42]. This paper distinguished hard-time
• Dividing the task scheduling process into stages; threads (HRTT) and soft real-time threads (SRTT). The Flex-
• Unifying the task mapping method - developing a PRET architecture generated by the CHISEL tool [46] pro-
universal BLTS (Balanced Load Task Scheduling) vided hardware-based isolation to HRTT tasks, while SHTT
algorithm; tasks efficiently utilized available processing resources. The
• Proposing a method of task scheduling by composing the authors of [42] introduced dedicated timing instructions to
contents of the ICR (hardware alternative for RTOS); ISA based on the RISC-V processors family and proposed
• Developing a pre-layout Worst Case Timing Analyzer thread scheduler that kept control over the threads processed
(WCTA) to perform a timing analysis of the designed by the pipeline. The entire structure was also implemented in
system, and in particular to estimate the execution times a FPGA device.
of individual tasks; The paper [8] addressed the analysis of time-predictable
• Final hardware validation of the system. systems at a higher level of abstraction. The authors of this
work introduce a new lightweight and concurrent language,
The paper consists of six sections. First, the related work PRET-C (Precision Time version of C). PRET-C, thanks to its
is briefly discussed, then the main elements of the proposed syntax, synchronous semantics, and very simple mechanisms
multitask system architecture and assumed tasks model are handling time, is well suited for predictable PRET archi-
recalled. In the fourth section the main stages of the schedul- tectures. The authors also proposed a hardware accelerator
ing process implemented in the methodology are described. for PRET-C execution over soft-core processors allowing
Section five covers the experiments and discussion of the time-predictable execution of tasks with high efficiency. This
obtained results. The paper is summarized with the final time-predictable architecture was called ARPRET.
conclusions. Yet another interesting solution is presented in another
paper [45], in which the authors presented their own
II. RELATED WORK time-predictable architecture called ARPA-MT. ARPA-MT
The idea of time-predictable systems, the PREcision Timed architecture consists of 3 main elements: the main processing
machines (PRETs), was formulated and presented as early as unit, two coprocessors Cop0-MEC responsible for memory
2007 by Edwards and Lee [13]. It assumed that such a system operations management and exceptions and interrupts han-
should always, regardless of load, perform scheduled tasks dling, and Cop2-MEC implementing and accelerating RTOS
on time. The concept was developed by many researchers: functions in hardware. The ARPA-MT [45] structure con-
Thiele and Wilhelm [14] formulated a set of recommenda- tains a very interesting 5-staged pipeline with the first two
tions and guidelines to facilitate the design of time-critical stages (IF and ID) replicated for each thread. While the other
embedded systems. Ip and Edwards [15] proposed extending 3 stages of the pipeline (EX, MA and WB) process different
the command list of RISC processors to include a DEAD- threads that are interleaved. This solution presented inter-
LINE instruction to enable time-critical task control. The esting results of hardware-software synergy while designing
CHESS (Center for Hybrid and Embedded Software Sys- real-time systems.
tems) [16] group at UC Berkeley, led by Prof. Edward Lee, A group of researchers from Denmark, Austria, France,
developed the idea of the PRET processor, and proposed, and the US presented the new original architecture of the
among other things, a way to predictably access memory multi-core processor called Patmos [17]. Then in 2015,
by proposing a method called ‘‘memory wheel’’. Then, the a group of 24 European researchers presented the T-CREST
CHESS group, in cooperation with centers from Germany project [3], which demonstrated a multi-core approach to
(Saarland University) and Sweden (Linköping University), the original PRET system concept. Problems arising from
starting the system, only the number and type of tasks being
executed are known, while the number of different tasks and
the timing of their startup are unknown. Therefore, to ensure
the above assumptions, it is necessary to find the most heavily
loaded task sequence on the system and prove before system
startup that the task will be executed on time.
As an example of the application of such a system, one
can recall an automotive safety system (see the diagram
in Fig. 1). Such a system can run a cyclic task, called
FIGURE 1. An example of the system run. every 25 µs by a hardware timer (Task Z1). This task calcu-
lates the approximate braking distance at a given speed and
scheduling tasks running in parallel in multicore systems weather conditions, and all operations must be completed in
were published in [18], [19], [20], [21], [22], and [23], along less than 20 µs. Additionally, each time the driver presses
with attempts to solve them. the brake, the system must calculate whether, and if so,
It is noticeable that the research carried out on time pre- which wheels have skidded. To make sure that safety systems
dictable systems for many years involved the analysis of very such as ABS (Anti-lock Braking System) or traction control
different issues related to both hardware [12], [14], [17], [24], react correctly, this event must be detected no later than, for
[25], [26] and software [18], [19], [27], [28] development. example, 10 µs after the brake pedal is pressed (see Task Z2).
Researchers often look for general solutions [8], [9], but many The wheels can also get into a skid after braking has started,
works concern dedicated solutions [17], [25]. In [29] authors so checking whether the wheels have gotten into a skid should
provide an overview of many issues related to the design of take place all the time the brake is depressed (high state of the
embedded time-predictable systems. signal Z2).
Authors of some papers [30], [31] analyzed the problem of
load-balanced scheduling and proposed techniques enabling A. SYSTEM ARCHITECTURE
reduction of the energy consumed by the systems. In [32] The system architecture uses the thread interleaving method
one can find a technique of the effective utilization of pro- to avoid data and control hazards. In that case, there is
cessing elements in the clusters of processors with the shared no need for complex forwarding and jump prediction cir-
L1 cache memory based on the optimized synchronization cuits. The thread interleaving method has been extended by
and communication between the system components. Some the authors [1], [2] and has also been used to exchange
works demonstrated the application of heuristic methods and data between tasks. The proposed architecture is based on
AI algorithms in the task scheduling process [33], [34], [35]. pipeline processing. It can consist of 1 to 8 reconfigurable
In the paper [1], an original real-time system solution cores. It is possible to reconfigure the pipelines, which can
was proposed. It was based on thread interleaved pipeline contain from 5 to 12 stages [2], depending on application
processing. The solution allows flexible configuration of the requirements. As described in [2], the basic five stages of
pipeline and uses a set of dedicated scheduling algorithms. the pipeline can be expanded to include sub-stages. Thus,
In a subsequent paper [2], the problem of energy optimization the IF (Instruction Fetch) stage can contain three sub-stages
in time-predictable systems was analyzed and the authors pro- (Select Bank and Instruction Address, Instruction Fetch,
posed a new solution of this problem in the form of enhanced Select Instruction); the ID (Instruction Decode) stage can be
and dedicated scheduling algorithms used for various design divided into two sub-stages (Select General Purpose Register
goals and constraints. In the presented paper, the problem Bank, Instruction Decode); the EXE (Execution) stage can be
of generating sequences of thread identifiers in the process expanded to three processing cycles (Shift, ALU, EXE); sim-
of scheduling tasks of different types was more thoroughly ilarly, the MA (Memory Access) stage can also be expanded
addressed. The methodology was based on analyzing the tim- to three cycles (Select Bank and MEM Data Address, MEM
ing conditions on a specially constructed Worst Case Timing Data Fetch, Select MEM Data).
Analyzer (WCTA), which makes it possible at the system Interleaving threads requires the introduction of multiplex-
design stage to accurately analyze the timing of the system. ers to switch task data, therefore the authors proposed to
Unlike other approaches where the authors analyze Deadline place the multiplexers in separate stages of the pipeline. This
Miss Ratio (DMR) [36], this solution does not allow any task minimizes the impact of the number of tasks on the maximum
to exceed the deadline time. operating frequency of the system. The microarchitecture of
our system is modeled after the ISA (Instruction Set Architec-
III. OVERWIEW OF THE SYSTEM ture) of the ARM processor family. ISA was enhanced with
Every task executed in the system must be completed before special timing instructions [1]:
its strictly defined execution time (deadline) regardless of the • AD counter_number – the instruction activating the
system load. Such tasks are inherently asynchronous since appropriate deadline counter;
they can be triggered by external interrupts, the occurrence of • DD counter_number– the instruction deactivating the
which is very difficult or even impossible to predict. Before appropriate deadline counter;
B. PIPELINE CYCLE
As mentioned in the above section, every core has a dedicated
number of assigned threads. The threads work in the inter-
leaved scheme [38]. The interleaving eliminates the impact
of the pipeline processing on the timing unpredictability.
The order and frequency of the threads being processed is
configurable to allow the best possible match between the
FIGURE 2. The structure of the core.
core and the tasks being executed. The tasks performed by
the threads are switched (interleaved) according to the order
• SD counter_number value – the instruction loading the stored in a special Interleaving Cycle Register (ICR). The
value to the deadline counter (setting the deadline); register contains threads’ identifiers, and its length is also
• WFD counter_number value – the instruction sus- adjusted. With the appropriate ICR design, some threads can
pending the execution of the program until the deadline be processed more frequently than others. This mechanism
counter reaches 1. allows the core to match tasks in such a way that threads with
The system structure consists of one or more cores con- very strict timing constraints, and those with less rigorous
nected to a common bus. The bus is used to exchange infor- constraints, can run within a single core. Fig. 3 shows an
mation between threads. The system’s cores communicate example of the contents of the ICR in a simplified five-thread
with external peripherals via input/output ports, which are core. ICR has a length of 6 threads and processes threads 1,
mapped as part of the data memory of each thread. The 2, 3, 1, 4 and 5. Assuming that the timing requirements for
system’s cores communicate with external peripherals via thread 1 are much stricter, this thread must be executed twice
input/output ports, which are mapped as part of the data as often as the other threads 2, 3, 4, and 5. Therefore, the
memory of each thread. Every thread can have its own indi- identifier of this thread appears more frequently in the ICR.
vidual input/output ports. In addition, each thread is started The entire sequence is repeated from the beginning of the
by a corresponding individual external signal. The proce- system’s operation until it is shut down.
dure of launching a given thread can be implemented either
hardware-wise (using an external interrupt) or software-wise
by executing the corresponding instruction by another proces-
sor thread. The system architecture is based on the authors’
previous solutions [1], [2], [29]. The current implementation
of the system allows simultaneous processing of up to 255 dif-
ferent tasks, with theoretically any distribution of these tasks
between cores (the theoretical maximum number of cores is
also 255).
A configurable number of supported threads are pro-
vided for every core. Because these threads are not depen-
dent on each other, they must have their own independent
FIGURE 3. An example of ICR content.
resources such as memory and register files. In order to make
the cores as compact as possible, only these independent
resources shall be additional elements dedicated to a thread. To reduce the relationship between the number of threads
The remaining elements of the core, such as the processing processed by a single core, the maximum frequency of the
pipeline, are shared by threads mapped and executed in a clock, and to simplify the inspection of timing predictabil-
single core. Fig. 2 presents the structure of a configurable ity of the system, the pipeline can be extended with addi-
core. The size of a core depends on the number of threads tional stages, but one needs to remember that a core with
processed in a given core. For this purpose, a special param- an extended pipeline must process the appropriate number of
eter is defined for every core. The value of this parameter threads [38].
determinates the number of register files, memory banks and
size of the multiplexors responsible for switching threads’ C. TYPES OF THE PROCESSED TASKS
resources. The rest of the core is designed to be independent In the designed system, all tasks are of the hard
of this parameter. So, the increase in resource requirements type [39], [40], which means that no thread is allowed to miss
D. MODEL OF TASKS
The task model was extended [2] with two new parameters The previously defined TFi (Task Frequency) parame-
(flags) related to the new types of tasks introduced. Thus, the ter [1], [2] which allows determining the requirement for
model of a given task unambiguously indicates its type and computational power for each task, was also remodeled.
processing. Finally, the assumed model of the task processed Changes introduced into the micro-architecture of the system
in the proposed system is the following: made it necessary to modify the TFi parameter. Its final
version is presented by equation (2):
Ti = {Ci , Mi , Di , CTHMi , SHTi } (1) h i
where: Ci + Mi · MinMindistance
dur
+2
TFi [MHz] = (2)
Ci number of standard (without Mi ) instruc- Di [µs]
tions of i-th task;
Mi number of memory access instructions that where:
refer to the data of other threads; Mdur number of clock cycles required to exe-
cute an instruction of data exchange with
Di maximum acceptable execution time another thread;
(deadline) of i-th task;
CTHMi flag (valid for Mi > 0) indicating whether Minindistance interleave depth corresponding to the min-
i-th task is of type CT (‘0’) or CTHM (‘1’); imum allowed distance between pipeline
SHTi flag indicating whether i-th task is strong stages processing the same task in a given
hard timed (‘1’). moment of time.
case, each core should perform at most only one SHT task. consecutive instructions of a given thread to meet its timing
If it turns out that the number of SHT tasks exceeds the requirements. The value of this parameter can be calculated
number of cores, then in the next step (task arrangement), from the following expression (5)
we should increase the values of their task’s frequencies TFi
Fcorej − Fcmarg
accordingly or freeze the TWi time window value for each TWi = (5)
SHT task. TFi
Then the procedure of pre-assignment of tasks to cores where:
is performed (steps B1 -B3 ), during which two consecutive Fcorej [MHz] Theoretical operating frequency of the j-th
tasks with maximum and minimum TFi values are alternately core of the system
assigned to successive cores. It should be noted that by oper- Fcmarg [MHz] The system’s operating frequency margin
ating on a vector of TFsorted , it is not necessary to search for
the maximum and minimum values each time. The first and The Fcmarg parameter is an experimentally selected oper-
last elements of the TFsorted vector are taken. When all tasks ating frequency margin of the system, and the results of
are assigned and the matrix TFsorted is empty (step C), the previous experiments [2] have shown that the best results are
procedure is terminated. achieved with a value of about 4-12% of the Fcorej parameter.
The next step (D) of the algorithm is called ‘‘Load balanc- The decision determining which thread will be processed
ing of the cores’’ and is similar to the BLIS II algorithm [2]. by the system in the next clock cycle is made based on the
Namely, the relative difference between the most loaded core states of the TCi registers. When the algorithm is started,
and the least loaded one for a certain used percentage ratio the initial states of the TCi registers are calculated by copy-
should be minimized. ing the values stored in the TWi registers. However, there
When we obtain a system with sufficiently evenly loaded is one exception to this rule, no two registers TCi and
cores, we take the appropriate frequency values (step E) and TCj (i ̸= j) can contain the same value at any moment of time
check that all the obtained values of the operating frequen- (a conflict would arise). To avoid such a situation, in case
cies of the cores are within the acceptable operating range the specified value is already occupied, the nearest free lower
(step F). If any operating frequency exceeds the permissible state is searched and written to the given TCi register.
value of Fmax , the number of cores must be increased and the
algorithm should be repeated from step A.
The entire procedure is presented in the example depicted
in the section describing the experiments.
B. TASKS ARRANGEMENT
This stage of task scheduling is responsible for preparing a
sequence of thread identifiers (ThID) which will ultimately
be stored in the ICR. The order of ThIDs determines the
order of tasks processed in the core pipeline. For simulation
purposes, the algorithm generates the sequence of identifiers
in ASCII code and stores it in the appropriate file. The length
of the sequence is adjusted experimentally and this parameter
is denoted by WL (Window Length). In order to ensure the
continuity of tasks’ processing so that the length of the ICR
does not determine the processing time, the register works in
a round-robin scheme.
The scheduling algorithm is implemented as a program
in C# and mimics the behavior of a digital circuit imple-
mented in hardware. Thus, this program can be transferred
into the corresponding hardware structure quickly and easily.
Each task has its own independent set of counting registers: FIGURE 9. ICR sequence generation algorithm.
TCi , TWi and TTWi . The first TCi reverse counting register
(Thread Counter) stores the value corresponding to the num- The third register TTWi (Temporary Time Window) is used
ber of the clock cycles after which i-th thread is to be loaded just when conflict of TC values arise. This register holds
into the pipeline, in order to meet its deadline. When a given the value corresponding to the difference between the actual
register TCi reaches 0, it means that appropriate thread must cycle of i-th task occurrence and the cycle resulting from the
be executed. TWi value (5). The search procedure seeks to have a thread
The second register TWi (Time Window) contains the executed as early as possible, but in a critical situation, its ser-
value representing i-th task’s period (the number of clock vice may occur later than the TWi parameter (in which case
cycles), i.e., the maximum period of the execution of the TTWi value is negative). The task scheduling algorithm
strives to make the average TWi value of each task as close and implemented it in C# in the form of a timing analyzer
as possible to the value resulting from equation (5). WCTA. Input data contains information about tasks (their
Fig. 9 shows the scheme of the ICR sequence generation parameters) and the threads’ identifiers sequence (ICR) gen-
procedure. In each loop of the algorithm, the value of all erated in the previous stage. The simulation of the configured
TC registers is decremented. When the value of any of them system is done at a high level without stepping into details at
reaches 0, it means that such a thread will be processed in the signal level.
the next clock cycle. Then either the value of TWi (if it is During the initial simulation, the necessary information
free at the time) is assigned to this TCi counting register or a concerning on-line cores’ timing parameters is collected. The
new value is searched for, and the difference is written to the initial simulation also allows reporting possible errors and
TTWi register. suggesting the system’s frequency correction. Obtained data
It may also occur that in a given cycle none of the counters shows also the spread of timing parameters. The following
has reached the zero state, in which case a thread with ThID0 steps should be performed for each core operating in the
is added to the cycle. This identifier means that no thread system:
will be processed in a given cycle - it will be a so-called idle To calculate the maximum and minimum execution time
cycle not processing any instruction. Then the WL value is of a given task, the length of the task must be expressed by
decremented and the entire cycle is repeated until the value the number of standard instructions. In the case of the not
of the WL counter (ICR sequence length counter) reaches 0. cooperating tasks (NCT), the problem is trivial, because this
The discussed scheduling algorithm can also eliminate number corresponds to the Ci parameter of the task.
the resulting idle cycles. If the next thread to be added to For CT type tasks (where Mi ̸= 0) the length of the data
the cycle is to be the ThID0 thread, the algorithm checks exchange with another thread is expressed by Mdur param-
whether it is possible to swap this address for another one. eter. The method of the estimation of Mdur was discussed
If so, the algorithm swaps it and updates its TC and TTW in [1], [2]. Mdur corresponds to the number of clock cycles
register values accordingly. This could minimize the number needed for the completion of the memory operation. To deter-
of idle cycles at the expense of faster processing of some mine the maximum number of standard instructions (Ci )
threads. This means that this procedure may cause a larger corresponding to the data exchange operation (Mi ) (in the
difference between the minimum and maximum execution worst case), the entire ICR sequence should be analyzed
time of standard threads. Therefore, this part of the algorithm for the number of occurrences of the identifier of a specific
is optional. task within a time window of the length of Mdur cycles
SHT threads are hardware threads that either have no (Fig. 10). The entire process is illustrated in Fig. 10 for
instructions to exchange data with other threads or execute three tasks #1, #5 and #12. During the initial simulation,
instructions to exchange data with other threads with a fixed a time window of Mdur clock bars is created. The first part
execution time (Fig. 4). This means that when SHT threads of the ICR sequence is analyzed in this time window. The
are used, all task execution time discrepancies originating number of occurrences of each ThIDi identifier is stored in the
from the system microarchitecture are eliminated. To pre- corresponding Mdur Insi parameter. The time window is then
serve these properties of SHT threads, these threads take pri- shifted by one symbol towards the end of the ICR sequence
ority in the scheduling process and always appear in the ICR and these values are updated as necessary. To reduce the com-
sequence at the exact moments determined by their TW value. putational complexity of the algorithm, a search of the entire
This ensures that the difference between their maximum and time window is performed only once at the beginning, while
minimum execution times is minimal. after each shift of the time window only the new identifier
Unfortunately, the scheduling of threads of this type comes entering the time window from the right side and the identifier
with some limitations. SHT-type threads have priority in
scheduling of threads over regular threads, but there can be
no conflict in scheduling of two threads of this type (SHT),
because one of them could not be an SHT thread. To prevent
such conflicts, the TW parameters of SHT-type threads run-
ning in a single core must be the same or must be a multiple
of the smallest TW of the SHT-type thread. (If there is a
conflict, the TW parameter of any of the SHT tasks should
be increased, which will result in faster processing.) With the
above assumption, the distances between the ThIDs of SHT
threads in the generated sequence will always be constant,
thus avoiding SHT thread conflicts.
C. INITIAL SIMULATION
The next stage of the proposed scheduling procedure is sim- FIGURE 10. Finding the maximum (WCT) number of standard instructions
ulation. We have developed our own simulation environment corresponding to the memory instruction for three tasks.
ejected from the window from the left side are checked. As a with the identifier of the corresponding thread. In this way, the
result, we obtain a set of Mdur Insi values corresponding to maximum and minimum number of clock cycles required to
the number of standard instructions Ci , with a total execution execute each task mapped to a given core can be determined.
time equivalent to the maximum time required to exchange The method of determining the MaxExeTicks1 parameter is
data for the i-th thread. It should be noted that for tasks of shown by the example depicted in Fig. 11. Assuming that
type SHT, the value Mdur Insi is constant in all time windows. the MaxI1 parameter is 2, the longest sequence of job ThIDs
Now the total maximum length of a given thread MaxIi that begins and ends with the identifier ‘‘1’’ and contains
can be expressed by the number of standard instructions. The MaxI1 + 1 of these identifiers (3 in this example) is searched
following relation (6) is, in practice, strongly overestimated for in the sequence of identifiers stored in the ICR. The
(pessimistic) although it gives a 100 percent guarantee of longest sequence (MaxExeInterval1 ) has a length of 18. In the
time predictability, since, as is well known, operations with worst-case scenario, i.e., the worst possible moment to start
memory are affected by the greatest amount of uncertainty. the task execution, the second ID is shown (marked in red).
If the length of the task is 2 instructions (MaxI1 = 2), the
MaxIi = Mdur Insi · Mi + Ci (6)
task will complete as late as the point marked in green, that
where: is, after 17 clock cycles. Thus, MaxExeTicks1 is going to be
MaxIi The equivalent of the maximum number of equal to 17.
standard instructions needed to perform the In the best case, i.e., when the task starts at the best possible
i-th task moment, the shortest sequence starting and ending with the
Mdur Insi Maximum number of standard instructions identifier ‘‘1’’ and containing exactly MinI1 such identifiers
needed to execute a data exchange instruc- is searched. MinExeTicks1 will just be equal to the length of
tion with another thread this sequence.
In practice, the MaxExeTicksi and MinExeTicksi values
For NCT and CTHM tasks, the number of instructions denoting the maximum and minimum number of the clock
needed will always be constant. For NCT and CTHM tasks, cycles, respectively, needed to execute the i-th task should
the number of instructions needed will always be constant. be expanded by few cycles associated with the launch of the
Thus, if MinIi denotes the minimum number of standard pipeline of a given core, depending on its configuration [1].
instructions needed to execute a data exchange instruction The parameter MaxExeTicksi divided by the deadline Di
with another thread, both parameters are equal, i.e. MaxIi = has a significant practical meaning, namely, it denotes the
MinIi . Moreover, in particularly advantageous situations, minimum operating frequency of the core at which the i-th
when the data exchange operation takes the same length of task is predictable. Thus, the minimum operating frequency
time as the standard instruction, CT tasks can be completed of the core can be determined from the relation (7):
faster and MinIi = Mi + Ci .
MaxExeTicksi
MinFreq = max (7)
i=1,...,N Di
where:
N Number of tasks mapped to the core
MaxExeTicksi maximum number of the clock cycles
needed to execute the i-th task
D. FREQUENCY DETERMINATION
If the frequency of the core calculated from the equation (5)
is higher than the frequency Fsysk determined at the stage
of mapping tasks to the k-th core, the frequency should be
FIGURE 11. Finding the maximum number of system ticks required to
updated and taken as the new value of the operating frequency
complete the task. Fszek of this core, otherwise Fszek = Fsysk .
When the initial simulation of all cores is completed and
In the next step of the simulation, the maximum and mini- their operating frequencies determined, the final operating
mum number of clock ticks needed to execute a given thread frequency of the entire system should be calculated. To ensure
is calculated. The sequence of ThIDs checked in the simula- time predictability of all tasks, the maximum value among
tion looks for all possible intervals in which a given task can Fszek should be taken. At this point, the final simulation of
be completely executed from the start to the end. In the ICR the entire system can proceed.
sequence, we look for substrings containing MaxIi + 1 iden-
tifiers of the i-th task for the maximum task execution time E. THE FINAL SIMULATION
and MinIi identifiers for the case of minimum task execution Fig. 12 presents the main window (the interface) of the simu-
time, respectively. Each sub-sequence must begin and end lator. As with the initial simulation, the numerical parameters
The transmitter can be used to transmit actual data coming representative example showing our approach step by step is
from the real-time system regarding similar parameters that included in the next subsection.
were obtained during the simulation (Fig. 12). This data is The tests comprised successively running three different
read out using a dedicated software where it is then analyzed. system configurations related to different design goals: mini-
mizing cost (MINRES), minimizing power (MAXPRO), and
B. VERIFICATION OF THE ADOPTED SCHEDULING aiming to meet given energy constraints while optimizing
METHODOLOGY AGAINST THE TIME PREDICTABILITY cost (SFERA). Next, each system was operated and per-
REQUIREMENTS OF THE SYSTEM formed the tasks assigned to the computing resources (cores)
A series of conducted experiments confirmed the advantages for about one hour. Then the data acquired from the measure-
of the adopted methodology based on the use of the WCTA. ment environment (the maximum and minimum execution
In this section, generalized, aggregated results showing time of each task) was compared to the data acquired from
averaged data obtained from at least one million executions the WCTA analyzer and the deadline (Di ) parameter of the
of each of the tested tasks are presented. It should be noted tasks. These experiments confirmed the correctness of the
that each time the test set was randomized. A more detailed concept and execution of the scheduling methodology and
no deadline was missed. Furthermore, the performance of the
task execution analyzer was verified. Each of the maximum
and minimum task execution times is within the allocation
calculated by WCTA.
Experiments have shown that the best results are obtained
for ICR sequences consisting of 2000-5000 identifiers (WL
parameter). The next three diagrams show the averaged
results obtained from a series of experiments. The complete
analysis, on the other hand, can be traced through a repre-
sentative example, which is included in the next point of this
section.
Figures 14 and 15 show the relative difference between
the maximum and minimum execution time of each type of
task related to the maximum execution time of the task (the
worst case) obtained from simulations and measurements of
the actual system, respectively.
The results obtained from the simulation are more pes-
simistic than the measurements made in the real system real-
ized in the FPGA. This is because in the analysis process, the
FIGURE 14. The relative difference between the maximum and minimum simulator always assumes the worst case of the task run time.
execution time obtained from WCTA related to the actual maximum
execution time (mean values for various types of tasks).
It can be seen from the graphs that the largest fluctuations in
task execution time occur for CT and CTHM tasks. A fairly
high repeatability of SHT task execution times of about 0.1%
was observed.
Fig. 16 shows the relative difference between the deadline
and the average task completion time. These charts show
how much earlier a given result can be obtained. The largest
differences reaching 44% were obtained for SHT tasks. This
is due to the fact that in the scheduling process, such tasks
are assigned the highest priority, which makes it necessary to
increase the frequency of work to meet the time requirements
for all tasks.
C. REPRESENTATIVE EXAMPLE
The following example shows the process of scheduling of
randomly selected 60 test tasks. In addition, randomization of
task parameters was carried out: Ci , Mi and Di . The following
constraints were assumed:
1) the maximum value of Ci is limited to 3000;
2) the maximum value of Mi is limited to 100;
FIGURE 15. The relative difference between the maximum and minimum
execution time obtained from the practical experiments related to the
3) the maximum value of TFi is limited to 4;
actual maximum execution time (mean values for various types of tasks). 4) the probability of random drawing SHT task equals 5%;
TABLE 1. Parameters of test tasks. Based on the above assumptions, 60 test tasks were gener-
ated, as shown in Table 1.
For the current implementation of the system in the Virtex-
7 FPGA chip, Fmax = 150 MHz was assumed, and this value
includes some safety margin [1], [2]. And because the total
sum of all tasks’ frequencies (TFi ) is lower than Fmax , all
tasks can be allocated in a single core. We then proceeded
to determine the number of cores needed in each of the three
methodologies, with a power limit of 1 W for the SFERA con-
figuration. The resulting parameters are gathered in Table 2,
with 1 core sufficient for the MINRES methodology, and
3 cores for SFERA (assumed as the initial configuration)
proving sufficient to meet the power requirements.
The obtained final tasks mapping based on these parame-
ters and taking into account the previously determined task
frequencies is shown in Table 3.
As a result of running the task arrangement procedure, ICR
sequences with appropriately selected lengths are obtained.
For example, in the case of an architecture implemented
according to the MINRES scenario, the length of the window
is 2880 and it is a multiple of the time window of the largest
SHT-type thread (TW5 = 72). The WL parameters of the
other scenarios were selected similarly. Table 4 collects these
parameters and shows the minimum operating frequencies of
individual cores and the final frequency of the entire system
obtained from the simulation process.
Then, each version of the system was implemented in
the FPGA and tested in the environment shown in Fig. 13.
During the experiments, the power consumption of the differ-
ent system configurations (Fig. 17) and the requirements for
post-implementation hardware resources (Fig. 18) were com-
pared. In the case of energy demand, the difference between
the two extreme implementations of MINRES and MAXPRO
is about 42% for total power and almost 61% for dynamic
power, respectively.
TABLE 2. Number of processing cores for different methodologies. TABLE 4. Selected results of tasks arrangement and initial simulation of
the system.
FIGURE 23. The average execution times of each task for each system configuration along with the range of changes (min-max)
from WCTA.
FIGURE 24. The average execution times of each task for each system configuration along with the range of changes (min-max) from the
real system implemented in hardware.
FIGURE 25. The quality measure factors [29] obtained for all tasks in three tested configurations in the simulation (left charts) and in
the implemented system (right charts).
of the proposed solution. We have proposed a simple and and we decided to compare the resources obtained after
effective task and context switching mechanism based on the implementation with these two approaches. The cited papers
interleaving cycle register controlling the pipeline process- provide resources for different system configurations and
ing. The processing cycles dedicated to a given task are con- different numbers of tasks, so some normalization of the
tinuously repeated and thus, without the need of interrupts, results has been introduced to make the comparison as rel-
the task can be executed regardless of when it starts with full evant as possible.
timing predictability. The proposed solution mimics RTOS at The diagram depicted in Fig. 26 contains the comparison of
the hardware layer. In the proposed solution, it is possible to resources consumed for the implementation of the presented
use a single thread as an operating system thread using timing architectures with the averaged resources required per single
instructions and deadline registers. task in the ARPA-MT system [45]. That architecture [45]
The two architectures closest to the proposed solution, used hardware support for RTOS (Cop2-OSC), while tasks
in our opinion, are ARPA-MT [45] and FlexPRET [42], were processed in the main pipeline (Cop0). Such averaging
REFERENCES
[1] E. Antolak and A. Pułka, ‘‘Flexible hardware approach to multi-core time-
predictable systems design based on the interleaved pipeline processing,’’
IET Circuits, Devices Syst., vol. 14, no. 5, pp. 648–659, Aug. 2020, doi:
10.1049/iet-cds.2019.0521.
[2] E. Antolak and A. Pulka, ‘‘Energy-efficient task scheduling in design
of multithread time predictable real-time systems,’’ IEEE Access, vol. 9,
pp. 121111–121127, 2021, doi: 10.1109/ACCESS.2021.3108912.
[3] M. Schoeberl, ‘‘T-CREST: Time-predictable multi-core architecture for
embedded systems,’’ J. Syst. Archit., vol. 61, no. 9, pp. 449–471, Oct. 2015,
doi: 10.1016/j.sysarc.2015.04.002.
[4] M. Paolieri, E. Quinones, F. J. Cazorla, J. Wolf, T. Ungerer, S. Uhrig, and
Z. Petrov, ‘‘A software-pipelined approach to multicore execution of timing
predictable multi-threaded hard real-time tasks,’’ in Proc. 14th IEEE Int.
Symp. Object/Component/Service-Oriented Real-Time Distrib. Comput.,
Mar. 2011, pp. 233–240, doi: 10.1109/ISORC.2011.36.
[5] T. Ungerer, F. Cazorla, P. Sainrat, G. Bernat, Z. Petrov, C. Rochange,
E. Quinones, M. Gerdes, M. Paolieri, J. Wolf, H. Casse, S. Uhrig,
I. Guliashvili, M. Houston, F. Kluge, S. Metzlaff, and J. Mische,
FIGURE 27. Comparison of ‘processing’ resources (without storing ‘‘Merasa: Multicore execution of hard real-time applications supporting
components) consumption per a single task. analyzability,’’ IEEE Micro, vol. 30, no. 5, pp. 66–75, Sep. 2010, doi:
10.1109/MM.2010.78.
[6] D. D. Gajski, Embedded System Design: Modeling, Synthesis and Verifi-
cation. Dordrecht, The Netherlands: Springer, 2009.
can only provide a rough comparison; however, all results [7] E. A. Lee, ‘‘Absolutely positively on time: What would it take? [embedded
computing systems,’’ Computer, vol. 38, no. 7, pp. 85–87, Jul. 2005, doi:
indicate that our solution is more resource efficient. 10.1109/MC.2005.211.
Since in the FlexPRET solution [42] the data related to the [8] S. Andalam, P. S. Roop, A. Girault, and C. Traulsen, ‘‘A predictable
context of the tasks being processed (including register files) framework for safety-critical embedded systems,’’ IEEE Trans. Comput.,
vol. 63, no. 7, pp. 1600–1612, Jul. 2014, doi: 10.1109/TC.2013.28.
are stored in memory, the information reported on resources [9] D. Broman, ‘‘Precision timed infrastructure: Design challenges,’’ in Proc.
consumed ignores these items. Thus, in order to make the Electron. Syst. Level Synth. Conf. (ESLsyn), Austin, TX, USA, May 2013,
comparison adequate, the resources needed to implement the pp. 1–6.
[10] T. Henzinger and C. Kirsch, ‘‘The embedded machine: Predictable,
general-purpose registers of individual tasks in our solutions portable real-time code,’’ ACM Trans. Program. Lang. Syst., vol. 29, p. 33,
have been neglected. A direct comparison of resources, espe- Dec. 2002, doi: 10.1145/512529.512567.
cially with a large number of processed tasks, would give a [11] E. A. Lee, ‘‘The problem with threads,’’ Computer, vol. 39, no. 5,
pp. 33–42, May 2006, doi: 10.1109/MC.2006.180.
significantly falsified result.
[12] F. J. Cazorla, P. M. W. Knijnenburg, R. Sakellariou, E. Fernandez,
A. Ramirez, and M. Valero, ‘‘Predictable performance in SMT processors:
VI. SUMMARY Synergy between the OS and SMTs,’’ IEEE Trans. Comput., vol. 55, no. 7,
pp. 785–799, Jul. 2006, doi: 10.1109/TC.2006.108.
The presented methodology allows adjusting the hardware [13] S. A. Edwards and E. A. Lee, ‘‘The case for the precision timed (PRET)
structure so as to minimize the energy required for the machine,’’ in Proc. 44th ACM/IEEE Design Autom. Conf., San Diego, CA,
system’s operation or the number of resources required, USA, Jun. 2007, pp. 264–265, doi: 10.1109/DAC.2007.375165.
whilst ensuring the critical time predictability of the system. [14] L. Thiele and R. Wilhelm, ‘‘Design for timing predictability,’’
Real-Time Syst., vol. 28, nos. 2–3, pp. 157–177, Nov. 2004, doi:
The proposed method of simulating task execution allows 10.1023/B:TIME.0000045316.66276.6e.
to predict maximum and minimum task execution times [15] N. J. H. Ip and S. A. Edwards, ‘‘A processor extension for cycle-
even before the system starts up. The division into different accurate real-time software,’’ in Embedded and Ubiquitous Comput-
ing, vol. 4096, E. Sha, S.-K. Han, C.-Z. Xu, M.-H. Kim, L. T. Yang,
types of tasks makes it possible to match the differences and B. Xiao, Eds. Berlin, Germany: Springer, 2006, pp. 449–458, doi:
between maximum and minimum task execution times with 10.1007/11802167_46.
the requirements of the task application. [16] B. Lickly, I. Liu, S. Kim, H. D. Patel, S. A. Edwards, and E. A. Lee,
‘‘Predictable programming on a precision timed architecture,’’ in Proc. Int.
We were able to achieve a 100% fulfillment rate of task Conf. Compil., Archit. Synth. Embedded Syst. (CASES), Atlanta, GA, USA,
completion time, i.e. DMR=0% [36] for all tasks. Although 2008, p. 137, doi: 10.1145/1450095.1450117.
[17] M. Schoeberl, P. Schleuniger, W. Puffitsch, F. Brandner, and [36] Y. Gao, G. Pallez, Y. Robert, and F. Vivien, ‘‘Dynamic scheduling strategies
C. W. Probst, ‘‘Towards a time-predictable dual-issue microprocessor: for firm semi-periodic real-time tasks,’’ IEEE Trans. Comput., vol. 72,
The Patmos approach,’’ in Proc. 1st Workshop Bringing Theory no. 1, pp. 55–68, Jan. 2023, doi: 10.1109/TC.2022.3208203.
Pract., Predictability Perform. Embedded Syst. (PPES), Grenoble, [37] E. Antolak and A. Pułka, ‘‘An analysis of the impact of gating techniques
France, 2011, pp. 11–21. Accessed: Nov. 12, 2019. [Online]. Available: on the optimization of the energy dissipated in real-time systems,’’ Appl.
http://drops.dagstuhl.de/opus/volltexte/2011/3077 Sci., vol. 12, no. 3, p. 1630, Jan. 2022, doi: 10.3390/app12031630.
[18] M. Fernández, R. Gioiosa, E. Quiñones, L. Fossati, M. Zulianello, [38] E. Lee and D. Messerschmitt, ‘‘Pipeline interleaved programmable DSP’s:
and F. J. Cazorla, ‘‘Assessing the suitability of the NGMP multi- Architecture,’’ IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-
core processor in the space domain,’’ in Proc. 10th ACM Int. Conf. 35, no. 9, pp. 1320–1333, Sep. 1987, doi: 10.1109/TASSP.1987.1165274.
Embedded Softw. (EMSOFT), Tampere, Finland, 2012, pp. 175–184, doi: [39] G. C. Buttazzo, Hard Real-Time Computing Systems, vol. 24. Boston, MA,
10.1145/2380356.2380389. USA: Springer, 2011, doi: 10.1007/978-1-4614-0676-1.
[19] A. Alhammad and R. Pellizzoni, ‘‘Time-predictable execution of multi- [40] E. L. Lamie, Real-Time Embedded Multithreading: Using ThreadX and
threaded applications on multicore systems,’’ in Proc. Design, Automat. ARM. San Francisco, CA, USA: CMP Books, 2005.
Test Eur. Conf. Exhib. (DATE), Dresden, Germany, 2014, pp. 1–6, doi: [41] A. Diavastos and T. E. Carlson, ‘‘Efficient instruction scheduling using
10.7873/DATE.2014.042. real-time load delay tracking,’’ ACM Trans. Comput. Syst., vol. 40, nos. 1–
[20] S. Moulik, R. Devaraj, and A. Sarkar, ‘‘COST: A cluster-oriented schedul- 4, pp. 1–21, Nov. 2022, doi: 10.1145/3548681.
ing technique for heterogeneous multi-cores,’’ in Proc. IEEE Int. Conf. [42] M. Zimmer, D. Broman, C. Shaver, and E. A. Lee, ‘‘FlexPRET: A pro-
Syst., Man, Cybern. (SMC), Miyazaki, Japan, Oct. 2018, pp. 1951–1957, cessor platform for mixed-criticality systems,’’ in Proc. IEEE 19th Real-
doi: 10.1109/SMC.2018.00337. Time Embedded Technol. Appl. Symp. (RTAS), Berlin, Germany, Apr. 2014,
[21] R. Pathan, P. Voudouris, and P. Stenström, ‘‘Scheduling parallel pp. 101–110, doi: 10.1109/RTAS.2014.6925994.
real-time recurrent tasks on multicore platforms,’’ IEEE Trans. Par- [43] I. Liu, J. Reineke, and E. A. Lee, ‘‘A PRET architecture supporting
allel Distrib. Syst., vol. 29, no. 4, pp. 915–928, Apr. 2018, doi: concurrent programs with composable timing properties,’’ in Proc. Conf.
10.1109/TPDS.2017.2777449. Rec. Forty 4th Asilomar Conf. Signals, Syst. Comput., Pacific Grove, CA,
USA, Nov. 2010, pp. 2111–2115, doi: 10.1109/ACSSC.2010.5757922.
[22] D. Kim, Y.-B. Ko, and S.-H. Lim, ‘‘Energy-efficient real-time
multi-core assignment scheme for asymmetric multi-core mobile [44] I. Liu, J. Reineke, D. Broman, M. Zimmer, and E. A. Lee, ‘‘A PRET
devices,’’ IEEE Access, vol. 8, pp. 117324–117334, 2020, doi: microarchitecture implementation with repeatable timing and
10.1109/ACCESS.2020.3005235. competitive performance,’’ in Proc. IEEE 30th Int. Conf. Comput.
Design (ICCD), Montreal, QC, Canada, Sep. 2012, pp. 87–93, doi:
[23] J. Chen, C. Du, P. Han, and Y. Zhang, ‘‘Sensitivity analysis of strictly
10.1109/ICCD.2012.6378622.
periodic tasks in multi-core real-time systems,’’ IEEE Access, vol. 7,
[45] A. S. R. Oliveira, L. Almeida, and A. D. B. Ferrari, ‘‘The ARPA-
pp. 135005–135022, 2019, doi: 10.1109/ACCESS.2019.2941958.
MT embedded SMT processor and its RTOS hardware accelerator,’’
[24] B. Forsberg, L. Benini, and A. Marongiu, ‘‘HePREM: Enabling predictable IEEE Trans. Ind. Electron., vol. 58, no. 3, pp. 890–904, Mar. 2011, doi:
GPU execution on heterogeneous SoC,’’ in Proc. Design, Autom. Test Eur. 10.1109/TIE.2009.2028359.
Conf. Exhib. (DATE), Dresden, Germany, Mar. 2018, pp. 539–544, doi: [46] CHISEL Main Web Page. Accessed: Mar. 16, 2023. [Online]. Available:
10.23919/DATE.2018.8342066. https://www.chisel-lang.org/community.html
[25] B. Akesson and K. Goossens, Memory Controllers for Real-Time
Embedded Systems, vol. 2. New York, NY, USA: Springer, 2012, doi:
10.1007/978-1-4419-8207-0.
[26] J. Reineke, I. Liu, H. D. Patel, S. Kim, and E. A. Lee, ‘‘PRET DRAM
controller: Bank privatization for predictability and temporal isolation,’’ in ERNEST ANTOLAK received the M.Sc. degree
Proc. 7th IEEE/ACM/IFIP Int. Conf. Hardw./Softw. Codesign Syst. Synth., in electronics and telecommunication engineer-
Taipei, Taiwan, 2011, p. 99, doi: 10.1145/2039370.2039388. ing from the Silesian University of Technol-
[27] L. M. AlBarakat, P. V. Gratz, and D. A. Jiménez, ‘‘MTB-fetch: Multi- ogy, Gliwice, Poland, in 2018. He is currently
threading aware hardware prefetching for chip multiprocessors,’’ IEEE pursuing the Ph.D. degree in project methodol-
Comput. Archit. Lett., vol. 17, no. 2, pp. 175–178, Jul. 2018, doi: ogy of designing real-time systems. His research
10.1109/LCA.2018.2847345. interests include real-time scheduling, designing
[28] M. Schoeberl, ‘‘A Java processor architecture for embedded real-time safety-critical embedded systems, cyber-physical
systems,’’ J. Syst. Archit., vol. 54, nos. 1–2, pp. 265–286, Jan. 2008, doi: systems, systems on chips, and energy-efficient
10.1016/j.sysarc.2007.06.001. digital architectures.
[29] P. Axer, R. Ernst, H. Falk, A. Girault, D. Grund, N. Guan, B. Jonsson,
P. Marwedel, J. Reineke, C. Rochange, M. Sebastian, R. V. Hanxleden,
R. Wilhelm, and W. Yi, ‘‘Building timing predictable embedded systems,’’
ACM Trans. Embedded Comput. Syst., vol. 13, no. 4, pp. 1–37, Dec. 2014, ANDRZEJ PUŁKA (Senior Member, IEEE)
doi: 10.1145/2560033.
received the M.Sc., Ph.D., and D.Sc. degrees
[30] Y. Kim, J. Kong, and A. Munir, ‘‘CPU-accelerator co-scheduling for CNN in electronics from Silesian Technical Univer-
acceleration at the edge,’’ IEEE Access, vol. 8, pp. 211422–211433, 2020,
sity, Gliwice, Poland, in 1988, 1997, and 2013,
doi: 10.1109/ACCESS.2020.3039278.
respectively.
[31] A. U. Rehman, ‘‘Dynamic energy efficient resource allocation strat-
Currently, he is the University Professor of the
egy for load balancing in fog environment,’’ IEEE Access, vol. 8,
Silesian University of Technology, Gliwice, and
pp. 199829–199839, 2020, doi: 10.1109/ACCESS.2020.3035181.
the Deputy Head of the Department of Electron-
[32] F. Glaser, G. Tagliavini, D. Rossi, G. Haugou, Q. Huang, and L. Benini,
ics, Electrical Engineering and Microelectronics.
‘‘Energy-efficient hardware-accelerated synchronization for shared-L1-
memory multiprocessor clusters,’’ IEEE Trans. Parallel Distrib. Syst., He is the author and coauthor of approximately
vol. 32, no. 3, pp. 633–648, Mar. 2021, doi: 10.1109/TPDS.2020.3028691. 90 scientific papers, including journal articles, book chapters, and conference
[33] H. Bahn and K. Cho, ‘‘Evolution-based real-time job scheduling for co-
papers. His research interests include the automated design of digital and
optimizing processor and memory power savings,’’ IEEE Access, vol. 8, mixed signal circuits in FPGAs, the modeling and simulation of electronic
pp. 152805–152819, 2020, doi: 10.1109/ACCESS.2020.3017014. embedded systems, VHDL, Verilog, SystemVerilog, SystemC, real-time
[34] H. Chniter, O. Mosbahi, M. Khalgui, M. Zhou, and Z. Li, ‘‘Improved systems—precision time machines (PRET), the design of energy efficient
multi-core real-time task scheduling of reconfigurable systems with systems, power optimization in SoC, AI, and commonsense reasoning mod-
energy constraints,’’ IEEE Access, vol. 8, pp. 95698–95713, 2020, doi: eling, and the applications of FPGA platforms for hardware acceleration of
10.1109/ACCESS.2020.2990973. complex computations in bioinformatics. He is a member of the Electronics
[35] Welcome to LPA. Accessed: Jan. 31, 2023. [Online]. Available: Commission of the Polish Academy of Sciences.
https://www.lpa.co.uk/