Dynamic Partitioned Scheduling of Real-Time Tasks On ARM Big - littLE
Dynamic Partitioned Scheduling of Real-Time Tasks On ARM Big - littLE
article info a b s t r a c t
Article history: This paper presents Big-LITTLE Constant Bandwidth Server (BL-CBS), a dynamic partitioning approach
Received 13 July 2020 to schedule real-time task sets in an energy-efficient way on multi-core platforms based on the ARM
Received in revised form 25 October 2020 big.LITTLE architecture. BL-CBS is designed as an on-line and adaptive scheduler, based on a push/pull
Accepted 9 December 2020
architecture that is suitable to be incorporated in the current SCHED_DEADLINE code base in the
Available online 11 December 2020
Linux kernel. It employs a greedy heuristic to dynamically partition the real-time tasks among the
Keywords: big and LITTLE cores aiming to minimize the energy consumption and the migrations imposed on
Real-time scheduling the running tasks. The new approach is validated through the open-source RT-Sim simulator, which
ARM big.LITTLE has been extended integrating an energy model of the ODROID-XU3 board, fitting tightly the power
Heterogeneous multicore processing consumption profiles for the big and LITTLE cores of the board. An extensive set of simulations have
Energy-efficiency
been run with randomly generated real-time task sets, leading to promising results.
© 2020 Elsevier Inc. All rights reserved.
https://doi.org/10.1016/j.jss.2020.110886
0164-1212/© 2020 Elsevier Inc. All rights reserved.
A. Mascitti, T. Cucinotta, M. Marinoni et al. The Journal of Systems & Software 173 (2021) 110886
to experience unnecessary deadline misses, or forcing the plat- by Pillai and Shin (2001) used an approach exploiting the task
form into excessively high power consumption. More appropri- unused computation time (with respect to its worst case) to
ate solutions are needed, designed in the realm of real-time decrease the working frequency dynamically. Aydin et al. (2004)
systems (Balsini et al., 2019, 2016). An interesting feature of used similar static scaling algorithms to minimize the speed
the latter kind, made recently available in the Linux kernel, is while guaranteeing tasks deadlines, coupling them with dynamic
the SCHED_DEADLINE scheduler, a multi-processor variant of the reclaiming of unused computation time both intratask and inter-
well-known constant bandwidth server (CBS) (Abeni and But- task. While these authors focused on solutions based on dynamic
tazzo, 1998), employing a reservation-based scheduling strategy priorities and the Earliest Deadline First (EDF) scheduling policy,
that can be conveniently configured as using either global or Saewong and Rajkumar (2003) presented similar algorithms but
partitioned or clustered EDF scheduling underneath. However, focusing on fixed priorities. More dynamic approaches have been
the interesting energy-awareness features of the EAS framework proposed, for example, by Zhu and Mueller (2004, 2007), who
cannot be exploited by SCHED_DEADLINE at the moment. used a feedback mechanism to maximize energy saving for the
average execution time, still guaranteeing the timing constraints
1.1. Paper contributions in case of worst-case execution times.
When considering multi-core CPUs, minimizing the consumed
In this paper, we propose a novel energy-aware scheduling energy is not as simple as selecting the minimum frequency and
strategy for soft real-time tasks running on big.LITTLE architec- different strategies can be used. See Bambagini et al. (2016) for an
tures, in the context of a complex embedded OS like Linux, overview of the solutions presented in the literature (in particular
where applications cannot be known beforehand, and they can Section 7 considers uniform heterogeneous multi-processors or
be dynamically started and terminated. The proposed approach multi-cores, as defined in Funk, 2004). For example, the place-
combines partitioned EDF scheduling (CBS, actually) with an on- ment of tasks on the various cores (or the migration of tasks
line partitioning heuristics that is activated only at task wake-up between cores) has a significant impact on the operating frequen-
(or creation) and suspension (or termination) times, performing cies, and on some platforms (such as ARM big.LITTLE) different
a single task placement (or migration) action that: (1) ensures cores can be characterized by different power characteristics (and
schedulability of the real-time tasks that have been admitted onto different sets of possible operating frequencies). Hence, frequency
the system; (2) achieves the lowest expected power consumption scaling, task placement, and migration actions must be strictly
for the execution of the real-time tasks; (3) avoids unnecessary coordinated.
migrations that might degrade the tasks performance. Multi-core real-time schedulers are generally classified as
This work extends a preliminary prior work of ours (Mascitti global schedulers or partitioned schedulers: while a global sched-
et al., 2020) along several directions: we clarify and refine sev- uler is free to migrate tasks between different cores according
eral aspects of the proposed mechanism, identify the theoretical to the scheduling policy (and hence, conceptually, the scheduler
conditions under which schedulability is guaranteed, discuss key uses one single global ready queue that contains all the tasks
implementation details related to the performance of the mech- ready for execution), a partitioned scheduler does not migrate
anism, and present a much more comprehensive evaluation of tasks between cores (and tasks are statically assigned to cores by
the technique under various workload conditions. The performed the system designer). As a consequence, when using partitioned
simulations show that our approach is actually promising, allow- scheduling the problem of scheduling tasks on m CPUs is reduced
ing 15% of energy saving in average with respect to the current to m scheduling problems on a single CPU, and single-processor
state of the SCHED_DEADLINE code base. DVFS algorithms can be re-used. Hence, the main challenge in
partitioned scheduling is the tasks assignment so that the DVFS
1.2. Paper organization algorithm can more effectively decrease the consumed energy.
Semi-partitioned scheduling (Andersson and Tovar, 2006;
This paper is organized as follows: after a brief review of the Burns et al., 2012; Casini et al., 2017) represents a trade-off
related research in Section 2, key background concepts related between global and partitioned scheduling, allowing to schedule
to the adopted real-time task model and CBS scheduling are tasksets that are not partitionable. It relies on splitting a real-
presented in Section 3, then the computing platform we focus time task into two parts with reduced demand that fit into two
on and its energy model are described in Section 4, along with different cores and execute with a precedence constraint (one af-
some accompanying notation that is adopted throughout the rest ter the other), keeping schedulability. These techniques have also
of the paper. The proposed BL-CBS technique is described in been applied to heterogeneous multi-cores, and used to reduce
Section 5, along with the main factors driving its design. Section 6 energy consumption (Liu et al., 2016). However, this power saving
introduces theoretical conditions on the real-time task sets that technique relies on an off-line placement of the tasks (or parts of
are schedulable under BL-CBS, i.e., they are guaranteed not to the tasks) on the various cores. Some authors (Casini et al., 2017)
miss any deadlines. Then, after providing a few important details investigated the use of linear-time approximation methods to
in Section 7 related to an efficient implementation within an perform the splittings so that they can be performed on-line, but
in-kernel scheduler, the proposed BL-CBS technique is evaluated this technique does not take into account energy consumption.
by simulation in Section 8. Finally, conclusions are drawn in The present work advocates a simpler utilization-based
Section 9, sketching out possible directions for future research on method where task splitting is not used and a ‘‘restricted mi-
the topic. grations’’ approach (Baruah and Carpenter, 2003) is used: a task
resides mostly on one core for each job, and it is migrated
2. Related work normally only across subsequent activations.2 This is easier to
compute in an OS kernel (see Section 5 for details). On the other
Energy-efficient scheduling for real-time tasks has been widely hand, task splitting requires to migrate tasks while running, when
investigated in the research literature, starting from some sem- one split of the task exhausted its assigned time on one CPU, and
inal works on uni-processor systems. In particular, real-time the subsequent split continues execution on a different CPU.
schedulability analysis has been combined with DVFS techniques
to reduce the CPU frequency as much as possible without break- 2 As it will become clear in Section 5, a task can also be migrated in the
ing guarantees. For example, the RT-DVS algorithms proposed middle of a job to balance workload and bring down the island frequency.
2
A. Mascitti, T. Cucinotta, M. Marinoni et al. The Journal of Systems & Software 173 (2021) 110886
Notice that frequency scaling is not the only technique that recently enriched with an on-line support for DVFS (Scordino
can be used to reduce energy consumption. For example, Dynamic et al., 2018, 2019), by integrating a variant of the GRUB-PA
Power Management (DPM) allows for reducing the energy con- algorithm (Scordino and Lipari, 2004). However, the current im-
sumption by putting the CPU in a ‘‘low-power’’ (or sleep) state plementation does not support non-symmetric multi-cores such
whenever it is idle. DPM is not an alternative to DVFS but can as ARM big.LITTLE platforms. Also, other investigations of ours on
be combined with it. For example, these techniques are jointly the use of adaptive partitioning schedulers for multi-cores (Abeni
investigated in the context of real-time scheduling in Bambagini and Cucinotta, 2020), albeit in a non energy-aware context, high-
et al. (2013) and Moulik et al. (2019). Other works (Imes and lighted that these techniques can be effective in scheduling real-
Hoffmann, 2015) investigated the trade-offs between DVFS and time task sets reducing deadline violations compared to the
DPM, studying whether it is better to execute at the highest global EDF used by SCHED_DEADLINE. Following this research
possible frequency and then set cores to a sleep state as long track, this paper focuses on an on-line scheduling algorithm that
as possible (and so relying on DPM only), or finding the lowest dynamically partitions real-time tasks, exploiting job-level migra-
frequency needed to make tasks respect their deadlines and never tion. It is suitable to be implemented into the existing code-base
go idle (and so relying on DVFS only). of the Linux kernel in the scheduling class SCHED_DEADLINE and
Several authors reduced the complexity of the scheduling works with incremental, dynamic task sets, which are not known
problem by reducing the number of decisions taken at run-time a priori. Off-line approaches cannot be used in this context since
(for example, by statically assigning tasks to cores or core is- they are not suitable for being used with dynamic workloads.
lands, or statically deciding the frequency at which each core or However, off-line partitioning algorithms commonly give better
core island executes, by resorting to a static schedule computed results since all the optimizations can be performed on a task set
off-line). This allows for taking some decisions by solving an known a priori, leading to better solutions in terms of real-time
optimization problem (Chwa et al., 2015; Thammawichai and guarantees and energy-saving. Conversely, an on-line algorithm,
Kerrigan, 2018; Qin et al., 2019a,b). Polynomial-time algorithms like the one proposed in this paper, is expected to make swift
have been proposed (Liu et al., 2015) to divide real-time stream- decisions, aiming at achieving a good-enough performance in a
ing applications across DVFS capable islands, where tasks are reduced computation time.
statically assigned to an island and then globally scheduled in- Some authors deal with the problem of scheduling real-time
side it via an optimal scheduler. In Colin et al. (2014), it is DAG tasks (Guo et al., 2017, 2019; Li et al., 2019) and
found that the most efficient way to allocate real-time tasks and digraphs (Zahaf et al., 2019) on ARM big.LITTLE, evaluating the
save energy is neither balancing the load nor choosing the most approach either through implementation on real hardware or by
power-efficient core. They find off-line the optimal load distribu- simulation. Our work is currently limited to scheduling of real-
tion via integer linear programming (ILP) and try to approximate time independent task sets, albeit extensions along said lines will
on-line that result via heuristics. Other authors (Thammawichai be considered in future extensions.
and Kerrigan, 2018) divide the scheduling problem into workload Finally, an interesting approach is the one proposed in Balsini
partitioning and next task ordering. The first step determines et al. (2016), where authors highlight that power consumption
what parts of the tasks should be executed at what frequency may vary in non-negligible way depending also on the workload
within a time interval such that feasibility constraints are satis- type being computed, supporting the argument with real data
fied, while the second part establishes how to order the pieces measured on an ODROID-XU3 platform. Possible integration of
of tasks for each core. Also, an analysis of the task code structure such a per-application power-consumption model might be an
coupled with an ILP returning the minimum frequency and loca- interesting area of future extensions of the present work.
tion to be used for each code segment has been proposed (Qin In this paper, the validation is carried out using RTSim
et al., 2019a). In the same work, the authors tried to moderate (Palopoli et al., 2002, 2001; Scordino and Lipari, 2006), an open-
the use of the LITTLE-Core-First principle, according to which one source tool we have been evolving over time and used in several
should always fill LITTLE cores while possible before selecting a previous research works. Albeit other real-time task simula-
big one. Using the task execution variance (i.e., the ratio between tors (Pillai and Isha, 2013; Thakare and Deshmukh, 2017; Cher-
the WCET on LITTLE and on big for a given task) an ILP formula- amy et al., 2014) were available, RTSim was an easier choice for
tion is used to compute the optimal distribution of the utilization us, also because its modifications in Balsini et al. (2016), integrat-
of the tasks between the islands and their minimum frequencies ing a realistic energy consumption model for the ODROID-XU3
to respect the deadline. A heuristic is then used to assign tasks big.LITTLE platform, have been used as a starting point to develop
and set the frequencies. This approach, however, is related to the energy-aware adaptive partitioning technique presented in
ARM DynamIQ and it makes use of per-core DVFS, which is not this paper.
feasible for generic big.LITTLE platforms. Optimization methods
have also been used in Nogues et al. (2016) for choosing the best 3. Background
frequency for each node of stream processing applications rep-
resented as Synchronous Data Flows with end-to-end deadlines, In this paper we consider a set of real-time tasks {τi } to be
making their parallelism explicit in the model. scheduled on a number of CPUs. A real-time task τi is charac-
While most of the above works focus on optimizing some deci- terized by a minimum inter-arrival period Ti equal to its relative
sions taken off-line, in this paper, we deal with designing a strat- deadline (implicit deadline case) and by the worst-case execution
egy that can be applied on-line within an OS kernel scheduler. time (WCET), which will be discussed in depth in the following
Hence, our algorithm copes with both on-line scheduling of tasks sections for the case of the ARM big.LITTLE architecture. τi gen-
(considering both task migrations and possible task overruns, erates a sequence of jobs Ji,j and for a job arriving at time ri,j , its
which we handle by using CBS servers) and dynamic frequency finishing time is denoted by fi,j > ri,j . Task τi respects all of its
scaling. deadlines if ∀j, fi,j ≤ ri,j + Ti . Generally, ri,j+1 ≥ ri,j + Ti , but for a
Focusing on recent developments in the mainline Linux kernel, periodic real-time task, we have ri,j+1 = ri,j + Ti .
notable energy-aware features for big.LITTLE have been inte- In this paper we consider soft real-time tasks, meaning that
grated within the EAS framework, that is mostly focused on a job missing its deadline will not cause severe consequences,
the CFS scheduler for general-purpose workloads. On the other but rather will lessen the Quality of Service (QoS) of the system.
hand, for real-time tasks, the SCHED_DEADLINE policy has been For example, a multimedia application that needs to periodically
3
A. Mascitti, T. Cucinotta, M. Marinoni et al. The Journal of Systems & Software 173 (2021) 110886
perform buffering, decoding and visualization of frames, needs 4.1. Platform model
to complete each activation by a precise deadline. However, a
relatively small percentage of deadline misses can be tolerated, The processing platform under study is composed of two core
with the user perceiving a degraded quality as said percentage islands, where each island s (s ∈ I ≜ {L, B}) has ms identical cores
grows. that can (all together) be switched among a set of ks possible
As many real-time tasks may be concurrently active on the operational performance points (OPPs) with different frequencies
fs,1 , . . . , fs,ks , ordered from the minimum to the maximum one.
{ }
same CPU, they interfere with each other, jeopardizing their
deadlines. Therefore, we make use of resource reservations to The per-core power consumption at frequency fs,j is ps,j when
enforce temporal isolation among tasks. Each task τi is associated computing, or pidle s,j < ps,j when staying idle, where ∀j1 , j2 ∈
with its own reservation (Qi , Pi ), meaning that τi is guaranteed to {1, . . . , ks }, j1 < j2 H⇒ ps,j1 < ps,j2 ∧ pidle s,j1 < ps,j2 . For the sake
idle
be scheduled on the CPU for Qi time units (a.k.a., budget) in every of simplicity, in this work, we ignore the existence of multiple
time interval of length Pi (a.k.a., reservation period). deep-sleep idle modes of the CPU(s) with different associated
In this paper, we use the Constant Bandwidth Server (CBS) power consumptions, postponing their proper integration in our
(Abeni and Buttazzo, 1998) as a resource-reservation mechanism. technique to future work.
We define the maximum speed xs ≤ 1 of a core of an island s
In CBS, reservations are realized by means of the Earliest Deadline
to be the ratio between the processing time C of a task deployed
First (EDF) scheduler, which schedules tasks {τi } based on their
on a core of the big island at maximum frequency fB,kB and its
scheduling deadlines {di }, assigned by the CBS algorithm. When
processing time Cs,ks , when deployed on that core at maximum
a new job Ji,j arrives, the server checks whether the current
frequency fs,ks : xs = C C .
deadline is sufficient to schedule it, otherwise it assigns a new s,ks
A core of the island s, running at OPP j < ks , has a reduced
deadline equal to ri,j + Pi . While the job executes, the budget of f
speed xs,j < xs , where a common assumption is that xs,j = xs f s,j ,
the associated CBS is decreased. If it executes for more than Qi s,ks
time units, its scheduling deadline is postponed by Pi . Therefore, albeit the approach of this paper relies on arbitrary processing
the job is prevented from executing more than Qi time units with speed factors {xs,j } (monotonically increasing with j), being in-
spired by the capacity matrix as available in the device tree in
the same scheduling deadline, and it is guaranteed a computation
the Linux kernel supporting EAS.
bandwidth of Bi = Qi /Pi regardless of the behaviour of the other
tasks. This ensures temporal isolation, preventing a misbehaving
4.2. Task model
task to cause deadline misses on jobs of other tasks with farther
away deadline. To guarantee the schedulability of each task, the
Each task τi is assumed to be a periodic task with known period
following schedulability condition must hold:
∑ Ti , or equivalently a sporadic task with minimum inter-arrival
Bi ≤ Umax (1) time among two subsequent activations of Ti , and with a known
i
nominal WCET Ci , being the WCET of the task when running on a
big core at maximum frequency. Similarly, the nominal utilization
with Umax = 1 in case of EDF. C
Ui is defined as Ui ≜ Ti . Whenever the task is running on a core
i
A CBS server may be associated with many tasks. However, of an island s at an OPP j, its timing is characterized by the scaled
in this paper, each one will have its own CBS server. For the WCET C̃i,s,j and scaled utilization Ũi,s,j , defined as:
reader’s convenience, Table 1 summarizes the notation symbols [ ]
used throughout this paper. Ci Ci fs,ks C̃i,s,j Ui
C̃i,s,j = = ; Ũi,s,j ≜ ≡ . (2)
An energy-aware extension of the CBS server is GRUB-PA xs,j xs fs,j Ti xs,j
(Greed Reclamation of Unused Bandwidth — Power Aware), integrat-
Our task model comprises a few common task states: a task
ing the ability to reclaim unused processor capacity (bandwidth)
arrives or activates, becoming ready to run, periodically (or with
that is not used because some of the servers may have no jobs a minimum periodicity), generating a sequence of jobs. When the
awaiting execution and exploiting the hardware DVFS capabilities task is selected for execution on a CPU then it becomes running,
to reduce the cores frequency. GRUB-PA has been implemented and its current job starts being executed. While running, a task
in the mainline Linux running SCHED_DEADLINE CBS reserva- may be preempted by another task just arrived (or migrated from
tions, starting from version 4.13 released in September 2017. Its another CPU) with an earlier deadline, causing the former task
energy-related behaviour on multiprocessor platforms could be to go back to the ready state. Each job completes with the task
summarized as follows. When a new task instance Ji,j arrives, the suspending, waiting for the next activation, when the task will
first free core is selected if available; otherwise, the core with the wake-up and become again ready to run, then its next job will
latest-deadline task is chosen and the new task is dispatched onto start executing when it is scheduled.
it. Then, in the case of ARM big.LITTLE, the highest-utilization core This paper focuses on approaches where the set of scheduled
of each island is used to determine the frequency (or the highest tasks Γ = {τ1 , . . . , τn } can be partitioned among the available
frequency is picked if the busiest core has utilization greater than cores, so that, at any time, each core h of an island s hosts a
1.0). When a task in a server ends, leaving its core idle, the task subset of tasks Γs,h ⊆ Γ meeting some schedulability condition.
with closest deadline is pulled onto it and the islands frequencies For example, if EDF is used, the well-known EDF schedulability
are adjusted. The reader can find more details about the GRUB-PA condition has to be met on the scaled utilizations, considering the
in Scordino and Lipari (2004). OPP j at which the island is configured:
∑ ∑
Ũi,s,j ≤ 1 ⇐⇒ Ui ≤ xs,j ∀h ∈ {1..ms } . (3)
4. Notation and energy model i∈Γs,h i∈Γs,h
Table 1
Summary of the symbols used in the paper. Some terms are introduced in the following sections.
Symbol Meaning
τi ith task (i = 1, . . . , n)
Γ Set of tasks Γ = {τi }i=1,...,n
Ci Nominal worst-case execution time (WCET) of τi
Ti Period of the instances of τi
di Scheduling deadline of τi
Ui = Ci /Ti Nominal utilization of task τi with WCET Ci and period Ti
ks Number of OPPs for island s
fs,j CPU frequency for island s when running at OPP j (j = 1, . . . , ks )
ps ,j Per-core power consumption at frequency fs,j
xs ,j Speed of island s at OPP j
xs Maximum speed each core of island s can reach
Ũi,s,j Scaled utilization of τi on island s with OPP j
C̃i,s,j Scaled WCET of τi on island s with OPP j
Ẽi,s,j Energy consumption over its period Ti of τi on island s with OPP j
ẼΓ ,s,j Energy consumption over the hyperperiod of tasks in Γ on island s with OPP j
P̃Γ ,s,j Power consumption over the hyperperiod of tasks in Γ on island s with OPP j
Ũs,js Scaled utilization of the tasks on all cores of island s running at OPP js
ms Number of cores in island s
Γs,h ∑ Set of tasks on core h of island s
Vs,h = U Overall nominal utilization of tasks on core h of island s
∑msi∈Γs,h i
Vs = h=1 Vs,h Overall nominal utilization of tasks on all cores of island s
A task τi deployed alone on a core of an island s at frequency In the proposed approach, real-time tasks are dynamically
j keeps the core busy for a time C̃i,s,j and idle for a time Ti − C̃i,s,j partitioned among the available cores and they are scheduled
in each time window with a duration of its period Ti , resulting in on the assigned cores using the CBS scheduling policy, based on
an overall energy consumption over its period equal to: EDF. In the rest of the paper, a common practice is applied that
( )
assigns a single task to each CBS server, setting the server budget
Ei,s,j = ps,j C̃i,s,j + pidle
s,j Ti − C̃i,s,j .
equal to the task WCET Ci and the server period equal to the task
Similarly, for a schedulable set of tasks Γ deployed on a core minimum inter-arrival time Ti . Migrations of tasks among CPUs
of island s at frequency j, we can compute the overall energy can dynamically occur at job-level, whenever a task suspends
consumption over the hyperperiod HΓ ≜ LCMi {Ti | τi ∈ Γ }, con- (i.e., its current job ends), and its active utilization expires, or a
sidering that each task τi will have TΓ instances over a time
H
task wakes up (i.e., a new job begins).
i
duration of HΓ (with reference to a schedule with null initial In order to decide the CPU on which to place a new task that
offsets): becomes ready-to-run, and at what CPU frequency, we propose a
( ) greedy algorithm aiming at minimizing the average power consump-
∑ HΓ ∑ HΓ tion P as defined in Eq. (4). The heuristic is based on choosing
EΓ ,s,j = ps,j C̃i,s,j + pidle
s,j HΓ − C̃i,s,j .
Ti Ti on a task wake-up (bringing it back into the scheduler queue of
i∈Γ i∈Γ
ready tasks) a core placement decision that causes the minimum
For any practical calculation, it is convenient to divide the above
P increase, among all the possible moves that keep the schedu-
equation by HΓ , obtaining the average power consumption PΓ ,s,j
lability of all the tasks. Also, whenever the active utilization of a
(over the hyperperiod) of a schedulable task set Γ deployed on a
task expires on a core, the frequency of the corresponding island
core of an island s at frequency j:
( ) is lowered to the minimum one that keeps the schedulability of
EΓ ,s,j ∑ ∑ all the tasks on the island.
PΓ ,s,j ≜ = ps,j Ũi,s,j + pidle
s,j 1− Ũi,s,j . Whenever multiple choices are available, bringing the same
HΓ
i∈Γ i∈Γ difference in the average power consumption, we give preference
Considering a system where each island s is running at OPP js and to spreading and balancing the workload across the available
each core h on island s is hosting a task set Γs,h , the overall average cores, adopting a worst-fit strategy.
power consumption for the system is defined as: In the following, we start from important observations on
ms
∑∑ the power consumption of a single task in Section 5.1, a set
P ≜ PΓs,h , s, js of tasks across single-core islands in Section 5.2 and multi-core
s∈I h=1 islands in Section 5.3. Then, the overall placement algorithm is
presented in Section 5.4. Additional implementation details and
⎡ ⎛ ⎞⎤
ms
observations are discussed in Section 7, including the discussion
∑∑ ∑ ∑
= ⎣ps,js Ũi,s,js + pidle
s,js
⎝1 − Ũi,s,js ⎠⎦
of possible efficiency issues and the computational complexity of
s∈I h=1 τi ∈Γs,h τi ∈Γs,h
∑[ ( )] the proposed algorithm.
= ps,js Ũs,js + pidle
s,js ms − Ũs,js , (4)
s∈I 5.1. Single-task placement
∑ms ∑
where Ũs,js ≜ h=1 τi ∈Γs,h Ũi,s,js is the overall scaled utilization
of the tasks hosted on all cores of island s when running at OPP Proposition 1. A set of tasks Γs,h can be hosted on any core of
js . an island s at a given OPP j only if their overall nominal utilization
The metric P defined in Eq. (4) is the main driver for our does not exceed a maximum value Usmax , j equal to the speed xs,j
proposed task placement algorithm that will be presented next. corresponding to the OPP j.
5
A. Mascitti, T. Cucinotta, M. Marinoni et al. The Journal of Systems & Software 173 (2021) 110886
Fig. 1. Contribution to the average power consumption (on the Y axis) due to a single core as a function of the aggregated nominal utilization of the hosted tasks
(on the X axis), for LITTLE cores (left plot) or big ones (right plot).
Proof. This follows easily from the schedulability condition in However, our real system has a number of real-time tasks.
Eq. (3) and the scaled WCET definition in Eq. (2): In order to highlight how the problem becomes more complex,
∑ ∑ Ci /xs, j consider a simple conceptual example. Imagine a system with
Ũi, s, j ≤ 1 ⇐⇒ ≤1 just one core for the big and one core for the LITTLE islands;
Ti
i∈Γs,h i∈Γs,h we have already placed one lightweight task τ1 on the LITTLE
∑ Ci ∑ core, with a nominal utilization of U1 = 0.2296, which is just
≡ ,j .
Ui ≤ xs, j ≜ Usmax □ a tiny bit below a relatively big power consumption jump for
Ti
i∈Γs,h i∈Γs,h LITTLE cores, as shown in Fig. 1(a); also, we already placed a
heavyweight task τ2 with nominal utilization above the LITTLE
Note that the xs, j ≡ Usmax
, j are monotonically increasing with maximum speed of xL = 0.345328, for example U2 = 0.71,
the OPP j.
as shown in Fig. 1(b). Clearly, both cores are running at the
Highlighting the role of the nominal utilizations in Eq. (4), we
minimum frequency guaranteeing schedulability, i.e., the big core
have:
∑[ ( )] at a frequency of 1400 MHz, and the LITTLE one at a frequency of
P = ps,js Ũs,js + pidle
s,js ms − Ũs,js
800 MHz.
s∈I Under said conditions, if a very lightweight task τ3 with a
∑[ small U3 arrives, placing it on the LITTLE core is not optimal.
]
) Vs
ms pidle pidle
(
= s,js + ps,js − s,js (5) Indeed, to preserve schedulability, such a placement would force
xs,js
s∈I to bump the LITTLE core frequency up one step, adding to the
where we introduced the aggregated nominal utilization power consumption P a fixed term ∆P = 0.0372 W, plus an
∑ms ∑Vs of the
m
tasks ∪h=s 1 Γs,h in all cores of an island s: Vs ≜ additional increment per nominal utilization unit of 0.355 W. On
h=1 τi ∈Γs,h Ui .
Similarly, we use Vs,h to refer to the aggregated ∑ nominal utiliza- the other hand, the big core is in a status in which it would be
tion of the tasks Γs,h currently on core h: Vs,h ≜ able to host τ3 without any frequency change, causing an increase
τi ∈Γs,h Ui .
Eq. (5) states that the power consumption of an island running in P of just a (steeper) increment per nominal utilization unit of
at a given OPP is linearly dependent on the overall nominal uti- 0.855 W. Therefore, with sufficiently small U3 values, the most
lization deployed across the cores in the island. This is highlighted energy efficient action is achieved by placing τ3 on the big core,
in Fig. 1, reporting, for each available OPP (different curves) of alongside τ2 . Fig. 2 shows the total increase in the average power
either the big or the LITTLE island (different plots) of an ODROID- consumption achieved by placing τ3 on either the big or the
XU3 board, the contribution to the average power consumption LITTLE core, as a function of the new task utilization U3 . As visible,
of each core (Y axis) as a function of the aggregated nominal it is more convenient to place τ3 on the big core if U3 ≤ ∼ 0.034,
utilization hosted on the core (X axis). Due to Proposition 1, the while the LITTLE core is a better choice for U3 > ∼ 0.034 and
maximum nominal utilization that can be hosted on a single core U3 < ∼ 0.116. Beyond this last utilization, it cannot fit on the
of island s at OPP j cannot exceed the associated speed xs,j , thus LITTLE core and the only choice is an assignment on the big one.
each line in Eq. (5) is only displayed in the range [0, xs,j ] of the X The above reasoning introduces the motivations behind our
axis. scheduling strategy design, based on two points. First, every time
Looking at the power curves in Fig. 1, it is clear that, given a task enters the ready queue (i.e., at task creation or wake-
a single real-time task with a given nominal utilization Ui , the up time) it is placed on the core causing the minimum possible
minimum average power consumption on each island is attained increase in the overall average power consumption P as defined
by using the minimum OPP with speed xs,j ≥ Ui . By taking such a in Eq. (5), also considering the potential need for increasing the
choice for each possible Ui , we obtain the thick power-utilization operating frequency of the destination island, in order to preserve
curves labelled as ‘‘Lowest power’’. schedulability. Second, every time a task exits the ready queue
However, optimum placement decisions need to take into (i.e., it goes to sleep or terminates), a pull operation is sched-
account the multitude of tasks and cores in the platform, as uled at the task active utilization expiry time (the task virtual
discussed next. time if other tasks are ready on the CPU, or the current time if
the CPU remains idle — this ensures that schedulability is pre-
5.2. Multi-task placement served, Scordino et al., 2019), which migrates a task from another
CPU determining the move with the maximum possible decrease
The minimum among the thick curves in the plots in Fig. 1 in P, also considering the potential need for switching to a higher
allows us to decide, given a single task with a utilization Ui , where OPP for the destination island, and possibly the opportunity to
to place and schedule it in the most energy-efficient way, so that switch the big island to a lower OPP, in case we can pull from
no deadlines can be missed. there.
6
A. Mascitti, T. Cucinotta, M. Marinoni et al. The Journal of Systems & Software 173 (2021) 110886
Fig. 2. Increase in average power consumption (on the Y axis) due to placing
Putting together the building blocks from the above sections,
a new task on the big vs LITTLE core (different curves) as a function of its
utilization U3 (on the X axis), in a sample scenario. we obtain the overall algorithm that is summarized in Algo-
rithm 1, to be run every time a task enters the ready queue (task
creation time or new job arrival). The procedure finds the core
where the task fits according to Eq. (3) and for which ∆Ps is
5.3. Placement for multi-core islands
minimum. Notice that this pseudo-code is not efficient and it
is presented this way for the sake of clarity, while its efficient
Focus on an island s with ms cores. First, we observe that, as
highlighted in Eq. (5), a multi-core island s has a power con- implementation is discussed in Section 7.
sumption that depends on the OPP js currently being used, and
the aggregated nominal utilization Vs of all cores in the island. Algorithm 1 Placement of a new task τi .
Therefore, while a single-core contribution to the overall power 1: // Return minimum OPP j of island s needed for per-core utilizations {Ui }
consumption P is well captured by a point on one of the segments 2: procedure minOPP(island
{ s, utilizations
} {Ui })
in Fig. 1, the contribution of a whole island to P is captured by a 3: return min j ≤ ks | xs,j ≥ maxi {Ui }
point that stays on the same curves, but likely residing on the 4: end procedure
5:
extension of the segment beyond the maximum speed xs,j of the 6: // Return island and core where to place τi with nominal utilization Ui
OPP js , up to a maximum per-island utilization Vs of ms xs,j (e.g., 4 7: procedure Place(utilization Ui )
times as much, in the case of our ODROID-XU3 platform with 4 8: for each island s ∈ I do
cores per island). The average power consumption curve, for each 9: choose h∗s | Vs,h∗s = minh {Vs,h }
island s and OPP j, as a function of the overall nominal utilization 10: if Vs,h∗s + Ui ≤ xs then
11: set W := {Vs,h }h where Vs,h∗s is increased by Ui
hosted on the island, will be denoted by Ps,j (·). 12: set j∗ := minOPP(s, W )
Whenever a task τi with nominal utilization Ui becomes ready, 13: set ∆Ps := Ps,j∗ (Vs + Ui ) − Ps,js (Vs )
we need to compare what increase in the power consumption P 14: else
would arise from hosting the task on each of the islands. 15: set ∆Ps := +∞
On an island s, τi can easily be hosted at the same frequency if 16: end if
17: end for
there is at least a core h with nominal utilization that, increased 18: choose s∗ | ∆Ps∗ = mins∈I {∆Ps }
by Ui , does not exceed the CPU speed xs, js due to the current 19: return (s∗ , h∗s∗ )
island frequency js (see Proposition 1). However, whenever more 20: end procedure
than one core satisfies this requirement, we propose to adopt a
worst-fit (WF) placement strategy, that tries to fit the task into
the least-loaded core, i.e., the one with the minimum overall When a task running on a core cempty goes to sleep (or ter-
nominal (and scaled) utilization among the hosted tasks. This minates) and its virtual time expires, a pull operation is needed.
choice is motivated by the need to try to keep each island OPP We distinguish pull operations into (i) pull of a task from another
at a value that is as small as possible, and this is achieved by core of the same island; and (ii) pull of a task from the big to the
spreading the workload as evenly as possible across the cores LITTLE island. At the moment, a pull operation is attempted only
of each island. Note that such a placement is also the one that if cempty is left idle. The possibility to pull tasks also if cempty is not
minimizes the variance in the nominal utilization among cores idle is left as future work. Running tasks are not pulled to avoid
in the same island. However, any placement over any other core compromising their real-time guarantees, since pulling a task in
in such a way that the total achieved nominal utilization on the the middle of a job may imply a non-negligible increase of its
core does not exceed xs, js brings the same identical increase in execution time, especially for a migration across islands.
power consumption, so these are all equivalent solutions, includ- In our proposed strategy, summarized in Algorithm 2, if cempty
ing widely known alternatives like: first-fit (FF), choosing the first is a LITTLE core, then we try first to pull a task from the big island.
core with enough free utilization; or best-fit (BF), choosing the If this is not possible, or cempty is a big core, then we try to pull
infl
core with the smallest free utilization greater than, or equal to, from the same island (Lines 2–9). Assume by now that Ui ≡ Ui ,
the one of the task to be placed. this will be clarified at the end of this section.
If we cannot place the task preserving the current island OPP While pulling a task from the big island to the LITTLE core
js , we need to factor the increase in power consumption P due cempty (Lines 10–29), we reduce as much as possible the big island
to each possible OPP switch. Here, our WF-based choice allows utilization, while trying to leave the overall load across the cores
us to search for the minimum OPP j∗s with an associated speed balanced. To this end, we pull the ready task τi with the highest
greater than or equal to the sum of the nominal utilization Uh′s of utilization Ui from the busiest big core cmax that fits inside the
the least-loaded core h′s , plus the one of τi : target CPU cempty and that produces energy saving.
In case there is no big task satisfying the just mentioned
j∗s = j | Uh′s + Ui ≤ xs,j .
{ }
min (6) condition, or cempty is a big core, then we pull from the busiest
j>js ∧j≤ks
7
A. Mascitti, T. Cucinotta, M. Marinoni et al. The Journal of Systems & Software 173 (2021) 110886
infl
Algorithm 2 Pull algorithm onto a core cempty that was left idle. This is why we need to use Ui instead of Ui in the pseudo-
1: // B denotes big cores, L denotes LITTLE ones.
code following Lines 16 and 40, to determine whether pulling τi
2: // Find a task to pull onto the empty core cempty from another core is safe. Estimating whether pulling a task is energetically conve-
3: procedure Pull(cempty ) nient, either from the same and the big island, is done by finding
4: set result := pullFromBig(cempty ) the new speed of both islands after the pull and seeing if there is
5: if result is NULL then
a reduction of energy consumption in the whole system. In these
6: set result := pullFromSameIsland(cempty )
7: end if power consumption calculations, the real task utilization is used
8: return result (as argument to Ps,j (·)), but the actual OPP to be used on the island
9: end procedure of cempty needs to account for the inflated nominal utilization.
10: // Return task to be migrated onto cempty and big core where to pull it from, Additional notes about the above algorithm are reported in
or NULL
Section 7, along with some implementation details.
11: procedure pullFromBig(cempty )
12: if cempty ∈ L then
13: choose cmax | Ucmax = maxc ∈B {Uc } 6. Schedulable task sets
14: set R := List of ready tasks τi in cmax , sorted by decreasing utilization
Ui . In this section, we present some of the theoretical conditions
15: for each τi ∈ R do
infl under which a set of real-time tasks scheduled according to the
16: if Ui ≤ xL then
infl technique presented in Section 5 is expected to be schedulable,
17: set WL := {VL,h } where VL,cempty is set to Ui
18: set j∗L := minOPP(L, WL ) namely to never miss a deadline. We anticipate that our set of
19: set WB := {VB,h } where VB,cmax is decreased by Ui identified conditions is sufficient, not necessary, for schedulability
20: set j∗B := minOPP(B, WB ) of the task sets. This allows us to have quick and easy-to-compute
set ∆Ps := PB,j∗ (VB − Ui ) + PL,j∗ (VL + Ui ) − PB,jB (VB ) + PL,jL (VL )
( )
21:
B L
admission tests to understand whether a new real-time task can
22: if ∆Ps < 0 then be safely admitted into the system guaranteeing timeliness of
23: return (τi , cmax )
24: end if
execution for all the admitted tasks.
25: end if First, we need a preliminary result that can be formulated with
26: end for reference to a single island s of identical cores. Also, we start
27: end if enumerating these conditions under simplifying assumptions:
28: return NULL
29: end procedure A1 the operation of frequency switch for a given island takes a
30: // Return task to be migrated onto cempty and same-island core where to
negligible time;
pull it from
31: procedure pullFromSameIsland(cempty )
32: set s := island of core cempty A2 the operation of context switch among tasks and with the idle
33: set S := cores in island s task takes a negligible time on any of the cores;
34: choose cmax | Ucmax = maxc ∈S {Uc }
35: if cmax == cempty then A3 tasks are only pushed according to the rule in Algorithm 1,
36: return NULL they are never pulled while running or ready to run.
37: end if
U
38: set R := List of ready tasks τi in cmax with Ui < cmax 2
, sorted by At the end, we will discuss how to properly relax these as-
decreasing Ui .
sumptions without breaking the schedulability properties. In the
39: for each τi ∈ R do
40:
infl
if Ui ≤ xL then following theorems and propositions, notice that the taskset Γ is
41: set W := {Vs,h } where Vs,cempty = Ui
infl
and Vs,cmax is decreased by Ui ordered in decreasing nominal utilization order and, for instance,
42: set j∗ := minOPP(s, W ) the highest-utilization task has utilization U1 ≡ U max , while the
43: if j∗ < js then second highest-utilization task has utilization U2 ≡ U max2 ≤ U1 .
44: return (τi , cmax )
45: end if
Proposition 2. A set of n tasks Γ = {1, . . . , n}, ordered from the
46: end if
47: end for maximum to the minimum nominal utilization, is schedulable by the
48: return NULL technique presented in Section 5 on a platform having a multi-core
49: end procedure island s with ms cores if:
⌊ x ⌋
s
n ≤ ms (9)
U max
core of the same island as cempty (Lines 30–49). This is done where U max ≜ maxτi ∈Γ {Ui } is the maximum nominal utilization
aiming at balancing the overall nominal utilization hosted on among all the tasks in Γ , and xs is the speed of a core at the
the source core (whose utilization decreases) and the destination maximum island OPP ks .
one (whose utilization increases) of the same island, in order to
maximize the possible OPP reduction. Therefore, as highlighted in Proof. Whenever evaluating where to place a task, our proposed
U
Line 38, Algorithm 2 tries to pull the biggest task with Ui < cmax 2 heuristic applies a worst-fit strategy, where each placement takes
in this case. place on one of the least loaded cores h∗s of the island (see
Note that, at any time t when we evaluate a pull operation, Algorithm 1). If the current frequency does not give h∗s a suf-
it is necessary to consider that the task being pulled might have ficient computational capacity, we are ready to raise the island
already partially executed on the source CPU, so it has a leftover frequency to the minimum OPP such that the new task fits (its
(nominal) WCET ci ≤ Ci that has to be scheduled on the des- nominal utilization, plus the one of the core, do not exceed the
tination CPU within the scheduling deadline di > t already in maximum limit due to the OPP as from Proposition 1). Thanks
place for the task. Therefore, it is safe to migrate the task only to the assumptions A1 and A2, this can all be done in zero time,
if its inflated utilization fits on cempty , which also impacts on the so a task is certainly schedulable if its utilization fits within the
minimum OPP needed on cempty in order to perform correctly the least-loaded core when bumped-up at the maximum island OPP
migration: ks . As the maximum utilization that can be safely hosted on any
infl ci Ci core at max OPP is xs , our proposition proof is reduced to proving
Ui ≜ ≥ . (8) that this is always the case under the assumptions in Eq. (9).
di − t Ti
8
A. Mascitti, T. Cucinotta, M. Marinoni et al. The Journal of Systems & Software 173 (2021) 110886
This is trivially done by contradiction: assume we have a task property. However, we omit here further details for the sake of
τi that does not fit on any of the ms cores of the island. This brevity.
means that ∀h ∈ [1, . . . , ms ], Us,h + Ui > xs . However,
∑ as each
task has a maximum utilization of U max , then xs < j∈Γs,h Uj + Theorem 1. Given a big.LITTLE platform with mL and mB LITTLE
Γ < ⏐Γs,h ⏐ + 1 H⇒ and big cores respectively, and a set of real-time tasks Γ = ΓH ∪ ΓL
(⏐ ⏐ ⏐ ⏐
⏐ + 1 , therefore xmax
max ⏐
) s
U i ≤ U s, h U
⏐Γs,h ⏐ ≥ . Summing up this for all task-sets Γs,h of all partitioned as heavyweight and lightweight ones as from Eq. (11),
⏐ ⏐ ⌊ xs ⌋
U max
cores, along with the new task τi to be placed, which all together with heavyweight tasks spread evenly on big cores and lightweight
constitute a partitioning of Γ , we obtain an evident contradiction ones placed according to the technique proposed in Section 5, then
to Eq. (9). □ we always manage to schedule them if the following conditions hold
true:
In case of a particularly ‘‘nasty’’ task-set with only a heavy- ⎧ ⌊ ⌋ ⌊ ⌋
1−UBmax
⌊ ⌋
weight task with utilization U max much higher than all other
⎨ |ΓH | ≤ mB U max
1
∨ |ΓH | ≤ 1 + U max2 + (mB − 1) U max2 1
⌊ B ⌋ ⌊ B B (12)
tasks, bringing down significantly the maximum number of tasks max ⌋ max ⌋
⌊
1−h×UH 1−(h+1)UH
⎩ |ΓL | ≤ mL U xmax
L
+ U max (m B − k) + U max k,
that could be hosted according to Proposition 2, we can iden- L L L
Proposition 3. A set of n tasks Γ = {1, . . . , n}, ordered from the Proof sketch. The upper condition in Eq. (12) derives from the
maximum to the minimum nominal utilization, is schedulable by the simple fact that heavyweight tasks can only be placed on big
technique presented in Section 5 on a platform having a multi-core cores, so the results of Propositions 2 and 3 can be directly applied
island s with ms cores if: to the big island and the heavyweight tasks alone, observing that
⌊
xs − U max
⌋ ⌊ x ⌋ xB = 1.
s
n≤1+ + (ms − 1) . (10) The lower condition is obtained as the sum of three terms,
U max2 U max2 which are explained as follows. The first term accounts for the
where U max and U max2 are the maximum and second-maximum maximum number of lightweight tasks that can fit on the LITTLE
nominal utilizations among all the tasks in Γ , and xs is the speed island, again reusing the result in Proposition 2 (the one in
of the core at the maximum island OPP ks . Proposition 3 might be used as well, this is omitted for the sake
of simplicity). The second and third terms refer to how many
Proof. This is easily obtained again by contradiction, putting lightweight tasks can fit at most into the big island, using the
ourselves into a scenario where we have a task τi ̸ = τ1 whose residual space left by possible heavyweight tasks. Here, we need
utilization Ui does not fit onto any core. This time, however, to consider how the heavyweight tasks are spread throughout the
we can put ourselves into a worst-case scenario where we have big cores. Whenever we have ⌊ |⌋ΓH | heavyweight tasks, the worst-
|ΓH |
one of the cores, say h̃, hosting the heavyweight utilization U1 , fit algorithm will cause mB
≡ h tasks in ΓH to be present
and all other cores hosting tasks with a maximum utilization in each big core, however exactly (|ΓH | mod mB ) ≡ k cores of
of U max2 . Therefore, for the core h̃ we have: xs < Us,h̃ + Ui ≤ the big island will host one more heavyweight task. Therefore, a
U1 + ⏐Γs,h̃ ⏐ − 1 U2 + U2 ≡ U1 + ⏐Γs,h̃ ⏐ U2 H⇒
xs −U1
< ⏐Γs,h̃ ⏐
(⏐ ⏐ ) ⏐ ⏐ ⏐ ⏐
U2
corresponding maximum utilization of heavyweight tasks needs
to be subtracted from the maximum possible big cores capacity,
⌊ ⌋
H⇒ ⏐Γs,h̃ ⏐ ≥ xs U−U1 + 1. For any other core h ̸= h̃, we have
⏐ ⏐
⌊ ⌋ 2 xB = 1, to obtain the nominal utilization available for lightweight
⏐Γs,h ⏐ ≥ xs . Putting tasks of all cores together, alongside the tasks to be hosted. □
⏐ ⏐
U2
new task being placed τi , we conclude that the overall number of
7. Implementation details
tasks must be
⌊ ⌋ ⌊ ⌋
xs − U 1 xs BL-CBS has been implemented3 within RTSim (Palopoli et al.,
|Γ | = ⏐Γs,h̃ ⏐ + (ms − 1) ⏐Γs,h ⏐ + 1 ≥ + 2,
⏐ ⏐ ⏐ ⏐
+ (ms − 1)
U2 U2 2002), a portable, open-source event-based simulator written in
C++ allowing to simulate the execution timing of real-time tasks
which is in contradiction with Eq. (10).
running on multi-processor platforms using various schedulers.
The case τi = τ1 can be handled in a very similar way. □
This section discusses a few implementation notes regarding an
Generalizing, we can design an admission test peeking at the efficient and sound implementation of various critical steps of the
2nd, 3rd, . . . , kth maximum utilization, but the test looses its algorithm introduced in Section 5.
simplicity and its complexity grows in a combinatorial way. A dis-
cussion of these tests for SMP platforms can be found in Mascitti 7.1. Push algorithm
et al. (2020).
We focus now on the general problem of admitting a set of First, in the push algorithm in Algorithm 1, we need to find the
real-time tasks Γ into a big.LITTLE platform making use of the core with the minimum nominal utilization. Nowadays big.LITTLE
scheduling strategy presented in Section 5. First, we need to architectures present a small number of cores per island (typically
distinguish among possible heavyweight tasks ΓH with a too high 4), so it is easy to scan through the cores in an island to this
utilization that would not fit on any LITTLE core, from the other purpose. However, for bigger islands of possible future platforms,
lightweight tasks ΓL : we can use a min-heap to keep track of the lowest-utilization
core for each island. With such a solution, we can choose quickly
ΓH ≜ {τi ∈ Γ | Ui > xL } ; ΓL ≜ {τi ∈ Γ | Ui ≤ xL } . (11) which core a newly arrived task fits into, with a logarithmic
In our target scenarios, we do not expect to see many heavy- complexity in the number of per-island cores.
weight tasks on the platform. However, whenever they are Second, in order to evaluate what move is the most conve-
present, we assume to be able to spread them out evenly on the nient one from the average power consumption viewpoint, we
big cores, for simplicity in the analysis that follows. If heavy-
weight tasks can arrive dynamically, this means we should be 3 Source code is available at https://gitlab.retis.santannapisa.it/a.mascitti/
ready to migrate one or more lightweight tasks, to ensure this rtsim-efficient-public-grub-pa.
9
A. Mascitti, T. Cucinotta, M. Marinoni et al. The Journal of Systems & Software 173 (2021) 110886
needed to implement the function Ps (·) that is the min-power 8. Simulation results
curve obtained by combining all the ‘‘lowest power’’ segments
highlighted in the plots of Fig. 1. From the in-kernel device tree, 8.1. Compared schedulers
we can extract the table reporting, for each available island s and
OPP j, the corresponding power consumptions when computing We validated the BL-CBS algorithm described in Section 5 and
px,j and when idle pidle
x,j , as introduced in Section 5.1. However, for
measured its performance and energy consumption through the
computing the linear formula in Eq. (5), we need a data structure class EnergyMRTKernel in RTSim with respect to: (i) a variant of
for retrieving the correct power terms corresponding to the min- BL-CBS based on a first-fit placement strategy, called EDF-FF in
imum OPP able to host a given total nominal utilization. This is what follows, implemented in the same EnergyMRTKernel class;
relatively expensive, so this search for the power terms related (ii) a similar variant based on best-fit, called EDF-BF below; (iii)
to a nominal utilization can be accelerated keeping a sorted data G-EDF, as available through the class MRTKernel_Linux5_3_11 in
structure, for example, a balanced binary tree, allowing us to RTSim, simulating the behaviour of the mainline Linux kernel
perform the look-up in O(log ks ), with ks number of available a few years back, running SCHED_DEADLINE CBS reservations
without GRUB nor power awareness features; (iv) and MRTKer-
OPPs. Alternatively, with a little waste of memory, it is possible
nel_Linux5_3_11_GRUB_PA in RTSim, implementing the energy-
to store a sampled version of the same curve using a utilization
related behaviour of GRUB-PA, i.e. decreasing the frequency to
expressed, for example, in 1024th, creating a table with 1024
the minimum required to sustain the utilization of every CPU,
entries, so that the same look-up can be performed in constant
and reflecting (part of) the behaviour of the implementation of
O(1) time instead.
GRUB-PA in the mainline Linux running SCHED_DEADLINE CBS
reservations. In fact, a complete implementation of GRUB-PA
7.2. Pull algorithm would add the bandwidth reclaiming mechanism, which would
not benefit our simulated scenarios since task overruns are not
considered.
The choice of which job to pull has an impact on how much
Concerning our implementation of EDF-FF, this policy needs an
frequencies can actually be reduced. In the proposed method, it is
ordering of the cores, so we consider the LITTLE cores preceding
possible to choose a task from the big core with maximum utiliza-
the big ones. When a task τi with nominal utilization Ui arrives
tion if it fits into the destination core, as described in Section 5.
(i.e., a job begins), it is dispatched to the first core h where it
This strategy allows for achieving the best frequency reduction of
fits in the mentioned ordering, raising its OPP if needed, i.e.:
the big island since the core with maximum utilization is the one
h = min{k | Ui ≤ xsk − Vsk ,k } (sk denotes the island of core k). Note
that constraints the island frequency. that, if Ui > xL , then only a big core can be chosen. Whenever a
For the just mentioned choice, we need to find the busiest core h is left idle, the first ready task fitting in h from the last
big core. Similarly to the above observation, this is not difficult non-idle core k > h where there is one such task, is pulled onto
for very small islands as in nowadays big.LITTLE platforms. How- h, if any exists.
ever, in case of a big island with a relatively high number of In our implementation of EDF-BF, when a task τi with nominal
cores, it is always possible to implement a fast look-up by recur- utilization Ui arrives (i.e., a job begins), it is placed on the core
ring to a max-heap with alteration operations having logarithmic h of the same island s where it was located with minimum
complexity in the number of cores per island. residual capacity where it fits, raising the island OPP if needed:
min
After having identified the busiest core, we need to find its h ∈ s s.t. xs − Vs,h = k∈s {xs − Vs,k | Ui ≤ xs − Vs,k }.
biggest hosted task that fits into the destination core. If scanning However, if the current island has no core satisfying said con-
through the real-time ready tasks of the found core needs acceler- dition, then the other island is tried. Under EDF-BF, pulls are
ation, we can recur to a data structure sorted by (nominal) utiliza- disabled, i.e., when a job completes its execution or its virtual
tion, like a red-black-tree, as used already in SCHED_DEADLINE time expires, leaving its core idle, no pull from the other cores
for implementing the EDF scheduler queue. is performed.
10
A. Mascitti, T. Cucinotta, M. Marinoni et al. The Journal of Systems & Software 173 (2021) 110886
Fig. 3. Left: total energy consumption for different total utilizations, considering 10 experiments for each utilization. Right: energy consumption ratio between (i)
G-EDF, BL-CBS; (ii) GRUB-PA and BL-CBS; (iii) EDF-BF and BL-CBS; and (iv) EDF-FF and BL-CBS for different utilizations. Y axes are in logarithmic scale.
Fig. 4. Average frequency for different total utilizations for BL-CBS, EDF-BF, EDF-FF and GRUB-PA, for the big island (left) and the LITTLE island (right), considering
10 experiments for each utilization.
rough granularity of 0.5 ms, in order to keep under reasonable since the latter does not spread the tasks on the cores of each
limits the generated hyperperiod. island, and thus tasks tend to be dispatched on fewer cores of
Each run has been repeated 10 times with different seed each island, which reduces the chances to reduce the island OPP
values for the taskset generator, running each taskset with either due to job completions. Generally speaking, as the total utilization
BL-CBS, EDF-BF, EDF-FF, GRUB-PA or G-EDF, then statistics have grows, the power consumption of all the considered algorithms
been computed on the resulting values of the various metrics of tends to grow, as the islands are kept for longer times at higher
interest throughout the different runs. frequencies. Overall, EDF-BF, EDF-FF and G-EDF consume more
Fig. 3 (left) depicts the obtained overall energy consumption than BL-CBS and GRUB-PA. BL-CBS achieves 15% of energy saving
throughout the runs (on the Y axis, in logarithmic scale) for with respect to GRUB-PA, on average across all the performed
each total nominal utilization (on the X axis, divided by 8), and experiments.
both for the proposed technique, GRUB-PA, EDF-BF, EDF-FF and While Fig. 3 (left) takes into account experiments with differ-
the G-EDF scheduler (different curves in the plot). Each reported ent hyperperiods (which can be very different among the exper-
point and its associated vertical bar represents the average energy
iments), Fig. 3 (right) is hyperperiod-independent and shows the
consumption obtained for the 10 runs and its corresponding stan-
ratio (on the Y axis) of the overall energy consumption between
dard deviation. The obtained energy consumption with BL-CBS
both (i) GRUB-PA and BL-CBS; (ii) EDF-BF and BL-CBS; (iii) EDF-
is consistently lower than the one obtained with G-EDF, EDF-BF,
FF and BL-CBS and (iv) G-EDF and BL-CBS (different curves) over
EDF-FF and GRUB-PA. This was expected with G-EDF, as BL-CBS
10 experiments for each total nominal utilization (on the X axis,
on average keeps the frequencies of the islands on lower values
than G-EDF, which constraints the two islands to their maximum divided by 8). Also, the vertical bars represent the minimum and
frequencies. Also, BL-CBS consumes less than GRUB-PA for each the maximum ratios found among the experiments. While the
total utilization since at each job arrival BL-CBS chooses the core ratio between GRUB-PA and BL-CBS is quite stable throughout
with the minimum power increase, while GRUB-PA picks the core the X axis, the ratio between G-EDF and BL-CBS is more marked
with the latest deadline task, without taking into account the con- and decreases with the utilizations. In case of the highest total
sequent energy consumption. Moreover, GRUB-PA consumes less nominal utilization U = 5.6, we found only 2 experiments where
than G-EDF since the latter keeps the highest cores frequencies. BL-CBS consumes up to 20% more than GRUB-PA. As expected,
Finally, BL-CBS consumes less than EDF-FF since the latter tends the ratio between BL-CBS and GRUB-PA is the minimum, and it
to load the first cores in the ordering as much as possible before is similar to the one between BL-CBS and EDF-BF and between
using a subsequent empty one, causing high loads on the first core BL-CBS and EDF-FF, while the gap between BL-CBS and G-EDF
of an island, forcing its OPP to increase in presence of idle cores is remarkable. Also, the ratio between BL-CBS and EDF-BF and
on the same island. Similarly, BL-CBS consumes less than EDF-BF BL-CBS and EDF-FF is very close because both fill fewer cores
11
A. Mascitti, T. Cucinotta, M. Marinoni et al. The Journal of Systems & Software 173 (2021) 110886
Fig. 5. Comparison of power consumption over time for different utilizations (plots refer to just 1 run out of the 10 performed for each configuration).
0.25. Notice that the power consumptions of Fig. 5 also take into 15% of energy saving in average with respect to the state of the
account the power consumptions when the cores are idle on the art GRUB-PA.
X axis for both islands. Concerning possible lines of future work on the topic, we plan
The power savings just discussed, obtained with CPUs at lower to increase the chances of migrations between the two islands
frequencies on average, impact on the response time experienced and, possibly, to balance the load of the cores in even more
by the real-time tasks. Fig. 6 reports the obtained experimental situations, which opens many possibilities to decrease frequen-
f −r
CDF of the response times relative to the periods ( i,j T i,j for cies further. Moreover, the pessimism of our admission test can
i
all instances of all tasks), for BL-CBS, GRUB-PA, EDF-BF, EDF-FF, be reduced so that its usability can be expanded to a broader
and G-EDF schedulers (different curves). Jobs response times are set of scenarios. Moreover, we will consider more complex task
higher with BL-CBS than with GRUB-PA, EDF-BF, EDF-FF, and G- sets, with inter-tasks relationships (DAGs) and workload types.
EDF, where island frequencies are just kept at the maximum. We also plan to investigate on how to modify the proposed
Since BL-CBS picks the core with the minimum power increase, mechanism in order to properly consider possible deep-idle states
while GRUB-PA simply picks the one with latest deadline, cores of the CPU. Finally, we plan to realize BL-CBS within the cur-
tend to be less loaded and frequencies lower in average than with rent SCHED_DEADLINE codebase in the Linux kernel, to perform
GRUB-PA, which increases the jobs response times of BL-CBS. further experimentation and validation using real application
Also, BL-CBS tends to spread the jobs on the cores of each is- workloads on Linux/Android.
land, while EDF-BF and EDF-FF prefer the cores already occupied,
which increases the frequencies and lowers the jobs response CRediT authorship contribution statement
time. Jobs take less to terminate with GRUB-PA than with EDF-
BF and EDF-FF, since GRUB-PA uses higher frequencies for total Agostino Mascitti: Software, Writing - original draft, Data cu-
utilization U = 2.0 (which the CDF refers to), especially on the ration, Investigation. Tommaso Cucinotta: Supervision, Concep-
big island as in Fig. 4. tualization, Methodology, Writing - original draft. Mauro Mari-
Further experiments have been performed also with noni: Supervision, Conceptualization, Methodology. Luca Abeni:
randomly-generated tasksets with 16 and 32 tasks and, for each Supervision, Conceptualization, Methodology.
configuration, the same total nominal utilization range. Results
show a behaviour similar in terms of energy saving to the ex- Declaration of competing interest
periments presented above and depicted in Fig. 3, i.e., G-EDF has
much higher energy consumptions than the other three algo- The authors declare that they have no known competing finan-
rithms, while BL-CBS consumes consistently less than GRUB-PA, cial interests or personal relationships that could have appeared
EDF-BF, and -EDF-FF for each total nominal utilization. to influence the work reported in this paper.
In our experimentation, all tasksets respecting the theoretical
condition of Eq. (12) did not experience any deadline miss during References
their execution. Furthermore, the simulated tasksets included
also many ones that did not respect said condition, which is Abeni, L., Buttazzo, G., 1998. Integrating multimedia applications in hard real-
clearly pessimistic, as its result is heavily affected by the biggest- time systems. In: Proc. 19th IEEE Real-Time Systems Symposium. pp.
4–13.
utilization task in the set. Particularly, among the tasksets with
Abeni, L., Cucinotta, T., 2020. Adaptive partitioning of real-time tasks on multiple
utilization from 0.25 onwards, we had many of them not re- processors. In: Proc. 35th Annual ACM Symposium on Applied Computing.
specting the theoretical schedulability condition, yet they did not SAC’20, ACM, New York, NY, USA, ISBN: 9781450368667, pp. 572–579.
experience any deadline miss throughout the run, as expected Andersson, B., Tovar, E., 2006. Multiprocessor scheduling with few preemp-
due to the pessimism of the theoretical analysis. The only ex- tions. In: 12th IEEE International Conference on Embedded and Real-Time
Computing Systems and Applications. (ISSN 2325-1271) pp. 322–334.
ception has been for the highest total utilization we considered
ARM, 2019. ARM Technologies: Dynamiq. https://www.arm.com/why-arm/
(U = 5.6), where we found a 0.53% of jobs having missed their technologies/dynamiq. (Accessed November 4 2019).
deadline at the end of the simulation (for a taskset not respecting Aydin, H., Melhem, R., Mossé, D., Mejía-Alvarez, P., 2004. Power-aware
Eq. (12)). scheduling for periodic real-time tasks. IEEE Trans. Comput. 53 (5), 584–600.
Finally,5 our implementation of the proposed algorithm, as Balsini, A., Cucinotta, T., Abeni, L., Fernandes, J., Burk, P., Bellasi, P., Ras-
mussen, M., 2019. Energy-efficient low-latency audio on android. J. Syst.
provided in the RTSIM simulator, has been measured to take on Softw. (ISSN: 0164-1212) 152, 182–195.
average 0.8 us (and a maximum of 16.43 us, when executing Balsini, A., Pannocchi, L., Cucinotta, T., 2016. Modeling and simulation of
a routine to report that a job cannot be dispatched onto any power consumption and execution times for real-time tasks on embedded
core because the system is overloaded) when running on an Intel heterogeneous architectures. In: Proc. International Workshop on Embedded
Operating Systems. Torino, Italy.
i7-8700 at 4.6 GHz and 16 GB RAM.
Bambagini, M., Bertogna, M., Marinoni, M., Buttazzo, G., 2013. An energy-aware
algorithm exploiting limited preemptive scheduling under fixed priorities. In:
9. Conclusions and future work 2013 8th IEEE International Symposium on Industrial Embedded Systems. pp.
3–12.
In this paper, we have presented and simulated big-LITTLE Bambagini, M., Marinoni, M., Aydin, H., Buttazzo, G., 2016. Energy-aware
scheduling for real-time systems: A survey. ACM Trans. Embed. Comput. Syst.
Constant Bandwidth Server (BL-CBS), an adaptive partitioning
15 (1), 7.
approach to schedule energy-efficiently real-time tasks. This al- Baruah, S., Carpenter, J., 2003. Multiprocessor fixed-priority scheduling with
gorithm has been mainly designed for ARM big.LITTLE, exploiting restricted interprocessor migrations. In: Proc. 15th Euromicro Conference on
the underlying hardware architecture features to provide both Real-Time Systems. pp. 195–202.
real-time guarantees and energy saving. It has been shown that Burns, A., Davis, R.I., Wang, P., Zhang, F., 2012. Partitioned EDF scheduling for
multiprocessors using a C=D task splitting scheme. Real-Time Syst. 48 (1),
BL-CBS allows for guaranteeing timing constraints of real-time
3–33.
task sets that satisfy certain theoretical conditions. Simulations, Casini, D., Biondi, A., Buttazzo, G., 2017. Semi-partitioned scheduling of dynamic
based on experimental measurements made on the ODROID-XU3 real-time workload: A practical approach based on analysis-driven load
board, show that the algorithm is actually promising, allowing balancing. In: 29th Euromicro Conference on Real-Time Systems. Dubrovnik,
Croatia.
Cheramy, M., Hladik, P., Deplanche, A., Dube, S., 2014. Simulation of real-
5 It would be very interesting to have an implementation of BL-CBS in a real time scheduling with various execution time models. In: Proc. 9th IEEE
kernel, such as in the SCHED_DEADLINE scheduler within the Linux kernel. International Symposium on Industrial Embedded Systems. pp. 1–4.
13
A. Mascitti, T. Cucinotta, M. Marinoni et al. The Journal of Systems & Software 173 (2021) 110886
Chwa, H.S., Seo, J., Yoo, H., Lee, J., Shin, I., 2015. Energy and feasibility optimal Palopoli, L., Lipari, G., Abeni, L., Di Natale, M., Ancilotti, P., Conticelli, F., 2001.
global scheduling framework on big. LITTLE platforms. In: Proc. Real-Time A tool for simulation and fast prototyping of embedded control systems.
Scheduling Open Problems Seminar. Lund, Sweden, pp. 1–11. In: Proc. 2001 ACM SIGPLAN Workshop on Optimization of Middleware and
Colin, A., Kandhalu, A., Rajkumar, R., 2014. Energy-efficient allocation of real-time Distributed Systems. Snow Bird, Utah, USA, pp. 73–81.
applications onto heterogeneous processors. In: Proc. 20th IEEE Interna- Palopoli, L., Lipari, G., Lamastra, G., Abeni, L., Bolognini, G., Ancilotti, P., 2002.
tional Conference on Embedded and Real-Time Computing Systems and An object-oriented tool for simulating distributed real-time control systems.
Applications. pp. 1–10. Softw. - Pract. Exp. 32 (9), 907–932.
Emberson, P., Stafford, R., Davis, R.I., 2010. Techniques for the synthesis of Perret, Q., 2018. Energy aware scheduling. https://lwn.net/Articles/760647.
multiprocessor tasksets. In: Proc. 1st International Workshop on Analysis Pillai, A.S., Isha, T.B., 2013. ERTSim: An embedded real-time task simulator
Tools and Methodologies for Embedded and Real-Time Systems. Brussels, for scheduling. In: 2013 IEEE International Conference on Computational
Belgium, pp. 6–11. Intelligence and Computing Research. Chennai, India, pp. 1–4.
Funk, S., 2004. EDF Scheduling on Heterogeneous Multiprocessors (Ph.D. thesis). Pillai, P., Shin, K.G., 2001. Real-Time dynamic voltage scaling for low-power
University of North Carolina at Chapel Hill. embedded operating systems. In: Proc. 18th ACM Symposium on Operating
Guo, Z., Bhuiyan, A., Liu, D., Khan, A., Saifullah, A., Guan, N., 2019. Energy- Systems Principles. vol. 35. Banff, Canada, pp. 89–102.
Efficient real-time scheduling of DAGs on clustered multi-core platforms. Qin, Y., Zeng, G., Kurachi, R., Li, Y., Matsubara, Y., Takada, H., 2019a. Energy-
In: IEEE Real-Time and Embedded Technology and Applications Symposium. efficient intra-task DVFS scheduling using linear programming formulation.
pp. 156–168. IEEE Access 7, 30536–30547.
Guo, Z., Bhuiyan, A., Saifullah, A., Guan, N., Xiong, H., 2017. Energy- Qin, Y., Zeng, G., Kurachi, R., Matsubara, Y., Takada, H., 2019b. Execution-
efficient multi-core scheduling for real-time DAG tasks. In: 29th Euromicro variance-aware task allocation for energy minimization on the big. LITTLE
Conference on Real-Time Systems. Dubrovnik, Croatia, pp. 156–168. architecture. Sustain. Comput. Inform. Syst. 22, 155–166.
Imes, C., Hoffmann, H., 2015. Minimizing energy under performance constraints Saewong, S., Rajkumar, R., 2003. Practical voltage-scaling for fixed-priority
on embedded platforms: Resource allocation heuristics for homogeneous and RT-Systems. In: Proc. 9th IEEE Real-Time and Embedded Technology and
single-ISA heterogeneous multi-cores. ACM SIGBED Rev. 11 (4), 49–54. Applications Symposium. pp. 106–114.
Li, T., Zhang, T., Yu, G., Song, J., Fan, J., 2019. Minimizing temperature and energy Scordino, C., Abeni, L., Lelli, J., 2018. Energy-aware real-time scheduling in the
of real-time applications with precedence constraints on heterogeneous Linux Kernel. In: Proc. 33rd Annual ACM Symposium on Applied Computing.
MPSoC systems. J. Syst. Archit. 98, 79–91. Pau, France, pp. 601–608.
Liu, D., Spasic, J., Chen, G., Stefanov, T., 2015. Energy-efficient mapping of real- Scordino, C., Abeni, L., Lelli, J., 2019. Real-time and energy efficiency in Linux:
time streaming applications on cluster heterogeneous MPSoCs. In: 13th IEEE Theory and practice. ACM SIGAPP Appl. Comput. Rev. (ISSN: 1559-6915) 18
Symposium on Embedded Systems for Real-Time Multimedia. ESTIMedia. pp. (4), 18–30.
1–10. Scordino, C., Lipari, G., 2004. Using resource reservation techniques for power-
Liu, D., Spasic, J., Wang, P., Stefanov, T., 2016. Energy-efficient scheduling of real- aware scheduling. In: Proc. 4th ACM International Conference on Embedded
time tasks on heterogeneous multicores using task splitting. In: IEEE 22nd Software. EMSOFT ’04, ACM, New York, NY, USA, ISBN: 1581138601, pp.
International Conference on Embedded and Real-Time Computing Systems 16–25.
and Applications. pp. 149–158. Scordino, C., Lipari, G., 2006. A resource reservation algorithm for power-aware
Mascitti, A., Cucinotta, T., Abeni, L., 2020. Heuristic partitioning of real-time tasks scheduling of periodic and aperiodic real-time tasks. IEEE Trans. Comput. 55
on multi-processors. In: 2020 IEEE 23rd International Symposium on Real- (12), 1509–1522.
Time Distributed Computing. ISORC. pp. 36–42. http://dx.doi.org/10.1109/ Thakare, G.S., Deshmukh, P.R., 2017. EERTSS: An energy efficient real-time
ISORC49007.2020.00015. task scheduling simulator. In: 2017 International Conference on Computing,
Mascitti, A., Cucinotta, T., Marinoni, M., 2020. An adaptive, utilization-based Communication, Control and Automation. pp. 1–4.
approach to schedule real-time tasks for ARM big.LITTLE architectures. In: Thammawichai, M., Kerrigan, E.C., 2018. Energy-efficient real-time scheduling for
Proc. International Workshop on Embedded Operating Systems, vol. 17. ACM, two-type heterogeneous multiprocessors. Real-Time Syst. 54 (1), 132–165.
New York, NY, USA, pp. 18–23, Zahaf, H., Lipari, G., Bertogna, M., Boulet, P., 2019. The parallel multi-mode
Moulik, S., Devaraj, R., Sarkar, A., 2019. HEALERS: A heterogeneous energy-aware digraph task model for energy-aware real-time heterogeneous multi-core
low-overhead real-time scheduler. IET Comput. Digit. Tech. 13 (6), 470–480. systems. IEEE Trans. Comput. 68 (10), 1511–1524.
Nogues, E., Pelcat, M., Menard, D., Mercat, A., 2016. Energy efficient scheduling of Zhu, Y., Mueller, F., 2004. Feedback EDF scheduling exploiting dynamic voltage
real time signal processing applications through combined DVFS and DPM. scaling. In: 10th IEEE Real-Time and Embedded Technology and Applications
In: 2016 24th Euromicro International Conference on Parallel, Distributed, Symposium. Toronto, Canada, pp. 84–93.
and Network-Based Processing. IEEE, pp. 622–626. Zhu, Y., Mueller, F., 2007. Exploiting synchronous and asynchronous DVS for
feedback EDF scheduling on an embedded platform. ACM Trans. Embed.
Comput. Syst. (ISSN: 1539-9087) 7 (1), 3:1–3:26.
14