Adapting FreeRTOS For Multicores: An Experience Report
Adapting FreeRTOS For Multicores: An Experience Report
SUMMARY
Multicore processors are ubiquitous. Their use in embedded systems is growing rapidly, and given the
constraints on uniprocessor clock speeds, their importance in meeting the demands of increasingly processor-
intensive embedded applications cannot be understated. To harness this potential, system designers need to
have available to them embedded operating systems with built-in multicore support for widely available
embedded hardware. This paper documents our experience of adapting FreeRTOS, a popular embedded
real-time operating system, to support multiple processors. A working multicore version of FreeRTOS that
is able to schedule tasks on multiple processors as well as provide full mutual-exclusion support for use in
concurrent applications is presented. Mutual exclusion is achieved in an almost completely platform-agnostic
manner, preserving one of FreeRTOS’s most attractive features: portability. Copyright © 2013 John Wiley
& Sons, Ltd.
KEY WORDS: embedded systems; operating systems; multicore computing; task scheduling; software-
based mutual exclusion; field-programmable gate arrays
1. INTRODUCTION
FreeRTOS is a popular open-source embedded real-time operating system (RTOS) that has been
ported to over 30 different hardware platforms and receives over 75,000 downloads per year. The
aim of this work is to produce a version of FreeRTOS that supports multicore hardware, an increas-
ingly important requirement given that such hardware is now appearing widely even in embedded
devices. The software developed during the course of the work has been created to serve as a starting
point from which a full-featured version of FreeRTOS can be developed to provide a comprehensive
operating system (OS) solution for embedded devices with multiple processors. This paper, based
on the first author’s MSc thesis [1], documents our experience of the whole software development
process, including requirements, design choices, implementation details, critical evaluation, and the
lessons learned.
1.1. Motivation
Real-time software is fundamental to the operation of systems in which there exist requirements
to impose temporal deadlines and behave in a deterministic manner [2, p. 4]. Examples of such
systems include those responsible for controlling airbags in cars, control surfaces in aircraft, and
missile early warning alerts.
*Correspondence to: James Mistry, British Telecom, Adastral Park, Ipswich, IP5 3RE, UK.
† E-mail: [email protected]
To prove that real-time systems meet their deadlines, it is necessary to identify their worst-case
performance scenarios. If systems become too complex, the cost of performing such an analysis can
become prohibitively high. Thus, simplicity is a core principle of the open-source embedded RTOS,
FreeRTOS [3]. Developing applications based on FreeRTOS involves leveraging an easy-to-use API
and a simple, low-footprint, real-time kernel. As with any RTOS, it must be understood that it serves
as a tool to be used correctly or incorrectly by system developers; an RTOS is not a magic elixir
from which normal software automatically derives real-time properties.
Indeed, it is important to make the point that processing throughput (the amount of processing a
system can do per unit of time) does not define a real-time system [4, 5]. A hard real-time system
that can guarantee its deadlines will be met becomes no ‘more’ real-time with additional processing
throughput, although it may well become faster. The motivation for extending FreeRTOS to support
multicore hardware is thus not to change the way in which it handles task deadlines or exhibits
predictable properties, but rather to meet the realities of the ever-increasing demands being made
of embedded systems. For real-time and non-real-time applications alike, being able to run tasks
concurrently promises performance advantages not possible on single-core architectures because
of the limit processor manufacturers have encountered in maintaining increases in clock speeds on
individual chips.
Perhaps the most striking examples of this in embedded systems can be found in the incredi-
ble explosion of processing power, and specifically the recent mainstream adoption of multicore
architectures, on smartphones [6]. It is clear that all embedded real-time systems must adapt in the
same way as those in the consumer market. Andrews et al. [7] contended that RTOS designers have
been ‘fighting’ Moore’s law and must now instead look to make use of processor advancements
to help avoid resorting to ever more complex software solutions designed to squeeze the required
real-time performance from slower hardware, as well as the associated maintenance problems that
arise from this. More fundamentally, a lack of action has the potential to cause the usefulness of
embedded real-time systems to hit a wall. What will the system designers do when future iterations
of the application must perform more tasks, process larger data, and provide faster response times
without having access to matching improvements in single-core hardware? Migration to multicores
is inevitable.
1.2. Method
This paper documents the process of modifying the MicroBlaze FreeRTOS port to run tasks
concurrently on multiple processors. MicroBlaze is a soft processor architecture implemented
through the use of field-programmable gate arrays (FPGAs), special integrated circuits whose
configuration is changeable ‘in the field’. This technology allows for the rapid design and imple-
mentation of customised hardware environments, allowing complete control over the processing
logic as well as system peripherals such as input/output (I/O) devices and memory. Because of
how configurable and relatively inexpensive FPGAs are, they provide an attractive way to develop
highly parallel hardware solutions, with which a multicore embedded RTOS would be incredibly
useful in providing a platform to abstract both the real-time scheduling behaviour and the underlying
hardware away from application code.
Although MicroBlaze has been chosen as the test bed, it is important to point out that the majority
of the multicore modifications made to FreeRTOS are fully general, not specific to any particular
platform. In fact, making the modifications easily portable to a range of different multicore architec-
tures is one of the primary design goals in this work. Portability is, after all, one of the main features
of the FreeRTOS design, and this work aims to preserve it.
With that said, the MicroBlaze platform does have attractions. MicroBlaze FPGA designs are
readily available and can be implemented with little effort. And by being able to strip down the
hardware configuration (for example by disabling processor caching and limiting the number of
hardware peripherals), it is possible to greatly reduce the complexity of the software components
that interact with the hardware at a low level, thus reducing implementation time. Another attraction
is that the number of processor cores need not, in principle, be limited to just two: modern FPGAs
are capable of hosting tens of MicroBlaze cores.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1131
1.3. Outline
Section 2 discusses the background issues, technologies, and research surrounding this topic.
Section 3 details and analyses the requirements. Section 4 explains the system design, detailing
significant algorithms and other key design decisions. Section 5 discusses the components of the
system, how they work, and the justifications for the exact method of their implementation. Section 6
evaluates the adequacy of the implementation with regard to the defined requirements. Section 7
summarises the work, recapping the main lessons learned, and suggests avenues for future work.
2. BACKGROUND
Disabling interrupts is all that is necessary to ensure exclusive access because the only way
another task may run and therefore gain access to the shared variables on a single-core processor is
if the scheduler interrupts the current task and resumes another.
However, on a multicore processor, this is not the case. Because all cores execute instructions at
the same time, the potential for the simultaneous use of shared resources is not connected to the
swapping of tasks. To protect against this, an alternative mechanism for creating a critical section
that works across all cores in the system is required.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1132 J. MISTRY, M. NAYLOR AND J. WOODCOCK
2.2.1. Built-in atomicity. Some processor architectures provide built-in operations that are atomic
(indivisible) and that work across all cores. For example, on x86, the xchg instruction can be used
to exchange a register value with one at a specified memory location [8]. Because this exchange is
guaranteed to be atomic across all cores, it can be used to implement mutual-exclusion primitives
such as semaphores.
MicroBlaze also provides instructions that allow atomic operations on memory (namely lwx and
swx), but a core executing these instructions does not co-ordinate its actions with other cores in
the processor. It thus does not provide atomicity across multiple cores, making these instructions
inappropriate for implementing multicore mutual exclusion.
2.2.2. Mutex peripheral. To enable mutual exclusion on multicore MicroBlaze processors at the
hardware level, Xilinx provide a ‘mutex peripheral’ that can be programmed into their FPGAs
[9]. This peripheral is configured at design time with various parameters, including the quantity
of desired mutex objects. Software running on the processor cores can then interface with the mutex
peripheral to request that a mutex be locked or released.
Rather than simply being hardware specific (after all, the x86 atomic instructions are also
hardware specific), this method of implementing mutual exclusion is configuration specific. The
mutex peripheral is a prerequisite to supporting mutual exclusion using this method and requires
additional FPGA space as well as software specific to the peripheral to operate. In addition, because
certain options (such as the number of supported mutexes) must be configured at the hardware level,
related changes postimplementation may make maintaining the system harder.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1133
competing process’ write to the turn variable takes effect, the loop expression will no longer
evaluate to true for the first process. The first process will then exit the loop and be within the criti-
cal section. Once finished, the process will set its ‘intention’ in the Q array to 0 (denoting false), and
a competing process busy waiting in the while loop will detect this and enter the critical section
itself. Note that all elements in the Q array are initialised to 0 to indicate that in the initial state of
the system, no process intends to enter the critical section.
Figure 2. Implementing an n-process lock using n 1 instances of a two-process lock. Arrows represent
processes. The winning process must acquire (or pass through) log2 .n/ locks. In this example, n D 8.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1134 J. MISTRY, M. NAYLOR AND J. WOODCOCK
to enter stage j . The idea is that, at each stage j , a single process will be blocked (namely the one
specified by TURNŒj ) and prevented from proceeding to the next stage unless all other processes
are currently working at a stage earlier than j . After n 1 stages, only one process can remain, and
it may enter the critical section.
An inefficiency of Peterson’s generalised algorithm, first highlighted by Block and Woo [13], is
caused by the fact that each process must cross n 1 stages, meaning that even processes with
no interest in the critical section must have their elements in Q checked by a process wishing to
enter. Block and Woo’s solution, as presented by Alagarsamy [11], is shown in Figure 4. The main
idea is that if only m out of the n processes are competing at a particular time, then the process at
stage m can enter the critical section (rather than having to wait until it reaches stage n as in Peter-
son’s algorithm). Note that in Block and Woo’s algorithm, the Q array simply holds 0 or 1 values
denoting whether a process wishes to enter the critical section (like in the two-process version of
Peterson’s algorithm).
As Alagarsamy explains, Peterson’s generalised algorithm has another limitation: it does not
enforce any upper bound on the number of ‘bypasses’ that can occur. A bypass is a situation in
which a process that is made to busy wait at the first stage is ‘overtaken’ by other processes before
it can progress to a higher stage [11]. This can occur because a process blocked at the first stage,
regardless of the duration for which it has been blocked, is at the same stage as a process show-
ing interest in the critical section for the first time and can thus be ‘beaten’ to the second stage. A
process that is continually overtaken in this manner can thus find itself perpetually denied access to
the critical section.
Whereas Block and Woo’s algorithm does bound the number of bypasses at n.n 1/=2 [11],
Alagarsamy introduced a variant, shown in Figure 5, that promises an ‘optimal’ n 1 bypasses.
Alagarsamy’s solution works by having processes ‘promote’ others at lower stages when they enter
the critical section. This ensures that any process blocked at any stage will be guaranteed to move to
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1135
the next once a process enters the critical section. The code to achieve promotion is visible in lines
7–9 of Figure 5. However, there is a danger with such promotion: it may cause more than one pro-
cess to reach a stage high enough to enter the critical section (the number of competing process goes
down at the same time that processes are promoted). To address this issue, Alagarsamy reintroduced
the ‘wait until’ condition of Peterson’s algorithm into Block and Woo’s algorithm to give lines 1–5
of Figure 5.
2.2.5. Memory barriers. A challenge that exists with software-based mutual exclusion arises from
the fact that processors will often reorder memory accesses internally, for example to improve
performance. This behaviour cannot be controlled at design time and must be mitigated using
a ‘memory barrier’. A memory barrier is a point in a program, represented by the issuing of a
special-purpose instruction, after which the processor must guarantee that all outstanding memory
operations are complete. On x86, the mfence instruction performs this function; on MicroBlaze,
the equivalent is the mbar instruction. If we use these instructions strategically, it should be possible
to avoid the effects of reordering on the algorithms.
Although software-based mutual exclusion arguably introduces certain inefficiencies into the
implementation of synchronisation features, the benefits of providing largely hardware-agnostic
mutual exclusion cannot be ignored, particularly in a system ported to as many platforms as
FreeRTOS.
2.3. FreeRTOS
FreeRTOS is an RTOS written in C. As of August 2011, it has been ported to 27 different archi-
tectures [3]. Perhaps the most impressive aspect of FreeRTOS is that which has enabled its use on
so many different platforms: its division into two architectural layers, the ‘hardware-independent’
layer, responsible for performing the majority of OS functions, and the ‘portable’ layer, responsible
for performing hardware-specific processing (such as context switching). The hardware-independent
layer is the same across all ports—it expects a standard set of functions to be exposed by the portable
layer so that it can delegate platform-specific processing to it, regardless of which hardware the port
is intended for.
2.3.1. Overview. The division between the hardware-independent and portable layers is expressed
in terms of the source files in Figure 6.‡ The two required files, ‘list.c’ and ‘tasks.c’, provide the
minimum high-level functionality required for managing and scheduling tasks. The function of each
source file is explained as follows:§
croutine.c Provides support for ‘coroutines’. These are very limited types of tasks that are more
memory efficient.‡
list.c Implements a list data structure for use by the scheduler in maintaining task queues.
queue.c Implements a priority queue data structure accessible to application code for use in
message passing.‡
tasks.c Provides task management functionality to both application tasks and the portable layer.
timers.c Provides software-based timers for use by application tasks.‡
port.c Exposes the standard portable API required by the hardware-independent layer.
heap.c Provides the port-specific memory allocation and de-allocation functions. This is
explained later in more detail.
Note that the application code is not illustrated in Figure 6. This includes a global entry point to
the system, taking the form of a main function, and is responsible for creating all the initial appli-
cation tasks and starting the FreeRTOS scheduler. Note that a FreeRTOS task is most analogous
to a thread in a conventional OS: it is the smallest unit of processing within FreeRTOS. Indeed,
‡
Header files are not included in this illustration.
§
Components marked ‡ are excluded from the scope of this work.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1136 J. MISTRY, M. NAYLOR AND J. WOODCOCK
because of the fact that FreeRTOS distributions are typically highly specialised, no distinction is
made between tasks and processes.
2.3.3. The life of a task. When a task is created using the xTaskGenericCreate function,
FreeRTOS first allocates memory for the task. This involves allocating memory both for the task’s
stack (the size of which can be specified in the function call) and for the task control block (TCB)
data structure, which, among other things, stores a pointer to the task’s stack, its priority level, and
also a pointer to its code. The only hard design-time restriction on this code is that its variables must
fit within the stack size specified when the task is created. A task should also not end without first
deleting itself using the vTaskDelete API function to free up the memory it uses (Figure 7).
A task can be defined using the portTASK_FUNCTION macro.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1137
Figure 7. Pseudo-Universal Modelling Language sequence diagram showing a task’s creation and execution.
This macro accepts the name of the task function as the first parameter and the name of the task
‘parameter’ argument as the second (used for passing information to the task when it first starts).
Expanded when using the MicroBlaze port, the preceding macro looks as follows:
This macro is defined in the portable layer, and its use improves the portability of a code
by ensuring that task function signatures are always correct for the platform on which the code
is compiled.
The memory allocation for a task is delegated to the bundled memory manager, which is a
component of the portable layer. The memory manager is expected to expose two functions:
pvPortMalloc (for memory allocation) and pvPortFree (for memory de-allocation). How
these functions manage the memory internally is up to the specific implementation: included with
FreeRTOS are three different memory managers, each with a different behaviour. The first and most
simple allows memory allocation but not de-allocation; the second permits releasing without com-
bining small adjacent blocks into larger contiguous ones; and the third relies on the compiler’s own
malloc and free implementations but does so in a thread-safe way.
Once allocation is done, xTaskGenericCreate then delegates the initialisation of the stack
area to the portable layer using pxPortInitialiseStack. The purpose of this function is to
prepare the stack area of the newly created task in such a way that it can be started by the context-
switching code as if it was a suspended task being resumed. Once created, the task is then added to
the ready queue in preparation for it to be scheduled at a later time. When no tasks run, a system-
created idle task takes control of the processor until it is swapped out. This can occur either as a
result of a context switch or because the idle task is configured to yield automatically by setting the
configIDLE_SHOULD_YIELD macro to 1.
Eventually, the main function (referred to as the ‘stub’) will start the FreeRTOS scheduler using
vTaskStartScheduler in the hardware-independent layer, although the notion of a scheduler
entirely provided by FreeRTOS is not accurate. Scheduling, by its very nature, straddles the line
between the hardware-independent and portable layers. On the one hand, it is heavily dependent on
the handling of interrupts (when in pre-emptive mode) and the performance of context switching,
things firmly in the realm of the portable layer. On the other, it involves making high-level decisions
about which tasks should be executed and when. This is certainly something that would not be worth
re-implementing in each portable layer.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1138 J. MISTRY, M. NAYLOR AND J. WOODCOCK
Once the FreeRTOS scheduler has been started by the application code using the hardware-
independent layer, the xPortStartScheduler function in the portable layer is called. This then
configures the hardware clock responsible for generating tick interrupts. This allows FreeRTOS to
keep track of time and, when in pre-emptive mode, to trigger task rescheduling.
On MicroBlaze, _interrupt_handler is the function to which the processor jumps when
an interrupt occurs. This calls vPortSaveContext, which transfers the contents of the proces-
sor’s registers, including the address at which to continue execution, into the memory. Once this
returns, it clears the yield flag (a flag denoting whether or not a task should yield control of the
processor) and then calls the low-level handler XIntc_DeviceInterruptHandler. This is
provided by Xilinx in a small support library. Its purpose is to perform the necessary housekeeping
to ensure that interrupts are correctly acknowledged. It also calls an optionally defined ‘callback’
function associated with each interrupt. For clock interrupts, this is configured by the portable layer
to be the vTickISR function, which simply increments the system tick counter and, if in pre-
emptive mode, sets the yield flag. Control then returns to _interrupt_handler, which tests
the yield flag. If the flag indicates that a yield should occur, then vTaskSwitchContext is
called in the hardware-independent layer, which selects a task to ‘swap in’ by pointing to it using
the pxCurrentTCB pointer. vPortRestoreContext is then called, which restores the con-
text of the task denoted by pxCurrentTCB. Note that if the yield flag indicates that a yield should
not occur, vTaskSwitchContext is not called and the task that was interrupted is resumed by
vPortRestoreContext.
2.3.4. Interrupt controller. The MicroBlaze port requires an interrupt controller to be included in
the hardware design. An interrupt controller accepts multiple interrupt sources as input, forward-
ing received interrupts to a processor in a priority order. The main reason this is necessary on
MicroBlaze is that the architecture only has one interrupt line [15]—the interrupt controller is relied
upon to receive and queue interrupts before the processor is notified. As a result, the portable layer
must configure and enable the processor’s interrupt controller before any tasks can be started. This
is performed by applications in their main function by calling xPortSetupHardware.
3. OBJECTIVES
3.1.2. REQ-02: Operating system components must be implicitly thread safe. The OS components
will utilise mutual-exclusion features automatically when required to ensure thread safety and will
not expose such concerns to application tasks.
3.1.3. REQ-03: Application tasks must have access to an API providing mutual exclusion. Tasks
must be able to access shared resources in a thread-safe manner without implementing mutual-
exclusion solutions themselves and thus must be able to synchronise using an API exposed by
FreeRTOS. Using no more than two API calls, application tasks will be able to enter and exit a
critical region for protected access to a named resource.
3.1.4. REQ-04: The MicroBlaze port of FreeRTOS must be modified to be compatible with the
multicore-enabled hardware-independent layer. To run the modified version of the hardware-
independent layer on bare hardware (i.e. without using a simulated environment), the MicroBlaze
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1139
portable layer developed by Tyrel Newton will be modified to support a multicore architecture and
made compatible with the changes in the hardware-independent layer.
3.1.5. REQ-05: Example applications must be created to demonstrate concurrency and synchroni-
sation. To demonstrate the use of concurrency and synchronisation implemented by the FreeRTOS
modifications, appropriate applications will be created to exemplify the features added.
3.1.6. REQ-06: Modifications made to FreeRTOS must scale for use with n cores. It would be
unnecessarily limiting to restrict multicore support in FreeRTOS to merely support for dual-core
processors. Modifications should scale transparently for use on processors with more than two cores.
3.2.2. REQ-08: Application development must not be made more difficult. One of the virtues of
FreeRTOS is its simplicity. To minimise the effect on application development time, it is important
that this aspect of the OS is preserved through the modifications made in this work. For example,
the process of converting a task of between 250 and 350 lines in length designed for use with the
unmodified version of FreeRTOS to make it compatible for use with the modifications made in this
work must take a developer familiar with both versions no longer than 5 min.
3.2.3. REQ-09: The results of this work should be reproducible. Complex hardware requirements
have the potential to make any results difficult to reproduce. Thus, the hardware design used must
be reproducible using the Xilinx Platform Studio (XPS) base system builder (an automated design
helper) with as few manual changes required as possible. The time required to reproduce the mod-
ified FreeRTOS system developed in this work will not exceed an hour for a competent software
engineer, with detailed instructions, the required apparatus, and no experience of hardware design.
4. DESIGN
4.1. Hardware
A fundamental design decision that spans the hardware and software aspects of this work is that
of using a symmetric multiprocessing (SMP) architecture in which a single instance of the OS is
stored in shared memory and accessed by all cores. The alternative option (asymmetric multipro-
cessing, AMP) is for each core to have its own instance of the OS running in its own private memory
area (Figure 8).
One attraction of AMP is a reduction in the need for synchronisation as less memory needs to
be shared between cores. Another is increased scope for parallelism because cores can in principle
fetch instructions simultaneously from parallel memory units. However, any cross-task communica-
tion still requires full concurrent synchronisation. And to truly benefit from multiple cores, an AMP
system would need a way to transfer tasks (and their associated memory) from one core to another.
This is because it must be possible for waiting tasks to be executed by any core, regardless of which
core’s memory area the task first allocated its memory on. Apart from implying a degree of com-
plexity in the context-switching code, there are performance issues with this as well. In pre-emptive
mode, context switches happen very frequently (many times per second), and the implications on
speed of moving even moderate quantities of memory this often are significant.
An SMP architecture avoids these issues. The queue of waiting tasks can be accessed by all cores,
so each core can execute any task regardless of whether or not it initially allocated the task’s memory.
To prevent one core from simultaneously allocating the same memory as another, synchronisation
is used to protect the memory allocation code with a critical section. Furthermore, because the SMP
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1140 J. MISTRY, M. NAYLOR AND J. WOODCOCK
Figure 8. Hardware design. DDR RAM, double-data-rate random-access memory; BRAM, block
random-access memory.
architecture requires less memory, it is arguably more appropriate for the real-time domain where
memory constraints are often tight.
The only exception to the use of shared memory is in the initial startup stage of the system, termed
the ‘core ignition’. This is a necessary bootstrapping process that puts the cores into a state in which
they are ready to begin executing the OS and tasks symmetrically.
To provide support for pre-emptive scheduling, timer and interrupt controller peripherals are pro-
vided for each core. The timer sends interrupt signals at regular intervals, which are received by
the interrupt controller. This then queues them until the core is ready to respond, at which point the
core begins executing its dedicated interrupt handler as registered during the hardware configura-
tion on startup. Unfortunately, the XPS tools do not allow interrupt controllers to be shared, meaning
that separate timers and interrupt controllers for each processing core are included in the hardware
design. Whether or not this restriction is artificially imposed by the automated XPS components
or inherent to MicroBlaze has not been explored: as outlined by requirement REQ-09, the priority
was to keep the design simple and, in particular, ensure that it varied as little as possible from the
stock dual-core design provided by Xilinx. This also explains the use of separate ‘buses’ (lines of
communication between cores and peripherals) rather than a shared bus for access to memory.
Figure 9. Memory model (not to scale). DDR RAM, double-data-rate random-access memory; BRAM,
block random-access memory.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1141
application entry point (the main function), located alongside FreeRTOS and application tasks in
a large shared random-access memory (RAM), and performs application-specific initialisation such
as the creation of tasks. It also performs some hardware configuration, such as the set-up of the
master core’s interrupt controller and hardware timer, the drivers for which are provided by Xilinx
and reside in the core’s private BRAM. It then starts the FreeRTOS scheduler and begins executing
the first task.
The code for the slave core performs the hardware configuration of the slave core’s own peripher-
als, as well as uses a small area of shared RAM (the ignition communication block) to communicate
with the master core to determine when the FreeRTOS scheduler has been started, as it is only after
this happens that the slave core can begin executing the FreeRTOS code in shared memory and
scheduling tasks itself. At this point, the cores are executing symmetrically, and the code in BRAM
is never executed again with one exception: when an interrupt is fired, a small interrupt handler
in the slave core’s BRAM runs and immediately passes execution to the master interrupt handler
in FreeRTOS.
¶
Note that for slave cores, the mainline stack is located in BRAM and is not illustrated.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1142 J. MISTRY, M. NAYLOR AND J. WOODCOCK
Figure 10. Obtaining the current task’s task control block (TCB) in multicore FreeRTOS.
identify it in queues (when waiting to be scheduled, for example) and store and retrieve pertinent
details about it, such as its priority. In single-core FreeRTOS, a pointer called pxCurrentTCB
exists to identify the location of the TCB for the task currently being executed by the processor.
This is extremely useful: when the scheduler comes to switch tasks, it can simply refer to this
pointer to know which task to swap out and then, once a new task is swapped in, merely assign the
address of the new task’s TCB to the pointer before restoring its context. Whenever something about
the currently executing task needs to be queried or updated, FreeRTOS merely needs to refer to the
TCB via this pointer.
Extending this for use with multiple cores is straightforward by turning pxCurrentTCB from
a pointer to a TCB into an array of pointers to TCBs. The size of the array is equal to the number
of processor cores, the zero-based index of each core serving as a key to the TCB of the task it is
executing. This design does make an assumption about the hardware. For the relevant code to access
the element of the TCB array corresponding to the current core, there must be some way of uniquely
identifying the core (Figure 10).
Setting an affinity [...] for a thread can result in threads receiving less processor time, as the system is
restricted from running the threads on certain processors. In most cases, it is better to let the system
select an available processor [16].
The same is true of multicore FreeRTOS. However, because of the fact that in a multicore
environment, idle tasks are needed for each core, there has to be some mechanism to ensure that
they only get scheduled on the correct ones. Certain applications may also benefit from being able
to tie a task to a core. For example, in our hardware design, only the first core is connected to the
primary serial port. Thus, to send data through this port, a task must be running on the first core. It
may also be, for example, that in the presence of caches, core affinity could be used to improve the
cache hit rate.
This feature raises important design implications: tasks must have an awareness of their affinity
for the duration of their life; as well as expressing affinity, there must be a way to express a lack
of affinity; and the scheduler must reflect a task’s affinity in an appropriate way. Whereas the latter
is discussed in detail in the following section (Section 4.6), the former two issues are closely related
with the indexing of processor cores in the pxCurrentTCBs array. The affinity of a task can
simply be the integer identifier of the core to which it is bound. A lack of affinity can simply
be expressed by a constant greater than the highest TCB array index. Because the TCB array is
zero-based, an obvious choice for this constant is the number of cores in the system.
By amending the TCB data structure to include a member called uxCPUAffinity, each task
can be associated with a core ID (or the ‘no-affinity’ constant). The process of using FreeRTOS
from an application’s perspective does not need to change much at all: when creating a task, the
application must either specify a core index to assign the created task an affinity or specify the
portNO_SPECIFIC_PROCESSOR constant defined by the portable layer.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1143
4.6. Scheduling
The scheduler is the FreeRTOS component that must actually act upon the affinity values by select-
ing the appropriate task for execution on a particular processor core. The affinity-aware scheduler
behaves as follows:
1. Obtain the current core ID.
2. Enter a critical section (because global task lists will be manipulated).
3. Retrieve the next task in the highest-priority nonempty ready list that is not executing on any
core. If there is no task matching these criteria in the highest-priority list, repeat for the list
with the next lowest priority.
4. If no affinity is set for the retrieved task in the ready list, select the task for scheduling on the
current core.
5. Otherwise, and if the affinity of the retrieved task is set to the current core ID, select the task
for scheduling on the current core.
6. If no new task was selected for scheduling, select the previously executing task.
7. Exit the critical section.
Thus, a task is only scheduled on a core when the following occur: (a) it is not executing on
any other processor; (b) it either has no core affinity or has a core affinity indicating that it should
execute on the current core; and (c) it is the highest-priority task satisfying conditions (a) and (b).
4.7. Synchronisation
As discussed in Section 2.2, software-based synchronisation solutions provide a viable alternative
to relying on hardware components for the implementation of mutual exclusion, easing portabil-
ity. Although this option comes with an inevitable overhead, it does not preclude the addition of
alternative platform-specific synchronisation in the portable layer if desired (Figure 11).
The synchronisation features are implemented in software using Alagarsamy’s improved version
of Peterson’s algorithm, discussed in Section 2.2.3. There is still, however, a practical consideration
that must be addressed: it is important to allow applications to easily specify a custom synchronisa-
tion target, as it would be very inefficient for one system-wide critical section to be used to protect
all resources.
One simple option is to use an integer as a target: on entry to a critical section, a task provides
an integer parameter in a predefined range to uniquely identify a synchronisation target. Other tasks
will only be prevented from entering critical sections if they attempt to use the same integer param-
eter in their entry call. It is possible to modify Alagarsamy’s algorithm to do this quite easily: by
placing an upper-bound restriction on the number of named mutexes used by the system (defined in
the portable layer), the Q and turn arrays can be given an extra dimension indexed by the name
of the mutex. Note that whether the actual target is shared memory, an I/O peripheral or something
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1144 J. MISTRY, M. NAYLOR AND J. WOODCOCK
else is application specific and not relevant to FreeRTOS: the tasks themselves are responsible for
enforcing the association between the integer and the resource it represents.
A benefit of this approach is that it avoids the need for dynamic memory allocation, which can
be slow and indeterministic. It also allows the memory usage of the synchronisation code to be
clearly limited to a predefined size at design time by the modification of the portMAX_TASKS
and portMAX_MUTEXES portable layer macros. At the same time, this pre-allocation means that
memory usage may be higher than if the space for the synchronisation variables was dynamically
allocated. For example, if we use a long integer as the portBASE_TYPE, the turn array alone
would require 440 B of memory for the whole life of the system in a configuration supporting up to
10 tasks and 10 mutexes, regardless of how many tasks would actually use the named mutexes and
how regularly they would do so. Nevertheless, in the interests of preserving the software’s real-time
properties and keeping the modifications as simple as possible, the first method is used to support
named mutexes.
The mutex API is divided into two: the first part is for synchronisation by tasks using the functions
vTaskAcquireNamedMutex and vTaskReleaseNamedMutex, whereas the second part is
for synchronisation by the kernel using the vCPUAcquireMutex and vCPUReleaseMutex
functions. There are several differences, the most important of which is the component on whose
behalf the mutex is acquired. When tasks acquire mutexes, they do so on their own behalf—the
mutex will only be released when the same task releases it (tasks are identified by a unique ID
assigned on their creation and stored in their TCB, a handle to which is passed the mutex acquisition
and release functions).
However, when the kernel acquires a mutex, it does so on behalf of the current processor core.
This is to allow mutexes to be acquired outside of the context of a task, for example when the ini-
tialisation code calls the FreeRTOS API before the scheduler has been started. Unfortunately, this
means that whereas threads of execution running on other cores will be unable to acquire the mutex,
other threads that become scheduled on the same core (as a result of a context switch, for exam-
ple) will have access to the critical section. Much of the kernel code that makes use of this executes
while interrupts are disabled, meaning that there is no possibility of such a context switch occurring.
However, for some code, such as the memory allocation and de-allocation functions, interrupts must
be explicitly disabled (if enabled to begin with) after entering the critical section to create a ‘full’
lock and guard against other threads running on the same core.
There are also other design differences between the two synchronisation implementations:
1. Whereas the task synchronisation relies on an integer to identify mutexes to simplify work for
application developers, the kernel synchronisation uses a pointer to global variables for use
by the synchronisation algorithm. This is because the kernel synchronisation does not need
to be as flexible, with all mutexes predefined at design time. Identifying these using pointers
simplifies the synchronisation code and improves its clarity.
2. To preserve the semantics of single-core FreeRTOS, the kernel synchronisation explicitly
supports critical section nesting. This allows the same critical section to be entered multiple
times by the same thread of execution without first being exited as long as, eventually, each
entry call is matched by an exit call. For the sake of simplicity, nesting is not supported in
task synchronisation.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1145
5. IMPLEMENTATION
5.1. Hardware
The Xilinx stock dual-core MicroBlaze configuration serves as the basis of the implemented hard-
ware design. This includes two MicroBlaze cores running at a clock speed of 125 MHz with
instruction and data caching disabled for simplicity. Additionally, several peripherals are included:
(i) one 16-kB private BRAM peripheral for each core, used to store core-specific code, such as
the ignition communication block; (ii) one serial port I/O peripheral for each core; and (iii) one
shared 256 MB double-data-rate RAM peripheral where all OS and application codes, as well as
working memory, are stored. The double-data-rate RAM also serves as the medium for cross-task
communication.
In addition, the special-purpose read-only basic processor version register is enabled for both
cores. This is a register that can be assigned a constant value at design time and interrogated by
software at runtime, using the dedicated mfs MicroBlaze instruction, to determine the ID of the
processor that executed it.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1146 J. MISTRY, M. NAYLOR AND J. WOODCOCK
the slave core’s handler (r18) is first saved to shared memory in the interruptBridgeSwap
member of the ignition communication block before jumping to the master handler (Figure 13). The
master handler then includes the code to check whether it is running on the slave core and if so
restores the register (r18) with this saved value before continuing to save the task context.
5.3. Scheduler
The multicore environment imposes two additional constraints on the scheduler for a task to be
scheduled on a core: (i) the task must not be executing on any other core; and (ii) the task must
either have no core affinity or have an affinity for the core in question. Although significant modi-
fications have been required to accommodate these constraints, the solution has been implemented
with the intention of producing semantics that follow logically from those found in single-core
FreeRTOS. Simply put, the scheduler will always select the available task with the highest priority
for scheduling. Using the task ready lists to do this in a multicore environment is not as simple as in
a single-core environment.
The way this is implemented in the multicore scheduler is best demonstrated using an example.
Figure 14 illustrates a hypothetical set of ready lists in a system configured to use four different
priorities, with priority 0 assigned as the idle priority. The highest-priority nonempty list displayed
is of priority 2. Assume that a yield occurs on core 0 and the scheduler is invoked to determine
which task to next schedule on the core. The scheduler first considers the priority 2 list but finds that
task A is already executing on core 1. It then considers task B but cannot select it for scheduling
because of the fact that it has an affinity for core 1. At this point, the scheduler must consider tasks
of lower priority; otherwise, it will be unable to give core 0 any more tasks to execute. As such, it
moves down one priority level to the priority 1 list—it retrieves task C from this list. This task has
no core affinity and is not being executed by any other core, allowing the scheduler to select it for
execution. If task C for some reason could not be selected for execution, the scheduler would move
down to the priority 0 list at which it would be guaranteed to select the idle task with affinity for
core 0.
Once a task is selected for execution, the scheduler passes control to the portable layer. The
element for the current core in the pxCurrentTCBs array is then used to swap the selected task in,
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1147
Figure 15. Getting the task control block of the task to swap in.
as coded in Figure 15. The core ID is retrieved from the processor version register (lines 3 and
4) and is then multiplied by the size of a pointer (line 5). This is used to retrieve the element of
pxCurrentTCBs corresponding to the current core (line 7), after which the first word in the TCB
(the stack pointer) is read into r1. The part of the stack containing the contents of the processor core
registers is then loaded into each register, and the processor core is made to jump to the location in
memory as specified by the program counter saved when the task was swapped out.
5.4. Synchronisation
Memory access reordering presents a particularly significant problem for software mutual-exclusion
algorithms, as they rely on memory reads and writes occurring in the exact order specified (as
explained in Section 2.2.3). Memory barriers are used to force the processor to execute the memory
accesses in the order defined in the code. Whenever one is required, the vPortMemoryBarrier
function defined in the portable layer can be called to create one. It is possible to demonstrate the
importance of the memory barriers by commenting out the code within this function in the multicore
portable layer and running the sync test demo application (explained in Section 6.1). The result is a
failure of the synchronisation code, causing deadlock.
Because the details of how MicroBlaze reorders instructions are not well documented, the strategy
for using memory barriers in the mutual-exclusion algorithm has been to have them protect all shared
memory accesses. This consists of creating a memory barrier after each memory write and before
each memory read of a global variable.
Making the API easy to use is also very important. As such, mutex acquisition and release are
performed with just two simple function calls, illustrated in Figure 16. The call to the function
vTaskAcquireNamedMutex passes in the task’s handle (a pointer to its TCB) and the ‘value’
of the mutex (the integer identifying the resource being locked). Once control returns from this
function, the mutex has been acquired, and all subsequent code is executed within the named mutex’s
critical section until vTaskReleaseNamedMutex is called and returns. Notice the use of the
pvParameters argument to allow the task to refer to its handle: unlike in the original version of
FreeRTOS in which task parameters are simply user-defined pointers, in multicore FreeRTOS, they
are always supplied by the system as a pointer to a systemTaskParameters struct that provides
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1148 J. MISTRY, M. NAYLOR AND J. WOODCOCK
access to the task’s handle. User-defined parameters are passed as void pointers in a member of this
struct and can be accessed as follows:
6. EVALUATION
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1149
the check task is notified that it is still alive. When the critical section is exited, the check task
is notified of the number of runs it has executed so far; when finished, it notifies the check task
of this as well.
2. Core 1 worker performs the same function as the core 0 worker, except that it has an affinity
for core 1. This task will run concurrently with the core 0 worker.
3. Check task monitors the progress of the two worker tasks and is responsible for displaying
program output over the serial port and is thus given an affinity for core 0. Approximately
every hundred runs, it updates the user as to the status of the worker tasks. If either of the
worker tasks fail to check in within 5000 ticks of the previous check in, the task warns that
deadlock may have occurred. If the task receives a notification from one of the worker tasks
that there has been a synchronisation failure, a warning to this effect is passed on to the user.
When both worker tasks are finished, this task outputs a summary of the results.
The conc test is simpler. It consists of three tasks as well, with two worker tasks assigned an
affinity for different cores. They perform memory writes a predefined number of times, and each
notify the check task when they have finished. The check task then displays the execution time
in ticks and seconds. Three versions of the tests were used: short run (one million runs),|| long
run (10 million runs), and very long run (100 million runs). An equivalent version of the conc
test application has been created for the single-core version of FreeRTOS. Table I compares the
execution time of the two versions. The short-run test shows an improvement factor of 1.6; the
long-run one an improvement factor of 1.85; and the very long-run one an improvement factor of
nearly 1.87. The exact relationship between the efficiencies provided by multicore FreeRTOS and
the elapsed execution time has not been investigated further, although these results can be seen to
show that for highly parallel tasks, very significant execution time improvements are possible with
multicore FreeRTOS.
Applications have also been created to test the scheduler, as well as various API calls. These tests
are explained in the associated MSc thesis [1]. Some optional elements of the FreeRTOS API have
not been tested. The relevant code is located in tasks.c and is marked with comments reading
‘NOT FULLY MULTICORE TESTED’.
||
‘Runs’ in this context refers to the number of memory operations performed by the test.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1150 J. MISTRY, M. NAYLOR AND J. WOODCOCK
6.4. Concurrency
At the most basic level, the data structure multiplicity was key in providing the kernel code with
‘awareness’ of multiple cores. If the TCBs relating to executing tasks are indexed by the core ID
and the modification of this structure is restricted to elements relating to the core on which the code
performing the modification is running, the kernel is able to manipulate the TCBs of executing cores
without the need for mutual exclusion. Indeed, this principle has also been applied in the portable
layer: the ‘yield flag’ indicating the desire of a processor core to perform a context switch has been
modified in the same way.
Considering performance optimisation has not been included within the scope of this paper, the
execution time results are good, whereas the highest execution time improvements (achievable
after the first 7 min of execution) approach the theoretical maximum. Although the cause of the
relationship between the performance improvements and the elapsed execution time is not fully
understood, it seems that the code likely to most improve performance resides within the sched-
uler, in which a high proportion of time can be spent waiting for access to critical sections. This
could be mitigated to some extent by a flat reduction in the code covered by critical sections and by
increasing the specificity of the synchronisation targets. For example, the current implementation
uses a single mutex object to denote the ‘scheduler’s lock’, but this could be expanded into two
objects responsible for locking access to priority lists and the currently executing task array sepa-
rately. Although arguably a symptom of the decision to share the priority lists among all cores, it
was implemented in this way to make changing the core on which a task is executed easier. Without
this, it is likely that the synchronisation problem would rear its head elsewhere, as at some point,
a critical section would be required to negotiate the selection of a task from another core’s list
for execution.
Additionally, when checking if a task is being executed by another core, there is no distinction
made by the multicore scheduler between a task executing and a task waiting to enter the scheduler’s
critical section. It could be that a lower-priority task, having yielded and entered the scheduler’s
critical section, might prevent a higher-priority task from being scheduled because it perceives
it to be executing on another core. Although this perception is technically correct, it arguably
gives rise to an inefficiency: the task is so close to nonexecution (having already been swapped
out and simply be waiting for the scheduler to select another task to swap in) that treating it
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1151
differently from a task that is not executing at all is somewhat arbitrary yet prevents it from being
scheduled. This could potentially be addressed by notifying the scheduler when a task waiting
to enter its critical section starts, allowing it to reschedule a higher-priority task waiting on the
critical section on another core rather than ignore it. Care would have to be taken to ensure that if
another core did ‘take’ a task in this way, the core originally executing the task did not attempt to
resume it.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1152 J. MISTRY, M. NAYLOR AND J. WOODCOCK
by a counter, and increments the counter so that the next interrupt will be fed to the next core in a
round-robin fashion. This contrasts with our more traditional approach in which each core has its
own interrupt timer and is a nice demonstration of the flexibility afforded by reconfigurable
hardware. Unlike in our work, idle tasks are treated as special cases by the scheduler, and the general
concept of core affinity is not introduced.
To implement mutual exclusion, Huerta et al. used a mutex peripheral connected to the
MicroBlaze bus. Although this is both hardware specific and configuration specific (the peripheral
and its address are required), mutex acquisition is carefully abstracted by a macro that can be easily
modified to yield different designs. But there is no by-default portable mutual-exclusion mechanism
as in our approach.
Acquiring a mutex is one problem, but using it to achieve thread safety is another. And here,
the Xilkernel design makes for a very neat solution. Every kernel-level routine available to a user
thread, including the scheduler, has a common entry point in the form of a system call. This means
only a single line of code to lock a mutex is required in the system call entry code to obtain thread
safety between cores for the whole API. Although this approach is pessimistic in the sense that many
system calls may actually run safely in parallel, a safe-by-default design is easier to get right than
having to locate every point in the kernel where a mutex is strictly necessary. Because API functions
for communication and synchronisation also take the form of system calls, they inherit thread safety
in the same way.
7.1. Conclusion
With no multicore version of FreeRTOS currently available, the modifications required for the OS to
schedule tasks on multiple processors were extensive, and consideration had to be given at each level
of design to the interplay between the system’s hardware and software requirements. For example,
a principal high-level design decision was that the system would use an SMP architecture, in which
all processors execute the same ‘instance’ of the OS. A major advantage of this is that only one
instance of the OS is required in memory, making the overall memory footprint much lower than in
an equivalent asymmetric model (in which each processor executes its own instance of the OS). It
also helped keep the scheduler modifications simple, avoiding the onerous task of having to copy
task stacks between private memory areas when moving a task from one processor core to another.
However, this had implications on the hardware design: for the OS to be aware of the processor
on which it is currently executing, special-purpose read-only registers had to be configured to allow
the processors to identify themselves to the software. Indeed, the entire notion of recognising ‘the
current processor’ had to be built into FreeRTOS. As well as being necessary for obvious things
such as ensuring that the scheduler selects a task for execution on the correct core, the processor ID
is used as an index to core-specific data, which allows much of the kernel to avoid the complications
(and performance penalties) of synchronisation. It also allowed the implementation of core affinity,
the ability to bind a task to a core for the duration of its life. Borne from a need to ensure that each
idle task only ever executes on one processor core, core affinity also allows applications to be tuned
for optimal performance or to accommodate core-specific hardware. For example, a quirk of the
hardware design implemented as part of this project was that only the first core had access to the
serial port, meaning that tasks needing to communicate with the development personal computer
had to be given affinity for it.
This affinity, as with all the properties of multicore, had to be reflected in the scheduler. Although
the simplicity of the original FreeRTOS scheduler is very attractive, the realities of multicore meant
confronting inevitable complications. With the constraint of core affinities and the possibility that
tasks in the shared ready lists might already be executing on other cores when retrieved by the
scheduler, careful consideration had to be given as to how the scheduler should be modified to
accommodate these things while remaining a spiritual successor, semantically, to the original code.
By remaining a fixed-priority scheduler but with a remit to look beyond tasks with the ‘top ready’
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
ADAPTING FREERTOS FOR MULTICORES: AN EXPERIENCE REPORT 1153
priority when unable to select anything for execution, the multicore scheduling algorithm exhibits
behaviour that is a natural extension of the original.
Implementing mutual exclusion correctly was essential, but MicroBlaze does not currently
support any built-in synchronisation features that work across multiple processors [18, p. 227].
Although an additional mutex peripheral could have been included in the hardware design, it was
decided that because of FreeRTOS’s popularity on so many different platforms, the kernel itself
should provide a platform-agnostic implementation. Thanks to the collective works of Peterson,
Block, Woo, and Alagarsamy, an optimal and elegant algorithm has been developed to do just this.
Using only shared memory and memory barriers, a tentative implementation that is integrated into
the FreeRTOS kernel and available to application tasks and OS components alike has been produced.
7.2.3. Scheduler optimisation. As discussed in Section 6.4, it may be possible to improve the
scheduler’s runtime performance. Analysing in detail how the scheduling code can be improved
to reduce the quantity of time that processor cores find themselves waiting for access to a critical
section, while maintaining thread safety, would be essential to improving the performance of
the system.
7.2.4. Extending multicore support. The modifications in this project have been limited to the
minimum set of FreeRTOS components and have only been implemented for the configuration
specified in the source code (the code excluded by configuration macros has not been modified).
Extending the multicore support to include not only all of FreeRTOS but also additional peripherals
(such as networking components) and third-party software libraries (such as those providing video
encoding) would make an excellent candidate for future work. Similarly, the creation of multicore
versions of the other available FreeRTOS portable layers would also have the potential to be
extremely useful to the FreeRTOS community. Fully testing the modifications on systems with
n cores would also be highly desirable, particularly as this would allow for the possibility of
implementing massively parallel applications, particularly on FPGA platforms.
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe
1154 J. MISTRY, M. NAYLOR AND J. WOODCOCK
REFERENCES
1. Mistry J. FreeRTOS and multicore. MSc Thesis, Department of Computer Science, University of York,
September 2011.
2. Laplante PA. Real-time Systems Design and Analysis. Wiley-IEEE: New York, 2004.
3. Real Time Engineers Limited. Introduction to FreeRTOS, August 2011. Available from: http://www.freertos.org/
[last accessed August 2011].
4. Stankovic JA. Misconceptions about real-time computing: a serious problem for next-generation systems. Computer
Journal 1988; 21(10):10–19.
5. ChibiOS/RT. RTOS concepts, August 2011. Available from: http://www.chibios.org/ [last accessed August 2011].
6. Lomas N. Dual-core smartphones: the next mobile arms race, August 2011. Available from: http://www.silicon.
com/technology/mobile/2011/01/12/dual-core-smartphones-the-next-mobile-arms-race-39746799/ [last accessed
August 2011].
7. Andrews D, Bate I, Nolte T, Otero-Perez C, Petters SM. Impact of embedded systems evolution on RTOS use and
design. Proceedings of the 1st Workshop on Operating System Platforms for Embedded Real-time Applications,
Palma de Mallorca, Balearic Islands, Spain, 2005; 13–19.
8. Chynoweth M, Lee MR. Intel Corporation. Implementing scalable atomic locks for multi-core Intel EM64T and
IA32 architectures, August 2011. Available from: http://software.intel.com/en-us/articles/implementing-scalable-
atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/ [last accessed August 2011].
9. Xilinx Inc. XPS mutex documentation, August 2011. Available from: http://www.xilinx.com/support/documentation/
ip_documentation/xps_mutex.pdf [last accessed August 2011].
10. Peterson GL. Myths about the mutual exclusion problem. Information Processing Letters 1981; 12(3):115–116.
11. Alagarsamy K. A mutual exclusion algorithm with optimally bounded bypasses. Information Processing Letters
2005; 96(1):36–40.
12. Hofri M. Proof of a mutual exclusion algorithm—a classic example. ACM SIGOPS Operating Systems Review 1990;
24(1):18–22.
13. Block K, Woo TK. A more efcient generalization of Petersons mutual exclusion algorithm. Information Processing
Letters 1990; 35(5):219–222.
14. Real Time Engineers Limited. Task priorities, August 2011. Available from: http://www.freertos.org/a00015.html#
TaskPrior [last accessed August 2011].
15. Agron J. How to create and program interrupt-based systems, August 2011. Available from: https://wiki.ittc.ku.edu/
ittc/images/d/df/Edk_interrupts.pdf [last accessed August 2011].
16. Microsoft. SetThreadAffinityMask Function, August 2011. Available from: http://msdn.microsoft.com/en-us/library/
ms686247%28v=vs.85%29.aspx [last accessed August 2011].
17. Huerta P, Castillo J, Sánchez C, Martínez JI. Operating system for symmetric multiprocessors on FPGA. Proceedings
of the 2008 International Conference on Reconfigurable Computing and FPGAs, Cancun, Mexico, 2008; 157–162.
18. Xilinx Inc. MicroBlaze processor reference guide, January 2013. Available from: http://www.xilinx.com/support/
documentation/sw_manuals/mb_ref_guide.pdf [last accessed August 2011].
Copyright © 2013 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2014; 44:1129–1154
DOI: 10.1002/spe