0% found this document useful (0 votes)
39 views4 pages

Towards Transparent Parallel/Distributed Support For Real-Time Embedded Applications

The document discusses a transparent parallel and distributed framework for real-time embedded applications. It proposes using OpenMP for parallelization and MPI for distribution. It also introduces a modified GCC compiler implementation to support parallel and distributed computations in a transparent way. An evaluation of a real implementation provides insights towards developing a parallel/distributed fork-join framework for real-time embedded applications.

Uploaded by

bhalchimtushar0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views4 pages

Towards Transparent Parallel/Distributed Support For Real-Time Embedded Applications

The document discusses a transparent parallel and distributed framework for real-time embedded applications. It proposes using OpenMP for parallelization and MPI for distribution. It also introduces a modified GCC compiler implementation to support parallel and distributed computations in a transparent way. An evaluation of a real implementation provides insights towards developing a parallel/distributed fork-join framework for real-time embedded applications.

Uploaded by

bhalchimtushar0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

8th IEEE International Symposium on Industrial Embedded Systems (SIES 2013)

Towards Transparent Parallel/Distributed Support for


Real-Time Embedded Applications

Ricardo Garibay-Martínez, Luis Lino Ferreira, Cláudio Maia and Luís Miguel Pinho
CISTER/INESC-TEC, ISEP
Polytechnic Institute of Porto, Portugal
{rgmz, llf, crrm, lmp}@isep.ipp.pt

Abstract—An increasing number of real-time embedded However, one main disadvantage of the use of such
applications present high computation requirements which need approaches comes from the programmer’s point of view.
to be realized within strict time constraints. Simultaneously, Coding parallel programs is not a straight-forward task, even
architectures are becoming more and more heterogeneous, more, when programming for distributed memory; including
programming models are having difficulty in scaling or stepping real-time constraints adds even more programming complexity.
outside of a particular domain, and programming such solutions This complexity usually requires detailed knowledge of the
requires detailed knowledge of the system and the skills of an system and the skills of an experienced programmer;
experienced programmer. In this context, this paper advocates constraints that may not always be affordable (in cost or time).
the transparent integration of a parallel and distributed
execution framework, capable of meeting real-time constraints, In this context, this paper presents a transparent parallel
based on OpenMP programming model, and using MPI as the distributed fork-join execution model intended to support real-
distribution mechanism. The paper also introduces our modified time constraints. Furthermore, we introduce our modified
implementation of GCC compiler, enabled to support such implementation of GNU Compiler Collection (GCC) compiler
parallel and distributed computations, which is evaluated [6]; enabled to support parallel and distributed computations in
through a real implementation. This evaluation gives important a transparent manner. We also show through a real
hints, towards the development of the parallel/distributed fork- implementation, how the execution times of parallel/distributed
join framework for supporting real-time embedded applications.
applications can be reduced by following our execution model.
Keywords—parallel execution, real-time, distributed embedded
We also derive some conclusions on the importance of
systems, compiler support, GCC, OpenMP, MPI. considering the transmission time (e.g. implicit transmission
delay) when developing distributed applications.
I. INTRODUCTION
II. PARALLEL/DISTRIBUTED REAL-TIME EXECUTION
Real-time embedded systems are present in our everyday life. MODEL
These systems range from safety critical ones to entertainment
and domestic applications, presenting very diverse set of The parallel/distributed fork-join model for distributed real-
requirements. Although diverse, in all these areas, modern real- time systems has been introduced in [3]. One of our main goals
time applications are becoming larger and more complex, thus is to be able to model the parallelisation and distribution of
demanding more and more computing resources. computations of real-time tasks. Thus, the generic operation of
the local process/thread which performs a parallel/distributed
By using parallel computation models, the time required for fork-join execution is as follows: i) The local process/thread
processing computational intensive applications can be initialize the distributed environment (e.g. by calling
reduced, therefore, gaining flexibility. This is a known solution MPI_Init()); ii) it determines the data to be sent and sends
in areas that require high performance computing power, and it using a communication mechanism (e.g. by using
real-time systems are not the exception. Therefore, the real- MPI_Send()); iii) the data gets transmitted through the
time community has been making a large effort to extend real- network with a certain implicit delay; iv) the data is received
time tools and methods to multi-cores [1], and lately to further on the remote neighbour node (e.g. by using MPI_Recv())
extend them considering the use of parallel models [2]. and processed (in parallel if more than one core is available in
Nevertheless, these parallel models do not take into the remote node); v) when the execution in the remote node is
consideration heterogeneous architectures, which mix both finished, the results are sent back through the network to the
data-sharing and message passing models. In [3], we local process/thread; vi) finally, the results are gathered and the
introduced a solution for parallelising and distributing final result produced.
workloads between neighbouring nodes based on a hybrid To implement this model we propose the introduction of a
approach of OpenMP [4] and Message Passing Interface (MPI) new #pragma omp parallel distributed for in
programs [5], which can be used in this context. Furthermore, OpenMP, which removes the burden of coding the
we presented a timing model, which enables the structured parallel/distributed application from the programmer. Also, a
reasoning on the timing behaviour of such hybrid
deadline() clause can be associated to this pragma, passing
parallel/distributed programs.
978-1-4799-0658-1/13/$31.00 ©2013 IEEE

114
as parameter the number of milliseconds on which the III. TOWARDS PARALLEL/DISTRIBUTED REAL-TIME
application is expected to finish its parallel/distributed part of COMPILER SUPPORT
code. Figure 1 depicts a fragment of code using this pragma,
indicating that the code embraced within the for loop can be A. GCC Compiler and OpenMP
executed on distributed nodes within no more than 200 GCC is structured in three modules, the front-end (also known
milliseconds. as parser) which is responsible for identifying and validating
the source code (e.g. lexical analyses), the middle-end which
1. #pragma omp parallel distributed for has the main objective of simplifying and optimizing the code,
deadline (200) num_threads(3){ and the back-end which is in charge of transforming the final
2. for (i = 0; i < 4; i++) optimised code to assembly code, by taking into consideration
3. loopCode(); the destination platform. The modifications required for
4. } implementing our parallel/distributed model only require
changes on the GCC front-end and middle-end.
Figure 1. #pragma omp parallel distributed for deadline pragma example.
The front-end parses the code in a recursive manner and
Figure 2 is a timeline representation of the execution of performs sanity checks of the code. The main objective is to
parallel/distributed fork-join, where the horizontal lines identify the language keywords that affect the execution of the
represent threads/processes and the vertical lines represent program (including the OpenMP keywords). It is in this stage,
forks and joins. In this case, the main thread splits in three where the new distributed clause is added to the existing
threads (creates a team of OpenMP threads), two are executed set of OpenMP clauses. The result of applying the parsing
in parallel on the local node and another is, mainly, executed process is an intermediate code called GENERIC, which is
on a remote node, hereafter we call such kind of execution as later propagated to the middle-end for further processing.
remote execution. Furthermore, we also assume that it is
possible to split the remote execution in two threads, one for The middle-end transforms the GENERIC code into the
each core on the remote node. This is done automatically by intermediate representation called GIMPLE, this process is
the framework. Thus, taking as example the code in Figure 1, usually refereed as the “gimplification” step. During this step
the timeline in Figure 2 assumes that two threads and all implicit data sharing clauses are made explicit and atomic
execute locally one for loop iteration each, the directives are transformed into the corresponding atomic
distributable thread executes the remaining two update functions. Similarly to the front-end, the middle-end
starts a process of simplification and optimizations of the code.
iterations. Then, thread is hosted in a remote node and
Each optimization or simplification is defined as a pass.
further split into two threads, by adding thread , each one
of these threads is executing one iteration of the for loop. By The pass manager is in charge of calling the set of passes
looking at Figure 2, it is also considered the inherent which are included in the passes.c source file. One of those
transmission delay of transmitting and receiving code and data, passes is the OpenMP lowering pass. The objective of the
this is the transmission delay and , respectively. In OpenMP lowering pass is to identify the OpenMP clauses,
this case note that the MPI framework places the code on transform them in equivalent simplified code (GIMPLE code),
distributed nodes and starts the remote programs. and initialize the auxiliary variables of the expansion phase [7].
After the lowering pass, the pass manager invokes the OpenMP
Also, it is important to notice that the fork-join execution is expansion pass. The expansion pass is in charge of outlining
divided in a sequence of serial and parallel segments (e.g. the OpenMP parallel regions and introducing the built-in
parallel regions in OpenMP), which implies precedence function to be called by the libgomp library. libgomp is the
constraints between the segments of a fork-join task. GNU implementation of the OpenMP API for shared memory
Furthermore, the execution of a fork-join real-time task has platforms.
associated Worst-Case Execution Time , a task period
which indicates the minimum interval time in which a task is In our implementation, a new function has been created in
periodically invoked, and a task deadline which defines the libgomp. The GOMP_Distributed_Parallel_start()
maximum possible time a task can take to be completed. function initialises the MPI environment; from that point, the
code is simultaneously executed by all MPI processes. The
number of created MPI processes is equal to the number of
hosts in mpd.hosts configuration file of MPI, which indicates
the number of nodes in the distributed system (e.g. a cluster of
embedded computing devices). All MPI processes receive the
data to execute using MPI broadcast primitives. In our current
implementation the workload is evenly divided among nodes in
the system.
After all processes have received their corresponding data,
each process creates a team of OpenMP threads. By default a
team of OpenMP threads is then created, having the same
number of threads as the number of processors/cores in the
Figure 2. Timeline of a real-time task using the parallel/distributed fork-join node they are going to be executed.
model.

115
Once the team of threads has been created, the workload each iteration only produces changes on an independent part of
can be re-assigned by using the standard sharing mechanisms the data.
implemented by libgomp. Then, whenever the execution of a The work-sharing mechanism used for these experiments,
parallel region is finalised, the execution of the OpenMP team divides the iterations contained inside the for loop between the
of threads and MPI processes is automatically terminated by MPI processes (one process per node) and they further split on
calling the proper termination routine. the number of OpenMP threads (one thread per core). We had
B. Towards Real-time Libgomp Implementation chosen this approach since is one of the most general work-
sharing mechanisms used for the Single Program Multiple Data
Currently, we have an implementation, which is able to (SIMD) paradigm. On the other hand, it has the disadvantage
parallelise and distribute computing workloads transparently. of not being suitable for handling more dynamic patterns of
However, our final objective is to be able to support real-time parallelism (e.g. variable real-time patterns).
processes/threads based on the OpenMP programming model.
In order to support real-time processes/threads, an 1. #pragma omp parallel distributed for
underlying real-time operating system must be used. We are 2. for (i = 0; i < N_ITER; i++)
mainly interested on the use of two available real-time kernels 3. loopCode();
for Linux: the SCHED_DEADLINE [8] and the 4. }
SCHED_RTWS [9]. The first of them, implements partitioned,
Figure 3. Distributed for clause example.
global and clustered scheduling algorithms that allows the
cohabitence of hard and soft real time tasks. The second one, The data collected is related to experiments with a different
implements a combination of the Global EDF and a priority- number of iterations (100, 1000 and 10000) – variable N_ITER
based work-stealing algorithm, which allows executing parallel in Figure 3. Each experiment consists on averaging the
real-time tasks in more than one core at the same time instant execution time of the distributed parallel loop over one
whenever an idle core is available. thousand measurements. I.e. measuring the time from before
libgomp implements threads by using the POSIX threads the execution of line 1 to after line 4. When conducting the
library [10] (also known as pthreads). This is done by calling experiments, we were interested in estimating the reduction of
gomp_team_start() function. This function initializes the the execution time as a function of the number of utilised
data structures and passes the required parameters to the nodes, and estimating the maximum measured execution time.
phtread_create() function. Is in this point where we can Figure 4, depicts the average execution time and the
use the phtread_ create() function, to create the new maximum measured execution time of three experiments,
threads according to the received real-time parameters having 100, 1000 and 10000 iterations, respectively. It is
extracted from the deadline() clause. The implementation important to note that the vertical axis in Figure 4 has a
of the deadline() clause can be done by repeating a similar logarithmic scale. It is also possible to observe that the
procedure as when implementing the distributed clause. execution time (numerically presented above the bars) and the
maximum measured execution time (represented as error bars)
In the following section, we present the performance are reduced almost in a linear way (considering the logarithmic
evaluation of our distributed clause implementation, of scale) for the cases of 10000 and 1000 iterations. This is an
the non-real-time version of the software. Although, the real- expected speedup for parallel programs when computations
time support is being currently implemented, the results help us realised in different nodes are independent. However, this is
to confirm that we are following the correct path for supporting not the case for the execution with 100 iterations; where the
real-time processes/threads based on the OpenMP execution times are kept almost the same, regardless the
programming model. number of nodes utilised during execution. The reason for this
IV. EXPERIMENTAL EVALUATION behaviour is that the time saved on the parallel task execution,
is consumed in transmission time.
The experiments reported in this paper were conducted in a
cluster with 5 identical machines. Each machine is equipped
with a dual-core Intel(r) Celeron(r) CPU, with a clock rate of
1.9 GHz and 2 GB memory. Communications were supported
by 8-port 100 Mbps n-way fast Ethernet switch.
The experiments are based on the execution of the
#pragma omp parallel distributed for pragma,
and having a variable number of iterations. This number of
iterations represents different sizes of a synthetic workload that
need to be processed within some time constraints. For
example, in Figure 3, iterations contained in the variable
N_ITER are to be divided among the nodes/cores in the
system. Each iteration takes in average, approximately 0.016 Figure 4. Execution times with 100, 1000 and 10000 iterations.
ms to be executed. The iterations inside the for loop are Also, the average transmission time and the maximum
considered to be independent between them, and therefore, mesuared transmission time are important parameters to take

116
into account when scheduling parallel tasks in a distributed V. CONCLUSIONS AND FUTURE WORK
environment, because they add a considerable extra time This paper presented the parallel/distributed fork-join real-time
(delay), to the total execution, which is usually not considered execution model. We proposed an automatic code generation
when scheduling tasks in shared-memory platforms. For by modifying the GCC compiler and enabling it to
example, the size of the data to be transmitted is proportional to transparently support MPI messages based on OpenMP
the number of iterations contained in the N_ITER variable. programming model. To do so, we proposed an extension to
During the initial fork operation the data size to be transmitted
the OpenMP specification by adding the #pragma omp
is approximately 6.25 Kb, 62.5 Kb and 625 Kb, when N_ITER
parallel distributed for. Also, we describe the
variable is equal to 100, 1000 and 10000 iterations,
respectively. The same amount of data is later transmitted to modification done on the GCC compiler to support transparent
the main node during the join operation. Figure 5 shows the parallel/distributed executions.
total transmission time (e.g. the sum of and in We are currently working on the implementation and
Figure 2) for the different values of N_ITER mentioned above. integration of the deadline() clause into libgomp. This will
The results in Figure 5, show that the number of nodes in allow libgomp to be able to generate real-time threads. We also
the cluster does not affect the transmission times. The reason is plan to join libgomp with the new real-time SCHED_RTWS
that the size of the data to be transmitted is constant and scheduler for Linux. Then, it will be possible to execute
transmitted using MPI broadcast. The last, holds because the parallel/distributed threads on a real-time kernel and therefore,
workload to be transmitted is considerably small in comparison having real-time guarantees on the execution of parallel
with the total capacity of the network. It is also important to threads.
note that these measurements had been done in a closed ACKNOWLEDGMENTS
network where the only traffic is related to our experiments.
This work was partially supported by National Funds through FCT
(Portuguese Foundation for Science and Technology), by ERDF (European
Regional Development Fund) through COMPETE (Operational Programme
'Thematic Factors of Competitiveness'), within VipCore project, ref. FCOMP-
01-0124-FEDER-015006, and by ESF (European Social Fund) through POPH
(Portuguese Human Potential Operational Program), under PhD grant
SFRH/BD/71562/2010.

REFERENCES
[1] R. I. Davis and A. Burns, “A survey of hard real-time scheduling for
multiprocessor systems,” ACM Comp. Surv., vol. 43, no. 35, p. 1–44,
2011.
[2] A. Saifullah, K. Agrawal, C. Lu and C. Gill, “Multi-core Real-Time
Scheduling for Generalized Parallel Task Models,” in Proc. of the IEEE
Figure 5. Transmission times with 100, 1000 and 10000 iterations. 32st Real-Time Systems Symposium (RTSS 2011), 2011.
Therefore, we can confirm that when the execution time [3] R. Garibay-Martinez, L. L. Ferreira and L. M. Pinho, “A framework for
the development of parallel and distributed real-time embedded
increases, the transmission time becomes more negligible. systems,” in Proc. of 38th EUROMICRO Conference on Software
Thus, in order to correctly address parallel/distributed Engineering and Advanced Applications (SEAA 2012), 2012.
applications, a trade-off between local/distributed [4] OpenMP Architecture Review Board, "OpenMP application program
computations (execution time) and the cost of transmitting interface V3.1 July 2011," www.openmp.org/wp/openmp-specifications/,
computations (transmission time) over networked devices, online: last accessed April 2012.
needs to be carefully considered. This is one of the main [5] Message Passing Interface Forum, “MPI: A Message-Passing Interface
standard version 2.2,” online: http://www.mpi-forum.org/docs/docs.html,
differences of our model, when compared to pure shared- online: last accessed April 2012.
memory approaches. By observing the preliminary results in
[6] GCC Internals, http://gcc.gnu.org/onlinedocs/gccint/, online: last
Figures 4 and 5, we can derive two remarks from these plots. accessed September 2012.
The first one is that total execution time needs to be [7] D. Novillo, "Openmp and automatic parallelization in gcc," in In the
pondered, in order to decide if it is worth to apply a workload Proceedings of the GCC Developers’ Summit, June 2006.
distribution, or not. For example, when running applications [8] N. Manica, L. Abeni, L. Palopoli, D. Faggioli and C. Scordino,
“Schedulable device drivers: Implementation and experimental results,”
with execution times of a few milliseconds, the total execution in Proceedings of the 6th International Workshop on Operating Systems
time becomes more susceptible to delays (e.g. the implicit Platforms for Embedded Real-Time Applications (OSPERT 2010), 2010.
transmission delay). Therefore, it may be not be worthwhile to [9] L. Nogueira, J. C. Fonseca, C. Maia and L. M. Pinho, “Dynamic Global
realize workload distribution. It is possible to observe this Scheduling of Parallel Real-Time Tasks,” in Proc.10th IEEE/IFIP
effect in Figure 4 for the case of 100 iterations. The second International Conference on Embedded and Ubiquitous Computing
remark is that the existing parallel/distributed algorithms (EUC’12), 2012.
implemented by OpenMP and MPI are not suitable for [10] POSIX Threads Programming,
handling real-time workloads since no resources are reserved https://computing.llnl.gov/tutorials/pthreads/, online: last accessed,
in the nodes neither in the network. Therefore, this opens September 2012.
opportunities to explore different and more complex work-
sharing mechanisms and scheduling algorithms.

117

You might also like