Exploring Lua For Concurrent Programming (Alexandre Skyrme, Roberto Ierusalimschy) (2008)
Exploring Lua For Concurrent Programming (Alexandre Skyrme, Roberto Ierusalimschy) (2008)
21 (2008), 3556-3572
submitted: 16/4/08, accepted: 5/6/08, appeared: 1/12/08 © J.UCS
Alexandre Skyrme
(Pontifical Catholic University of Rio de Janeiro (PUC–Rio)
Rio de Janeiro, Brazil
[email protected])
Noemi Rodriguez
(Pontifical Catholic University of Rio de Janeiro (PUC–Rio)
National Education and Research Network (RNP)
Rio de Janeiro, Brazil
[email protected])
Roberto Ierusalimschy
(Pontifical Catholic University of Rio de Janeiro (PUC–Rio)
Rio de Janeiro, Brazil
[email protected])
1 Introduction
Object Monitors [Caromel, Mateus and Tanter 2004], and the Concurrency and
Coordination Runtime [Chrysanthakopoulos and Singh 2005].
Since 2003 the Lua programming language [Ierusalimschy et al. 1996, Ierusal-
imschy et al. 2006, Ierusalimschy et al. 2007] features coroutines, which enables
collaborative multithreading. However, a common criticism of coroutines is that
they cannot explore hardware parallelism, such as provided by multi-core pro-
cessors. In 2006, Ierusalimschy [Ierusalimschy 2006] proposed the use of multiple
independent states in Lua to implement Lua processes, based on some form of
message passing. In this paper we advance that proposal, building a complete
library for concurrent programming in Lua based on message passing over chan-
nels. As we will see, the resulting library showed encouraging performance results
even when running hundreds of thousands of simultaneous processes.
The rest of this paper is organized as follows. In section 2 we point out some
of the downsides of multithreading and present the model which we chose to
explore for concurrent programming in Lua. In section 3 we describe how we
implemented this model and in section 4 we present some results of a perfor-
mance evaluation of the implementation. Finally, in section 5, we draw some
conclusions.
Because they have independent resources, we call each thread in the library a
Lua process. We create Lua processes through calls to the luaproc.newproc func-
tion. As user threads, Lua processes are entities scheduled exclusively through a
scheduler which runs in user space, without direct relation to operating system
processes or other kernel scheduled entities.
Communication between Lua processes occurs exclusively through message
passing. On the one hand, communication with message passing can be slower
when compared to shared memory. On the other hand, the lack of shared memory
avoids the performance and complexity penalties associated to shared-memory
synchronization primitives. Besides, programs can use the same communication
model for processes within the same machine and for processes in a distributed
environment.
As their own names imply, the luaproc.send function sends messages and
the luaproc.receive function receives messages. Message addressing is based on
channels. Channels must be explicitly created by the luaproc.newchannel func-
tion and destroyed by the luaproc.delchannel function. A channel is an entity
on its own, without any direct relation to Lua processes. Each channel is named
by a string, which must be specified as a parameter to the luaproc.newchannel
function. Each process may send to and receive from any channel, as long as the
process knows the channel name. Thus, it suffices to know a channel name in
order to use it.
Each message carries a tuple of atomic Lua values: strings, number, or
booleans. More complex types must be encoded in some form. For instance,
it is easy in Lua to serialize data [Ierusalimschy 2006], that is to convert it into
a stream of bytes or characters, in order to save it in a file, send it through
a network connection or, in this case, send it in a message. Structured values
can easily be encoded as a piece of Lua code (as a string) that, when executed,
reconstructs that value in the receiver.
The luaproc.send operation is blocking: it returns only after another Lua
process has received its message on the targeted channel or if the channel does
not exist. Otherwise the sending Lua process is blocked until one of these two
conditions happen.
The luaproc.receive function, on the other hand, can be either blocking or
non-blocking, depending on a parameter. A blocking receive behaves similarly
to a blocking send: it only returns after matching with a send operation on that
channel, or if the channel does not exist. The non-blocking receive operation,
in contrast, always returns immediately; its result indicates whether it got any
message.
The reason we opted for blocking on send operations is that this provides a
simpler, more deterministic, programming model. When a call to luaproc.receive
returns successfully, it is possible to assert that the message was received. Addi-
Skyrme A., Rodriguez N., Ierusalimschy R.: Exploring Lua ... 3559
3 Model Implementation
the interpreter’s state and keeps track of functions and global variables, among
other information related to the interpreter.
Once Lua code has been loaded in a Lua state, it is possible to control its
execution through functions provided by the C API for Lua. Control takes place
as if the Lua code was executed as a coroutine. Therefore, even if the Lua code
does not include explicit calls to Lua’s standard coroutine handling functions,
it is possible to suspend and resume its execution through C functions. This
feature is essential to allow control over Lua processes execution.
Each Lua process is comprised by an independent Lua state, where the pro-
cess code is loaded during process creation. The independence between Lua
states ensures the lack of shared memory between Lua processes and helps to
enforce message passing as a means for interprocess communication. The remain-
ing structure used to implement Lua processes is compact and has few members
other than the process Lua state. Among relevant structure members are the
process execution state (idle, ready, blocked or finished) and the number of ar-
guments that must be used when resuming its execution in case it is blocked.
No unique process identifier (PID) is included since there is no fixed relation
between workers and processes.
Even though the creation of a Lua state is a cheap operation, loading all
standard Lua libraries can take more than ten times the time required to create
a state [Ierusalimschy 2006]. Thus, to reduce the cost of creating Lua processes,
only the basic standard library and our own library are automatically loaded
into each new Lua process. The remaining standard libraries (io, os, table, string,
math, and debug) are pre-registered and can be loaded with a standard call to
Lua’s require function.
Our library also offers a facility to recycle Lua processes, which is optionally
activated through a call to the luaproc.recycle function. Recycling consists in
reusing states from finished Lua processes to execute new processes. Instead
of being destroyed after finishing its execution, a state can be stored for reuse.
Creation of a Lua process can then be done by loading new Lua code in a recycled
state, thus eliminating the costs of creating a new state and loading libraries.
3.2 Scheduler
Workers are simply kernel threads, managed with the POSIX Threads library
(pthreads), which perform the following cycle: it retrieves the first Lua process
from the ready queue; executes the Lua code associated with the process until it
finishes, blocks or yields; and takes appropriate measures depending on execution
outcome. Creation and destruction of workers in execution time is supported
through the API functions luaproc.createworker and luaproc.destroyworker.
If the execution of a Lua process ends because the Lua code related to the
process has finished normally, the worker closes the corresponding Lua state and
destroys the process. If, during the execution of a Lua process, a call is made
to the standard Lua function coroutine.yield, the worker simply reinserts the
process at the end of the ready queue. This suspends the process execution and
allows other processes to execute, which is the expected behavior of a yield. If
the execution of a Lua process results in an unexpected error, the worker prints
an error message, closes the corresponding Lua state and destroys the process.
Since there is only a single ready queue, all workers must get Lua processes
from the same queue. This implies that shared memory synchronization primi-
tives had to be used to serialize access and manipulation of the queue. To that
matter, conditional variables and mutual exclusion were used, as they are both
supported by the POSIX Threads library (pthreads).
Lua uses a virtual stack to pass values to and from C. Each element in this
stack represents a Lua value. Calls from Lua to functions implemented in C use
the virtual stack to pass function arguments. Likewise, these C functions use
the virtual stack to pass results back to Lua. Therefore, passing messages in
our library simply implies copying data from the sender’s virtual stack to the
receiver’s virtual stack.
In our library, a Lua process can only have its execution blocked in two distinct
situations:
1. when it calls the blocking receive function with a channel where there are
no processes waiting to send, that is, when an attempt to receive a mes-
sage occurs without a previous corresponding attempt to send to the same
channel;
2. when it calls the send function with a channel where there are no processes
waiting to receive, that is, when an attempt to send a message occurs without
a previous corresponding attempt to receive from the same channel.
3562 Skyrme A., Rodriguez N., Ierusalimschy R.: Exploring Lua ...
When a Lua process blocks, the worker adds it to the corresponding channel’s
queue and gets another process from the ready queue in order to run it. A blocked
Lua process is unblocked only if there is a matching call on the same channel or if
the channel where it is blocked is destroyed. When such a matching call happens,
the same worker that is executing the process that made the call removes the
blocked process from the channel queue, copies message data between virtual
stacks, and places the unblocked process at the end of the ready queue.
To keep track of Lua processes that are blocked trying to communicate, each
channel has two distinct queues (FIFO): one holds processes blocked when trying
to send messages to the channel and another holds processes blocked when trying
to receive messages from the channel. At most one of these queues will not be
empty at any given time, otherwise the processes from each queue could match.
As we can see, the program begins by loading our library with the standard
Lua require function. Then, it creates an additional worker and a main Lua
Skyrme A., Rodriguez N., Ierusalimschy R.: Exploring Lua ... 3563
process that will hold the remaining of our application. This main Lua process
creates a channel and two additional Lua processes. While one of these processes
sends a message on the channel, the other one receives the message and then
prints it. We ensure that our application will not exit before all Lua processes
have completed their execution by calling the luaproc.exit function, which sim-
ply prevents workers from exiting while there are unfinished Lua processes.
4 Performance Evaluation
offer scalability that allows for massive concurrency. Therefore, we opted not to
present them in this work.
In this test we measure the memory usage caused by process creation. Like in
the previous test, we first create a main Lua process and then, from within
that process, we create the same number of communication channels and Lua
processes which wait for a message that is only sent after all processes have been
spawned. However, for this test, we introduced delays immediately before and
immediately after creating channels and processes, in order to allow for external
memory usage measurement with Linux’s pmap command, which maps virtual
memory usage per process.
We ran an analogous test with Erlang. Also like in the previous test, we cre-
ated a certain number of Erlang processes which waited for a message that was
only sent after all processes had been created and we did not create or used com-
munication channels. For this test, though, we introduced delays immediately
before and immediately after creating processes, in order to measure memory
usage by the same means described previously.
Skyrme A., Rodriguez N., Ierusalimschy R.: Exploring Lua ... 3565
9
Lua
Erlang compiled
Erlang interpreted
8
1
5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
Processes
4.3 Communication
Message passing is the intended way for Lua processes to communicate and
synchronize, therefore it is important to evaluate how it performs. In this test we
sequentially send and receive messages of different sizes and measure execution
time. First, the message contents are read from a file composed of copies of the
3566 Skyrme A., Rodriguez N., Ierusalimschy R.: Exploring Lua ...
1000
Lua
Erlang compiled
900 Erlang interpreted
800
700
500
400
300
200
100
0
5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
Processes
same string separated by newlines. Then, a main Lua process that will host the
remainder of our application’s code is created. Next, from within the main Lua
process, a communication channel is created and a new Lua process, whose sole
purpose is to receive messages, is spawned. Finally, the main Lua process sends
the same message sequentially, 1,000 times, to the second Lua process.
We conducted a similar test using Erlang. Just as in the previous tests, the
main difference between our Lua code and our Erlang code was the lack of need
to create communication channels in the later. Apart from that, Erlang code also
differs slightly since a few additional messages must be sent in order to inform
the receiver process identifier (PID) to the sender process and to ensure proper
synchronization.
Figure 3 shows the total execution times for sending and receiving messages
of increasing sizes using our library and Erlang. Once again, we present both the
results for interpreted and compiled Erlang code.
As we can see in the figure, our library presented good communication perfor-
mance, with execution times below 0.1s to send messages with up to 10,000 bytes.
Erlang, in turn, presented better performance when interpreted, rather than
compiled. It presented almost constant execution times when compiled, which
suggests it relies on an O(1) operation to perform message passing, such as copy-
ing a pointer that points to a shared memory address that holds message data.
Skyrme A., Rodriguez N., Ierusalimschy R.: Exploring Lua ... 3567
1.3
1.2
Lua
Erlang compiled
1.1 Erlang interpreted
0.9
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 10 100 1000 10000
Message size (bytes)
In this test, we evaluate how valuable the state recycling feature described in
section 3 can be, if used under the right circunstances. This time, we created a
fixed number of Lua processes sequentially and had them print a simple message
to standard output before exiting. No inter-process communication was used and
thus no communication channels were created. The lower individual execution
time allowed for better evalution of the process recycling feature.
We changed the recycled process limit and measured the total execution time
for creating and running all the Lua processes. The recycled process limit simply
determines the maximum number of Lua states from finished Lua processes
which are stored for further recycling when new processes are created. If this
limit is set to zero, the recycling feature is disabled and no Lua states from
finished Lua processes are kept. If it is set to n, at any given time there will be
at most n stored Lua states from finished Lua processes.
An altered version of the library was used for this test just to allow for
displaying how many processes were created with recycled Lua states. Figure 4
shows the total execution times for creating 100,000 Lua processes using different
recycle limits and redirecting standard output to the null device, along with
recycle counts.
As the figure shows, the process recycling feature can offer significant per-
3568 Skyrme A., Rodriguez N., Ierusalimschy R.: Exploring Lua ...
11
0
Processes
10 Recycled
4 99,936
Processes 99,819
Recycled Processes
3 99,869
Processes Recycled
Recycled
2
0
Recycle Recycle Recycle Recycle Recycle
Limit 0 Limit 1 Limit 10 Limit 100 Limit 1000
4.5 Parallelism
The second module is responsible for coordinating job distribution and for
centralizing results. It reads the patterns from a file, sends them to the searchers
and then starts to progressively distribute target file names to searchers. It also
receives results from searchers and notifies them when all target files have been
searched.
The third module is the searcher, that is, it is responsible for searching target
files for patterns. Each searcher receives a single file name at a time and only
sends back its results to the coordinator after processing the whole file. The
results sent are composed by the lines of the target file that matched any of the
patterns.
This test was exceptionally carried out on a computer with four AMD Opteron
dual-core 2.2 GHz processors, for a total of eight processor cores, and 32 GB of
RAM. Its operating system was also Linux, but this time with the CentOS 5.1
distribution, standard kernel 2.6.18-53.1.6.el5xen #1 SMP and Native POSIX
Threads library (NPTL) 2.5.
Initially, six workers were used to run the parallel version of the application
in order to stimulate parallelism and reduce concurrency in the execution of one
coordinator and five searcher Lua processes. Next, still using the parallel version
of the application, just a single worker was used, in order to allow for a more
balanced comparison with the serial version. The pattern file used throughout the
test was the same and it contained 25 lines, with one string per line. The target
files were copies of a single file, which included 6,605,423 lines and 2,147,483,849
bytes (around 2 GB). Results are shown in figure 5.
The results indicate that exploitation of parallelism on multi-processed envi-
ronments, as expected, can result in proportional reductions in execution time.
As we can observe, when using the serial version of the application or the paral-
lel version with a single worker (kernel thread) execution time increased almost
linearly with the number of target files. On the other hand, when the parallel
version runs with six workers, execution times for one or five target files was
almost the same, which strongly suggests that while one worker acted as the co-
ordinator, the other five workers acted as searchers and processed the five target
files in parallel. Still regarding the parallel version with six workers, it is worth
noticing that, once again as expected, execution time increased linearly when
the number of target files increased from five to ten.
Finally, results also show an almost insignificant difference in execution times
between the serial version of the application, which uses only standard Lua
libraries, and the parallel version, which uses our library, when it ran with a
single worker.
3570 Skyrme A., Rodriguez N., Ierusalimschy R.: Exploring Lua ...
12000
Serial version
Parallel version - 1 worker
Parallel version - 6 workers
10000
8000
4000
2000
0
1 target file 5 target files 10 target files
5 Conclusion
of Lua processes.
The use of the POSIX Threads library (pthreads) as a means to benefit
from kernel threads allowed for the exploitation of parallelism intermediated by
the underlying operating system. Nevertheless, paradoxically, it also resulted
in a significant increase in development complexity, mostly due to the need to
handle typical obstacles related to using preemptive multithreading with shared
memory.
The difficulties we experienced while developing the library to implement the
chosen model confirm the criticism of preemptive multithreading with shared
memory and reinforce the necessity of newer approaches to concurrent program-
ming. The limitations of preemptive multithreading with shared memory, in
particular the complexity in development, create difficulties even when it is used
just as a building block to structure alternative solutions.
This work does not exhaust the investigation of the chosen model for concur-
rent programming in Lua, nor the exploration of new alternatives for concurrent
programming. Our library could be further improved by new functionalities and
it could be further evaluated by development of more complex, or so-called “real–
world”, applications combined with a more extensive performance evaluation. In
addition, the usability of our library, which we intuitively believe to be better
than other libraries, still lacks proper testing. Nevertheless, the results presented
in this work represent an important step towards allowing other contributing ef-
forts to be undertaken.
References
[Andrews and Schneider 1983] Andrews, G. R., Schneider, F. B.: “Concepts and No-
tations for Concurrent Programming”; ACM Comput. Surv., 15, 1 (1983), 3–43.
[Armstrong 1996] Armstrong, J.: “Erlang - a Survey of the Language and its Industrial
Applications”; INAP’96 — The 9th Exhibitions and Symposium on Industrial
Applications of Prolog, Hino, Tokyo, Japan (1996), 16–18.
[Armstrong 2007] Armstrong, J.: “Programming Erlang”; Pragmatic Bookshelf (2007),
ISBN 193435600X.
[Benton, Cardelli and Fournet 2002] Benton, N., Cardelli, L., and Fournet, C.: “Mod-
ern Concurrency Abstractions for C#”; ECOOP ’02: Proceedings of the 16th Euro-
pean Conference on Object-Oriented Programming, Springer–Verlag (2002), ISBN
3-540-43759-2, 415–440.
[Caromel, Mateus and Tanter 2004] Caromel, D., Mateu, L., and Tanter, E.: “Sequen-
tial Object Monitors”; ECOOP 2004 – Object-Oriented Programming, 18th Euro-
pean Conference, Springer-Verlag, 3086, 316–340.
[Chrysanthakopoulos and Singh 2005] Chrysanthakopoulos, G., and Singh, S.: “An
Asynchronous Messaging Library for C#”; Synchronization and Concurrency in
Object-Oriented Languages (SCOOL), OOPSLA 2005 Workshop, San Diego, Cal-
ifornia, USA.
[Dijkstra 1975] Dijkstra, E. W.: “Guarded commands, nondeterminacy and formal
derivation of programs”; Commun. ACM 18, 8 (1975), 453–457.
[Dijkstra 1983] Dijkstra, E. W.: “The structure of THE - multiprogramming system”;
Commun. ACM 26, 1 (1983), 49–52.
3572 Skyrme A., Rodriguez N., Ierusalimschy R.: Exploring Lua ...