Programming Memory - Constrained Networked Embedded Systems
Programming Memory - Constrained Networked Embedded Systems
Doctoral Thesis
SICS Dissertation Series 47
Programming
Memory-Constrained
Networked Embedded Systems
Adam Dunkels
February 2007
Ten years after the Internet revolution are we standing on the brink of another
revolution: networked embedded systems that connect the physical world with
the computers, enabling new applications ranging from environmental moni-
toring and wildlife tracking to improvements in health care and medicine.
Only 2% of all microprocessors that are sold today are used in PCs; the re-
maining 98% of all microprocessors are used in embedded systems. The micro-
processors used in embedded systems have much smaller amounts of memory
than PC computers. An embedded system may have as little has a few hundred
bytes of memory, which is thousands of millions times less than the memory in
a modern PC. The memory constraints make programming embedded systems
a challenge.
This thesis focus on three topics pertaining to programming memory-
constrained networked embedded systems: the use of the TCP/IP protocol suite
even in memory-constrained networked embedded systems; simplifying event-
driven programming of memory-constrained systems; and dynamic loading of
program modules in an operating system for memory-constrained devices. I
show that the TCP/IP protocol stack can, contrary to previous belief, be used
in memory-constrained embedded systems but that a small implementation has
a lower network throughput. I present a novel programming mechanism called
protothreads that is intended to replace state machine-based event-driven pro-
grams. Protothreads provide a conditional blocked wait mechanism on top of
event-driven systems with a much smaller memory overhead than full multi-
threading; each protothread requires only two bytes of memory. I show that
protothreads significantly reduce the complexity of event-driven programming
for memory-constrained systems. Of seven state machine-based programs
rewritten with protothreads, almost all explicit states and state transitions could
be removed. Protothreads also reduced the number of lines of code with 31%
on the average. The execution time overhead of protothreads is on the order
i
ii
iii
iv
Adam Dunkels
Stockholm, January 8 2007
vii
Acknowledgements
I first and foremost thank my colleague Thiemo Voigt, who has also been the
co-adviser for this thesis, for all the moral support over the past few years, for
being the committed person that he is, and for being genuinely fun to work
with. Working with this thesis would have been considerably less enjoyable if
it had not been for Thiemo. Thiemo and I have been working during the final
hours before paper submission deadlines, sometimes as late/early as 6 AM in
the morning. Thiemo has also gone out of his way to take care of distracting
duties, thus allowing me to focus on doing the research for this thesis.
I am grateful to Mats Björkman, my university adviser for this thesis, for
being a stimulating person and for the smooth PhD process. Seemingly big
problems have always turned into non-problems after a discussion with Mats.
I am also very grateful to the inspiring Juan Alonso. Juan started the
DTN/SN project at SICS within which most of the work in this thesis was
done. I also would like to thank Henrik Abrahamsson for being a good friend
and stimulating discussion partner on subjects ranging from the craft of sci-
ence and research to cars and culinary culture. I am also very happy to work
with the great members of our Networked Embedded Systems Group at SICS:
Joakim Eriksson, Niclas Finne, and Fredrik Österlind. An equally skilled and
dedicated group of people is very hard to find. Many thanks also to Björn
Grönvall for taking a lot of the work of writing project deliverables as well
as porting Contiki to new platforms. Thanks also to Sverker Janson, labora-
tory manager of the Intelligent Systems Laboratory at SICS, for his inspiring
leadership and for his support. Many thanks to all the people at SICS for creat-
ing a stimulating work environment; Lars Albertsson, Frej Drejhammar, Karl-
Filip Faxén, Anders Gunnar, Ali Ghodsi, Kersti Hedman, Janusz Launberg,
Ian Marsh, Mikael Nehlsen, Martin Nilsson, L-H Orc Lönn, Tony Nordström,
Carlo Pompili, Babak Sadighi, and Karl-Petter Åkesson, just to name a few.
Many thanks to Oliver Schmidt for our cooperation on protothreads and his
ix
x
porting and maintaining of Contiki, for always being a very sharp discussion
partner, and for being a good person to work with.
Thanks also to the great master thesis students with whom I have been in-
volved during this work: Max Loubser, Shujuan Chen, Zhitao He, and Nicolas
Tsiftes. Thanks also to Muneeb Ali for his fruitful research visit at SICS.
My thanks also go out to the hundreds of people I have been in contact with
regarding my software over the past few years. I have gotten many warming
words, good comments on my software, bugfixes and patches, as well as new
modules and ports to new architectures. I have gotten so many e-mails that I
unfortunately have only been able to answer a fraction of them.
I am also deeply grateful to the people at Luleå University of Technology
for teaching me the basic aspects of computer science. Lars-Gunnar Taube for
introducing me to the secrets of computing many years ago; Håkan Jonsson for
his introduction to the interesting world of functional programming; Leif Ku-
soffsky for his imperative programming laboratory assignments that taught me
how to write virtual machines and how to develop compilers for object-oriented
languages; Lennart Andersson for giving me the extremely important insight
that external and internal data representation need not be the same, when we
were instructed to not use a two-dimensional array to represent the spreadsheet
data in the VisiCalc-clone we developed as a laboratory assignment; Mikael
Degermark and Lars-Åke Larzon for sparkling my interest in computer com-
munications; and Olov Schelén and Mathias Engan for teaching me how to
read and review scientific papers.
Thanks also go to my mother Kerstin for being supportive throughout my
education and research career, for taking interest in my research work, and for
reading and commenting on this thesis. I will also forever be in debt to my late
father Andrejs, who taught me the skills of everything from living and laughing
to mathematics and music.
Finally, I am extremely fortunate to have been blessed with such a loving
family: my wife Maria, our sons Morgan and Castor, and one who we look
forward to meet in a few months from now. Maria has supported me throughout
the work with this thesis, taken interest in my work, listened to and helped
improve my research presentations, and endured all my research ramblings at
home.
This work in this thesis is in part supported by VINNOVA, Ericsson, SITI, SSF,
the European Commission under the Information Society Technology priority
within the 6th Framework Programme, the European Commission’s 6th Frame-
xi
work Programme under contract number IST-004536, and the Swedish Energy
Agency. Special thanks to Bo Dahlbom and the Swedish Energy Agency for
funding the final writing up of this thesis. The Swedish Institute of Computer
Science is sponsored by TeliaSonera, Ericsson, SaabTech, FMV, Green Cargo,
ABB, and Bombardier Transportation AB.
Included Papers
This thesis consists of a thesis summary and five papers that are all published
in peer-reviewed conference and workshop proceedings. Throughout the thesis
summary the papers are referred to as Paper A, B, C, D, and E.
Paper A Adam Dunkels. Full TCP/IP for 8-bit architectures. In Proceedings
of The First International Conference on Mobile Systems, Applications,
and Services (ACM MobiSys 2003), San Francisco, USA, May 2003.
Paper B Adam Dunkels, Björn Grönvall, and Thiemo Voigt. Contiki - a
Lightweight and Flexible Operating System for Tiny Networked Sen-
sors. In Proceedings of the First IEEE Workshop on Embedded Net-
worked Sensors (IEEE Emnets 2004), Tampa, Florida, USA, November
2004.
Paper C Adam Dunkels, Oliver Schmidt, and Thiemo Voigt. Using pro-
tothreads for sensor node programming. In Proceedings of the Workshop
on Real-World Wireless Sensor Networks (REALWSN 2005), Stockholm,
Sweden, June 2005.
Paper D Adam Dunkels, Oliver Schmidt, Thiemo Voigt, and Muneeb Ali.
Protothreads: Simplifying event-driven programming of memory-
constrained embedded systems. In Proceedings of the 4th Interna-
tional Conference on Embedded Networked Sensor Systems (ACM Sen-
Sys 2006), Boulder, Colorado, USA, November 2006.
Paper E Adam Dunkels, Niclas Finne, Joakim Eriksson, and Thiemo Voigt.
Run-time dynamic linking for reprogramming wireless sensor networks.
In Proceedings of the 4th International Conference on Embedded Net-
worked Sensor Systems (ACM SenSys 2006), Boulder, Colorado, USA,
November 2006.
xiii
Contents
I Thesis Summary 1
1 Introduction 3
1.1 Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . 4
1.2 Programming Memory-Constrained Embedded Systems . . . 5
1.3 Research Approach and Method . . . . . . . . . . . . . . . . 5
1.4 Research Issues . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 TCP/IP for Memory-Constrained Systems . . . . . . . 7
1.4.2 Protothreads and Event-Driven Programming . . . . . 8
1.4.3 Dynamic Module Loading . . . . . . . . . . . . . . . 8
1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Related Work 21
4.1 Small TCP/IP Implementations . . . . . . . . . . . . . . . . . 21
xv
xvi Contents
Bibliography 35
II Papers 43
7 Paper A:
Full TCP/IP for 8-Bit Architectures 45
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 TCP/IP overview . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.4 RFC-compliance . . . . . . . . . . . . . . . . . . . . . . . . 52
7.5 Memory and buffer management . . . . . . . . . . . . . . . . 53
7.6 Application program interface . . . . . . . . . . . . . . . . . 55
7.7 Protocol implementations . . . . . . . . . . . . . . . . . . . . 56
7.7.1 Main control loop . . . . . . . . . . . . . . . . . . . . 56
7.7.2 IP — Internet Protocol . . . . . . . . . . . . . . . . . 57
7.7.3 ICMP — Internet Control Message Protocol . . . . . . 58
7.7.4 TCP — Transmission Control Protocol . . . . . . . . 59
7.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.8.1 Performance limits . . . . . . . . . . . . . . . . . . . 62
7.8.2 The impact of delayed acknowledgments . . . . . . . 63
7.8.3 Measurements . . . . . . . . . . . . . . . . . . . . . 64
7.8.4 Code size . . . . . . . . . . . . . . . . . . . . . . . . 66
7.9 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.10 Summary and conclusions . . . . . . . . . . . . . . . . . . . 70
7.11 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 70
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Contents xvii
8 Paper B:
Contiki - a Lightweight and Flexible Operating System for Tiny
Networked Sensors 75
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.1.1 Downloading code at run-time . . . . . . . . . . . . . 78
8.1.2 Portability . . . . . . . . . . . . . . . . . . . . . . . . 78
8.1.3 Event-driven systems . . . . . . . . . . . . . . . . . . 79
8.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.3 System overview . . . . . . . . . . . . . . . . . . . . . . . . 81
8.4 Kernel architecture . . . . . . . . . . . . . . . . . . . . . . . 82
8.4.1 Two level scheduling hierarchy . . . . . . . . . . . . 83
8.4.2 Loadable programs . . . . . . . . . . . . . . . . . . . 83
8.4.3 Power save mode . . . . . . . . . . . . . . . . . . . . 84
8.5 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.5.1 Service replacement . . . . . . . . . . . . . . . . . . 85
8.6 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.7 Communication support . . . . . . . . . . . . . . . . . . . . . 87
8.8 Preemptive multi-threading . . . . . . . . . . . . . . . . . . . 88
8.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.9.1 Over-the-air programming . . . . . . . . . . . . . . . 89
8.9.2 Code size . . . . . . . . . . . . . . . . . . . . . . . . 89
8.9.3 Preemption . . . . . . . . . . . . . . . . . . . . . . . 91
8.9.4 Portability . . . . . . . . . . . . . . . . . . . . . . . . 92
8.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9 Paper C:
Using Protothreads for Sensor Node Programming 97
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.3 Protothreads . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.3.1 Protothreads versus events . . . . . . . . . . . . . . . 102
9.3.2 Protothreads versus threads . . . . . . . . . . . . . . . 103
9.3.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . 103
9.3.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . 105
9.3.5 Implementation . . . . . . . . . . . . . . . . . . . . . 106
9.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
xviii Contents
10 Paper D:
Protothreads: Simplifying Event-Driven Programming of
Memory-Constrained Embedded Systems 111
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.2 Protothreads . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10.2.1 Scheduling . . . . . . . . . . . . . . . . . . . . . . . 116
10.2.2 Protothreads as Blocking Event Handlers . . . . . . . 117
10.2.3 Example: Hypothetical MAC Protocol . . . . . . . . . 117
10.2.4 Yielding Protothreads . . . . . . . . . . . . . . . . . 120
10.2.5 Hierarchical Protothreads . . . . . . . . . . . . . . . . 120
10.2.6 Local Continuations . . . . . . . . . . . . . . . . . . 121
10.3 Memory Requirements . . . . . . . . . . . . . . . . . . . . . 122
10.4 Replacing State Machines with
Protothreads . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.5.1 Prototype C Preprocessor
Implementations . . . . . . . . . . . . . . . . . . . . 126
10.5.2 Memory Overhead . . . . . . . . . . . . . . . . . . . 130
10.5.3 Limitations of the Prototype Implementations . . . . . 130
10.5.4 Alternative Approaches . . . . . . . . . . . . . . . . . 131
10.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
10.6.1 Code Complexity Reduction . . . . . . . . . . . . . . 133
10.6.2 Memory Overhead . . . . . . . . . . . . . . . . . . . 140
10.6.3 Run-time Overhead . . . . . . . . . . . . . . . . . . . 141
10.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Thesis Summary
1
Chapter 1
Introduction
Twenty years ago the computer revolution put PCs in offices and homes
throughout large parts of the western world. Ten years later the Internet revolu-
tion connected the computers together in a world-spanning communication net-
work. Today, we stand on the brink of the next revolution: networked embed-
ded systems that connect the physical world together with computers, enabling
a large variety of applications such as health and heart resuscitation monitor-
ing [8, 69], wildlife tracking and volcano monitoring [35, 48, 70], building
structure monitoring [39], building automation [63], and carbon dioxide mon-
itoring in relation to global warming [41].
It is difficult to estimate the total number of embedded systems in the world
today, but it is possible to get a grasp of the magnitude of the area by looking
at sales figures for microprocessors. We might expect that PCs account for
the bulk of microprocessors because of their widespread use. However, PCs
account for only a very small part of the microprocessor market. In 2002, only
2% of all microprocessors sold were used in PCs [66]. The remaining 98% of
all microprocessors were sold for use in various types of embedded systems.
Embedded systems typically have much less memory than general-purpose
PCs. A normal PC sold in late 2006 had thousands of millions bytes of random
access memory (RAM). This is many million times larger than the RAM size
in many embedded systems; the microprocessors in many embedded systems
have as little as a few hundred or a few thousand bytes of RAM.
It is difficult to estimate the typical or average memory size in embedded
systems today, but again we can get a grasp of the magnitude by looking at
the microprocessor sales figures. Of the total number of microprocessors sold
3
4 Chapter 1. Introduction
in 2002, over 90% were significantly smaller in terms of memory size than a
modern PC [66]. In fact, over 50% of all microprocessors were so-called 8-
bit processors, which typically only can handle a maximum of 65536 bytes of
memory. Because of cost constraints, most of those microprocessors are likely
to have considerably less memory than the maximum amount. For example,
in 2004 the price of the popular Texas Instruments’ MSP430 FE423 micropro-
cessor was nearly 20% higher with 1024 bytes of RAM ($5.95) than the same
microprocessor with 256 bytes of RAM ($4.85) [31]. While the price of on-
chip memory is likely to decrease in the future, the number of applications for
microprocessors will increase. As these applications will require microproces-
sors with an even lower per-unit cost than today, future microprocessor models
are likely to have similar memory configurations as today’s microprocessors
but at lower per-unit prices.
Most microprocessors in embedded systems are programmed in the C pro-
gramming language [22]. Their memory constraints make programming them
a challenge. Programmers that program general-purpose computers or PCs sel-
dom have to take memory limitations into consideration because of the large
amounts of memory available. Moreover, PC microprocessors hardware make
techniques such as virtual memory possible, making the memory accessible
to the programmer almost limitless. In contrast, the small amounts of mem-
ory requires embedded systems programmers to always be aware of memory
limitations when programming their systems. Also, most microprocessors for
embedded systems do not have the ability to extend the physical memory with
virtual memory. In this thesis I use the term memory constrained for systems
where the programmer explicitly must take memory limitations into consider-
ation when programming the system.
Many embedded systems today communicate with each other. Examples
include base stations for mobile telephony, wireless car keys, point of sale
terminals, and data logging equipment in trucks. The embedded systems com-
municate both with other embedded systems and general-purpose computers
using a network. In this thesis I call such systems networked embedded sys-
tems. Networked embedded systems are, just like ordinary embedded systems,
often memory constrained.
systems, equipped with sensors, that form a wireless network through which
sensor readings are transmitted [61]. Each sensor node is a networked embed-
ded system. Sensor data is relayed from sensor node to sensor node towards
a base station. If a node should break, sensor network routing protocols may
reroute the data around the broken node.
A wireless sensor network may consist of up to thousands of sensor nodes.
Because of the potential large scale of the sensor network the individual sensors
must be small, low cost, and expendable. For this reason, the sensor nodes used
in wireless sensor networks typically have memory-constrained microproces-
sors. Commercially available sensor nodes have between 2 and 10 kilobytes of
RAM [1, 59, 62]. Moreover, for sensor networks to run for extended periods
of time, the energy consumption of both individual sensor nodes and of the
network as a whole is of primary importance. Thus energy consumption is an
important performance metric for sensor networks.
for embedded systems [22], and I have made it a point to make my software
work on a wide range of embedded systems platforms. An alternative approach
would have been to use uncommon programming languages, develop new pro-
gramming languages, or develop new hardware platforms. However, this prag-
matic approach has enabled me to interact with embedded systems develop-
ers working with actual embedded systems and products which would have
been very difficult if I had not used the C programming language. Interact-
ing with embedded systems developers has given me insights into many of the
actual problems that exist in embedded systems programming and has forced
me to build systems that actually work. Moreover, this pragmatic approach
also makes the research directly accessible to practitioners. This is one of the
reasons behind the large impact of this thesis.
The research method employed in this thesis has been the method of com-
puter systems research [16]: I have built computer systems and conducted ex-
periments with them in order to evaluate a specific method, tool, mechanism,
or implementation technique. The systems I have built are software systems:
two TCP/IP stacks for memory-constrained systems, lwIP and uIP, and an op-
erating system for memory-constrained systems, Contiki.
My research work has typically gone through two phases, one exploratory
phase and one confirmatory phase. In the exploratory phase I have been writing
computer programs, either as part of another research project or for personal
enjoyment. When programming I have come up with an interesting idea and
have become interested in finding out whether the idea is good or not. To test
the idea I have formulated an initial hypothesis to verify or falsify. The work
has then entered the confirmatory phase. In the confirmatory phase I test the
hypothesis that I have developed during the exploratory phase. For the purpose
of testing the hypothesis I have built a software system to carry out experi-
ments that either verify or falsify my hypothesis. I have then conducted the
experiments and evaluated the results. If the experiments have not supported
my hypothesis I have either revised the hypothesis and rebuilt my system, or
have abandoned the hypothesis and continued the exploratory phase.
11
12 Chapter 2. Scientific Contributions and Impact
2.2 Impact
The impact of the research in this thesis has been, and continues to be, large.
The lwIP and uIP software developed as part of this thesis has been adopted
by well over one hundred companies world-wide in a wide variety of embed-
2.2 Impact 13
ded devices. Examples include satellite systems, oil boring and oil pipeline
equipment, TV transmission devices, equipment for color post-processing of
movies, world-wide container monitoring and security systems, switches and
routers, network cameras, and BMW racing car engines. The software is also
used in system development kits from hardware manufacturers including Ana-
log Devices, Altera, and Xilinx, which greatly increases the dissemination of
the software. Articles in professional embedded systems developer magazines
have been written, by others, on porting the uIP software for new hardware
platforms [9]. The software is also covered in books on embedded systems and
networking [34, 50] and is included in embedded operating systems [2, 14, 65].
The lwIP and uIP homepages have for a few years been among the top five hits
for Google searches such as TCP/IP stack and embedded TCP/IP.
The Contiki operating system has become a well-known operating sys-
tem in the wireless sensor network community and is used by several research
projects in the area. Many ideas from Contiki such as dynamic module load-
ing and the optional multi-threading library have also been adopted by other
operating systems for wireless sensor networks [26, 51]. The dynamic loader
mechanism in Contiki has also been investigated for use by Ericsson Mobile
Platforms as part of the hardware platform used in many of today’s 3G mobile
telephones.
Protothreads are currently used by numerous embedded software devel-
opers and have been recommended twice in acclaimed embedded developer
Jack Ganssle’s Embedded Muse newsletter [20]. Protothreads have also
been ported, by others, to other programming languages and operating sys-
tems [40, 58].
The papers and the software in this thesis are used in advanced courses on
embedded systems and sensor networks at many universities and institutions
throughout the world. The papers are cited by many scientific papers in the
area of wireless sensor networks. In early 2007 the Google Scholar citation
index [3] reports a total of 121 citations of the first three papers in this thesis.
The last two papers had not yet been indexed by Google Scholar.
Chapter 3
The thesis is a collection of five papers, Paper A, B, C, D, and E. All papers are
published in peer-reviewed conference and workshop proceedings. Conference
proceedings are the primary venue for scientific publishing in the area of this
thesis. I have presented all papers at the conferences and workshops at which
they appeared.
Papers A, D, and E are published at top-tier conferences, ACM MobiSys
2003 and ACM SenSys 2006. ACM MobiSys is a high-quality single track
conference. ACM SenSys is regarded as the most prestigious conference in the
area of wireless sensor networks. Paper B was published in the first instance
of a now established high quality workshop, IEEE EmNets 2004. Paper C
was published at the REALWSN 2005 workshop on real-world wireless sensor
networks.
Paper A presents and evaluates the lwIP and uIP TCP/IP stacks. The event-
driven nature of uIP forms the basis of the Contiki operating system introduced
in Paper B. Paper B presents Contiki and the dynamic module loading mech-
anism and the per-process optional multi-threading in Contiki. The dynamic
loading mechanism in Contiki is further developed and evaluated in Paper E.
The multi-threading mechanism in Contiki, presented in Paper B, is the first
step towards the protothreads mechanism that I introduce in Paper C. Paper
D refines, extends, and evaluates the protothreads mechanism. Paper C also
includes a qualitative comparison between protothreads, events, and threads
that is not included in Paper D. Papers B, C, D, and E show how the research
15
16 Chapter 3. Summary of the Papers and Their Contributions
Related Work
21
22 Chapter 4. Related Work
the uIP or the lwIP stack support both IP fragment reassembly and a variable
maximum TCP segment size.
In addition to the TCP/IP implementation for small embedded systems,
there is a large class of TCP/IP implementations for embedded systems with
less constraining limitations. Typically, such implementations are based on the
TCP/IP implementation from the BSD operating system [52]. These imple-
mentations do not suffer from the same problems as the tailored implementa-
tions. However, such implementations require too large amounts of resources
to be feasible for memory-constrained embedded systems. Such implementa-
tions are typically orders of magnitude larger than uIP.
in this thesis. SOS [26], which was published after Contiki, is similar in design
to Contiki: SOS consists of a small kernel and dynamically-loaded modules.
However, unlike Contiki SOS uses position independent code to achieve relo-
cation and jump tables for application programs to access the operating system
kernel. Application programs can register function pointers with the operat-
ing system for performing inter-process communication. Position independent
code is not available for all platforms, however, which limits the applicability
of this approach.
examples of stack-based virtual machines for sensor networks are DVM [7]
and my CVM Contiki virtual machine (Paper E). There are also Java-based
virtual machines for sensor networks apart from the Java VM in Paper E, such
as the VM⋆ system [38].
Chapter 5
5.1 Conclusions
In this thesis I investigate three aspects of programming memory-constrained
networked embedded systems: the use of the TCP/IP protocol stack for
memory-constrained systems; the novel protothread programming abstraction
for state machine-based programs for event-driven systems; and the use of dy-
namic loading and linking of native code in an operating system for memory-
constrained embedded systems. A general theme throughout this work is
how applicable standard or general-purpose mechanisms and methods are to
memory-constrained systems. I have identified and quantified trade-offs in
both the use of general-purpose mechanisms for memory-constrained systems
and the use of a general-purpose programming language for reducing the com-
plexity of memory-efficient programming for memory-constrained systems.
In this thesis I show that the standard TCP/IP protocol stack can be im-
plemented efficiently enough to be usable even in memory-constrained sys-
tems, but that such an implementation leads to a significant decrease in net-
work throughput. Furthermore, the results show that protothreads reduces the
complexity of state machine-based event-driven programs, while having a very
small memory and run-time overhead. Finally, the Contiki operating system
shows that dynamic loading and run-time dynamic linking of native code is a
feasible mechanism for memory-constrained networked embedded systems.
27
28 Chapter 5. Conclusions and Future Work
There are at least two conclusions that can be drawn from my research.
First, in many cases it is possible to use standard protocols and mechanisms
developed for general-purpose computers even in memory-constrained embed-
ded systems. However, there are trade-offs in terms of both memory footprint
and energy. Second, protothreads show that it is possible to combine features
from multi-threading and event-driven programming; sequential programming
from the multi-threaded model and a small memory overhead from the event-
driven model. Since a slightly limited version of protothreads can be imple-
mented in the general-purpose C programming language, it is possible to do
memory-efficient sequential programming in C without requiring the use of a
special-purpose programming language. However, there are trade-off due to
the limitations of the C-based implementation of protothreads.
to apply to other types of network protocols, such as power cycling MAC pro-
tocols for sensor networks. By using a single buffer, rather than a multi-buffer
management scheme, it might be possible to reduce both the data memory re-
quirements and the code footprint. However, additional mechanisms might be
needed to ensure that no packet queuing is required.
The research in this thesis suggests that there is a relation between imple-
mentation complexity in terms of memory footprint and the achievable appli-
cation performance; the smaller uIP implementation achieves much lower net-
work throughput than the larger lwIP implementation. A system designer that
requires high throughput can choose the larger lwIP stack, whereas a system
designer that requires a low memory footprint, but does not have any through-
put requirements, can choose the smaller uIP stack. It would be interesting to
investigate if this relation exists in other types of protocols as well. For exam-
ple, many protocols for distributing software updates through sensor networks
have been developed [30, 33, 45, 55, 64], each having different properties in
terms of both performance and footprint. A system designer would want to be
able to choose how much system resources to commit to the software update
feature and what kind of performance to expect from the feature, depending on
the requirements of particular deployments. By implementing a set of software
dissemination protocols and measuring their memory footprint and achievable
throughput it might be possible to quantify trade-offs between memory foot-
print and performance.
The systems in this thesis are built using a “bottom-up” approach rather
than a “top-down” approach. Instead of breaking down the system into smaller
components that are independently implemented and composed into a com-
plete system, I have started from the bottom with a set of simple building
blocks from which I have built a system that is similar to, but not always
equivalent to, a system built from the top down. It would be interesting to use
this approach for building new programming mechanisms such as abstractions
for programming of aggregates of memory-constrained networked embedded
systems, so-called macro-programming. Other approaches towards building
macro-programming systems have started by defining the programming inter-
face and decomposed this into smaller modules that are executed by each indi-
vidual node, e.g. the work by Newton et al. [56, 57] and Gummadi et al. [24].
By instead building upwards from a set of small and simple node-level abstrac-
tions to an abstract interface for network-level programming it may be possible
to build macro-programming systems that are more efficient in terms of both
implementation complexity and network communication than systems built in
a top-down fashion.
Chapter 6
6.1 Software
In addition to the software I have developed as part of this thesis I have written
a number of other programs that are relevant to the research in this thesis.
31
32 Chapter 6. Other Software and Publications
VNC client, SMTP client, HTTP client, IRC client, and web browser
Proof-of-concept implementations of a number of application-layer
protocols built on top of uIP and Contiki. The purpose was to investigate
if the event-driven interface of uIP was expressive enough to implement
complex application layer protocols. However, it turned out that many of
the protocols required complex state machines on top of the event-driven
interface of uIP. This insight lead me to develop protothreads. The
source code is available at http://www.sics.se/contiki/.
6.2 Publications
The following peer-reviewed publications were authored or co-authored by me
during the work with this thesis but are not included in the thesis.
• Helena Rivas, Thiemo Voigt, and Adam Dunkels. A simple and efficient
method to mitigate the hot spot problem in wireless sensor networks. In
Workshop on Performance Control in Wireless Sensor Networks, Coim-
bra, Portugal, May 2006.
• Muneeb Ali, Umar Saif, Adam Dunkels, Thiemo Voigt, Kay Römer,
Koen Langendoen, Joseph Polastre, and Zartash Afzal Uzmi. Medium
access control issues in sensor networks. ACM SIGCOMM Computer
Communication Review, April 2006.
• Adam Dunkels, Richard Gold, Sergio Angel Marti, Arnold Pears, and
Mats Uddenfeldt. Janus: An architecture for flexible access to sensor
networks. In First International ACM Workshop on Dynamic Intercon-
nection of Networks (DIN’05), Cologne, Germany, September 2005.
6.2 Publications 33
• Hartmut Ritter, Jochen Schiller, Thiemo Voigt, Adam Dunkels, and Juan
Alonso. Experimental Evaluation of Lifetime Bounds for Wireless Sen-
sor Networks. In Proceedings of the Second European Workshop on
Sensor Networks (EWSN2005), Istanbul, Turkey, January 2005.
• Adam Dunkels, Thiemo Voigt, Niclas Bergman, and Mats Jönsson. The
Design and Implementation of an IP-based Sensor Network for Intru-
sion Monitoring. In Swedish National Computer Networking Workshop,
Karlstad, Sweden, November 2004.
• Adam Dunkels, Thiemo Voigt, Juan Alonso, and Hartmut Ritter. Dis-
tributed TCP caching for wireless sensor networks. In Proceedings of
the Third Annual Mediterranean Ad Hoc Networking Workshop (Med-
HocNet 2004), June 2004.
• Thiemo Voigt, Hartmut Ritter, Jochen Schiller, Adam Dunkels, and Juan
Alonso. Solar-aware Clustering in Wireless Sensor Networks. In Pro-
ceedings of the Ninth IEEE Symposium on Computers and Communica-
tions, June 2004.
• Juan Alonso, Adam Dunkels, and Thiemo Voigt. Bounds on the energy
consumption of routings in wireless sensor networks. In Proceedings
of the 2nd WiOpt, Modeling and Optimization in Mobile, Ad Hoc and
Wireless Networks, Cambridge, UK, March 2004.
• Adam Dunkels, Thiemo Voigt, and Juan Alonso. Making TCP/IP Viable
for Wireless Sensor Networks. In Proceedings of the First European
Workshop on Wireless Sensor Networks (EWSN 2004), work-in-progress
session, Berlin, Germany, January 2004.
34 Chapter 6. Other Software and Publications
• Adam Dunkels, Thiemo Voigt, Juan Alonso, Hartmut Ritter, and Jochen
Schiller. Connecting Wireless Sensornets with TCP/IP Networks. In
Proceedings of the Second International Conference on Wired/Wireless
Internet Communications (WWIC2004), Frankfurt (Oder), Germany,
February 2004.
• Laura Marie Feeney, Bengt Ahlgren, Assar Westerlund, and Adam
Dunkels. Spontnet: Experiences in configuring and securing small ad
hoc networks. In Proceedings of The Fifth International Workshop on
Network Appliances (IWNA5), Liverpool, UK, October 2002.
Bibliography
35
36 Bibliography
[9] D. Barnett and A. J. Massa. Inside the uIP Stack. Dr. Dobb’s Journal,
February 2005.
[11] J. Bentham. TCP/IP Lean: Web servers for embedded systems. CMP
Books, October 2000.
[19] C. Frank and K. Römer. Algorithms for generic role assignment in wire-
less sensor networks. In Proceedings of the 3rd international conference
on Embedded networked sensor systems (SenSys ’05), San Diego, Cali-
fornia, USA, November 2005.
[22] J. Grenning. Why are you still using C? Embedded Systems Design, April
2003.
[28] T. F. Herbert. The Linux TCP/IP Stack: Networking For Embedded Sys-
tems. Charles River Media, 2004.
38 Bibliography
[58] J. Paisley and J. Sventek. Real-time detection of grid bulk transfer traffic.
In Proceedings of the 10th IEEE/IFIP Network Operations Management
Symposium, Vancouver, Canada, April 2006.
[61] K. Römer and F. Mattern. The design space of wireless sensor networks.
IEEE Wireless Communications, 11(6):54–61, December 2004.
[66] J. Turley. The Two Percent Solution. Embedded Systems Design, Decem-
ber 2002.
[67] R. von Behren, J. Condit, and E. Brewer. Why events are a bad idea (for
high-concurrency servers). In Proceedings of the 9th Workshop on Hot
Topics in Operating Systems, Lihue (Kauai), Hawaii, USA, May 2003.
[68] M. Welsh and G. Mainland. Programming Sensor Networks Using Ab-
stract Regions. In Proceedings of ACM/Usenix Networked Systems De-
sign and Implementation (NSDI’04), San Francisco, California, USA,
March 2004.
[69] M. Welsh, D. Myung, M. Gaynor, and S. Moulton. Resuscitation moni-
toring with a wireless sensor network. In Circulation 108:1037: Journal
of the American Heart Association, Resuscitation Science Symposium.,
October 2003.
[70] G. Werner-Allen, K. Lorincz, J. Johnson, J. Lees, and M. Welsh. Fidelity
and yield in a volcano monitoring sensor network. In Proceedings of the
7th USENIX Symposium on Operating Systems Design and Implementa-
tion 2006, Seattle, November 2006.
[71] K. Whitehouse, C. Sharp, E. Brewer, and D. Culler. Hood: a neighbor-
hood abstraction for sensor networks. In Proceedings of the 2nd interna-
tional conference on Mobile systems, applications, and services (MobiSys
’04), Boston, MA, USA, June 2004.
[72] Q. Xie, J. Liu, and P. H. Chou. Tapper: a lightweight scripting engine
for highly constrained wireless sensor nodes. In Proceedings of the fifth
international conference on Information processing in sensor networks
(IPSN ’06), Poster session, Nashville, Tennessee, USA, 2006.
II
Papers
43
Chapter 7
Paper A:
Full TCP/IP for 8-Bit
Architectures
45
Abstract
7.1 Introduction
With the success of the Internet, the TCP/IP protocol suite has become a global
standard for communication. TCP/IP is the underlying protocol used for web
page transfers, e-mail transmissions, file transfers, and peer-to-peer network-
ing over the Internet. For embedded systems, being able to run native TCP/IP
makes it possible to connect the system directly to an intranet or even the global
Internet. Embedded devices with full TCP/IP support will be first-class net-
work citizens, thus being able to fully communicate with other hosts in the
network.
Traditional TCP/IP implementations have required far too much resources
both in terms of code size and memory usage to be useful in small 8 or 16-
bit systems. Code size of a few hundred kilobytes and RAM requirements of
several hundreds of kilobytes have made it impossible to fit the full TCP/IP
stack into systems with a few tens of kilobytes of RAM and room for less than
100 kilobytes of code.
TCP [21] is both the most complex and the most widely used of the trans-
port protocols in the TCP/IP stack. TCP provides reliable full-duplex byte
stream transmission on top of the best-effort IP [20] layer. Because IP may
reorder or drop packets between the sender and the receiver, TCP has to im-
plement sequence numbering and retransmissions in order to achieve reliable,
ordered data transfer.
We have implemented two small generic and portable TCP/IP implemen-
tations, lwIP (lightweight IP) and uIP (micro IP), with slightly different design
goals. The lwIP implementation is a full-scale but simplified TCP/IP imple-
mentation that includes implementations of IP, ICMP, UDP and TCP and is
modular enough to be easily extended with additional protocols. lwIP has sup-
port for multiple local network interfaces and has flexible configuration options
which makes it suitable for a wide variety of devices.
The uIP implementation is designed to have only the absolute minimal set
of features needed for a full TCP/IP stack. It can only handle a single network
interface and does not implement UDP, but focuses on the IP, ICMP and TCP
protocols.
Both implementations are fully written in the C programming language.
We have made the source code available for both lwIP [7] and uIP [8]. Our
implementations have been ported to numerous 8- and 16-bit platforms such
as the AVR, H8S/300, 8051, Z80, ARM, M16c, and the x86 CPUs. Devices
running our implementations have been used in numerous places throughout
the Internet.
48 Paper A
We have studied how the code size and RAM usage of a TCP/IP implemen-
tation affect the features of the TCP/IP implementation and the performance of
the communication. We have limited our work to studying the implementa-
tion of TCP and IP protocols and the interaction between the TCP/IP stack and
the application programs. Aspects such as address configuration, security, and
energy consumption are out of the scope of this work.
The main contribution of our work is that we have shown that is it possible
to implement a full TCP/IP stack that is small enough in terms of code size and
memory usage to be useful even in limited 8-bit systems.
Recently, other small implementations of the TCP/IP stack have made it
possible to run TCP/IP in small 8-bit systems. Those implementations are of-
ten heavily specialized for a particular application, usually an embedded web
server, and are not suited for handling generic TCP/IP protocols. Future em-
bedded networking applications such as peer-to-peer networking require that
the embedded devices are able to act as first-class network citizens and run a
TCP/IP implementation that is not tailored for any specific application.
Furthermore, existing TCP/IP implementations for small systems assume
that the embedded device always will communicate with a full-scale TCP/IP
implementation running on a workstation-class machine. Under this assump-
tion, it is possible to remove certain TCP/IP mechanisms that are very rarely
used in such situations. Many of those mechanisms are essential, however, if
the embedded device is to communicate with another equally limited device,
e.g., when running distributed peer-to-peer services and protocols.
This paper is organized as follows. After a short introduction to TCP/IP
in Section 7.2, related work is presented in Section 7.3. Section 7.4 discusses
RFC standards compliance. How memory and buffer management is done in
our implementations is presented in Section 7.5 and the application program
interface is discussed in Section 7.6. Details of the protocol implementations
is given in Section 7.7 and Section 7.8 comments on the performance and max-
imum throughput of our implementations, presents throughput measurements
from experiments and reports on the code size of our implementations. Section
7.9 gives ideas for future work. Finally, the paper is summarized and concluded
in Section 7.10.
Application
data
Web server application
connections. Before the data is delivered to the application, TCP sorts the
packets so that they appear in the order they were sent. The TCP/IP stack will
also send acknowledgments for the received packets.
Figure 7.1 shows how packets come from the network device, pass through
the TCP/IP stack, and are delivered to the actual applications. In this example
there are five active connections, three that are handled by a web server ap-
plication, one that is handled by the e-mail sender application, and one that is
handled by a data logger application.
Outgoing
Web server application TCP/IP Network
stack packets interface
A high level view of the output processing can be seen in Figure 7.2. The
TCP/IP stack collects the data sent by the applications before it is actually
sent onto the network. TCP has mechanisms for limiting the amount of data
that is sent over the network, and each connection has a queue on which the
data is held while waiting to be transmitted. The data is not removed from
the queue until the receiver has acknowledged the reception of the data. If no
acknowledgment is received within a specific time, the data is retransmitted.
Data arrives asynchronously from both the network and the application,
50 Paper A
and the TCP/IP stack maintains queues in which packets are kept waiting for
service. Because packets might be dropped or reordered by the network, in-
coming packets may arrive out of order. Such packets have to be queued by the
TCP/IP stack until a packet that fills the gap arrives. Furthermore, because TCP
limits the rate at which data that can be transmitted over each TCP connection,
application data might not be immediately sent out onto the network.
The full TCP/IP suite consists of numerous protocols, ranging from low
level protocols such as ARP which translates IP addresses to MAC addresses,
to application level protocols such as SMTP that is used to transfer e-mail.
We have concentrated our work on the TCP and IP protocols and will refer
to upper layer protocols as “the application”. Lower layer protocols are often
implemented in hardware or firmware and will be referred to as “the network
device” that are controlled by the network device driver.
TCP provides a reliable byte stream to the upper layer protocols. It breaks
the byte stream into appropriately sized segments and each segment is sent in
its own IP packet. The IP packets are sent out on the network by the network
device driver. If the destination is not on the physically connected network,
the IP packet is forwarded onto another network by a router that is situated
between the two networks. If the maximum packet size of the other network
is smaller than the size of the IP packet, the packet is fragmented into smaller
packets by the router. If possible, the size of the TCP segments are chosen so
that fragmentation is minimized. The final recipient of the packet will have
to reassemble any fragmented IP packets before they can be passed to higher
layers.
ets in the network, and not the size of these packets, even small 8-bit systems
are able to produce enough traffic to cause congestion. A TCP/IP implementa-
tion lacking congestion control mechanisms should not be used over the global
Internet as it might contribute to congestion collapse [9].
Texas Instrument’s MSP430 TCP/IP stack [6] and the TinyTCP code [4]
use another common simplification in that they can handle only one TCP con-
nection at a time. While this is a sensible simplification for many applications,
it seriously limits the usefulness of the TCP/IP implementation. For example,
it is not possible to communicate with two simultaneous peers with such an
implementation. The CMX Micronet stack [27] uses a similar simplification in
that it sets a hard limit of 16 on the maximum number of connections.
Yet another simplification that is used by LiveDevices Embedinet imple-
mentation [12] and others is to disregard the maximum segment size that a
receiver is prepared to handle. Instead, the implementation will send segments
that fit into an Ethernet frame of 1500 bytes. This works in a lot of cases
due to the fact that many hosts are able to receive packets that are 1500 bytes
or larger. Communication will fail, however, if the receiver is a system with
limited memory resources that is not able to handle packets of that size.
Finally, the most common simplification is to leave out support for re-
assembling fragmented IP packets. Even though fragmented IP packets are
quite infrequent [25], there are situations in which they may occur. If packets
travel over a path which fragments the packets, communication is impossible
if the TCP/IP implementation is unable to correctly reassemble them. TCP/IP
implementations that are able to correctly reassemble fragmented IP packets,
such as the Kadak KwikNET stack [22], are usually too large in terms of code
size and RAM requirements to be practical for 8-bit systems.
7.4 RFC-compliance
The formal requirements for the protocols in the TCP/IP stack is specified in a
number of RFC documents published by the Internet Engineering Task Force,
IETF. Each of the protocols in the stack is defined in one more RFC documents
and RFC1122 [2] collects all requirements and updates the previous RFCs.
The RFC1122 requirements can be divided into two categories; those that
deal with the host to host communication and those that deal with communica-
tion between the application and the networking stack. An example of the first
kind is “A TCP MUST be able to receive a TCP option in any segment” and an
example of the second kind is “There MUST be a mechanism for reporting soft
7.5 Memory and buffer management 53
After the headers are written, the stack passes the buffers to the network de-
vice driver. The buffers are not deallocated when the device driver is finished
sending the data, but held on a retransmission queue. If the data is lost in the
network and have to be retransmitted, the buffers on retransmission queue will
be retransmitted. The buffers are not deallocated until the data is known to be
received by the peer. If the connection is aborted because of an explicit request
from the local application or a reset segment from the peer, the connection’s
buffers are deallocated.
In uIP, the same global packet buffer that is used for incoming packets is
also used for the TCP/IP headers of outgoing data. If the application sends
dynamic data, it may use the parts of the global packet buffer that are not used
for headers as a temporary storage buffer. To send the data, the application
passes a pointer to the data as well as the length of the data to the stack. The
TCP/IP headers are written into the global buffer and once the headers have
been produced, the device driver sends the headers and the application data
out on the network. The data is not queued for retransmissions. Instead, the
application will have to reproduce the data if a retransmission is necessary.
The total amount of memory usage for our implementations depends heav-
ily on the applications of the particular device in which the implementations
are to be run. The memory configuration determines both the amount of traffic
the system should be able to handle and the maximum amount of simultaneous
connections. A device that will be sending large e-mails while at the same time
running a web server with highly dynamic web pages and multiple simultane-
ous clients, will require more RAM than a simple Telnet server. It is possible
to run the uIP implementation with as little as 200 bytes of RAM, but such
a configuration will provide extremely low throughput and will only allow a
small number of simultaneous connections.
If a packet has arrived, the input handler of the TCP/IP stack is invoked. The
input handler function will never block, but will return at once. When it returns,
the stack or the application for which the incoming packet was intended may
have produced one or more reply packets which should be sent out. If so, the
network device driver is called to send out these packets.
Periodic timeouts are used to drive TCP mechanisms that depend on timers,
such as delayed acknowledgments, retransmissions and round-trip time estima-
tions. When the main control loop infers that the periodic timer should fire, it
invokes the timer handler of the TCP/IP stack. Because the TCP/IP stack may
7.7 Protocol implementations 57
Application events
Output packets
Application events
Output packets
perform retransmissions when dealing with a timer event, the network device
driver is called to send out the packets that may have been produced.
This is similar to how the BSD implementations drive the TCP/IP stack,
but BSD uses software interrupts and a task scheduler to initiate input handlers
and timers. In our limited system, we do not depend on such mechanisms being
available.
IP fragment reassembly
In both lwIP and uIP, IP fragment reassembly is implemented using a sepa-
rate buffer that holds the packet to be reassembled. An incoming fragment is
copied into the right place in the buffer and a bit map is used to keep track
of which fragments have been received. Because the first byte of an IP frag-
ment is aligned on an 8-byte boundary, the bit map requires a small amount of
memory. When all fragments have been reassembled, the resulting IP packet
is passed to the transport layer. If all fragments have not been received within
a specified time frame, the packet is dropped.
The current implementation only has a single buffer for holding packets
to be reassembled, and therefore does not support simultaneous reassembly of
more than one packet. Since fragmented packets are uncommon, we belive this
to be a reasonable decision. Extending our implementation to support multiple
buffers would be straightforward, however.
Since only the ICMP echo message is implemented, there is no support for
Path MTU discovery or ICMP redirect messages. Neither of these is strictly
required for interoperability; they are performance enhancement mechanisms.
Listening connections
TCP allows a connection to listen for incoming connection requests. In our
implementations, a listening connection is identified by the 16-bit port number
and incoming connection requests are checked against the list of listening con-
nections. This list of listening connections is dynamic and can be altered by
the applications in the system.
Sending data
When sending data, an application will have to check the number of available
bytes in the send window and adjust the number of bytes to send accordingly.
The size of the send window is dictated by the memory configuration as well
as the buffer space announced by the receiver of the data. If no buffer space is
available, the application has to defer the send and wait until later.
Buffer space becomes available when an acknowledgment from the re-
ceiver of the data has been received. The stack informs the application of this
event, and the application may then repeat the sending procedure.
Sliding window
Most TCP implementations use a sliding window mechanism for sending data.
Multiple data segments are sent in succession without waiting for an acknowl-
edgment for each segment.
60 Paper A
The sliding window algorithm uses a lot of 32-bit operations and because
32-bit arithmetic is fairly expensive on most 8-bit CPUs, uIP does not im-
plement it. Also, uIP does not buffer sent packets and a sliding window im-
plementation that does not buffer sent packets will have to be supported by a
complex application layer. Instead, uIP allows only a single TCP segment per
connection to be unacknowledged at any given time. lwIP, on the other hand,
implements TCP’s sliding window mechanism using output buffer queues and
therefore does not add additional complexity to the application layer.
It is important to note that even though most TCP implementations use
the sliding window algorithm, it is not required by the TCP specifications.
Removing the sliding window mechanism does not affect interoperability in
any way.
Retransmissions
Retransmissions are driven by the periodic TCP timer. Every time the periodic
timer is invoked, the retransmission timer for each connection is decremented.
If the timer reaches zero, a retransmission should be made.
The actual retransmission operation is handled differently in uIP and in
lwIP. lwIP maintains two output queues: one holds segments that have not
yet been sent, the other holds segments that have been sent but not yet been
acknowledged by the peer. When a retransmission is required, the first segment
on the queue of segments that has not been acknowledged is sent. All other
segments in the queue are moved to the queue with unsent segments.
As uIP does not keep track of packet contents after they have been sent
by the device driver, uIP requires that the application takes an active part in
performing the retransmission. When uIP decides that a segment should be re-
7.7 Protocol implementations 61
transmitted, it calls the application with a flag set indicating that a retransmis-
sion is required. The application checks the retransmission flag and produces
the same data that was previously sent. From the application’s standpoint, per-
forming a retransmission is not different from how the data originally was sent.
Therefore the application can be written in such a way that the same code is
used both for sending data and retransmitting data. Also, it is important to note
that even though the actual retransmission operation is carried out by the ap-
plication, it is the responsibility of the stack to know when the retransmission
should be made. Thus the complexity of the application does not necessarily
increase because it takes an active part in doing retransmissions.
Flow control
Congestion control
Urgent data
TCP’s urgent data mechanism provides an application-to-application notifica-
tion mechanism, which can be used by an application to mark parts of the data
stream as being more urgent than the normal stream. It is up to the receiving
application to interpret the meaning of the urgent data.
In many TCP implementations, including the BSD implementation, the ur-
gent data feature increases the complexity of the implementation because it
requires an asynchronous notification mechanism in an otherwise synchronous
API. As our implementations already use an asynchronous event based API,
the implementation of the urgent data feature does not lead to increased com-
plexity.
Connection state
Each TCP connection requires a certain amount of state information in the
embedded device. Because the state information uses RAM, we have aimed
towards minimizing the amount of state needed for each connection in our
implementations.
The uIP implementation, which does not use the sliding window mech-
anism, requires far less state information than the lwIP implementation. The
sliding window implementation requires that the connection state includes sev-
eral 32-bit sequence numbers, not only for keeping track of the current se-
quence numbers of the connection, but also for remembering the sequence
numbers of the last window updates. Furthermore, because lwIP is able to
handle multiple local IP addresses, the connection state must include the lo-
cal IP address. Finally, as lwIP maintains queues for outgoing segments, the
memory for the queues is included in the connection state. This makes the
state information needed for lwIP nearly 60 bytes larger than that of uIP which
requires 30 bytes per connection.
7.8 Results
7.8.1 Performance limits
In TCP/IP implementations for high-end systems, processing time is domi-
nated by the checksum calculation loop, the operation of copying packet data
and context switching [15]. Operating systems for high-end systems often have
multiple protection domains for protecting kernel data from user processes and
7.8 Results 63
user processes from each other. Because the TCP/IP stack is run in the ker-
nel, data has to be copied between the kernel space and the address space of
the user processes and a context switch has to be performed once the data has
been copied. Performance can be enhanced by combining the copy operation
with the checksum calculation [19]. Because high-end systems usually have
numerous active connections, packet demultiplexing is also an expensive oper-
ation [17].
A small embedded device does not have the necessary processing power to
have multiple protection domains and the power to run a multitasking operating
system. Therefore there is no need to copy data between the TCP/IP stack and
the application program. With an event based API there is no context switch
between the TCP/IP stack and the applications.
In such limited systems, the TCP/IP processing overhead is dominated by
the copying of packet data from the network device to host memory, and check-
sum calculation. Apart from the checksum calculation and copying, the TCP
processing done for an incoming packet involves only updating a few counters
and flags before handing the data over to the application. Thus an estimate
of the CPU overhead of our TCP/IP implementations can be obtained by cal-
culating the amount of CPU cycles needed for the checksum calculation and
copying of a maximum sized packet.
7.8.3 Measurements
For our experiments we connected a 450 MHz Pentium III PC running
FreeBSD 4.7 to an Ethernut board [16] through a dedicated 10 megabit/second
Ethernet network. The Ethernut board is a commercially available embedded
system equipped with a RealTek RTL8019AS Ethernet controller, an Atmel
Atmega128 AVR microcontroller running at 14.7456 MHz with 128 kilobytes
of flash ROM for code storage and 32 kilobytes of RAM. The FreeBSD host
was configured to run the Dummynet delay emulator software [24] in order
to facilitate controlled delays for the communication between the PC and the
embedded system.
In the embedded system, a simple web server was run on top of the uIP and
lwIP stacks. Using the fetch file retrieval utility, a file consisting of null bytes
was downloaded ten times from the embedded system. The reported through-
put was logged, and the mean throughput of the ten downloads was calculated.
By redirecting file output to /dev/null, the file was immediately discarded
by the FreeBSD host. The file size was 200 kilobytes for the uIP tests, and 200
megabytes for the lwIP tests. The size of the file made it impossible to keep it
all in the memory of the embedded system. Instead, the file was generated by
the web server as it was sent out on the network.
The total TCP/IP memory consumption in the embedded system was var-
ied by changing the send window size. For uIP, the send window was varied
between 50 bytes and the maximum possible value of 1450 bytes in steps of
50 bytes. The send window configuration translates into a total RAM usage of
between 400 bytes and 3 kilobytes. The lwIP send window was varied between
500 and 11000 bytes in steps of 500 bytes, leading to a total RAM consumption
of between 5 and 16 kilobytes.
Figure 7.4 shows the mean throughput of the ten file downloads from the
web server running on top of uIP, with an additional 10 ms delay created by
7.8 Results 65
500000
Throughput with delayed ACKs disabled
Throughput with delayed ACKs enabled
450000
400000
350000
Throughput (bytes/second)
300000
250000
200000
150000
100000
50000
0
0 200 400 600 800 1000 1200 1400 1600
Send window (bytes)
the Dummynet delay emulator. The two curves show the measured through-
put with the delayed acknowledgment algorithm disabled and enabled at the
receiving FreeBSD host, respectively. The performance degradation caused by
the delayed acknowledgments is evident.
Figure 7.5 shows the same setup, but without the 10 ms emulated delay.
The lower curve, showing the throughput with delayed acknowledgments en-
abled, is very similar to the lower one in Figure 7.4. The upper curve, however,
does not show the same linear relation as the previous figure, but shows an
increasing throughput where the increase declines with increasing send win-
dow size. One explanation for the declining increase of throughput is that the
round-trip time increases with the send window size because of the increased
per-packet processing time. Figure 7.6 shows the round-trip time as a func-
tion of packet size. These measurements were taken using the ping program
and therefore include the cost for the packet copying operation twice; once for
packet input and once for packet output.
The throughput of lwIP shows slightly different characteristics. Figure 7.7
shows three measured throughput curves, without emulated delay, and with
emulated delays of 10 ms and 20 ms. For all measurements, the delayed ac-
knowledgment algorithm is enabled at the FreeBSD receiver. We see that for
small send window sizes, lwIP also suffers from the delayed acknowledgment
throughput degradation. With a send window larger than two maximum TCP
66 Paper A
500000
Throughput with delayed ACKs disabled
Throughput with delayed ACKs enabled
450000
400000
350000
Throughput (bytes/second)
300000
250000
200000
150000
100000
50000
0
0 200 400 600 800 1000 1200 1400 1600
Send window (bytes)
segment sizes (3000 bytes), lwIP is able to send out two TCP segments per
round-trip time and thereby avoids the delayed acknowledgments throughput
degradation. Without emulated delay, the throughput quickly reaches a maxi-
mum of about 415 kilobytes per second. This limit is likely to be the processing
limit of the lwIP code in the embedded system and therefore is the maximum
possible throughput for lwIP in this particular system.
The maximum throughput with emulated delays is lower than without delay
emulation, and the similarity of the two curves suggests that the throughput
degradation could be caused by interaction with the Dummynet software.
7
Round-trip time
4
RTT (ms)
0
0 200 400 600 800 1000 1200 1400 1600
Packet size (bytes)
plementation, lwIP has significantly more complex buffer and memory man-
agement than uIP. Since lwIP can handle packets that span several buffers, the
checksum calculation functions in lwIP are more complex than those in uIP.
The support for dynamically changing network interfaces in lwIP also con-
tributes to the size increase of the IP layer because the IP layer has to manage
multiple local IP addresses. The IP layer in lwIP is further made larger by
the fact that lwIP has support for UDP, which requires that the IP layer is able
handle broadcast and multicast packets. Likewise, the ICMP implementation
in lwIP has support for UDP error messages which have not been implemented
in uIP.
The TCP implementation is lwIP is nearly twice as large as the full IP,
ICMP and TCP implementation in uIP. The main reason for this is that lwIP
implements the sliding window mechanism which requires a large amount of
68 Paper A
500000
Throughput without emulated delay
Throughput with 10 ms emulated delay
450000 Throughput with 20 ms emulated delay
400000
350000
Throughput (bytes/second)
300000
250000
200000
150000
100000
50000
0
2000 4000 6000 8000 10000
Send window (bytes)
Figure 7.7: lwIP sending data with and without emulated delays.
7.11 Acknowledgments
Many thanks go to Martin Nilsson, who has provided encouragement and been
a source of inspiration throughout the preparation of this paper. Thanks also go
to Deborah Wallach for comments and suggestions, the anonymous reviewers
whose comments were highly appreciated, and to all who have contributed
bugfixes, patches and suggestions to the lwIP and uIP implementations.
Bibliography
[1] J. Bentham. TCP/IP Lean: Web servers for embedded systems. CMP
Books, October 2000.
[5] Atmel Corporation. Embedded web server. AVR 460, January 2001.
Avalible from www.atmel.com.
[8] A. Dunkels. uIP - a TCP/IP stack for 8- and 16-bit microcontrollers. Web
page. 2003-10-21.
URL: http://dunkels.com/adam/uip/
[9] S. Floyd and K. Fall. Promoting the use of end-to-end congestion control
in the internet. IEEE/ACM Transactions on Networking, August 1999.
71
72 Bibliography
[28] The GCC Team. The GNU compiler collection. Web page. 2002-10-14.
URL: http://gcc.gnu.org/
Chapter 8
Paper B:
Contiki - a Lightweight and
Flexible Operating System
for Tiny Networked Sensors
75
Abstract
8.1 Introduction
Wireless sensor networks are composed of large numbers of tiny sensor devices
with wireless communication capabilities. The sensor devices autonomously
form networks through which sensor data is transported. The sensor devices
are often severely resource constrained. An on-board battery or solar panel
can only supply limited amounts of power. Moreover, the small physical size
and low per-device cost limit the complexity of the system. Typical sensor
devices [1, 2, 5] are equipped with 8-bit microcontrollers, code memory on
the order of 100 kilobytes, and less than 20 kilobytes of RAM. Moore’s law
predicts that these devices can be made significantly smaller and less expensive
in the future. While this means that sensor networks can be deployed to greater
extents, it does not necessarily imply that the resources will be less constrained.
For the designer of an operating system for sensor nodes, the challenge lies
in finding lightweight mechanisms and abstractions that provide a rich enough
execution environment while staying within the limitations of the constrained
devices. We have developed Contiki, an operating system developed for such
constrained environments. Contiki provides dynamic loading and unloading of
individual programs and services. The kernel is event-driven, but the system
supports preemptive multi-threading that can be applied on a per-process basis.
Preemptive multi-threading is implemented as a library that is linked only with
programs that explicitly require multi-threading.
Contiki is implemented in the C language and has been ported to a number
of microcontroller architectures, including the Texas Instruments MSP430 and
the Atmel AVR. We are currently running it on the ESB platform [5]. The ESB
uses the MSP430 microcontroller with 2 kilobytes of RAM and 60 kilobytes
of ROM running at 1 MHz. The microcontroller has the ability to selectively
reprogram parts of the on-chip flash memory.
The contributions of this paper are twofold. Our first contribution is that
we show the feasibility of loadable programs and services even in a constrained
sensor device. The possibility to dynamically load individual programs leads
to a very flexible architecture, which still is compact enough for resource con-
strained sensor nodes. Our second contribution is more general in that we show
that preemptive multi-threading does not have to be implemented at the lowest
level of the kernel but that it can be built as an application library on top of an
event-driven kernel. This allows for thread-based programs running on top of
an event-based kernel, without the overhead of reentrancy or multiple stacks in
all parts of the system.
78 Paper B
8.1.2 Portability
As the number of different sensor device platforms increases (e.g. [1, 2, 5]),
it is desirable to have a common software infrastructure that is portable across
hardware platforms. The currently available sensor platforms carry completely
different sets of sensors and communication devices. Due to the application
specific nature of sensor networks, we do not expect that this will change in
the future. The single unifying characteristic of today’s platforms is the CPU
architecture which uses a memory model without segmentation or memory
protection mechanisms. Program code is stored in reprogrammable ROM and
data in RAM. We have designed Contiki so that the only abstraction provided
by the base system is CPU multiplexing and support for loadable programs
and services. As a consequence of the application specific nature of sensor
networks, we believe that other abstractions are better implemented as libraries
or services and provide mechanisms for dynamic service management.
8.1 Introduction 79
our experiences with using the system is discussed in Section 8.9. Finally, the
paper is concluded in Section 8.10.
EEPROM, from where it can be burned into flash ROM. Due to the multi-
threaded semantics, every Mantis program must have stack space allocated
from the system heap, and locking mechanisms must be used to achieve mu-
tual exclusion of shared variables. In contrast, Contiki uses an event based
scheduler without preemption, thus avoiding allocation of multiple stacks and
locking mechanisms. Preemptive multi-threading is provided by a library that
can be linked with programs that explicitly require it.
The preemptive multi-threading in Contiki is similar to fibers [4] and
the lightweight fibers approach by Welsh and Mainland [23]. Unlike the
lightweight fibers, Contiki does not limit the number of concurrent threads to
two. Furthermore, unlike fibers, threads in Contiki support preemption.
As Exokernel [11] and Nemesis [16], Contiki tries to reduce the number
of abstractions that the kernel provides to a minimum [10]. Abstractions are
instead provided by libraries that have nearly full access to the underlying hard-
ware. While Exokernel strived for performance and Nemesis aimed at quality
of service, the purpose of the Contiki design is to reduce size and complexity,
as well as to preserve flexibility. Unlike Exokernel, Contiki do not support any
protection mechanisms since the hardware for which Contiki is designed do
not support memory protection.
11111111111
00000000000
ROM
00000000000
11111111111
00000000000
11111111111
00000000000
11111111111
00000000000
11111111111
Loaded program
00000000000
11111111111
00000000000
11111111111
00000000000
11111111111
Communication service
11111111111
00000000000
RAM
Language run−time 00000000000
11111111111
00000000000
11111111111
00000000000
11111111111
Loaded program
00000000000
11111111111
Program loader
Communication service
Kernel
Kernel
Core Core
and is specific to the deployment in which Contiki is used. Typically, the core
consists of the Contiki kernel, the program loader, the most commonly used
parts of the language run-time and support libraries, and a communication stack
with device drivers for the communication hardware. The core is compiled into
a single binary image that is stored in the devices prior to deployment. The core
is generally not modified after deployment, even though it should be noted that
it is possible to use a special boot loader to overwrite or patch the core.
Programs are loaded into the system by the program loader. The program
loader may obtain the program binaries either by using the communication
stack or by using directly attached storage such as EEPROM. Typically, pro-
grams to be loaded into the system are first stored in EEPROM before they are
programmed into the code memory.
As shown in Section 8.8, however, event handlers may use internal mechanisms
to achieve preemption.
The kernel supports two kind of events: asynchronous and synchronous
events. Asynchronous events are a form of deferred procedure call: asyn-
chronous events are enqueued by the kernel and are dispatched to the target
process some time later. Synchronous events are similar to asynchronous but
immediately causes the target process to be scheduled. Control returns to the
posting process only after the target has finished processing the event. This can
be seen as an inter-process procedure call and is similar to the door abstraction
used in the Spring operating system [14].
In addition to the events, the kernel provides a polling mechanism. Polling
can be seen as high priority events that are scheduled in-between each asyn-
chronous event. Polling is used by processes that operate near the hardware
to check for status updates of hardware devices. When a poll is scheduled all
processes that implement a poll handler are called, in order of their priority.
The Contiki kernel uses a single shared stack for all process execution.
The use of asynchronous events reduce stack space requirements as the stack
is rewound between each invocation of event handlers.
Kernel
Service layer
Service interface
Application process Service process
Version number
Service Function 1 ptr
interface
Function 1(); stub Function 2 ptr
Function 1 implementation
Function 3 ptr
Function 2(); Function 3 implementation
Function 3();
Function 2 implementation
In sensor networks, being able to power down the node when the network is
inactive is an often required way to reduce energy consumption. Power conser-
vation mechanisms depend on both the applications [18] and the network pro-
tocols [20]. The Contiki kernel contains no explicit power save abstractions,
but lets the the application specific parts of the system implement such mech-
anisms. To help the application decide when to power down the system, the
event scheduler exposes the size of the event queue. This information can be
used to power down the processor when there are no events scheduled. When
the processor wakes up in response to an interrupt, the poll handlers are run to
handle the external event.
8.5 Services 85
8.5 Services
In Contiki, a service is a process that implements functionality that can be used
by other processes. A service can be seen as a form of a shared library. Ser-
vices can be dynamically replaced at run-time and must therefore be dynam-
ically linked. Typical examples of services includes communication protocol
stacks, sensor device drivers, and higher level functionality such as sensor data
handling algorithms.
Services are managed by a service layer conceptually situated directly next
to the kernel. The service layer keeps track of running services and provides a
way to find installed services. A service is identified by a textual string that de-
scribes the service. The service layer uses ordinary string matching to querying
installed services.
A service consists of a service interface and a process that implements the
interface. The service interface consists of a version number and a function
table with pointers to the functions that implement the interface.
Application programs using the service use a stub library to communicate
with the service. The stub library is linked with the application and uses the
service layer to find the service process. Once a service has been located, the
service stub caches the process ID of the service process and uses this ID for
all future requests.
Programs call services through the service interface stub and need not be
aware of the fact that a particular function is implemented as a service. The first
time the service is called, the service interface stub performs a service lookup in
the service layer. If the specified service exists in the system, the lookup returns
a pointer to the service interface. The version number in the service interface
is checked with the version of the interface stub. In addition to the version
number, the service interface contains pointers to the implementation of all
service functions. The function implementations are contained in the service
process. If the version number of the service stub match the number in the
service interface, the interface stub calls the implementation of the requested
function.
8.6 Libraries
The Contiki kernel only provides the most basic CPU multiplexing and event
handling features. The rest of the system is implemented as system libraries
that are optionally linked with programs. Programs can be linked with libraries
in three different ways. First, programs can be statically linked with libraries
that are part of the core. Second, programs can be statically linked with li-
braries that are part of the loadable program. Third, programs can call services
implementing a specific library. Libraries that are implemented as services can
be dynamically replaced at run-time.
Typically, run-time libraries such as often-used parts of the language run-
time libraries are best placed in the Contiki core. Rarely used or application
specific libraries, however, are more appropriately linked with loadable pro-
grams. Libraries that are part of the core are always present in the system and
do not have to be included in loadable program binaries.
As an example, consider a program that uses the memcpy() and atoi()
functions to copy memory and to convert strings to integers, respectively. The
memcpy() function is a frequently used C library function, whereas atoi()
is used less often. Therefore, in this particular example, memcpy() has been
included in the system core but not atoi(). When the program is linked to
produce a binary, the memcpy() function will be linked against its static ad-
dress in the core. The object code for the part of the C library that implements
the atoi() function must, however, be included in the program binary.
8.7 Communication support 87
Communication
Application stack
Hardware
mt yield();
Yield from the running thread.
mt wait(event, dataptr);
Wait for an event to be posted to the running thread.
mt exit();
Exit the running thread.
mt exec(thread);
Execute the specified thread until it yields or is preempted.
function performs the actual scheduling of a thread and is called from an event
handler.
8.9 Discussion
We have used the Contiki operating system to implement a number of sen-
sor network applications such as multi-hop routing, motion detection with dis-
tributed sensor data logging and replication, and presence detection and notifi-
cation.
top of the system. Table 8.1 shows the compiled code size and the RAM us-
age of the Contiki system compiled for two architectures: the Texas Instru-
ments MSP430 and the Atmel AVR. The numbers report the size of both core
components and an example application: a sensor data replicator service. The
replicator service consists of the service interface stub for the service as well as
the implementation of the service itself. The program loader is currently only
implemented on the MSP430 platform.
The code size of Contiki is larger than that of TinyOS [15], but smaller than
that of the Mantis system [3]. Contiki’s event kernel is significantly larger than
that of TinyOS because of the different services provided. While the TinyOS
event kernel only provides a FIFO event queue scheduler, the Contiki kernel
supports both FIFO events and poll handlers with priorities. Furthermore,
the flexibility in Contiki requires more run-time code than for a system like
TinyOS, where compile time optimization can be done to a larger extent.
17
Round-trip time
16.5
Milliseconds
16
15.5
15
0 2 4 6 8 10 12 14 16
Seconds
8.9.3 Preemption
The purpose of preemption is to facilitate long running computations while be-
ing able to react on incoming events such as sensor input or incoming commu-
nication packets. Figure 8.5 shows how Contiki responds to incoming packets
during an 8 second computation running in a preemptible thread. The curve
is the measured round-trip time of 200 “ping” packets of 40 bytes each. The
computation starts after approximately 5 seconds and runs until 13 seconds
have passed. During the computation, the round-trip time increases slightly
but the system is still able to produce replies to the ping packets.
The packets are sent over a 57600 kbit/s serial line with a spacing of 200
ms from a 1.4 GHz PC to an ESB node running Contiki. The packets are trans-
mitted over a serial line rather than over the wireless link in order to avoid radio
effects such as bit errors and MAC collisions. The computation consists of an
arbitrarily chosen sequence of multiplications and additions that are repeated
for about 8 seconds. The cause for the increase in round-trip time during the
computation is the cost of preempting the computation and restoring the kernel
context before the incoming packet can be handled. The jitter and the spikes of
about 0.3 milliseconds seen in the curve can be contributed to activity in other
92 Paper B
8.9.4 Portability
We have ported Contiki to a number of architectures, including the Texas In-
struments MSP430 and the Atmel AVR. Others have ported the system to the
Hitachi SH3 and the Zilog Z80. The porting process consists of writing the
boot up code, device drivers, the architecture specific parts of the program
loader, and the stack switching code of the multi-threading library. The kernel
and the service layer does not require any changes.
Since the kernel and service layer does not require any changes, an opera-
tional port can be tested after the first I/O device driver has been written. The
Atmel AVR port was made by ourselves in a couple of hours, with help of pub-
licly available device drivers. The Zilog Z80 port was made by a third party, in
a single day.
8.10 Conclusions
We have presented the Contiki operating system, designed for memory con-
strained systems. In order to reduce the size of the system, Contiki is based
on an event-driven kernel. The state-machine driven programming of event-
driven systems can be hard to use and has problems with handling long running
computations. Contiki provides preemptive multi-threading as an application
library that runs on top of the event-driven kernel. The library is optionally
linked with applications that explicitly require a multi-threaded model of com-
putation.
A running Contiki system is divided into two parts: a core and loaded pro-
grams. The core consists of the kernel, a set of base services, and parts of the
language run-time and support libraries. The loaded programs can be loading
and unloading individually, at run-time. Shared functionality is implemented
as services, a form of shared libraries. Services can be updated or replaced
individually, which leads to a very flexible structure.
We have shown that dynamic loading and unloading of programs and ser-
vices is feasible in a resource constrained system, while keeping the base sys-
tem lightweight and compact. Even though our kernel is event-based, preemp-
tive multi-threading can be provided at the application layer on a per-process
basis.
8.10 Conclusions 93
Because of its dynamic nature, Contiki can be used to multiplex the hard-
ware of a sensor network across multiple applications or even multiple users.
This does, however, require ways to control access to the reprogramming facil-
ities. We plan to continue our work in the direction of operating system support
for secure code updates.
Bibliography
Bibliography
94
Bibliography 95
[12] D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer, and D. Culler. The
nesC language: A holistic approach to networked embedded systems. In
Proc. SIGPLAN’03, 2003.
[17] P. Levis and D. Culler. Maté: A tiny virtual machine for sensor networks.
In Proc. ASPLOS-X, October 2002.
Paper C:
Using Protothreads for
Sensor Node Programming
Adam Dunkels, Oliver Schmidt, and Thiemo Voigt. Using protothreads for
sensor node programming. In Proceedings of the Workshop on Real-World
Wireless Sensor Networks (REALWSN 2005), Stockholm, Sweden, June 2005.
c 2005 Swedish Institute of Computer Science.
97
Abstract
Wireless sensor networks consist of tiny devices that usually have severe re-
source constraints in terms of energy, processing power and memory. In order
to work efficiently within the constrained memory, many operating systems
for such devices are based on an event-driven model rather than on traditional
multi-threading. While event-driven systems allow for reduced memory us-
age, they require programs to be developed as explicit state machines. Since
implementing programs using explicit state machines is hard, developing and
maintaining programs for event-driven systems is typically more difficult than
for multi-threaded ones.
In this paper, we introduce protothreads, a programming abstraction for
event-driven sensor network systems. Protothreads simplify implementation of
high-level functionality on top of event-driven systems, compared to traditional
methods.
9.1 Introduction 99
9.1 Introduction
Wireless sensor networks consist of tiny devices that usually have severe re-
source constraints in terms of energy, processing power and memory. Most
programming environments for wireless sensor network nodes today are
based on an event-triggered programming model rather than traditional multi-
threading. In TinyOS [7], the event-triggered model was chosen over a multi-
threaded model due to the memory overhead of threads. According to Hill et
al. [7]:
“In TinyOS, we have chosen an event model so that high levels
of concurrency can be handled in a very small amount of space. A
stack-based threaded approach would require that stack space be
reserved for each execution context.”
While the event-driven model and the threaded model can be shown to be
equivalent [9], programs written in the two models typically display differ-
ing characteristics [1]. The advantages and disadvantages of the two models
are a debated topic [11, 14].
In event-triggered systems, programs are implemented as event handlers.
Event handlers are invoked in response to external or internal events, and run to
completion. An event handler typically is a programming language procedure
or function that performs an explicit return to the caller. Because of the run-to-
completion semantics, an event-handler cannot execute a blocking wait. With
run-to-completion semantics, the system can utilize a single, shared stack. This
reduces the memory overhead over a multi-threaded system, where memory
must be allocated for a stack for each running program.
The run-to-completion semantics of event-triggered systems makes imple-
menting certain high-level operations a complex task. When an operation can-
not complete immediately, the operation must be split across multiple invo-
cations of the event handler. Levis et al. [10] refer to this as a split-phase
operation. In the words of Levis et al.:
“This approach is natural for reactive processing and for in-
terfacing with hardware, but complicates sequencing high-level
operations, as a logically blocking sequence must be written in
a state-machine style.”
In this paper, we introduce the notion of using protothreads [3, 6] as a
method to reduce the complexity of high-level programs in event-triggered sen-
sor node systems. We argue that protothreads can reduce the number of explicit
100 Paper C
9.2 Motivation
To illustrate how high-level functionality is implemented using state machines,
we consider a hypothetical energy-conservation mechanism for wireless sensor
nodes. The mechanism switches the radio on and off at regular intervals. The
mechanism works as follows:
5. Wait for tsleep milliseconds. If the radio could not be turned off before
tsleep milliseconds because of remaining communication, do not turn the
radio off at all.
enum {
PT_THREAD(radio_wake_thread
ON,
(struct pt *pt)) {
WAITING,
PT_BEGIN(pt);
OFF
} state;
while(1) {
radio_on();
void radio_wake_eventhandler() {
timer_set(&timer, T_AWAKE);
switch(state) {
PT_WAIT_UNTIL(pt,
timer_expired(&timer));
case OFF:
if(timer_expired(&timer)) {
timer_set(&timer, T_SLEEP);
radio_on();
if(!communication_complete()) {
state = ON;
PT_WAIT_UNTIL(pt,
timer_set(&timer, T_AWAKE);
communication_complete()
}
|| timer_expired(&timer));
break;
}
case ON:
if(!timer_expired(&timer)) {
if(timer_expired(&timer)) {
radio_off();
timer_set(&timer, T_SLEEP);
PT_WAIT_UNTIL(pt,
if(!communication_complete())
timer_expired(&timer));
{
}
state = WAITING;
}
} else {
radio_off();
PT_END(pt);
state = OFF;
}
}
}
break;
case WAITING:
if(communication_complete()
|| timer_expired(&timer)) {
state = ON;
timer_set(&timer, T_AWAKE);
} else {
radio_off();
state = OFF;
}
break;
}
}
Figure 9.1: The radio sleep cycle implemented with events (left) and with pro-
tothreads (right).
9.3 Protothreads
Protothreads [6] are an extremely lightweight stackless type of threads, de-
signed for severely memory constrained systems. Protothreads provide condi-
tional blocking waits on top of an event-driven system, without the overhead
of per-thread stacks. The purpose of protothreads is to implement sequential
flow of control without complex state machines or full multi-threading.
We developed protothreads in order to deal with the complexity of explicit
state machines in the event-driven uIP TCP/IP stack [4]. For uIP, we were
able to substantially reduce the number of state machines and explicit states
used in the implementations of a number of application level communication
protocols. For example, the uIP FTP client could be simplified by completely
removing the explicit state machine, and thereby reducing the number of ex-
plicit states from 20 to one.
OFF ON
communicaion
completed communication
active
WAITING
Figure 9.2: State machine realization of the radio sleep cycle protocol.
9.3 Protothreads 103
9.3.3 Comparison
Table 9.1 summarizes the features of protothreads and compares them with the
features of events and threads. The names of the features are from [1].
Proto-
Feature Events Threads threads
Control structures No Yes Yes
Debug stack retained No Yes Yes
Implicit locking Yes No Yes
Preemption No Yes No
Automatic variables No Yes No
while(1) {
radio_on();
timer_set(&timer, T_AWAKE);
pt->lc = 8;
case 8:
if(!timer_expired(&timer)) {
return;
}
timer_set(&timer, T_SLEEP);
if(!communication_complete()) {
pt->lc = 13;
case 13:
if(!(communication_complete() ||
timer_expired(&timer))) {
return;
}
if(!timer_expired(&timer)) {
radio_off();
pt->lc = 18;
case 18:
if(!timer_expired(&timer)) {
return;
}
}
}
}
}
Figure 9.3: C switch statement expansion of the protothreads code in Figure 9.1
Debug stack retained. Because the manual stack management and the free
flow of control in the event-driven model, debugging is difficult as the
sequence of calls is not saved on the stack [1]. With both threads and
protothreads, the full call stack is available for debugging.
Because both the event-driven model and protothreads use a single stack,
preemption is not possible within these models.
Automatic variables. Since the threaded model allocates a stack for each
thread, automatic variables—variables with function local scope auto-
matically allocated on the stack—are retained even when the thread
blocks. Both the event-driven model and protothreads use a single shared
stack for all active programs, and rewind the stack every time a program
blocks. Therefore, with protothreads, automatic variables are not saved
across a blocking wait. This is discussed in more detail below.
9.3.4 Limitations
While protothreads allow programs to take advantage of some of the bene-
fits of a threaded programming model, protothreads also impose some of the
limitation from the event-driven model. The most evident limitation from the
event-driven model is that automatic variables—variables with function-local
scope that are automatically allocated on the stack—are not saved across a
blocking wait. While automatic variables can still be used inside a protothread,
the contents of the variables must be explicitly stored before executing a wait
statement. The reason for this is that protothreads rewind the stack at every
blocking statement, and therefore potentially destroy the contents of variables
on the stack.
If an automatic variable is erroneously used after a blocking statement, the
C compiler is able to detect the problem. Typically a warning is produced, stat-
ing that the variable in question “might be used uninitialized in this function”.
While it may not be immediately apparent for the programmer that this warn-
ing is related to the use of automatic variables across a blocking protothreads
statement, it does provide an indication that there is a problem with the pro-
gram. Also, the warning indicates the line number of the problem which assists
the programmer in identifying the problem.
The limitation on the use of automatic variables can be handled by using
an explicit state object, much in the same way as is done in the event-driven
model. The state object is a chunk of memory that holds the contents of all
automatic variables that need to be saved across a blocking statement. It is,
however, the responsibility of the programmer to allocate and maintain such a
state object.
It should also be noted that protothreads do not limit the use of static lo-
cal variables. Static local variables are variables that are local in scope but
106 Paper C
allocated in the data section. Since these are not placed on the stack, they are
not affected by the use of blocking protothreads statements. For functions that
do not need to be re-entrant, using static local variables instead of automatic
variables can be an acceptable solution to the problem.
9.3.5 Implementation
Protothreads are based on a low-level mechanism that we call local continu-
ations [6]. A local continuation is similar to ordinary continuations [12], but
does not capture the program stack. Local continuations can be implemented
in a variety of ways, including using architecture specific machine code, C-
compiler extensions, and a non-obvious use of the C switch statement. In this
paper, we concentrate on the method based on the C switch statement.
A local continuation supports two operations; it can be either set or re-
sumed. When a local continuation is set, the state of the function—all CPU
registers including the program counter but excluding the stack—is captured.
When the same local continuation is resumed, the state of the function is reset
to what it was when the local continuation was set.
A protothread consists of a single local continuation. The protothread’s lo-
cal continuation is set before each conditional blocking wait. If the condition
is true and the wait is to be performed, the protothread executes an explicit
return statement, thus returning to the caller. The next time the protothread is
called, it resumes the local continuation that was previously set. This will ef-
fectively cause the program to jump to the conditional blocking wait statement.
The condition is re-evaluated and, once the condition is false, the protothread
continues down through the function.
Figure 9.4: The local continuation resume and set operations implemented us-
ing the C switch statement.
usage. Their work is different from protothreads in that OSM requires sup-
port from an external OSM compiler to produce the resulting C code, whereas
protothreads only make use of the regular C preprocessor.
9.5 Conclusions
Many operating systems for wireless sensor network nodes are based on an
event-triggered programming model. In order to implement high-level opera-
tions under this model, programs have to be written as explicit state machines.
Software implemented using explicit state machines is often hard to under-
stand, debug, and maintain.
We have presented protothreads as a programming abstraction that re-
duces the complexity of implementations of high-level functionality for event-
triggered systems. With protothreads, programs can perform blocking waits on
top of event-triggered systems with run-to-completion semantics.
Acknowledgments
This work was partly financed by VINNOVA, the Swedish Agency for Inno-
vation Systems, and the European Commission under contract IST-004536-
RUNES.
Bibliography
109
[8] O. Kasten and K. Römer. Beyond event handlers: Programming wire-
less sensors with attributed state machines. In The Fourth International
Conference on Information Processing in Sensor Networks (IPSN), Los
Angeles, USA, April 2005.
[9] H. C. Lauer and R. M. Needham. On the duality of operating systems
structures. In Proc. Second International Symposium on Operating Sys-
tems, October 1978.
[10] P. Levis, S. Madden, D. Gay, J. Polastre, R. Szewczyk, A. Woo,
E. Brewer, and D. Culler. The Emergence of Networking Abstractions
and Techniques in TinyOS. In Proceedings of ACM/Usenix Networked
Systems Design and Implementation (NSDI’04), San Francisco, Califor-
nia, USA, March 2004.
[11] J. K. Ousterhout. Why threads are a bad idea (for most purposes). Invited
Talk at the 1996 USENIX Technical Conference, 1996.
[14] R. von Behren, J. Condit, and E. Brewer. Why events are a bad idea (for
high-concurrency servers). In Proceedings of the 9th Workshop on Hot
Topics in Operating Systems, Lihue (Kauai), Hawaii, USA, May 2003.
Chapter 10
Paper D:
Protothreads: Simplifying
Event-Driven Programming
of Memory-Constrained
Embedded Systems
Adam Dunkels, Oliver Schmidt, Thiemo Voigt, and Muneeb Ali. Protothreads:
Simplifying event-driven programming of memory-constrained embedded sys-
tems. In Proceedings of the 4th International Conference on Embedded Net-
worked Sensor Systems (ACM SenSys 2006), Boulder, Colorado, USA, Novem-
ber 2006.
c 2006 Association for Computing Machinery.
111
Abstract
10.1 Introduction
Event-driven programming is a common programming model for memory-
constrained embedded systems, including sensor networks. Compared to
multi-threaded systems, event-driven systems do not need to allocate mem-
ory for per-thread stacks, which leads to lower memory requirements. For this
reason, many operating systems for sensor networks, including TinyOS [19],
SOS [17], and Contiki [12] are based on an event-driven model. According
to Hill et al. [19]: “In TinyOS, we have chosen an event model so that high
levels of concurrency can be handled in a very small amount of space. A stack-
based threaded approach would require that stack space be reserved for each
execution context.” Event-driven programming is also often used in systems
that are too memory-constrained to fit a general-purpose embedded operating
system [28].
An event-driven model does not support a blocking wait abstraction. There-
fore, programmers of such systems frequently need to use state machines to im-
plement control flow for high-level logic that cannot be expressed as a single
event handler. Unlike state machines that are part of a system specification, the
control-flow state machines typically have no formal specification, but are cre-
ated on-the-fly by the programmer. Experience has shown that the need for ex-
plicit state machines to manage control flow makes event-driven programming
difficult [3, 25, 26, 35]. With the words of Levis et al. [26]: “This approach is
natural for reactive processing and for interfacing with hardware, but compli-
cates sequencing high-level operations, as a logically blocking sequence must
be written in a state-machine style.” In addition, popular programming lan-
guages for tiny embedded systems such as the C programming language and
nesC [15] do not provide any tools to help the programmer manage the imple-
mentation of explicit state machines.
In this paper we study how protothreads, a novel programming abstrac-
tion that provides a conditional blocking wait operation, can be used to reduce
the number of explicit state machines in event-driven programs for memory-
constrained embedded systems.
The contribution of this paper is that we show that protothreads simplify
event-driven programming by reducing the need for explicit state machines.
We show that the protothreads mechanism is simple enough that a prototype
implementation of the protothreads mechanism can be done using only C lan-
guage constructs, without any architecture-specific machine code. We have
previously presented the ideas behind protothreads in a position paper [13].
In this paper we significantly extend our previous work by refining the pro-
114 Paper D
10.2 Protothreads
Protothreads are a novel programming abstraction that provides a conditional
blocking wait statement, PT WAIT UNTIL(), that is intended to simplify
event-driven programming for memory-constrained embedded systems. The
operation takes a conditional statement and blocks the protothread until the
statement evaluates to true. If the conditional statement is true the first time
the protothread reaches the PT WAIT UNTIL() the protothread continues to
execute without interruption. The PT WAIT UNTIL() condition is evaluated
each time the protothread is invoked. The PT WAIT UNTIL() condition can
be any conditional statement, including complex Boolean expressions.
A protothread is stackless: it does not have a history of function invoca-
tions. Instead, all protothreads in a system run on the same stack, which is
rewound every time a protothread blocks.
A protothread is driven by repeated calls to the function in which the pro-
tothread runs. Because they are stackless, protothreads can only block at the
top level of the function. This means that it is not possible for a regular func-
tion called from a protothread to block inside the called function - only explicit
PT WAIT UNTIL() statements can block. The advantage of this is that the
programmer always is aware of which statements that potentially may block.
Nevertheless, it is possible to perform nested blocking by using hierarchical
protothreads as described in Section 10.2.5.
The beginning and the end of a protothread are declared with
PT BEGIN and PT END statements. Protothread statements, such as the
PT WAIT UNTIL() statement, must be placed between the PT BEGIN and
PT END statements. A protothread can exit prematurely with a PT EXIT state-
ment. Statements outside of the PT BEGIN and PT END statements are not
part of the protothread and the behavior of such statements are undefined.
Protothreads can be seen as a combination of events and threads. From
threads, protothreads have inherited the blocking wait semantics. From events,
protothreads have inherited the stacklessness and the low memory overhead.
The blocking wait semantics allow linear sequencing of statements in event-
driven programs. The main advantage of protothreads over traditional threads
is that protothreads are very lightweight: a protothread does not require its
own stack. Rather, all protothreads run on the same stack and context switch-
ing is done by stack rewinding. This is advantageous in memory constrained
116 Paper D
systems, where a thread’s stack might use a large part of the available mem-
ory. For example, a thread with a 200 byte stack running on an MS430F149
microcontroller uses almost 10% of the entire RAM. In contrast, the memory
overhead of a protothread is as low as two bytes per protothread and no addi-
tional stack is needed.
10.2.1 Scheduling
The protothreads mechanism does not specify any specific method to invoke or
schedule a protothread; this is defined by the system using protothreads. If a
protothread is run on top of an underlying event-driven system, the protothread
is scheduled whenever the event handler containing the protothread is invoked
by the event scheduler. For example, application programs running on top
of the event-driven uIP TCP/IP stack are invoked both when a TCP/IP event
occurs and when the application is periodically polled by the TCP/IP stack. If
the application program is implemented as a protothread, this protothread is
scheduled every time uIP calls the application program.
In the Contiki operating system, processes are implemented as protothreads
running on top of the event-driven Contiki kernel. A process’ protothread is
invoked whenever the process receives an event. The event may be a message
from another process, a timer event, a notification of sensor input, or any other
type of event in the system. Processes may wait for incoming events using the
protothread conditional blocking statements.
The protothreads mechanism does not specify how memory for holding the
state of a protothread is managed. As with the scheduling, the system using
protothreads decides how memory should be allocated. If the system will run a
predetermined amount of protothreads, memory for the state of all protothreads
can be statically allocated in advance. Memory for the state of a protothread
can also be dynamically allocated if the number of protothreads is not known in
advance. In Contiki, the memory for the state of a process’ protothread is held
in the process control block. Typically, a Contiki program statically allocates
memory for its process control blocks.
In general, protothreads are reentrant. Multiple protothreads can be running
the same piece of code as long as each protothread has its own memory for
keeping state.
10.2 Protothreads 117
Figure 10.1: The radio sleep cycle implemented with events, in pseudocode.
Figure 10.2: The radio sleep cycle implemented with protothreads, in pseu-
docode.
allow the radio to be turned off as often as possible in order to reduce the over-
all energy consumption of the device. Many MAC protocols therefore have
scheduled sleep cycles when the radio is turned off completely.
The hypothetical MAC protocol used here is similar to the T-MAC proto-
col [34] and switches the radio on and off at scheduled intervals. The mecha-
nism is depicted in Figure 10.3 and can be specified as follows:
t wait_max
t awake t sleep
keep on if communication
Radio ON
off if
no
comm.
Radio OFF
t t + t awake + t sleep
0 0
ON
Remaining communication
Timer expired
OFF
Figure 10.4: State machine realization of the radio sleep cycle of the example
MAC protocol.
Figure 10.2 shows the resulting pseudocode code. We see that the code is
shorter than the event-driven version from Figure 10.1 and that the code more
closely follows the specification of the mechanism.
reliable send(message):
rxtimer: timer
PT BEGIN
do
rxtimer ← tretransmission
send(message)
PT WAIT UNTIL(ack received() or expired(rxtimer))
until (ack received())
PT END
This state can then later be restored with the resume operation. The state cap-
tured by a local continuation does not include the history of functions that have
called the function in which the local continuation was set. That is, the lo-
cal continuation does not contain the stack, but only the state of the current
function.
A protothread consists of a function and a single local continuation. The
protothread’s local continuation is set before each PT WAIT UNTIL() state-
ment. If the condition is false and the wait is to be performed, the protothread
is suspended by returning control to the function that invoked the protothread’s
function. The next time the protothread function is invoked, the protothread
resumes the local continuation. This effectively causes the program to execute
a jump to the conditional blocking wait statement. The condition is reevaluated
and either blocks or continues its execution.
Events, Threads
protothreads
Stack size
1 2 3 1 2 3
Figure 10.6: The stack memory requirements for three event handlers, the three
event handlers rewritten with protothreads, and the equivalent functions run-
ning in three threads. Event handlers and protothreads run on the same stack,
whereas each thread runs on a stack of its own.
a) b) c)
cond2
cond2a cond2b
an iteration:
a sequence: PT BEGIN
PT BEGIN (* ... *)
(* ... *) while (cond1)
PT WAIT UNTIL(cond1) PT WAIT UNTIL(cond1 or
(* ... *) cond2)
PT END (* ... *)
PT END
a selection:
PT BEGIN
(* ... *)
if (condition)
PT WAIT UNTIL(cond2a)
else
PT WAIT UNTIL(cond2b)
(* ... *)
PT END
Iteration
Selection
Remaining communication
Timer expired
Sequence
Figure 10.10: The state machine from the example radio sleep cycle mecha-
nism with the iteration and sequence patterns identified.
sleep cycle of the example MAC protocol in Section 10.2.3, with the itera-
tion and sequence state machine patterns identified. From this analysis the
protothreads-based code in Figure 10.2 can be written.
10.5 Implementation
struct pt { lc_t lc };
#define PT_WAITING 0
#define PT_EXITED 1
#define PT_ENDED 2
#define PT_INIT(pt) LC_INIT(pt->lc)
#define PT_BEGIN(pt) LC_RESUME(pt->lc)
#define PT_END(pt) LC_END(pt->lc); \
return PT_ENDED
#define PT_WAIT_UNTIL(pt, c) LC_SET(pt->lc); \
if(!(c)) \
return PT_WAITING
#define PT_EXIT(pt) return PT_EXITED
C Switch Statement
The main problem with the GCC C extension-based implementation of local
continuations is that it only works with a single C compiler: GCC. We next
show an implementation using only standard ANSI C constructs which uses
the C switch statement in a non-obvious way.
Figure 10.15 shows local continuations implemented using the C switch
statement. LC RESUME() is an open switch statement, with a case 0:
immediately following it. The case 0: makes sure that the code after
the LC RESUME() statement is always executed when the local continua-
tion has been initialized with LC INIT(). The implementation of LC SET()
10.5 Implementation 129
Figure 10.16: Expanded C code with local continuations implemented with the
C switch statement.
uses the standard LINE macro. This macro expands to the line number in
the source code at which the LC SET() macro is used. The line number is
used as a unique identifier for each LC SET() statement. The implementation
of LC END() is a single right curly bracket that closes the switch statement
opened by LC RESUME().
To better illustrate how the C switch-based implementation works, Fig-
ure 10.16 shows how a short protothreads-based program is expanded by the C
preprocessor. We see that the resulting code is fairly similar to how the explicit
state machine was implemented in Figure 10.1. However, when looking closer
at the expanded C code, we see that the case 8: statement on line 7 appears
inside the do-while loop, even though the switch statement appears outside of
the do-while loop. This does seem surprising at first, but is in fact valid ANSI
C code. This use of the switch statement is likely to first have been publicly
described by Duff as part of Duff’s Device [8]. The same technique has later
130 Paper D
Automatic Variables
Assembly Language
We have found that for some combinations of processors and C compilers it
is possible to implement protothreads and local continuations by using assem-
bly language. The set of the local continuations is then implemented as a C
132 Paper D
function that captures the return address from the stack and stores it in the lo-
cal continuation, along with any callee save registers. Conversely, the resume
operation would restore the saved registers from the local continuation and
perform an unconditional jump to the address stored in the local continuation.
The obvious problem with this approach is that it requires a porting effort for
every new processor and C compiler. Also, since both a return address and
a set of registers need to be stored in the local continuation, its size grows.
However, we found that the largest problem with this approach is that some C
compiler optimizations will make the implementation difficult. For example,
we were not able to produce a working implementation with this method for
the Microsoft Visual C++ compiler.
Stackful Approaches
By letting each protothread run on its own stack it would be possible to im-
plement the full protothread mechanism, including storage of automatic vari-
ables across a blocking wait. With such an implementation the stack would
be switched to the protothread’s own stack by the PT BEGIN operation and
switched back when the protothread blocks or exits. This approach could be
implemented with a coroutine library or the multi-threading library of Contiki.
However, this implementation would result in a memory overhead similar to
that of multi-threading because each invocation of a protothread would require
the same amount of stack memory as the equivalent protothread running in
a thread of its own due to the stack space required by functions called from
within the protothread.
10.6 Evaluation 133
10.6 Evaluation
To evaluate protothreads we first measure the reduction in code complexity
that protothreads provide by reimplementing a set of event-driven programs
with protothreads and measure the complexity of the resulting code. Second,
we measure the memory overhead of protothreads compared to the memory
overhead of an event-driven state machine. Third, we compare the execution
time overhead of protothreads with that of event-driven state machines.
protothreads, we were able to entirely remove the explicit state machines for
most programs. For all programs, protothreads significantly reduce the number
of state transitions and lines of code.
The reimplemented programs have undergone varying amounts of testing.
The Contiki code propagation, the TR1001 low-level radio driver, and the uIP
SMTP client are well tested and are currently used on a daily basis in live
systems, XNP and TinyDB have been verified to be working but not heavily
tested, and the CC1000 drivers have been tested and run in simulation.
Furthermore, we have anecdotal evidence to support our hypothesis that
protothreads are an alternative to state machines for embedded software devel-
opment. The protothreads implementations have for some time been available
as open source on our web page [9]. We know that at least ten embedded sys-
tems developers have successfully used protothreads to replace state machines
for embedded software development. Also, our protothreads code have twice
been recommended by experienced embedded developers in Jack Ganssle’s
embedded development newsletter [14].
XNP
XNP [20] is one of the in-network programming protocols used in
TinyOS [19]. XNP downloads a new system image to a sensor node and writes
the system image to the flash memory of the device. XNP is implemented on
top of the event-driven TinyOS. Therefore, any operations in XNP that would
be blocking in a threaded system have to be implemented as state machines.
We chose XNP because it is a relatively complex program implemented on top
of an event-driven system. The implementation of XNP has previously been
analyzed by Jeong [20], which assisted us in our analysis. The implementation
of XNP consists of a large switch statement with 25 explicit states, encoded
as defined constants, and 20 state transitions. To analyze the code, we identi-
fied the state transitions from manual inspection of the code inside the switch
statement.
Since the XNP state machine is implemented as one large switch statement,
we expected it to be a single, complex state machine. But, when drawing the
state machine from analysis of the code, it turned out that the switch statement
in fact implements five different state machines. The entry points of the state
machines are not immediately evident from the code, as the state of the state
machine was changed in several places throughout the code.
The state machines we found during the analysis of the XNP program are
shown in Figure 10.17. For reasons of presentation, the figure does not show
10.6 Evaluation 135
DL_START REQ_CIDMISSING
DL_SRECWRITE UP_SRECWRITE
DL_START0 GET_CIDMISSING
DL_START2
DL_END_SIGNAL EEFLASH_WRITEDONE ISP_REQ1 GET_DONE
DL_FAIL
DL_FAIL_SIGNAL
Figure 10.17: XNP state machines. The names of the states are from the code.
The IDLE and ACK states are not shown.
the IDLE and ACK states. Almost all states have transitions to one of these
states. If an XNP operation completes successfully, the state machine goes
into the ACK state to transmit an acknowledgment over the network. The IDLE
state is entered if an operation ends with an error, and when the acknowledg-
ment from the ACK state has been transmitted.
In the figure we clearly see many of the state machine patterns from Fig-
ure 10.7. In particular, the sequence pattern is evident in all state machines. By
using the techniques described in Section 10.4 we were able to rewrite all state
machines into protothreads. Each state machine was implemented as its own
protothread.
The IDLE and ACK states are handled in a hierarchical protothread. A
separate protothread is created for sending the acknowledgment signal. This
protothread is spawned from the main protothread every time the program logic
dictates that an acknowledgment should be sent.
TinyDB
TinyDB [27] is a small database engine for the TinyOS system. With TinyDB,
a user can query a wireless sensor network with a database query language
similar to SQL. TinyDB is one of the largest TinyOS programs available.
In TinyOS long-latency operations are split-phase [15]. Split-phase opera-
tions consist of two parts: a request and a completion event. The request com-
pletes immediately, and the completion event is posted when the operation has
completed. TinyDB contains a large number of split-phase operations. Since
programs written for TinyOS cannot perform a blocking wait, many complex
136 Paper D
WRITE_NEXT_BUFFER READ_QUERY
READ_BUFFER
READ_FIELD_LEN
ALLOC_FIELD_DATA
READ_FIELD_DATA
ity, such as packet framing, header parsing, and MAC protocol must be imple-
mented in software.
We analyze and rewrite CC1000 drivers from the Mantis OS [2] and from
SOS [17], as well as the TR1001 driver from Contiki [12]. All drivers are
implemented as explicit state machines. The state machines run in the interrupt
handlers of the radio interrupts.
The CC1000 driver in Mantis has two explicit state machines: one for han-
dling and parsing incoming bytes and one for handling outgoing bytes. In
contrast, both the SOS CC1000 driver and the Contiki TR1001 drivers have
only one state machine that parses incoming bytes. The state machine that
handles transmissions in the SOS CC1000 driver is shown in Figure 10.19.
The structures of the SOS CC1000 driver and the Contiki TR1001 driver are
very similar.
TXSTATE_PREAMBLE
TXSTATE_SYNC
TXSTATE_PREHEADER
TXSTATE_HEADER
TXSTATE_DATA
TXSTATE_CRC
TXSTATE_FLUSH
TXSTATE_WAIT_FOR_ACK
TXSTATE_READ_ACK
TXSTATE_DONE
Figure 10.19: Transmission state machine from the SOS CC1000 driver.
With protothreads we could replace most parts of the state machines. How-
ever, for both the SOS CC1000 driver and the Contiki TR1001 drivers, we kept
a top-level state machine. The reason for this is that those state machines were
138 Paper D
not used to implement control flow. The top-level state machine in the SOS
CC1000 driver controlled if the driver was currently transmitting or receiving
a packet, or if it was finding a synchronization byte.
Contiki
The Contiki operating system [12] for wireless sensor networks is based on an
event-driven kernel, on top of which protothreads provide a thread-like pro-
gramming style. The first version of Contiki was developed before we intro-
duced protothreads. After developing protothreads, we found that they reduced
the complexity of writing software for Contiki.
For the purpose of this paper, we measure the implementation of a dis-
tribution program for distributing and receiving binary code modules for the
Contiki dynamic loader [11]. The program was initially implemented with-
out protothreads but was later rewritten when protothreads were introduced to
Contiki. The program can be in one of three modes: (1) receiving a binary
module from a TCP connection and loads it into the system, (2) broadcasting
the binary module over the wireless network, and (3) receiving broadcasts of a
binary module from a nearby node and loading it into memory.
When rewriting the program with protothreads, we removed most of the
explicit state machines, but kept four states. These states keep track in which
mode the program is: if it is receiving or broadcasting a binary module.
Results
The results of reimplementing the programs with protothreads are presented in
Table 10.1. The lines of code reported in the table are those of the rewritten
functions only. We see that in all cases the number of states, state transitions,
and lines of code were reduced by rewriting the programs with protothreads. In
10.6 Evaluation 139
Table 10.1: The number of explicit states, explicit state transitions, and lines
of code before and after rewriting with protothreads.
most cases the rewrite completely removed the state machine. The total average
reduction in lines of code is 31%. For the programs rewritten by applying
the replacement method from Section 10.4 (XNP, TinyDB, and the CC1000
drivers) the average reduction is 23% and for the programs that were rewritten
from scratch (the TR1001 driver, the uIP SMTP client, and the Contiki code
propagation program) the average reduction is 41%.
Table 10.2 shows the compiled code size of the rewritten functions when
written as a state machine and with protothreads. We see that the code size
increases in most cases, except for the Contiki code propagation program. The
average increase for the programs where the state machines were replaced with
protothreads by applying the method from Section 10.4 is 14%. The Con-
tiki TR1001 driver is only marginally larger when written with protothreads.
The uIP SMTP client, on the other hand, is significantly larger when written
with protothreads rather than with a state machine. The reason for this is that
the code for creating SMTP message strings could be optimized through code
reuse in the state machine-based implementation, something which was not
140 Paper D
Table 10.2: Code size before and after rewriting with protothreads.
State Proto-
machine thread Thread
Contiki TR1001 driver 1 2 18
Contiki code propagation 1 2 34
Table 10.3: Memory overhead in bytes for the Contiki TR1001 driver and the
Contiki code propagation on the MSP430, implemented with a state machine,
a protothread, and a thread.
Table 10.4: Machine code instructions overhead for a state machine, a pro-
tothread, and a yielding protothread.
Table 10.5: Mean execution time in milliseconds and processor cycles for a
single invocation of the TR1001 input driver under Contiki on the MSP430
platform.
for protothreads over the state machine is very small: three machine code in-
structions for the MSP430 and 11 for the AVR. In comparison, the number of
instructions required to perform a context switch in the Contiki implementation
of cooperative multi-threading is 51 for the MSP430 and 80 for the AVR.
The additional instructions for the protothread in Table 10.4 are caused
by the extra case statement that is included in the implementation of the
PT BEGIN operation.
per invocation for the GCC C extension-based implementation and just over
ten cycles per invocation for the C switch-based implementation. The results
are consistent with the machine code overhead in Table 10.4. We also mea-
sured the execution time of the radio driver rewritten with cooperative multi-
threading and found it to be approximately three times larger than that of the
protothread-based implementation because of the overhead of the stack switch-
ing code in the multi-threading library.
Because of the low execution time overhead of protothreads we conclude
that protothreads are usable even for interrupt handlers with tight execution
time constraints.
10.7 Discussion
We have been using the prototype implementations of protothreads described
in this paper in Contiki for two years and have found that the biggest problem
with the prototype implementations is that automatic variables are not pre-
served across a blocking wait. Our workaround is to use static local variables
rather than automatic variables inside protothreads. While the use of static lo-
cal variables may be problematic in the general case, we have found it to work
well for Contiki because of the small scale of most Contiki programs. Also,
as many Contiki programs do not need to be reentrant, the use of static local
variables work well.
Code organization is different for programs written with state machines
and with protothreads. State machine-based programs tend to consist either of
a single large function containing a large state machine or of many small func-
tions where the state machine is difficult to find. On the contrary, protothreads-
based programs tend to be based around a single protothread function that con-
tains the high-level logic of the program. If the underlying event system calls
different functions for every incoming event, a protothreads-based program
typically consists of a single protothread function and a number of small event
handlers that invoke the protothread when an event occurs.
10.9 Conclusions
We present protothreads, a novel abstraction for memory-constrained embed-
ded systems. Due to memory-constraints, such systems are often based on an
event-driven model. Experience has shown that event-driven programming is
difficult because the lack of a blocking wait abstraction forces programmers to
implement control flow with state machines.
Protothreads simplify programming by providing a conditional blocking
wait operation, thereby reducing the need for explicit state machines. Pro-
tothreads are inexpensive: the memory overhead is only two bytes per pro-
tothread.
We develop two prototype protothreads implementations using only C pre-
processor and evaluate the usefulness of protothreads by reimplementing some
widely used event-driven programs using protothreads. Our results show that
for most programs the explicit state machines could be entirely removed. Fur-
thermore, protothreads significantly reduce the number of state transitions and
146 Paper D
Acknowledgments
This work was partly financed by VINNOVA, the Swedish Agency for Inno-
vation Systems, and the European Commission under contract IST-004536-
RUNES. Thanks go to Kay Römer and Umar Saif for reading and suggesting
improvements on drafts of this paper, and to our paper shepherd Philip Levis
for his many insightful comments that significantly helped to improve the pa-
per.
Bibliography
147
148 Bibliography
[13] A. Dunkels, O. Schmidt, and T. Voigt. Using protothreads for sensor node
programming. In Proc. of the Workshop on Real-World Wireless Sensor
Networks (REALWSN’05), Stockholm, Sweden, June 2005.
[25] P. Levis and D. Culler. Maté: A tiny virtual machine for sensor networks.
In Proceedings of ASPLOS-X, San Jose, CA, USA, October 2002.
Adam Dunkels, Niclas Finne, Joakim Eriksson, and Thiemo Voigt. Run-time
dynamic linking for reprogramming wireless sensor networks. In Proceedings
of the 4th International Conference on Embedded Networked Sensor Systems
(ACM SenSys 2006), Boulder, Colorado, USA, November 2006.
c 2006 Association for Computing Machinery.
151
Abstract
From experience with wireless sensor networks it has become apparent that
dynamic reprogramming of the sensor nodes is a useful feature. The resource
constraints in terms of energy, memory, and processing power make sensor
network reprogramming a challenging task. Many different mechanisms for
reprogramming sensor nodes have been developed ranging from full image
replacement to virtual machines.
We have implemented an in-situ run-time dynamic linker and loader that
use the standard ELF object file format. We show that run-time dynamic link-
ing is an effective method for reprogramming even resource constrained wire-
less sensor nodes. To evaluate our dynamic linking mechanism we have im-
plemented an application-specific virtual machine and a Java virtual machine
and compare the energy cost of the different linking and execution models. We
measure the energy consumption and execution time overhead on real hardware
to quantify the energy costs for dynamic linking.
Our results suggest that while in general the overhead of a virtual machine
is high, a combination of native code and virtual machine code provide good
energy efficiency. Dynamic run-time linking can be used to update the native
code, even in heterogeneous networks.
11.1 Introduction 153
11.1 Introduction
Wireless sensor networks consist of a collection of programmable radio-
equipped embedded systems. The behavior of a wireless sensor network is
encoded in software running on the wireless sensor network nodes. The soft-
ware in deployed wireless sensor network systems often needs to be changed,
both to update the system with new functionality and to correct software bugs.
For this reason dynamically reprogramming of wireless sensor network is an
important feature. Furthermore, when developing software for wireless sen-
sor networks, being able to update the software of a running sensor network
greatly helps to shorten the development time.
The limitations of communication bandwidth, the limited energy of the
sensor nodes, the limited sensor node memory which typically is on the order
of a few thousand bytes large, the absence of memory mapping hardware, and
the limited processing power make reprogramming of sensor network nodes
challenging.
Many different methods for reprogramming sensor nodes have been de-
veloped, including full system image replacement [14, 16], approaches based
on binary differences [15, 17, 31], virtual machines [18, 19, 20], and loadable
native code modules in the first versions of Contiki [5] and SOS [12]. These
methods are either inefficient in terms of energy or require non-standard data
formats and tools.
The primary contribution of this paper is that we investigate the use of stan-
dard mechanisms and file formats for reprogramming sensor network nodes.
We show that in-situ dynamic run-time linking and loading of native code us-
ing the ELF file format, which is a standard feature on many operating systems
for PC computers and workstations, is feasible even for resource-constrained
sensor nodes. Our secondary contribution is that we measure and quantify the
energy costs of dynamic linking and execution of native code and compare it to
the energy cost of transmission and execution of code for two virtual machines:
an application-specific virtual machine and the Java virtual machine.
We have implemented a dynamic linker in the Contiki operating system that
can link, relocate, and load standard ELF object code files. Our mechanism is
independent of the particular microprocessor architecture on the sensor nodes
and we have ported the linker to two different sensor node platforms with only
minor modifications to the architecture dependent module of the code.
To evaluate the energy costs of the dynamic linker we implement an ap-
plication specific virtual machine for Contiki together with a compiler for a
subset of Java. We also adapt the Java virtual machine from the lejOS sys-
154 Paper E
tem [8] to run under Contiki. We measure the energy cost of reprogramming
and executing a set of program using dynamic linking of native code and the
two virtual machines. Using the measurements and a simple energy consump-
tion model we calculate break-even points for the energy consumption of the
different mechanisms. Our results suggest that while the execution time over-
head of a virtual machine is high, a combination of native code and virtual
machine code may give good energy efficiency.
The remainder of this paper is structured as follows. In Section 11.2 we
discuss different scenarios in which reprogramming is useful. Section 11.3
presents a set of mechanisms for executing code inside a sensor node and in
Section 11.4 we discuss loadable modules and the process of linking, relocat-
ing, and loading native code. Section 11.5 describes our implementation of
dynamic linking and our virtual machines. Our experiments and the results are
presented in Section 11.6 and discuss the results in Section 11.7. Related work
is reviewed in Section 11.8. Finally, we conclude the paper in Section 11.9.
a fire. When the fire detection application has detected a fire, the fire fighters
might want to run a search and rescue application as well as a fire tracking
application. While it may possible to host these particular applications on each
node despite the limited memory of the sensor nodes, this approach is not scal-
able [9]. In this scenario, replacing the application on the sensor nodes leads
to a more scalable system.
11.2.6 Summary
Table 11.1 compares the different scenarios and their properties. Update frac-
tion refers to what amount of the system that needs to be updated for every
update, update level to at what levels of the system updates are likely to occur,
and program longevity to how long an installed program will be expected to
reside on the sensor node.
The most common way to update software in embedded systems and sensor
networks is to compile a complete new binary image of the software together
158 Paper E
with the operating system and overwrite the existing system image of the sen-
sor node. This is the default method used by the XNP and Deluge network
reprogramming software in TinyOS [13].
The full image replacement does not require any additional processing of
the loaded system image before it is loaded into the system, since the loaded
image resides at the same, known, physical memory address as the previous
system image. For some systems, such as the Scatterweb system code [33], the
system contains both an operating system image and a small set of functions
that provide functionality for loading new operating system images. A new
operating system image can overwrite the existing image without overwriting
the loading functions. The addresses of the loading functions are hard-coded
in the operating system image.
Diff-based Approaches
Often a small update in the code of the system, such as a bugfix, will cause
only minor differences between in the new and old system image. Instead of
distributing a new full system image the binary differences, deltas, between
the modified and original binary can be distributed. This reduces the amount
of data that needs to be transferred. Several types of diff-based approaches
have been developed [15, 17, 31] and it has been shown that the size of the
deltas produced by the diff-based approaches is very small compared to the
full binary image.
Core
0x0237
int memcpy() {
/* ... */
}
0x1720
void radio_send() {
/* ... */
}
Figure 11.1: The difference between a pre-linked module and a module with
dynamic linking information: the pre-linked module contains physical ad-
dresses whereas the dynamically linked module contains symbolic names.
linking. Linking can be done either when the module is compiled or when the
module is loaded. We call the former approach pre-linking and the latter dy-
namic linking. A pre-linked module contains the absolute physical addresses
of the referenced functions or variables whereas a dynamically linked module
contains the symbolic names of all system core functions or variables that are
referenced in the module. This information increases the size of the dynam-
ically linked module compared to the pre-linked module. The difference is
shown in Figure 11.1. Dynamic linking has not previously been considered for
wireless sensor networks because of the perceived run-time overhead, both in
terms of execution time, energy consumption, and memory requirements.
The machine code in the module usually contains references not only to
functions or variables in the system, but also to functions or variables within the
module itself. The physical address of those functions will change depending
on the memory address at which the module is loaded in the system. The
addresses of the references must therefore be updated to the physical address
that the function or variable will have when the module is loaded. The process
160 Paper E
the ELF format is designed to work on 32-bit and 64-bit architectures. This
causes all ELF data structures to be defined with 32-bit data types. For 8-bit or
16-bit targets the high 16 bits of these fields are unused.
To quantify the overhead of the ELF format we devise an alternative to
the ELF object code format that we call CELF - Compact ELF. A CELF file
contains the same information as an ELF file, but represented with 8 and 16-bit
datatypes. CELF files typically are half the size of the corresponding ELF file.
The Contiki dynamic loader is able to load CELF files and a utility program is
used to convert ELF files to CELF files.
It is possible to further compress CELF files using lossless data compres-
sion. However, we leave the investigation of the energy-efficiency of this ap-
proach to future work.
The drawback of the CELF format is that it requires a special compres-
sor utility is for creating the CELF files. This makes the CELF format less
attractive for use in many real-world situations.
11.5 Implementation
We have implemented run-time dynamic linking of ELF and CELF files in
the Contiki operating system [5]. To evaluate dynamic linking we have im-
plemented an application specific virtual machine for Contiki together with a
compiler for a subset of Java, and have ported a Java virtual machine to Contiki.
11.5 Implementation 163
ROM
11111111111
00000000000
00000000000
11111111111
00000000000
11111111111
Loaded program
00000000000
11111111111
00000000000
11111111111
00000000000
11111111111
Device drivers
RAM
00000000000
11111111111
00000000000
11111111111
Language run−time
00000000000
11111111111
Loaded program Symbol table
00000000000
11111111111
Device drivers
Dynamic linker
Core Core
Figure 11.2: Partitioning in Contiki: the core and loadable programs in RAM
and ROM.
without the need for a reboot when a module has been loaded or unloaded.
While it is possible to replace the core at run-time by running a special
loadable program that overwrites the current core and reboots the system, ex-
perience has shown that this feature is not often used in practice.
RAM, on-chip flash ROM, external EEPROM, or external ROM without mod-
ification. Since all file access to the ELF/CELF file is made through the CFS,
the dynamic linker does not need to concern itself with low-level filesystem
details such as wear-leveling or fragmentation [4] as this is better handled by
the CFS.
The dynamic linker performs four steps to link, relocate and load an
ELF/CELF file. The dynamic linker first parses the ELF/CELF file and ex-
tracts relevant information about where in the ELF/CELF file the code, data,
symbol table, and relocation entries are stored. Second, memory for the code
and data is allocated from flash ROM and RAM, respectively. Third, the code
and data segments are linked and relocated to their respective memory loca-
tions, and fourth, the code is written to flash ROM and the data to RAM.
Currently, memory allocation for the loaded program is done using a sim-
ple block allocation scheme. More sophisticated allocation schemes will be
investigated in the future.
Relocation entries may also be relative to the data, BSS, or code segment
in the ELF/CELF file. In that case no symbol is associated with the reloca-
tion entry. For such entries the dynamic linker calculates the address that the
segment will have when the program has been loaded, and uses that address to
patch the code or data.
Loading
When the linking and relocating is completed, the text and data have been re-
located to their final memory position. The text segment is then written to flash
ROM, at the location that was previously allocated. The memory allocated for
the data and BSS segments are used as an intermediate storage for transferring
text segment data from the ELF/CELF file before it is written to flash ROM.
Finally, the memory allocated for the BSS segment is cleared, and the contents
of the data segment is copied from the ELF/CELF file.
When the dynamic linker has successfully loaded the code and data segments,
Contiki starts executing the program.
The loaded program may replace an already running Contiki service. If the
service that is to be replaced needs to pass state to the newly loaded service,
Contiki supports the allocation of an external memory buffer for this purpose.
However, experience has shown that this mechanism has been very scarcely
used in practice and the mechanism is likely to be removed in future versions
of Contiki.
Portability
Since the ELF/CELF format is the same across different platforms, we de-
signed the Contiki dynamic linker to be easily portable to new platforms. The
loader is split into one architecture specific part and one generic part. The
generic part parses the ELF/CELF file, finds the relevant sections of the file,
looks up symbols from the symbol table, and performs the generic relocation
logic. The architecture specific part does only three things: allocates ROM and
RAM, writes the linked and relocated binary to flash ROM, and understands
the relocation types in order to modify machine code instructions that need
adjustment because of relocation.
11.5 Implementation 167
Alternative Designs
The Contiki core symbol table contains all externally visible symbols in the
Contiki core. Many of the symbols may never need to be accessed by loadable
programs, thus causing ROM overhead. An alternative design would be to let
the symbol table include only a handful of symbols, entry points, that define
the only ways for an application program to interact with the core. This would
lead to a smaller symbol table, but would also require a detailed specification
of which entry points that should be included in the symbol table. The main
reason why we did not chose this design, however, is that we wish to be able to
replace modules at any level of the system. For this reason, we chose to provide
the same amount of symbols to an application program as it would have, would
it have been compiled directly into the core. However, we are continuing to
investigate this alternative design for future versions of the system.
11.6 Evaluation
To evaluate dynamic linking of native code we compare the energy costs of
transferring, linking, relocating, loading, and executing a native code module
in ELF format using dynamic linking with the energy costs of transferring,
loading, and executing the same program compiled for the CVM and the Java
virtual machine. We devise a simple model of the energy consumption of the
reprogramming process. Thereafter we experimentally quantify the energy and
memory consumption as well as the execution overhead for the reprogram-
ming, the execution methods and the applications. We use the results of the
measurements as input into the model which enables us to perform a quantita-
tive comparison of the energy-efficiency of the reprogramming methods.
We use the ESB board [33] and the Telos Sky board [29] as our experi-
mental platforms. The ESB is equipped with an MSP430 microcontroller with
2 kilobytes of RAM and 60 kilobytes of flash ROM, an external 64 kilobyte
11.6 Evaluation 169
etimer_set(&t, CLOCK_SECOND);
while(1) {
leds_on(LEDS_GREEN);
PROCESS_WAIT_UNTIL(etimer_expired(&t));
etimer_reset(&t);
leds_off(LEDS_GREEN);
PROCESS_WAIT_UNTIL(etimer_expired(&t));
etimer_reset(&t);
}
PROCESS_END();
}
Figure 11.3: Example Contiki program that toggles the LEDs every second.
We use three Contiki programs to measure the energy efficiency and exe-
cution overhead of our different approaches. Blinker, the first of the two pro-
grams, is shown in Figure 11.3. It is a simple program that toggles the LEDs
every second. The second program, Object Tracker, is an object tracking ap-
plication based on abstract regions [35]. To allow running the programs both
as native code, as CVM code, and as Java code we have implemented these
programs both in C and Java. A schematic illustration of the C implementation
is in Figure 11.4. To support the object tracker program, we implemented a
subset of the abstract regions mechanism in Contiki. The Java and CVM ver-
sions of the program call native code versions of the abstract regions functions.
The third program is a simple 8 by 8 vector convolution calculation.
170 Paper E
while(1) {
value = pir_sensor.value();
region_put(reading_key, value);
region_put(reg_x_key, value * loc_x());
region_put(reg_y_key, value * loc_y());
if(value > threshold) {
max = region_max(reading_key);
if(max == value) {
sum = region_sum(reading_key);
sum_x = region_sum(reg_x_key);
sum_y = region_sum(reg_y_key);
centroid_x = sum_x / sum;
centroid_y = sum_y / sum;
send(centroid_x, centroid_y);
}
}
etimer_set(&t, PERIODIC_DELAY);
PROCESS_WAIT_UNTIL(etimer_expired(&t));
PROCESS_END();
}
E = Ep + Es + El + Ef
where Ep is the energy spent in transferring the object over the network, Es
the energy cost of storing the object on the device, El the energy consumed
by linking and relocating the object, and Ef the required energy for of storing
the linked program in flash ROM. We use a simplified model of the network
propagation energy where we assume a propagation protocol where the energy
consumption Ep is proportional to the size of the object to be transferred. For-
mally,
Ep = Pp so
11.6 Evaluation 171
20
Current
15
Current (mA)
10
0
0 0.2 0.4 0.6 0.8 1
Time (s)
Figure 11.5: Current draw for receiving 1000 bytes with the TR1001.
where so is the size of the object file to be transfered and Pp is a constant scale
factor that depends on the network protocol used to transfer the object. We
use similar equations for Es (energy for storing the binary) and El (energy for
linking and relocating). The equation for Ef (the energy for loading the binary
to ROM) contains the size of the compiled code size of the program instead of
the size of the object file. This model is intentionally simple and we consider
it good enough for our purpose of comparing the energy-efficiency of different
reprogramming schemes.
20
Current
15
Current (mA)
10
0
0 0.2 0.4 0.6 0.8 1
Time (s)
Figure 11.6: Current draw for receiving 1000 bytes with the CC2420.
packet level.
Since the TR1001 operates at the bit-level, the communication speed of the
TR1001 is determined by the CPU. We use a data rate of 9600 bits per second.
The CC2420 has a data rate of 250 kilobits per second, but also incurs some
protocol overhead as it provides a more high-level interface.
Figures 11.5 and 11.6 show the current draw from receiving 1000 bytes
of data with the TR1001 and CC2420 radio transceivers. These measurements
constitute a lower bound on the energy consumption for receiving data over the
radio, as they do not include any control overhead caused by a code propagation
protocol. Nor do they include any packet headers. An actual propagation pro-
tocol would incur overhead because of both packet headers and control traffic.
For example, the Deluge protocol has a control packet overhead of approxi-
mately 20% [14]. This overhead is derived from the total number of control
packets and the total number of data packets in a sensor network. The average
overhead in terms of number of excessive data packets received is 3.35 [14]. In
addition to the actual code propagation protocol overhead, there is also over-
head from the MAC layer, both in terms of packet headers and control traffic.
The TR1001 provides a low-level interface to the CPU, which enabled us
to measure only the current draw of the receiver. We first measured the time
required for receiving one byte of data from the radio. To produce the graph
11.6 Evaluation 173
Table 11.2: Lower bounds on the time and energy consumption for receiving
1000 bytes with the TR1001 and CC2420 transceivers. All values are rounded
to two significant digits.
in the figure, we measured the current draw of an ESB board which we had
programmed to turn on receive mode and busy-wait for the time corresponding
to the reception time of 1000 bytes.
When measuring the reception current draw of the CC2420, we could not
measure the time required for receiving one byte because the CC2420 does not
provide an interface at the bit level. Instead, we used two Telos Sky boards and
programmed one to continuously send back-to-back packets with 100 bytes
of data. We programmed the other board to turn on receive mode when the
on-board button was pressed. The receiver would receive 1000 bytes of data,
corresponding to 10 packets, before turning the receiver off. We placed the two
boards next to each other on a table to avoid packet drops. We produced the
graph in Figure 11.6 by measuring the current draw of the receiver Telos Sky
board. To ensure that we did not get spurious packet drops, we repeated the
measurement five times without obtaining differing results.
Table 11.2 shows the lower bounds on the time and energy consumption
for receiving data with the TR1001 and CC2420 transceivers. The results show
that while the current draw of the CC2420 is higher than that of the TR1001, the
energy efficiency in terms of energy per byte of the CC2420 is better because
of the shorter time required to receive the data.
20
Current
15
Current (mA)
0
0 0.1 0.2 0.3 0.4 0.5 0.6
Time (s)
Figure 11.7: Current draw for writing the Blinker ELF file to EEPROM (0 -
0.166 s), linking and relocating the program (0.166 - 0.418 s), writing the re-
sulting code to flash ROM (0.418 - 0.488 s), and executing the binary (0.488
s and onward). The current spikes delimit the three steps and are intention-
ally caused by blinking on-board LEDs. The high energy consumption when
executing the binary is caused by the green LED.
into an on-board EEPROM from where the Contiki dynamic linker linked and
relocated the ELF file before it loaded the program into flash ROM.
Figure 11.7 shows the current draw when loading the Blinker program, and
Figure 11.8 shows the current draw when loading the Object Tracker program.
The current spikes seen in both graphs are intentionally caused by blinking the
on-board LEDs. The spikes delimit the four different steps that the loader is
going through: copying the ELF object file to EEPROM, linking and relocating
the object code, copying the linked code to flash ROM, and finally executing
the loaded program. The current draw of the green LED is slightly above 8
mA, which causes the high current draw when executing the blinker program
(Figure 11.7). Similarly, when the object tracking application starts, it turns
on the radio for neighbor discovery. This causes the current draw to rise to
around 6 mA in Figure 11.8, and matches the radio current measurements in
Figure 11.5.
Table 11.3 shows the energy consumption of loading and linking the
11.6 Evaluation 175
20
Current
15
Current (mA)
0
0 0.2 0.4 0.6 0.8 1 1.2
Time (s)
Figure 11.8: Current draw for writing the Object Tracker ELF file to EEPROM
(0 - 0.282 s), linking and relocating the program (0.282 - 0.882 s), writing the
resulting code to flash ROM (0.882 - 0.988 s), and executing the binary (0.988
s and onward). The current spikes delimit the three steps and are intentionally
caused by blinking on-board LEDs. The high current draw when executing the
binary comes from the radio being turned on.
Blinker program. The energy was obtained from integration of the curve from
Figure 11.7 and multiplying it by the voltage used in our experiments (4.5 V).
We see that the linking and relocation step is the most expensive in terms of
energy. It is also the longest step.
To evaluate the energy overhead of the ELF file format, we compare the
energy consumption for receiving four different Contiki programs using the
ELF and CELF formats. In addition to the two programs from Figures 11.3
and 11.4 we include the code for the Contiki code propagation mechanism and
a network publish/subscribe program that performs periodic flooding and con-
verging of information. The two latter programs are significantly larger. We
calculate an estimate of the required energy for receiving the files by using the
measured energy consumption of the CC2420 radio transceiver and multiply it
by the average overhead by the Deluge code propagation protocol, 3.35 [14].
The results are listed in Table 11.4 and show that radio reception is more energy
consuming than linking and loading a program, even for a small program. Fur-
176 Paper E
Table 11.3: Measured energy consumption of the storing, linking and loading
of the 1056 bytes large Blinker binary and the 1824 bytes large Object Tracker
binary. The size of the Blinker code is 130 bytes and the size of the Object
Tracker code is 344 bytes.
thermore, the results show that the relative average size and energy overhead
for ELF files compared to the code and data contained in the files is approxi-
mately 4 whereas the relative CELF overhead is just under 2.
Table 11.4: The overhead of the ELF and CELF file formats in terms of bytes
and estimated reception energy for four Contiki programs. The reception en-
ergy is the lower bound of the radio reception energy with the CC2420 chip,
multiplied by the average Deluge overhead (3.35).
The native C code was compiled with the MSP430 port of GCC version 3.2.3.
The MSP430 digitally-controlled oscillator was set to clock the CPU at a speed
of 2.4576 MHz. We measured the execution time of the three implementations
using the on-chip timer A1 that was set to generate a timer interrupt 1000 times
per second. The execution times are averaged over 5000 iterations of the object
tracking program.
The results in Table 11.6 show the execution time of one run of the object
tracking application from Figure 11.4. The execution time measurements are
averaged over 5000 runs of the object tracking program. The energy consump-
tion is calculated by multiplying the execution time with the average energy
consumption when a program is running with the radio turned off. The table
shows that the overhead of the Java virtual machine is higher than that of the
CVM, which is turn is higher than the execution overhead of the native C code.
All three implementations of the tracker program use the same abstract
regions library which is compiled as native code. Thus much of the execution
time in the Java VM and CVM implementations of the object tracking program
is spent executing the native code in the abstract regions library. Essentially,
the virtual machine simply acts as a dispatcher of calls to various native func-
178 Paper E
Table 11.5: Memory requirements, in bytes. The ROM size for the dynamic
linker includes the symbol table. The RAM figures do not include memory for
programs running on top of the virtual machines.
Table 11.6: Execution times and energy consumption of one iteration of the
tracking program.
tions. For programs that spend a significant part of their time executing virtual
machine code the relative execution times are significantly higher for the vir-
tual machine programs. To illustrate this, Table 11.7 lists the execution times
of a convolution operation of two vectors of length 8. Convolution is a com-
mon operation in digital signal processing where it is used for algorithms such
as filtering or edge detection. We see that the execution time of the program
running on the virtual machines is close to ten times that of the native program.
Using our model from Section 11.6.1 and the results from the above measure-
ments, we can calculate approximations of the energy consumption for distri-
bution, reprogramming, and execution of native and virtual machine programs
in order to compare the methods with each other. We set Pp , the scale factor
of the energy consumption for receiving an object file, to the average Deluge
overhead of 3.35.
11.6 Evaluation 179
140
Java VM
ELF
120 CVM
Consumed energy (mJ)
CELF
100
80
60
40
20
0
0 20000 40000 60000 80000 100000
Number of program iterations
Figure 11.9: Break-even points for the object tracking program implemented
with four different linking and execution methods.
140
Java VM
ELF
120 CVM
Consumed energy (mJ)
CELF
100
80
60
40
20
0
0 200 400 600 800 1000
Number of program iterations
Figure 11.10: Break-even points for the vector convolution implemented with
four different linking and execution methods.
Table 11.10: Number of lines of code for the dynamic linker and the
microcontroller-specific parts.
11.6.6 Portability
Because of the diversity of sensor network platforms, the Contiki dynamic
linker is designed to be portable between different microcontrollers. The dy-
namic linker is divided into two modules: a generic part that parses and ana-
lyzes the ELF/CELF that is to be loaded, and a microcontroller-specific part
that allocates memory for the program to be loaded, performs code and data
relocation, and writes the linked program into memory.
To evaluate the portability of our design we have ported the dynamic linker
to two different microcontrollers: the TI MSP430 and the Atmel AVR. The TI
MSP430 is used in several sensor network platforms, including the Telos Sky
and the ESB. The Atmel AVR is used in the Mica2 motes.
Table 11.10 shows the number of lines of code needed to implement each
module. The dramatic difference between the MSP430-specific module and
the AVR-specific module is due to the different addressing modes used by the
machine code of the two microcontrollers. While the MSP430 has only one
addressing mode, the AVR has 19 different addressing modes. Each addressing
mode must be handled differently by the relocation function, which leads to a
larger amount of code for the AVR-specific module.
11.7 Discussion
Standard file formats. Our main motivation behind choosing the ELF for-
mat for dynamic linking in Contiki was that the ELF format is a standard file
format. Many compilers and utilities, including all GCC utilities, are able to
produce and handle ELF files. Hence no special software is needed to com-
pile and upload new programs into a network of Contiki nodes. In contrast,
FlexCup [27] or diff-based approaches require the usage of specially crafted
184 Paper E
utilities to produce meta data or diff scripts required for uploading software.
These special utilities also need to be maintained and ported to the full range
of development platforms used for software development for the system.
Operating system support. Dynamic linking of ELF files requires sup-
port from the underlying operating system and cannot be done on monolithic
operating systems such as TinyOS. This is a disadvantage of our approach. For
monolithic operating systems, an approach such as FlexCup is better suited.
Heterogeneity. With diff-based approaches a binary diff is created either
at a base station or by an outside server. The server must have knowledge of
the exact software configuration of the sensor nodes on which the diff script
is to be run. If sensor nodes are running different versions of their software,
diff-based approaches do not scale.
Specifically, in many of our development networks we have witnessed a
form of micro heterogeneity in the software configuration. Many sensor nodes,
which have been running the exact same version of the Contiki operating sys-
tem, have had small differences in the address of functions and variables in
the core. This micro heterogeneity comes from the different core images being
compiled by different developers, each having slightly different versions of the
C compiler, the C library and the linker utilities. This results in small varia-
tions of the operating system image depending on which developer compiled
the operating system image. With diff-based approaches micro heterogeneity
poses a big problem, as the base station would have to be aware of all the small
differences between each node.
Combination of native and virtual machine code. Our results suggest
that a combination of native and virtual machine code is an energy efficient
alternative to pure native code or pure virtual machine code approaches. The
dynamic linking mechanism can be used to load the native code that is used by
the virtual machine code by the native code interfaces in the virtual machines.
updates and running native code that executes more efficiently but requires
more costly updates. This trade-off has been further discussed by Levis and
Culler [19] who implemented the Maté virtual machine designed to both sim-
plify programming and to leverage energy-efficient large-scale software up-
dates in sensor networks. Maté is implemented on top of TinyOS.
Levis and Culler later enhanced Maté by application specific virtual ma-
chines (ASVMs) [20]. They address the main limitations of Maté: flexibility,
concurrency and propagation. Whereas Maté was designed for a single ap-
plication domain only, ASVM supports a wide range of application domains.
Further, instead of relying on broadcasts for code propagation as Maté, ASVM
uses the trickle algorithm [21].
The MagnetOS [23] system uses the Java virtual machine to distribute ap-
plications across an ad hoc network of laptops. In MagnetOS, Java applica-
tions are partitioned into distributed components. The components transpar-
ently communicate by raising events. Unlike Maté and Contiki, MagnetOS
targets larger platforms than sensor nodes such as PocketPC devices. Sensor-
Ware [1] is another script-based proposal for programming nodes that targets
larger platforms. VM* is a framework for runtime environments for sensor
networks [18]. Using this framework Koshy and Pandey have implemented a
subset of the Java Virtual Machine that enables programmers to write applica-
tions in Java, and access sensing devices and I/O through native interfaces.
Mobile agent-based approaches extend the notion of injected scripts by de-
ploying dynamic, localized and intelligent mobile agents. Using mobile agents,
Fok et al. have built the Agilla platform that enables continuous reprogram-
ming by injecting new agents into the network [9].
TinyOS uses a special description language for composing a system of
smaller components [10] which are statically linked with the kernel to a com-
plete image of the system. After linking, modifying the system is not possi-
ble [19] and hence TinyOS requires the whole image to be updated even for
small code changes.
Systems that offer loadable modules besides Contiki include SOS [12] and
Impala [24]. Impala features an application updater that enables software up-
dates to be performed by linking in updated modules. Updates in Impala are
coarse-grained since cross-references between different modules are not possi-
ble. Also, the software updater in Impala was only implemented for much more
resource-rich hardware than our target devices. The design of SOS [12] is very
similar to the Contiki system: SOS consists of a small kernel and dynamically-
loaded modules. However, SOS uses position independent code to achieve
relocation and jump tables for application programs to access the operating
186 Paper E
system kernel. Application programs can register function pointers with the
operating system for performing inter-process communication. Position in-
dependent code is not available for all platforms, however, which limits the
applicability of this approach.
11.9 Conclusions
We have presented a highly portable dynamic linker and loader that uses the
standard ELF file format and compared the energy-efficiency of run-time dy-
namic linking with an application specific virtual machine and a Java virtual
machine. We show that dynamic linking is feasible even for constrained sensor
nodes.
Our results also suggest that a combination of native and virtual machine
code provide an energy efficient alternative to pure native code or pure virtual
machine approaches. The native code that is called from the virtual machine
code can be updated using the dynamic linker, even in heterogeneous systems.
Acknowledgments
This work was partly financed by VINNOVA, the Swedish Agency for Inno-
vation Systems, and the European Commission under contract IST-004536-
RUNES. Thanks to our paper shepherd Feng Zhao for reading and commenting
on the paper.
Bibliography
Bibliography
[3] TIS Committee. Tool Interface Standard (TIS) Executable and Linking
Format (ELF) Specification Version 1.2, May 1995.
[4] H. Dai, M. Neufeld, and R. Han. Elf: an efficient log-structured flash file
system for micro sensor nodes. In SenSys, pages 176–187, 2004.
[5] A. Dunkels, B. Grönvall, and T. Voigt. Contiki - a lightweight and flexible
operating system for tiny networked sensors. In Proceedings of the First
IEEE Workshop on Embedded Networked Sensors (IEEE Emnets ’04),
Tampa, Florida, USA, November 2004.
[6] A. Dunkels, O. Schmidt, T. Voigt, and M. Ali. Protothreads: Simpli-
fying event-driven programming of memory-constrained embedded sys-
tems. In Proceedings of the 4th International Conference on Embedded
Networked Sensor Systems, SenSys 2006, Boulder, Colorado, USA, 2006.
[7] D. Estrin (editor). Embedded everywhere: A research agenda for net-
worked systems of embedded computers. National Academy Press, 1st
edition, October 2001. ISBN: 0309075688
[8] G. Ferrari, J. Stuber, A. Gombos, and D. Laverde, editors. Programming
Lego Mindstorms with Java with CD-ROM. Syngress Publishing, 2002.
ISBN: 1928994555
188
Bibliography 189
[9] C. Fok, G. Roman, and C. Lu. Rapid development and flexible deploy-
ment of adaptive wireless sensor network applications. In Proceedings
of the 24th International Conference on Distributed Computing Systems,
Tokyo, Japan, June 2005.
[19] P. Levis and D. Culler. Maté: A tiny virtual machine for sensor networks.
In Proceedings of ASPLOS-X, San Jose, CA, USA, October 2002.
[22] J. Lilius and I. Paltor. Deeply embedded python, a virtual machine for
embedded systems. Web page. 2006-04-06.
URL: http://www.tucs.fi/magazin/output.php?ID=2000.N2.LilDeEmPy
[23] H. Liu, T. Roeder, K. Walsh, R. Barr, and E. Gün Sirer. Design and
implementation of a single system image operating system for ad hoc
networks. In MobiSys, pages 149–162, 2005.
04: Dan Sahlin, An Automatic Partial Evaluator for Full Prolog, 1991.
193
194 Bibliography
20: Annika Waern, Recognising Human Plans: Issues for Plan Recogni-
tion in Human-Computer Interaction, 1996.