0% found this document useful (0 votes)
2 views47 pages

Proceedings of Seminar On Energy-Aware Software

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views47 pages

Proceedings of Seminar On Energy-Aware Software

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Helsinki University of Technology

Department of Computer Science and Engineering


Software Technology Laboratory
Espoo 2007 TKO-C94/07

Proceedings of Seminar on
Energy-Aware Software

Vesa Hirvisalo (ed.)

TEKNILLINEN KORKEAKOULU
TEKNISKA HÖGSKOLAN
HELSINKI UNIVERSITY OF TECHNOLOGY
TECHNISCHE UNIVERSITÄT HELSINKI
UNIVERSITE DE TECHNOLOGIE D’HELSINKI
Helsinki University of Technology
Department of Computer Science and Engineering
Software Technology Laboratory
P.O. Box 5400
FIN-02015 HUT
Espoo
Finland

Keywords: embedded system, energy

Copyright c 2007 Helsinki University of Technology

ISBN: 978-951-22-8871-7
ISSN: 1239-6907
Preface
The Software Technology Laboratory at Helsinki University of Technology organized a
seminar on energy-aware software during spring 2007. The goal of the seminar was to
take a look at the current status of the field. These proceedings include some of the work
presented in the seminar.

Espoo, June 2007


Seminar chair

Vesa Hirvisalo

iii
iv
Contents

Hardware Accelerators in Embedded Systems


Lari Ahti 1

Harnessing the Power of Software – A Survey on Energy-Aware Interfaces


Gerard Bosch 6

Compiler Memory Energy Optimizations


Peter Majorin 13

Measuring the CPU Energy Consumption of a Modern Mobile Device


Antti P. Miettinen 20

Energy-Aware Scheduling
Juhani Peltonen 26

Instruction-Level Energy Consumption


Kristian Söderblom 33

Energy Acconting
Timo Töyry 38

v
vi
Hardware Accelerators in Embedded Systems

Lari Ahti
Helsinki University of Technology
Software Technology Laboratory
[email protected]

Abstract 2.1. Motivation

Hardware acceleration is use of additional hardware to Motivation is for hardware acceleration is usually perfor-
perform some function more effectively than with more gen- mance increase of the system, but also energy consumption
eral hardware implementation. Espescially in wireless em- may be motivator, especially in embedded systems. Hard-
bedded systems, complicated communication algortihms re- ware acceleration changes behaviour of the system at least
quire high performance from system on chip. In case of a in the hardware level, but may also affect software imple-
battery powered device, algorithms are often implemented mentation. [7]
with hardware accelerators in order to meet both energy Usually hardware acceleration implementation is a trade-
consumption and performance requirements. off between three main factors: Performance, energy con-
One implementation, that takes advantage of hardware sumption and flexibility. Other factors that may have effect
accelerators, is software defined radio, which is a solution on selecting specific technique are for example manufac-
targetted to solve problems caused by fixed hardware in mo- turing price, design costs and time to market. Especially
bile communication devices. Software defined radio means in battery powered devices energy consumption will limit
layer that is capable to perform highly demanding signal other factors. Another example of hardware acceleration
processing and yet be reprogrammable after manufactur- is graphics accelerators, that are designed to render real-
ing. Reprogrammability has several advantages: Support time graphics. In this case performance of the accelerator is
for multiple protocols, faster time-to-market, higher chip maximized and other factors are limited.
volumes and support for late implementation changes.
2.2. Hardware techniques

1. Introduction In this section several hardware techiniques for hardware


acceleration are discussed. Most hardware accelerators de-
crease flexibility of the system in order to gain performance
This paper is divided into two sections where first sec- or reduce energy consumption. Usually hardware acceler-
tion will give introduction to different hardware accelera- ators co-operate with general purpose processor, that con-
tion techniques and second section will concetrate in one trols the accelerator. Acceleration hardware can be divided
spesific area of hardware acceleration: Software defined ra- into several groups that vary in performance, energy con-
dio (SDR). This paper will concetrate in embedded systems, sumption and flexibility. Here are listed some most com-
because hardware accelerators play crusial role in many em- mon hardware accelerator implementations for embedded
bedded applications. systems:

2. Hardware acceleration • Application specific integrated circuits (ASIC) is in-


tegrated circuit specially designed for acceleration of
some function and they provide highly tuned perfor-
This section gives motivation for hardware acceleration mance for certain task, but desing costs are high and
and brief introduction to hardware acceleration techniques. time to market is longer than with other hardware solu-
Both hardware and software technologies are discussed in tions. Also lact of ability reprogram ASICs after man-
separate subsections. ufacturing is a disadvantage.
• Field programmable gate arrays (FPGA) are gener- 2.4. On-Chip Communication
ally slower than ASICs and tend to use more power,
but they can be made reprogrammable and design pro- So far separate processing units have been covered, but
cess is usually easier. also communication between elements must taken into ac-
count, when discussing hardware acceleration. Figure 1
• Digital signal processors (DSP) provide efficient presents power consumption distribution in embedded pro-
mechanism for handling real-time digital signal pro- cessor system. On-chip bus power consumption presents
cessing. DSPs are designed to handle streamed data about 15% of the sum of all elements power consumption.
and have spesialized instruction set architecture for
signal processing related functions.

[7]
There are also other types of accelerators used and in
many cases combination of several accelerators are used.
For example graphics accelerator is usually a combination
of several hardware accelerators that are located in single
chip.

2.3. Software techniques


Figure 1. Power consumption of system on
In many cases, hardware acceleration affects also the chip components at 200 Mhz frequency [3]
software of the device as new hardware configuration is
introduced. Software techniques provide methods to take
Several methods for increasing power efficiency of on-
advantage of hardware configuration of device. Generally
chip communication have been presented. For example bus
software implementations can be divided into threee meth-
encoding, segmented bus design, interface power manage-
ods: Instruction level acceleration, function level accelera-
ment, and traffic sequencing are techniques that are de-
tion and architectural enhancements.
signed reduce overall power consumption of the commu-
Instruction level acceleration means that most common
nication bus. A completely different approach to on-chip
instruction sequencies are implemented with spesialized in-
communication is “network on-chip” technology that re-
structions. In some cases instruction level parallelism can
places bus wiring with network between processing ele-
be used to increase performance of the system. This method
ments.
limits flexibility of the system and increases performance.
Spesialized instruction require extensions to instruction set
architecture and therefore also software compilers may re- 2.4.1 Bus encoding
quire extensions. In bus encoding, data transferred through bus is encoded to
Function level acceleration means that entire algorithm gain power savings. Bus encoding can be also used to ad-
is replaced with spesific hardware. Functional level accel- dress other problems like delay and crosstalk reduction. [5]
eration typically results in more efficient solution than with Several encoding schemes have been developed. Encoding
instruction level acceleration, but fixed hardware limits flex- can take advantage of bus usage characteristics. For exam-
ibility of the system. Execution of the function level accel- ple address bus encoding can reduce switching activities on
erator can often be executed parallel with other hardware, bus by exploiting the statial locality and temporal locality
which increases performance even more, but can introduce of the bus. [9] Bus encoding can also increase complexity
new problems. Functional level accelerators algorithm must of the bus, which may descrease overall performance or add
be known before manufacturing, because of fixed hardware. additional power consumption to system.[3]
Also reprogramming of accelerator may be difficult or even
impossible after manufacturing.
2.4.2 Segmented busses
Architectural enhancements are targetted to increase per-
formance of the existing hardware by adding more execu- In normal buses, each signal transmitted into bus is broad-
tion units or implementing parallel execution of instruc- casted to all possible bus slaves. This increases power con-
tions. In this method no instruction set architecture ex- sumption of the bus as unneccesary work is done to deliver
tensions are required, but performance increase in usually signal. Segmented bus architecture addresses this problem
smaller than with other acceleration methods. by including additional logic to bus. This logic controls
[7] transmission of signals and reduces power consumption by
selecting target of signal. This technique can introduce ad- for spesific task. Reoccuring costs are also high, because
ditional delay and power consumption due to control logic reuse of previous designs is difficult. Fixed hardware may
in bus.[3] cause problems, if hardware defect is detected after manu-
facturing, due to lack of reprogrammability.
2.4.3 Power Management Software defined radio is a solution targetted to solve
problems caused by fixed hardware. Software defined radio
Power management reduces power consumption by limit- means layer that is capable to perform highly demanding
ing performance of managed elements when possible. For signal processing and yet be reprogrammable after manu-
example in on-chip busses idle slave interfaces can be shut- facturing. Reprogrammability has several advantages: Sup-
down and restarted on demand. Slave interface waking may port for multiple protocols, faster time-to-market, higher
introduce additional delay to bus and thus effective power chip volumes and support for late implementation changes.
management logic is essential. For example idle times may [4]
predicted when deciding wheter to shutdown slave inter-
face or leave it idle. Power savings in this technology de- 3.1. Motivation
pend on system on chip architecture and power management
logic.[3] Motivation for software defined radio is to find energy
efficient way to handle multiple computationally demand-
2.4.4 Traffic Sequencing ing protocols in order to meet communication requirements
for wireless battery powered devices. Historically applica-
In case of multiple masters attached to bus, traffic sequenc- tion prosessing requirements have had main focus in sys-
ing can reduce the switching activity in bus and thus reduce tem architecture design, but problem emphasis has shifted
overall power consumption of the bus. This method does to network protocols and signal processing. [4] Currently
not nessecerily require new hardware, because it can be im- several different protocols are needed to satisfy all wire-
plemented in application run on system. Another option is less network types. Common modern wireless protocols are
to create more sophisticated bus protocol that allows bus to shown in Figure 2.
give feedback about current state of the bus.[3]

2.4.5 Network-On-Chip (NOC)


Previous techniques described different techniques to im-
prove power efficiency of bus based communication on
chip. Network-On-Chip is a new approach to communica-
tion between processing elements and it is targetted to avoid
common problems faced with busses or dedicated wiring.
Idea in NOC is to replace ad-hoc global wiring structures
with interconnection network. This approach benefits espe-
cially in architectures where thermal issues limit the overall
performance of system on chip. So far commercial imple-
mentations are still based on bus architecture, but in future
NOC implemetation may be developed. [2]

3. Software Defined Radio

Wireless communication is a field where hardware ac- Figure 2. Modern wireless communication
celerators are commonly used to perform signal processing protocols. [4]
tasks. Battery powered devices require energy consumption
to be as small as possible, but complicated signal process-
ing algortihms require high computational power. To over- Problem with these protocols is that their computational
come problem caused by this scenario, hardware accelators requirements are much higher than with capabilities of
are used to perform computionally demanding tasks with modern DSP processors. Current DSP processors are able
smaller energy consumption than with more general pur- to perform 10 Mops/mW while modern wirless protocols
pose hardware. Design and verification of fixed hardware require about 100 Mops/Mw. [4] To increase performance
systems is difficult, because each implementation is suited of the system, hardware accelerators are needed.
(VLIW) DSPs can achieve high performance by exe-
cuting several instructions parallel. The idea in VLIW
is that several instructions can be combined into sin-
gle intruction word in compile time and then executed
parallel in hardware. The instruction execution energy
consumption is higher than with other architectures,
which limits also the overall perfomance of the archi-
tecture. Current DSP implementations don’t satisfy
high computational requirements of modern wireless
algorithms, and thus architecture requires some type of
hardware accelerator to give additional performance.
One example of VLIW DSP processor is the Texas In-
struments TMS320C64x DSP. [8]

4 Summary
Figure 3. Computational power relative to en-
ergy consumption in some hardware imple- Many embedded systems require high computational
mentations. [4] performance and extremely low power consumption. Hard-
ware accelerators are used to provide additional perfor-
mance for system-on-chip. Hardware accelerators perform
some function more energy efficiently than with general
3.2. Architecture suggestions purpose processors. Usually accelerators limit flexibility of
the system by adding more task specific hardware to system.
Several different software defined radio architectures Many modern wireless battery powered devices use
have been suggested. It is interesting to note, that solutions hardware accelerators to meet with computional require-
are very different from each other. List of few interesting ments with limited energy capacity. Historically hardware
software defined radio architectures types: accelerator implementations were created with task specific
hardware that is not very flexibile. Current mobile devices
• Hybrid-SIMD based architecture. This solution ar- implement wide range of wireless protocols, which results
chitecture consist of separate scalar and vector proces- in difficulties with complicated hardware implementations.
sors. Scalar processor is used to control signal process- More dynamic approach is required to create more generic
ing and vector processors are used to perform actual processing elements that can be reprogrammed after manu-
processing. Examples of such architectures are SODA facturing.
[4] and SandBlaster [6]. Software defined radio is a solution to problems caused
• FPGA based architecture Many solutions take ad- by fixed hardware implmentations in mobile devices. The
vantage of FPGAs as they provide decent performance idea in SDR is to create layer that is capable to handle
increase and are reprogrammable. Solutions usually several complicated communication algorithms with lim-
include a processor that controls processing. Diffi- ited energy resources. Hardware accelerators are used
culties arise from requirements to meet real-time con- to gain performance without signicantly increasing power
straints. Example of architecture is picoArray [1]. consumption of the device. Many SDR architectures have
been proposed, but consumer products with SDR imple-
• Heterogeneous architecture Some solutions use sev- mentations are not yet available.
eral heterogeneous processing elements to satisfy com-
putational requirements. Each processing element is
References
designed for spesific communication signal process-
ing algorithm. This approach limits flexibility of the
system, but is very efficient for spesial purpose. Also [1] R. Baines and D. Pulley. Software defined baseband
workload distribution among processing elements is processing for 3g basestations. In 4th International
difficult due to heterogeneous architecture and each Conference on 3G Mobile Communication Technolo-
processing element type must be capable to handle gies, pages 123–127, 2003.
worst case workload.
[2] William J. Dally and Brian Towles. Route packets, not
• VLIW DSP architecture Very long instruction word wires: on-chip inteconnection networks. In DAC ’01:
Proceedings of the 38th conference on Design automa-
tion, pages 684–689, New York, NY, USA, 2001. ACM
Press.
[3] Kanishka Lahiri and Anand Raghunathan. Power anal-
ysis of system-level on-chip communication architec-
tures. In CODES+ISSS ’04: Proceedings of the 2nd
IEEE/ACM/IFIP international conference on Hard-
ware/software codesign and system synthesis, pages
236–241, New York, NY, USA, 2004. ACM Press.
[4] Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel,
Scott Mahlke, Trevor Mudge, Chaitali Chakrabarti, and
Krisztian Flautner. Soda: A low-power architecture for
software radio. In ISCA ’06: Proceedings of the 33rd
annual international symposium on Computer Archi-
tecture, pages 89–101, Washington, DC, USA, 2006.
IEEE Computer Society.
[5] Chittarsu Raghunandan, K. S. Sainarayanan, and M. B.
Srinivas. Bus-encoding technique to reduce delay,
power and simultaneous switching noise (ssn) in rlc
interconnects. In GLSVLSI ’07: Proceedings of the
17th great lakes symposium on Great lakes symposium
on VLSI, pages 371–376, New York, NY, USA, 2007.
ACM Press.
[6] Michael Schulte, John Glossner, Sanjay Jinturkar,
Mayan Moudgill, Suman Mamidi, and Stamatis Vas-
siliadis. A low-power multithreaded processor for soft-
ware defined radio. J. VLSI Signal Process. Syst., 43(2-
3):143–159, 2006.
[7] Eric Tell. Design of Programmable Baseband Proces-
sors. PhD thesis, Linkping Studies in Science and Tech-
nology, 2005.
[8] Texas Instruments. TMS320C64x DSP Generation,
2003. http://www.softier.com/pdf/sprt236a.pdf.
[9] Feng Wang, Yuan Xie, N. Vijaykrishnan, and M. J. Ir-
win. On-chip bus thermal analysis and optimization.
In DATE ’06: Proceedings of the conference on De-
sign, automation and test in Europe, pages 850–855,
3001 Leuven, Belgium, Belgium, 2006. European De-
sign and Automation Association.
Harnessing the Power of Software –
A Survey on Energy-Aware Interfaces
Gerard Bosch i Creus
Nokia Research Center
P.O. Box 407
FIN-00045 Nokia Group (Finland)
[email protected]

Abstract— to focus on hardware optimizations. However, since the only


Mobile devices are becoming increasingly complex and mission of hardware is actually to fulfill software needs, one
software-intensive. The platforms supporting these devices need can argue that software is the ultimate consumer of energy
to cope with the growing complexity while maximizing the use of
the underlying hardware. At the same time, manufacturers need and therefore the focus should be on software optimizations
to ensure that the battery life is not an encumbrance to user [1].
experience. Since battery technology does not have the capability Extensive research has been produced on both hardware
to provide the necessary improvements in energy density, the next and software sides. The approaches should not be considered
solution is to make devices more energy-efficient. Software can be mutually exclusive, but rather synergistic in nature. Hardware
regarded as the ultimate power consumer and therefore making
more energy-efficient software can have a significant impact on should ideally provide an optimal trade-off between energy
the battery life of mobile devices. This paper focuses on software and other non-functional attributes such as performance. On
adaptations for energy efficiency, and specifically the interfaces the other side, software should strive to use those hardware
to the operating system (OS). I present a survey of the previous pathways offering the optimal trade-offs for the application
research on the area as well as a review of the power management at hand. Most of the software is constructed with the help
functionality offered by leading mobile OSs. There is a clear gap
between the state of the art and the state of the practice. Bridging of supporting software development tools like compilers that
this gap has the potential to provide considerable energy savings may prioritize attributes such as speed or memory footprint
and give some relief to the energy consumption problem. at the expense of energy efficiency [2]. However there seems
to be a growing understanding that the applications and their
I. I NTRODUCTION interaction with the underlying platform play a crucial role in
The exploding number of features is rapidly adding to the power management [3], [4], [5], [6].
amount of processing power and related hardware needed for This paper presents a review of the available literature on
their implementation. Consumers desire more performance, software adaptations for energy efficiency. I focus on the layers
more impressive multimedia, faster data connections, and ranging from the application software to middleware and its
better usability. As a result, devices are getting more power- interactions with the underlying operating system (OS). In
hungry to the point where power consumption and thermal addition, I present a review of the available mechanisms on
issues become seriously limiting factors. several mainstream OSs for mobile devices.
Power management is required because mobile phones are
battery-operated devices and run on a limited power supply. II. TACKLING THE RIGHT ABSTRACTION LEVEL
Additionally phones are becoming smaller in physical size Software may be addressed at different levels of abstraction.
which can make excessive power consumption heat them up Typically, the higher the abstraction level, the more impact any
more easily. Battery technology improves at a steady rate, modifications will have on non-functional quality attributes.
but is not able to keep pace with the continuous upscaling Figure 1 shows the impact on energy consumption for modifi-
of processing performance and resource usage. Current bat- cations at different levels of abstraction. This section reviews
tery technology cannot offer the energy densities required to the different possibilities in this regard.
make the power consumption problem disappear. Given that
battery technology does not seem to provide the necessary A. Architectural level
improvements regarding energy and power management, the Software architecture is the highest level of abstraction
next solution is to try improving the phone platforms so that in software design. Architects decide different high-level
the desired features can be implemented at a much lower cost features and characteristics of software components at this
in energy consumption. level. Software is typically architected before proceeding onto
Possible solutions can be addressed using both hardware more detailed design, although iterative software development
and software approaches. The hardware approach is often processes imply a more cyclical nature for architecting.
emphasized since hardware is the part physically draining The wide view that software architecture provides may offer
energy from the battery. From this viewpoint, it makes sense unique opportunities for energy optimizations. Tan et al. [3]
platform. Software comes from a variety of providers: smart-
phone manufacturers, open-source contributors, tool vendors,
and other third-parties. Together, these individuals represent a
vast community with high impact on the energy expenditure
of the software they create. It is in the interest of device
manufacturers to assist developers on energy efficiency issues,
since bad user experience related to poor battery life can have
a strong impact on brand perception.
Manufacturers can provide assistance in the form of devel-
opment tools targeting energy optimizations or by providing
interfaces that developers can use to increase energy efficiency.
Fig. 1. Abstraction levels and their impact on energy consumption [3]
Previous studies have recognized these two areas to be of vital
importance in software development [6]. In the following, I
propose a tool for energy optimizations at the architectural focus on the interfaces that developers can use to produce
level. Their approach requires accurate modeling in the form of more energy-efficient software.
software architecture graphs (SAGs). Using SAGs as an input, III. P ROPOSED SOLUTIONS
the tool exploits several aspects of software like temporal and
sequential cohesion or IPC merging and replacement, among Several authors have previously addressed the topic of
others. The authors claim the approach provides up to 66.1% energy-aware interfaces between the OS and software running
power reductions. on top of it. This section reviews some proposed collaboration
mechanisms that applications and middleware could exploit for
However, such an approach requires a fairly accurate ar-
greater energy efficiency.
chitectural description of the actual software that in some
cases falls rather close to the implementation. Such accurate A. Adaptive applications
modeling may not be available at the time when architecture is
The concept of adaptive applications is an integral part
being prepared. Additionally, the more architecture and actual
of the Odyssey framework [7], [4]. Odyssey is a resource
implementation diverge, the more the results from this ap-
management framework and in this sense it is not specifically
proach will suffer. Unfortunately, practice shows that software
focusing on energy. Odyssey tackles energy along with other
rarely preserves the designed architecture in its entirety.
resources such as network bandwidth, CPU, or disk cache
B. Instruction level space. Application adaptation is tightly bound to the concept
of fidelity, in which applications adapt their quality of service
Compilers handle most of the translation from program (QoS) to their allowed resource usage levels.
structure to actual platform-specific instructions except for Odyssey emphasizes the necessary balance between appli-
a few cases where manual optimization is required. Most cation diversity and concurrency. In this context, diversity
developers are used to working instead at the logical level refers to the vast differences in application software and
since software may be targeted for several different platforms. their resource usage, and concurrency refers to the capability
Nevertheless, critical code may require manual optimizations of satisfying the resource usage of simultaneously executing
and therefore developers undertaking such an activity should applications. The level of OS involvement is directly correlated
be aware of the different trade-offs at this level between to this balance. Application diversity is best represented by
performance and energy consumption. giving applications direct control over resources. On the other
Previous studies show that the right choice of instruction can hand, application concurrency is best achieved by keeping
have a strong impact on energy consumption. Simunic et al. [2] resource control at the OS level, making resource manage-
claim savings of up to 90% by modifying specific constructs ment totally transparent to applications. Figure 2 presents the
on an ARM processor. For example, unsigned integers are Odyssey architecture.
more efficient than their signed counterpart, and 32-bit integers Odyssey takes a balanced approach by letting applications
should be preferred over shorter types as they are much more negotiate their resource usage with the system. Negotiation
energy-efficient. Recursive algorithms can be more energy- happens through a set of functions to request a certain re-
efficient that their equivalent iterative version in certain cases source level. The framework is responsible for enforcing the
[2]. agreed resource levels. Odyssey monitors resource usage and
notifies applications when a certain level can not be guaranteed
C. Logical level anymore. In that case, applications must modify their resource
Developers and platform providers interface at the logical levels. Diversity is supported by letting applications decide the
level. This is the level at which most of the implementation mappings between resource levels and fidelity. In other words,
takes place. Platform providers typically offer software devel- Odyssey implements an admission control mechanism typical
opment kits with a set of application programming interfaces of real-time operating systems. However, Odyssey tackles
that developers use to implement their software on a given resources at a much higher level of abstraction.
TABLE I
A PPLICATION CHARACTERIZATION FOR NETWORKING [10]

Parameter Tiny Small Medium Large Huge


Received SSH Browser Stream Download
packet size Stream
NFS
Sent packet NFS
size
Received/sent SSH Stream Download
ratio
Fig. 2. Odyssey framework architecture [7] Inactive/active Download NFS SSH Browser
ratio
Inactive Stream Browser
period jitter
The Odyssey framework has been validated with different
applications [5], [8], [6]. Application adaptation can have a
significant impact on power consumption. Flinn and Satya-
narayanan achieved up to 72% savings in power consumption for application input. Applications are in the best situation
using application adaptation [5]. However, they acknowledge to offer such hints because they know the acceptable power
that the savings vary according to the degree to which appli- to performance trade-offs for the user. Applications need to
cations lend themselves to fidelity management. balance energy efficiency with access performance, which is
Server-side collaboration may be required for maximum typically dictated by the user expectations on battery life.
energy savings in some applications. However, fidelity man-
agement is handled entirely on the client side. For example, a C. Application characterization
streaming video player may request the server side to change Weissel et al. [10] propose the concept of application
to a lower quality stream. Neglecting the server side will characterization to adapt the system to the application needs.
typically result in suboptimal savings. Modifying application code to provide input to the system for
power management may not always be feasible. The source
B. Ghost hints code or the resources to make the necessary modifications may
Anand et al. [9] propose the notion of ghost hints to increase be unavailable. Instead, the authors propose that application
energy efficiency. Their approach is based around the concept type may be inferred from run-time access characteristics.
of optimizing data access paths for applications. The authors Weisser et al. show that network traffic characteristics can be
assume that applications may fetch required data through used to identify different applications and steer the WLAN in-
several devices such as the hard disk or the network interface. terface power modes accordingly. Their approach requires de-
These devices may in turn implement several states with termining the traffic parameters most indicative of application
different performance to power ratios. The operating system type. Table I shows some of the correlations between traffic
is responsible for steering the different device states in the characteristics and applications [10]. The authors claim that
most energy-efficient manner. However, optimally selecting in the context of traffic characteristics, most applications can
the most efficient device states requires knowledge about the be classified in three groups: interactive, non-interactive with
future use of devices, which is not available. For example, strong performance requirements, and streaming applications.
a single access may not warrant a spin-up for a hard disk. Different application types have different latency and
However, the transition would pay off in the case of multiple bandwidth requirements and expectations. For example, web
accesses. browsing can support much higher latency than NFS directory
Applications query device power states from the OS. When listing without a negative impact on user experience. These
accessing data, applications compute the best path to access requirements and expectations can then be mapped to the
data with the available device situation as well as the optimal available power states with different power to performance
path should the power state levels have been different. When ratios.
applications detect that an access could have been improved
by having a device in a different state, they issue a ghost D. Idle period notifications
hint to that device power manager. If the power manager Idle periods provide opportunities for more aggressive
receives several ghost hints in succession, it may consider power management since devices can then be transitioned
changing the power state for that device. Without ghost hints, to sleep modes and more energy-conservative states. Heath
the device would not be used at all and thus the power et al. [11], [12] propose increasing the active run-lengths in
manager would be totally oblivious to the suboptimal device order to maximize idle periods. In addition, notifying the OS
state allocation. In other words, ghost hints provide a more about the length of upcoming idle periods allows for improved
proactive approach to device power management by allowing power state steering. Since power state transitions carry an
inherent cost with them (in terms of time and energy), the requester requester process management
system needs to decide whether such transitions are worth request request requester creation
based on a prediction of the idle period duration. Idle period and termination
notifications allow for more accurate idle time calculations device driver
performance
than other typically used predictive methods. observe requirement
request
The authors argue that application modifications are nec-
essary to obtain maximum energy savings. However, being scheduler estimate length
of idle periods
able to accurately predict idle period duration may require
extensive knowledge about the platform characteristics as well power manager
hardware device determine
as compiler support. power state
IV. O PERATING SYSTEM INTERFACES Fig. 3. Requester-aware power management
Operating systems and their interfaces are a critical area for
power management research [6], [8]. Traditionally, operating
systems have focused on other non-functional quality attributes thus maximizing power saving opportunities. Figure 3 presents
at the expense of energy efficiency, such as performance or the requester-aware power management approach.
reliability. Especially in the case of mobile OSs, power is In the following I review the power management interfaces
becoming an increasingly important resource and therefore offered by several mainstream mobile OSs.
mobile manufacturers are starting to demand better power
management capabilities. A. Symbian OS
Typically, OS interfaces either leave applications out of Symbian OS is the leading smartphone OS in terms of
the power management equation entirely, or tend to give market share. Nokia uses Symbian OS as the foundation of
applications an excessive amount of control over resources. S60, its own smartphone platform. Symbian OS includes a
Section III discusses part of this problem related to support- power management framework, which mostly focuses on the
ing application diversity and concurrency, and the level of interfaces toward hardware devices [15].
OS involvement. Yung-Hsiang et al. [1], [13] identify two The Symbian power management framework is based on
main categories for typical OS power management strategies, the concepts of power model, power handlers, and power
namely autonomous and requester-controlled. Autonomous resources. The framework provides a clear distinction between
strategies do not distinguish among requesters, and manage policy and mechanism. Power handlers represent devices re-
power in response to observed requests at the device level. On quiring power management, i.e. with significant power drain
the other hand, requester-controlled approaches give requesters on the system. The power model provides the policy for device
direct control over the power states for different devices. In power management. Figure 4 shows the relationships between
practice, requester-controlled approaches are heavily based on these components.
ACPI [14] and the notion of power states. The authors argue Typically, device drivers provide an associated power han-
that autonomous power management cannot reflect the diverse dler if the device exhibits a considerable power consumption.
power consumption patterns and performance requirements for The power model may require power handlers to take action
different applications. Requester-controlled alternatives do not in response to a system power event. Power handlers may
offer clear distinctions for power states, making power state also request system power state changes on behalf of their
decisions impractical at the requester level. In addition, a lack associated device. The power model powers up and down
of agreement on the power state for a certain device may the shared power inputs for the different power handlers
damage hardware when multiple requesters are involved. on demand, and informs power handlers when devices are
Yung-Hsiang et al. propose using requester-aware power required to power up and down, or perform an emergency
management, a mixed approach better supporting both appli- shutdown. Power handlers report their power consumption to
cation diversity and concurrency. The main idea is to allow the power model, which the kernel then adds together with
requesters to affect but not control the device power states. the CPU drain to obtain an approximate picture of the overall
Power management should be as transparent as possible for de- power consumption.
velopers, in the same manner that OS interfaces abstract away The power framework defines a set of power states that
relatively complex domains such as memory or file manage- devices should implement. Table II presents the different
ment. In this spirit, the proposed system offers a very simple device power states in Symbian.
interface. Applications specify their performance requirements In practice, the Restart state is not used since devices are
for specific devices to a central scheduler, which also monitors expected to maintain their internal state. Transitions between
device requests. With this information, the scheduler is able Idle and On states are assumed to be fast and implying no
to predict future resource usage and upcoming idle periods, data loss.
which the power manager uses to set device power states in the There is one power model in the system, responsible for
most energy-efficient manner. In addition, this scheme allows controlling the power behavior of the hardware components.
the scheduler to cluster idle periods from different requesters The power model is aware of each power handler in the system
TABLE II
D EVICE POWER STATES IN S YMBIAN OS

Application
Device Description Application
Device
Application
Power Application
Device
Driver
Device
Driver
Driver
State
On Full power Notification Message Queue Notification Message Queue
PM APIs
Idle Low power mode and inactive. The device can still respond
to interrupts or external events
«send» «send»
Notifications Notifications
Standby Inactive with internal state maintained Power Manager

Restart Inactive with lost internal state


Fig. 5. Windows Mobile power management architecture

TABLE III
GPS Power
Handler
D EVICE POWER STATES IN W INDOWS M OBILE V 5
GPS
Logical
Device Driver

LCD Power Power Device Description


Handler
LCD
State Power
Device
Driver
Type State
Full On D0 Full power
SD Power
Handler
Low On D1 Fully functional at lower power or performance
SD card
Logical
Standby D2 Partial power, standing by or wakeup request
Device Driver
Sleep D3 Sleeping, minimal power needed to initiate wakeup
Off D4 Totally off
Fig. 4. Symbian OS power management architecture

and their requirements, and adjusts the system power states in dictate the bottom limit state, floor, for a particular device
response to power events. Power events may be caused by through the PM interface. Devices are allowed to manage their
powering on and off the system and by changes to the state of own power states between the set ceiling and floor levels.
shared power resources. The system needs to behave gracefully System power states are named collections of device power
also in the event of a critical power failure. states, such as Battery, Docked or UserIdle. There is no limit
Power resources are powered up and down as needed, when for the number of system power states, and state transition
requested by their power handlers. Power resources use their does not need to be linear. Device power states follow a
power handlers to specify power requirements to the power clearly specified hierarchy. Table III presents the device power
model. The power model changes the system power state in state hierarchy. States D0 and D1 should be fully functional
response to the cumulative power requirements for all power from a user perspective, and higher numbered states typically
handlers. consume less power. Power management can increase device
driver complexity considerably. For example, drivers may need
B. Windows Mobile to behave differently upon receiving a power down event while
The power management infrastructure in Windows Mobile on a state other than D4. Dependencies to other devices (and
v5 is built around the Power Manager (PM) component [16]. their power states) will influence the course of action for
The PM provides interfaces at the application, system and device drivers.
driver level for developers, with the goal of extending battery
C. Darwin
life. The PM is based on the notion of power states, making
a clear separation between system and device (driver) power Darwin is the open-source UNIX-based foundation of Ap-
states. Both application and driver developers are actively ple’s Mac OS X. The basic concepts underlying Darwin power
encouraged to make use of the PM interfaces to control management are hierarchies of devices (referred to as power
devices. The PM uses a publish-subscribe pattern to update domains in Darwin) and their supervising entities, policy
software of impending changes in power states. Figure 5 makers and power controllers [17], [18], [19].
presents the Windows Mobile power management architecture. The fundamental entity in Darwin power management is
The PM is designed around the concept of power states and the device, a hardware component the power of which can be
a clear division between system and device states. Devices adjusted independently of system power. A device may have
are expected to implement a set of well-defined power states, different power states associated with it, at least two - on and
independent of states defined by the ACPI standard [14]. off. Darwin associates several attributes to the power state of
Manufacturers define system power states that provide an each device:
upper limit state, ceiling, for all devices. Applications can • Power used by the device in that state
Root
devices, but the character of the power supplied to members of
Domain
the domain. This may imply adjusting clock rates or voltages,
Policy Power for example. Secondly, policy makers for power domains
maker Controller

Root
do not base their decisions on idleness like device policy
Domain
Child
makers. Instead, they base their decisions based on requests
Domain Device
from their members (which include policy makers and power
Policy Power
Card Hard maker Controller controllers). Therefore, policy makers are responsible for the
Reader Disk
Domain Domain power supplied to the domain they supervise, and they request
Device the power state for the domain they belong to.
The power management architecture also provides support
for notification of power state changes to interested entities.
Power controllers are automatically notified of power state
(a) Power domain hierarchy (b) Darwin power management architecture changes. Other interested objects may include driver objects,
Fig. 6. Darwin power management
that will need to subscribe to the notifications by implementing
a couple of function callbacks. In addition, user processes may
also request notifications of system and device power events.
• Capabilities of the device in that state V. D ISCUSSION
• Power required to move the device to the next higher
There seems to be a clear gap between the state of the art
state (currently unused)
and the state of the practice. While several studies show that
• Time required to move the device to that state (currently
application input is essential for effective power management,
unused)
operating systems are slow adopting the results of power
Another entity in Darwin power management is a power management research.
domain, a switchable source of power in the system providing Symbian OS offers a comprehensive framework for power
power to one or more devices that are considered part of the management, yet fails to reflect application requirements,
power domain. Power domains dictate the highest power state assuming that a hardware view offers a complete enough
for the devices they contain, and like devices, they have a range picture of the power requirements. This effectively turns
of power states, with at least on and off states supported. Power Symbian power management into a reactive system rather than
domains are hierarchical, with the top-level domain being the proactive, therefore missing power saving opportunities. It is a
root power domain, which represents the main power of the clear example of autonomous power management. In addition,
system. Figure 6(a) presents an example of the power domain while the framework is designed to provide an accurate picture
hierarchy. of the device power consumption, this is rarely achieved in
Darwin defines two supervising entities for devices and practice. Hardware may exhibit complex power consumption
power domains: policy makers and power controllers. Policy patterns and manufacturers tend to neglect power consumption
makers are the objects responsible for deciding when to change notifications. Symbian OS power management is typically
the power state of devices and power domains, based on reduced to a device management framework for system startup
different factors. The major factor in this decision is device and shutdown events.
idleness. When the system detects that a device is idle, it Windows Mobile power management takes instead a
will try to reduce its power state, and policy makers are the requester-controlled approach, with the implied drawbacks.
entities responsible for deciding when to change the power The framework relies on deice drivers to take the most
state of a device or domain. However, the entities responsible intelligent decisions regarding device power states, which
for implementing these changes are power controllers. Other may not always be possible due to their narrower scope
factors that the policy makers take into account include the with respect to the system. In addition, the abstraction level
aggressiveness, which is defined by the user in different offered to applications is not adequate [1]. The system should
contexts (such as AC-plugged or running from batteries) or by be ultimately responsible for resource management, albeit
the system itself (such as when the battery reaches a critically through collaboration with applications.
low state). Darwin provides a very simple approach to power manage-
A power controller knows about the power states of a device ment. Idleness and user-defined aggressiveness are the major
and can steer the device between them. Additionally, it reports factors governing policy making. However, I argue that waiting
power-related information of a device to the policy maker for an idleness timer to expire is a potential energy waste and
to assist in decision-making. Figure 6 presents an overview should be avoided. Application inputs are not considered at
of the interoperation between the different Darwin power all, with the implied loss of power-saving opportunities.
management entities. It seems that leading mobile operating systems do not
There are a few fundamental differences between policy exploit the advantages offered by application adaptation. This
makers for devices and power domains. First, policy makers could provide significant power savings as exposed in III.
for power domains do not alter the power consumption of Additional techniques such as idle period notification and
ghost hints could provide additional savings. Given that power [10] A. Weissel, M. Faerber, and F. Bellosa, “Application characterization
efficiency is taking a front seat in mobile OS design, the for wireless network power management,” in Proceedings of the
International Conference on Architecture of Computing Systems
techniques presented provide promising methods to reduce the (ARCS’04), January 2004. [Online]. Available: http://citeseer.ist.psu.
burden on the battery. edu/649995.html
[11] T. Heath, E. Pinheiro, J. Hom, U. Kremer, and R. Bianchini,
“Application transformations for energy and performance-aware device
VI. C ONCLUSIONS AND FURTHER WORK management,” in Proceedings of the Eleventh Conference on Parallel
Architectures and Compilation Techniques (PACT’02), September 2002.
[Online]. Available: http://dx.doi.org/10.1109/PACT.2002.1106011
This paper has presented a survey of the published research [12] ——, “Code transformations for energy-efficient device management,”
on software adaptations for energy efficiency. In addition, IEEE Transactions on Computers, vol. 53, no. 8, August 2004. [Online].
I reviewed three mainstream mobile OSs and the power Available: http://www.cs.rutgers.edu/∼ricardob/papers/tc04.pdf
[13] Y.-H. Lu, L. Benini, and G. D. Micheli, “Power-aware operating
management functionality they offer. There seems to exist a systems for interactive systems,” IEEE Transactions on Very Large
big gap between the state of the art and the state of the practice, Scale Integration (VLSI) Systems, vol. 10, no. 2, April 2002. [Online].
as evidenced by the lack of application input to power manage- Available: http://dx.doi.org/10.1109/92.994989
[14] Hewlett-Packard, Intel, Microsoft, Phoenix, and Toshiba, Advanced
ment interfaces. Operating systems could greatly benefit from Configuration and Power Interface Specification 3.0a, Dec. 2005.
the inclusion of the reviewed techniques in their design, since [Online]. Available: http://www.acpi.info/DOWNLOADS/ACPIspec30a.
power management seems to play an increasingly important pdf
[15] “Power management framework,” Symbian Developer
role in driving OS design. Battery technology cannot provide Library. [Online]. Available: http://www.symbian.com/developer/
the necessary improvements to increase or even maintain the techlib/v70docs/sdl v7.0/doc source/baseporting/KernelProgramming/
battery life of mobile devices, that face an increasing amount PowerManagementFramework/index.html
[16] J. Looney, “New Power Manager States in Windows Mobile V5 and
of power-hungry features and hardware components. Software How to Use Them,” CLI327. Mobile & Embedded DevCon 2005.
needs to become more energy-efficient, and collaboration with Presentation.
the OS is a promising area for achieving this goal. [17] “Power management,” Apple Developer Connection. [Online].
Available: http://developer.apple.com/documentation/DeviceDrivers/
Conceptual/IOKitFundamentals/PowerMgmt/chapter 10 section 3.html
R EFERENCES [18] “Power management for Macintosh; getting started,” Apple Developer
Connection. [Online]. Available: http://developer.apple.com/technotes/
tn2002/tn2075.html
[1] Y.-H. Lu, L. Benini, and G. D. Micheli, “Requester-aware power [19] “Technical Note TN2075. Power Saving Features for the
reduction,” in International Symposium on System Synthesis. Stanford PowerBook G4 computer,” Apple Developer Connection. [Online].
University, September 2000, pp. 18–23. [Online]. Available: http: Available: http://developer.apple.com/documentation/Hardware/
//doi.acm.org/10.1145/501790.501796 Developer Notes/Macintosh CPUs-G4/PowerBook G4Apr02/
[2] T. Simunic, L. Benini, and G. De Micheli, “Energy-efficient 1Introduction/Power Saving Features.html#TPXREF115
design of battery-powered embedded systems,” in Proceedings of
the International Symposium on Low-Power Electronics and Design
(ISLPED’98), June 1998. [Online]. Available: http://www.acm.org/pubs/
articles/proceedings/dac/313817/p212-simunic/p212-simunic.pdf
[3] T. K. Tan, A. Raghunathan, and N. Jha, “Software architectural
transformations: A new approach to low energy embedded software,”
in Proceedings of the Conference on Design Automation and Test
in Europe (DATE’03), 2003. [Online]. Available: http://ieeexplore.ieee.
org/xpls/abs all.jsp?arnumber=1253742
[4] B. Noble, “System support for mobile, adaptive applications,” IEEE
Personal Communications, vol. 7, no. 1, pp. 44–49, February 2000.
[Online]. Available: http://www.cs.cmu.edu/∼coda/docdir/ieeepcs00.pdf
[5] J. Flinn and M. Satyanarayanan, “Energy-aware adaptation for mobile
applications,” in Proceedings of the Seventeenth Symposium on
Operating System Principles (SOSP’99), December 1999. [Online].
Available: http://portal.acm.org/citation.cfm?doid=319151.319155
[6] C. Ellis, “The case for higher level power management,” in Proceedings
of the Seventh Workshop on Hot Topics in Operating Systems
(HotOS’99), March 1999. [Online]. Available: http://www.cs.duke.edu/
∼carla/ellis.pdf
[7] B. D. Noble, M. Satyanarayanan, D. Narayanan, J. E. Tilton, J. Flinn,
and K. R. Walker, “Agile application-aware adaptation for mobility,” in
Proceedings of the Sixteenth Symposium on Operating System Principles
(SOSP’97), Saint Malo, France, 1997, pp. 276–287. [Online]. Available:
http://portal.acm.org/citation.cfm?id=269005.266708
[8] C.Ellis, A. Lebeck, and A. Vahdat, “System support for energy
management in mobile and embedded workloads: A white paper,” Duke
University, Department of Computer Science, Tech. Rep., October 1999.
[Online]. Available: http://www.cs.duke.edu/∼carla/research/whitepaper.
pdf
[9] M. Anand, E. B. Nightingale, and J. Flinn, “Ghosts in the machine:
Interfaces for better power management,” in Proceedings of the
Second International Conference on Mobile Systems, Applications,
and Services (MOBISYS’04), June 2004. [Online]. Available: http:
//www.eecs.umich.edu/∼anandm/mobisys.pdf
Compiler memory energy optimizations

Peter Majorin
Software Technology Laboratory/TKK

Abstract the state-of-the art memory energy optimizations developed


so far.
The energy consumption of memories is the most energy- The paper is organized as follows: we first go through
consuming part in a processor architecture of embedded on-chip memories and their relevance for memory opti-
systems. The most promising technique so far to get large mizations, and then present some program analyses com-
memory energy savings is to replace processor caches that monly needed in memory optimizations. Then we present
are close to the processor core with software allocated on- a short survey of state-of-the-art scratchpad allocation al-
chip memories or use them together. Numerous research gorithms, and techniques behind the allocations. At last
has shown that scratchpads which are software controlled we present some research done on frameworks for compiler
fast on-chip memories, can save significantly energy com- memory optimizations, and finally we conclude.
pared to systems using only caches, even with simple al-
location algorithms. In this paper, we give an overview 2. Low-power on-chip memories used for
of the state-of-the-art scratchpad allocation algorithms for
memory energy optimizations
single- and multiple scratchpad configurations, and treat
briefly scratchpads used in multitasking environments, and
some complementary techniques to scratchpad allocation. We give here an overview of a few hardware memory
We also give an overview of frameworks and compiler anal- components that can be used to save energy. Scratchpads
yses designed for such memory optimizations. and loop caches are the most common low-power memories
found in the literature regarding memory optimizations, e.g.
[17, 15, 6]. Although a usual processor cache is also a low-
power on-chip memory, it is somewhat uninteresting from a
1. Introduction compiler’s point of view, because it is not usually controlled
in software. However, caches cannot be ignored if present
The memory subsystem is one of the most energy- along with a scratchpad [20] for optimal energy savings.
consuming part in the processor architecture of several em-
bedded systems. Furthermore, instruction fetching alone 2.1. Scratchpad
takes a large part of the total energy consumed by a pro-
cessor (typically 27-50% depending on the processor). Data A scratchpad is a relatively small and fast on-chip mem-
accessed by the processor from memory may consume up to ory (or part of it) that is accessed explicitly. It is also re-
43% of the energy [15, 2]. Therefore, large energy savings ferred to as tightly-coupled memory (TCM), as it lies in
can be obtained by designing an appropriate memory hier- the vicinity of the CPU. Scratchpad memory (SPM) can
archy that consumes as little energy as possible. As a rule of be used to store both data and code. Some modern ar-
thumb, smaller memories consume less energy than larger, chitectures have separate scratchpad memories for instruc-
so we need to find ways of utilizing this small space as well tions and data. The size of the scratchpad memory is of the
as possible. The energy consumption in general is prob- same magnitude as a level 1 cache (in practice some kB or
lematic in battery-operated embedded systems (especially smaller).
portable), which usually have limited energy resources. Scratchpads consume significantly less energy than
However, traditional compiler optimizations focus on caches do, because they lack the control logic of caches, and
speed and performance, and do not usually take into account because of this missing logic they are totally predictable.
the memory hardware. Optimizing for speed and code size While caches also reduce energy consumption, they also in-
is not the same as optimizing for energy. Therefore, dif- troduce predictability problems, which is an issue for real-
ferent compiler optimization strategies targeted for energy time systems. The downside with scratchpads is that they
savings are needed. In this paper we give a broad survey of must be allocated by the programmer or by a compiler or
optimizer tool in a compiler-controlled way. However, once static analyses. To obtain results that generalize reasonably
allocation has been done properly, scratchpads have demon- well over all program executions, many program runs have
strated their superiority over caches in energy consumption, to be performed, which can be costly in terms of profiling
predictability and performance even with simple allocation time. Accurate profiling is needed because it is obvious that
algorithms. most program execution time is spent in loops, but this in-
Data transfer to and from a scratchpad can be performed formation is not enough: we need to know more precisely
by direct copying by usual CPU instructions (some over- which loops are important and where exactly in the loops
head) or by hardware support (DMA), which is more ef- most time is spent (there is not much room on on-chip mem-
ficient [9]. Operating system calls can be used to allo- ories).
cate scratchpads in a multitasking system, or just access the
hardware directly in simpler systems. 3.2. Loop analysis

2.2. Loop cache Loop analysis [1] is a static analysis which compilers use
to locate important code to be optimized. In the context of
A loop cache [6] is another low power on-chip memory, energy optimizations loop headers can be used as a basis
which is more limited than a cache, and therefore consumes for growing traces (Section 3.3). If the on-chip memory is
less energy. In contrast to scratchpads, loop caches can be large enough, or the loop is small enough, entire loops can
hardware controlled, but they can only contain code. As be placed on the on-chip memory at once.
the name implies, it is used to store loop code; the loop
code must fit entirely into the loop cache to be effective. A 3.3. Trace analysis
loop cache can also be software controlled, and it is then
equivalent to a scratchpad that can only store instructions. Traces [14] (frequently executed straight-line sequences
of basic blocks) can be used in the context of memory opti-
3. Compiler analyses for memory optimiza- mizations. Traces can be generated from the profiling data
tions and static loop analysis; the loop header provides a starting
point to grow the trace from. A trace is terminated when
its tail execution frequency decreases below a certain fixed
3.1. Static and dynamic analyses
threshold value as compared to the header execution fre-
quency. An advantage with a trace is that it can cross pro-
For memory optimizations, static and dynamic analyses
cedure boundaries so that opportunities for saving energy at
can be used; both have their advantages and disadvantages,
the interprocedural level are not missed. Furthermore, the
and at best they can be used to complement each other[5,
trace building must be tailored to a certain memory hierar-
16].
chy; the size of the trace must not exceed a SPM and caches
Dynamic analyses cannot usually capture all possible
must be taken into account if they are present [20].
program executions, but are easy to perform; the program
to be profiled is just run with various inputs a number of
times to obtain the program run-time behavior. In a partic-
3.4. Statistical measures
ular run, everything about the program can be found out,
including the memory accesses done and where program Instead of performing a structural analysis on code to
execution time is spent, but this information is obviously identify loops and to build traces out of these, the authors in
only valid for that particular run of the program. [9] suggest a novel heuristic they call concomitance. This
Static analyses on the other hand attempt to find out pro- is a statistical measure of the temporal correlation between
gram run-time behavior by considering all program execu- blocks of instructions. The advantage with this method is
tions at the same time. This makes static analyses sound that it can capture hot spots in the program without need-
(can capture all program executions), but in practice pro- ing to identify the structure of the program. Traces are still
gram semantics must be approximated to make the analyses needed as profiling data as with the other methods.
feasible, also resulting in inaccuracies of the results. More-
over, over a decade of research in automatic flow analysis 4. Scratchpad allocation algorithms
shows that statically analyzing program behavior to obtain
accurate information is a very difficult problem for larger The allocation algorithms presented here optimize en-
programs. ergy consumption with respect to average case energy con-
But for memory optimizations, an accurate knowledge sumption (ACEC); also other optimization criteria exist
about where program execution time is spent is essential, such as worst case energy consumption (WCEC) [10]. Op-
so program profiling should be used instead of inaccurate timization with respect to energy consumption may result
in performance improvements as well; this is usually the Size (bytes) fetch (SPM) (nJ) fetch (I-cache) (nJ)
case when doing memory optimizations because low-power 64 0.1803 0.2961
memories are usually faster as well. 128 0.1888 0.3059
The inputs for all the allocation algorithms are a power 256 0.1980 0.4732
model for the instructions and the hot spots of the program 512 0.2188 0.4966
to be optimized in the form of basic blocks, procedures, 1024 0.2404 0.5233
loops, traces and global variables. The energy savings ob- 2048 0.2748 0.5655
tained by scratchpad allocation is often compared against a 4096 0.3277 0.6351
cache of a similar size, or against a static allocation (Section
4.1), if appropriate.
Dynamic data structures such as a stack and heap mem- Table 1. Energy consumed by a SPM vs. an I-
ory remain problematic, because their sizes are not usu- cache (0.18 µm) of equal size from the CACTI
ally known at compile-time. However, these issues have model [15].
received some research so far [3].

4.1. Static Allocation The approach by Verma et al. [19] is based on ILP, and
solved in the following phases:
Static allocation has been studied extensively, e.g. [17]:
the contents (what variables and code) of the scratchpad is 1. Determine candidate SP (scratchpad) objects
loaded in the start of the program and this allocation re- as in static allocation (code and data)
mains unchanged during program execution. The problem
to solve is the integer knapsack problem: ILP (integer lin- 2. Perform liveness analysis on the SP objects
ear programming) or dynamic programming can be used
to optimally solve this problem: the selected procedures, 3. Assignment of SP memory objects and their
traces, loops and variables based on profiling data and en- spill locations in code
ergy model are placed on scratchpad so as to optimize its 4. Computation of memory addresses of the SP
filling. objects
Static allocation still finds use in studies of more com-
plex environments such as multi-banked and cache-aware Some of the above steps were approximated with heuristic
scratchpad allocations. It also serves as a benchmark to methods, because they take a very long time to compute for
compare the effectiveness of dynamic allocations against. larger programs if done with ILP. The authors report 26%
average energy savings as compared to a static allocation
4.2. Dynamic Allocation method.
Here we see the typical structure of a dynamic scratch-
Dynamic allocation is harder than static allocation, but it pad allocation algorithm: since the objects can be evicted
has been demonstrated to save more energy than static ap- from the scratchpad at any point in the program, we have to
proaches. In dynamic allocation, the allocation of scratch- determine the live ranges of the objects to be able to reason
pad can change during runtime. In contrast to static alloca- about how long we must keep the objects on the scratchpad
tion, program points have to be identified in a program to before we can evict them. We also have to determine the
load and evict the scratchpad. Several different approaches points in the program where to evict and load the scratch-
have been proposed for dynamic allocation [19, 18, 9], most pad. Finally we have to decide where in the scratchpad we
of them being heuristic methods. put the loaded scratchpad object.
The motivation for dynamic allocation can be seen in Ta- The approach by Janapsatya et al.[9] is based on a sta-
ble 1. Smaller SPM sizes save more energy, because each tistical method and considers only instructions. Rather than
fetch costs less energy, but on the other hand a smaller SPM using a structural approach to identify loops and traces, they
can hold less code or data. Dynamic allocation can there- identify temporally correlated blocks of code directly from
fore save more energy than with static allocation, because traces. The authors report 41.9% savings in energy when
it is able to utilize better the limited storage space. This compared to a similar sized cache. Part of this large sav-
is the case for larger programs that have several hot spots ing comes from their SPM controller and DMA support for
and alter between these. The conclusion is that very small scratchpad transfers.
SPM sizes may save most energy, but this is also program- The approach by Udayakumaran et al. [18] annotates the
dependent. We next present some state-of-the-art dynamic program CFG (control flow graph) with timestamps to rea-
allocation methods. son about the eviction/placement strategy. Both code and
data objects are considered, as well the stack. Because data small performance penalties, but an average leakage energy
is considered, the authors use a run-time disambiguator to of over 40% is saved over all the benchmarked programs.
correct memory references, which causes some overhead.
Energy savings of 31.3% on the average are reported as 4.4. Scratchpad allocation in a multitasking envi-
compared to a static method. ronment
The approach by Ravindran et al. [15] considers only in-
structions, and uses an iterative liveness analysis of traces to Scratchpad allocation in multitasking systems has also
hoist the allocation of traces upwards in the program CFG. been considered in [13, 4].
This is because loading naively in all basicblocks of a trace The problem considered in the first paper is how to
at its entry will result in a sub-optimal allocation strategy choose an appropriate static allocation for code and data for
that can be improved upon. The advantage with this method a set of statically scheduled processes on a single scratch-
is that it is heuristic method and requires much less compu- pad. Furthermore, the execution time of the processes and
tation than solving an ILP problem. their energy consumption is assumed be known a priori.
The drawback with dynamic allocation is that it causes The goal is minimize the energy consumption over the en-
aliasing problems; memory references become invalid to tire set of processes.
SP, when its allocation is changed during program execu- The allocation strategies considered are saving, non-
tion. Furthermore, dynamic allocation may also be a prob- saving and hybrid (which is a mixture of the two previ-
lem in strict real-time applications, because of the extra ous). The saving approach allocates the single scratchpad
copy code inserted to handle the scratchpad at various pro- completely to the active process, while the non-saving ap-
gram locations. In addition, if aliasing problems are ad- proach divides the scratchpad evenly among all processes.
dressed a run-time disambiguator costs additional processor The hybrid approach uses a common memory area for all
time. To avoid aliasing, but still get some benefits from dy- processes, but also areas that remain dedicated to certain
namic allocation, data could be allocated statically and code processes. This reduces the overhead of context switches
dynamically, largely avoiding the aliasing problems (this is by having a common region which can be used for shared
still a problem with code called via function pointers). data that all processes use frequently. Therefore, context
switches matter only for the saving and hybrid approaches,
4.3. Multi-banked scratchpad allocation which cause some overhead, but allows the scratchpad to be
better utilized, if it is small.
Dividing the memory hierarchy in several smaller mem- As a result, it was found that the non-saving approach
ory banks, instead of using a single monolithic memory worked best for large SPM sizes (1-4 kB), while the saving
bank has many advantages. Smaller banks consume less approach worked best for small SPM sizes (up to 512 B).
energy per access and unused banks can be turned off to The hybrid approach, on the other hand, worked well for
save static leakage energy. Furthermore, memory accesses all scratchpad sizes, but required most computational time.
can be made in parallel, giving additional performance [8]. Energy savings of 9-20% were reported as compared to a
Multi-banked scratchpad allocation has been studied in non-saving allocation that does not attempt to minimize the
e.g. [22, 12]. energy consumption over all processes.
The results of Wehmeyer et al. [22] show that using The second paper [4] presents a novel way of using
many smaller scratchpads instead of a large one becomes a scratchpad in a virtual memory system with an MMU
beneficial when a program is large enough to be able to uti- (memory management unit) to store swapped-in pages into
lize a bigger scratchpad size. Energy savings of up to 22% an SPM. Their page allocation algorithm considers only a
were reported as compared to a single scratchpad system for single process system and code allocation, but they state
a total of 32 kB SPM size for both cases. that their method is easily extended to a multi-process en-
Kandemir et al. [12] on the other hand focus on optimiz- vironment and data allocation. A 33% reduction in energy
ing array accesses in loop nests in a multi-banked scratch- consumption is reported, as compared to fully-cached con-
pad system in order to minimize leakage current loss. The figuration.
motivation for this is that the speed and density of the
CMOS transistors is expected to rise in future, so that static 4.5. Complementary scratchpad optimizations
leakage management is expected to become very important.
Their method is based on optimizing bank locality, which In this section we cover some complementary memory
means that successive SPM accesses should come to the optimizations that can be used together with a scratchpad to
same bank as much as possible, which makes it possible to save even more energy than with a scratchpad alone.
put the other banks in a low-power idle state for as long time A hardware-controlled loop cache has been studied to-
as possible. Turning on and off the memory banks results in gether with an Instruction Register File [7]. It was found
that allocating frequently used instructions into a register On the other hand, the work of [11] presents an energy-
file along with a hardware-controlled loop cache can save aware compilation framework (EAC) and focus only on
more energy than using these in isolation. Here, we see that high-level energy optimizations, including memory opti-
instead of a hardware-controlled loop cache, a scratchpad mizations. Array-dominated programs are common in DSP
could be used, saving even more energy. and multimedia applications, and large energy savings can
Processor caches have also been studied together with an be obtained by source-level optimizations. The authors take
SPM in [20], where the authors studied a memory hierarchy the view that simulations and profiling takes a too long time,
consisting of a scratchpad together with an I-cache and a so the code is analyzed statically instead, taking as input
static allocation algorithm of code objects was used. It was technology parameters (memory model, buses etc.). Vali-
found that using a scratchpad along with a cache can result dation of the methods is still performed with a simulator.
in poor energy savings if the cache behavior is not taken The previously mentioned frameworks have a drawback
into account, resulting in needless cache thrashing. There- in that they are meant for research use, and are not yet ma-
fore, cache misses and hits need to be taken into account ture for common use. In the author’s opinion, the prob-
when deciding what code objects to place on scratchpad. lem with these frameworks is also that they rely on previ-
Their formulation of the problem is a nonlinear optimiza- ously developed components which do not necessarily fit
tion problem, which consists of cache model represented as well together, creating an unnecessarily complex tool chain
a conflict graph. This problem is then linearized and solved with many intermediate formats. Also, integrating energy-
optimally and near-optimally as an ILP problem. awareness in a compiler architecture may not be a good idea
in the long run, because it would tie the users to a specific
5. Compiler frameworks for memory opti- compiler. Otherwise, energy-awareness would need to be
mizations integrated separately in each compiler.

Recently, some proposals for a compilation and simula-


tion framework for energy-aware (memory) optimizations 6. Conclusions
have been made in [23, 11, 21]. A central question that
arises in such a framework is at what program level are the
optimizations performed and what intermediate representa- Considerable research effort has recently (2001-2006)
tions are used. In addition, an energy profiler is needed to been invested in scratchpad allocation, and as a result this
evaluate the optimizations, and finally a code transformer field begins to be mature for practical applications. Large
to transform the program to use the energy optimizations. energy savings using scratchpad-aware compilation were
The energy profiler can further be split into a CPU simula- reported in many research papers. Energy savings of around
tor, an energy model, and a hardware model. Here, further 20-40% compared to a system with an equal-sized cache
consideration must be given to at what hardware level is the for a single scratchpad system have been demonstrated, de-
energy modeled (e.g. instruction-level). The memory hier- pending on what allocation method was used.
archy including caches and also the CPU pipeline need to Future research effort in this field should address the im-
be modeled in a cycle-accurate way to obtain an accurate plementation of practical tools and frameworks that can be
energy estimate for a program. used to study energy savings of various scratchpad allo-
The authors in [21] present a memory aware C compila- cation techniques in combination with other complemen-
tion framework (MACC). This framework contains a simu- tary (memory) energy optimizations. In particular, atten-
lation and a compilation part, where both have an access to tion should be paid to what intermediate representations are
an instruction-level energy database. The simulator models needed in such tools and what their interfaces are. Fur-
both the memory hierarchy (DRAM, cache and scratchpad) thermore, retargetability is in practice a very important is-
and the CPU in a cycle-accurate manner. The framework sue, given the amount of different development tools and
supports both static and dynamic scratchpad optimizations hardware (DSPs and GPPs) that are used in embedded sys-
at the assembly/linker level. The authors found source- tems. As we have seen, energy aware-compilation frame-
level optimizations (C code) to be useful and complemen- works have quite differing requirements than a traditional
tary to memory optimizations performed at the instruction compiler framework. Therefore, it could be a good idea to
level. The source-level optimizations that were considered separate the energy awareness into different tools instead of
included array partitioning and array tiling; these optimiza- trying to integrate these parts into a compiler framework.
tions split larger arrays into smaller arrays (if needed) such This is possible at least for memory optimizations at the
that the smaller part can be allocated to a scratchpad mem- binary/linker level, including some scratchpad allocation al-
ory. Such optimizations would be difficult to perform at the gorithms. Source-level memory optimizations are more dif-
binary level, and they would be hardware dependent. ficult to separate from a compiler architecture, however.
References IEEE Real-Time and Embedded Technology and Ap-
plications Symposium (RTAS’06), pages 81–90, San
[1] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jef- Jose, California, USA, April 2006.
frey D. Ullman. Compilers: Principles, Techniques &
[11] I. Kadayif, M. Kandemir, G. Chen, N. Vijaykrish-
Tools. Pearson Addison-Wesley, 2nd edition, 2007.
nan, M. J. Irwin, and A. Sivasubramaniam. Compiler-
[2] Bruno Bouyssounouse and Joseph Sifakis. Embedded Directed High-Level Energy Estimation and Opti-
Systems Design: The ARTIST Roadmap for Research mization. ACM Transactions on Embedded Comput-
and Development, volume 3436 of Lecture Notes in ing Systems (TECS), 4(4):819–850, November 2005.
Computer Science. Springer, 2005. [12] Mahmut Kandemir, Mary Jane Irwin, Guilin Chen,
and Ibrahim Kolcu. Compiler-Guided Leakage Op-
[3] Angel Dominguez. Heap Data Allocation to Scratch-
timization for Banked Scratch-Pad Memories. IEEE
Pad Memory in Embedded Systems. PhD dissertation,
Transactions on Very Large Scale Integration (VLSI)
University of Maryland, College Park, Department of
Systems, 13(10):1136–1146, October 2005.
Electrical and Computer Engineering, 2007.
[13] Lars Wehmeyer and Urs Helmig and Peter Marwedel.
[4] Bernhard Egger, Jaejin Lee, and Heonshik Shin.
Scratchpad sharing strategies for multiprocess embed-
Scratchpad Memory Management for Portable Sys-
ded systems: a first approach. In 3rd Workshop on
tems with a Memory Management Unit. In Proceed-
Embedded Systems for Real-Time Multimedia (ESTI-
ings of the 7th International Conference on Embedded
Media’05), pages 115–120, September 2005.
Software (EMSOFT’06), pages 321–330, Seoul, Ko-
rea, October 2006. [14] Lian Li and Jingling Xue. Trace-based leakage energy
optimisations at link time. Journal of Systems Archi-
[5] Michael D. Ernst. Static and dynamic analysis: Syn- tecture, 53(1):1–20, January 2007.
ergy and duality. In WODA 2003: ICSE Workshop
on Dynamic Analysis, pages 24–27, Portland, Oregon, [15] Rajiv A. Ravindran, Pracheeti D. Nagarkar, Ganesh S.
USA, May 2003. Dasika, Eric D. Marsman, Robert M. Senger, and
Scott A. Mahlke. Compiler Managed Dynamic In-
[6] Ann Gordon-Ross, Susan Cotterell, and Frank Vahid. struction Placement in a Low-Power Code Cache. In
Tiny Instruction Caches for Low Power Embedded Proceedings of the International Symposium on Code
Systems. ACM Transactions on Embedded Comput- Generation and Optimization (CGO’05), pages 179–
ing Systems, 2(4):449–481, November 2003. 190, San Jose, California, USA, March 2005.
[7] Stephen Hines, Gary Tyson, and David Whalley. Re- [16] S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar,
ducing Instruction Fetch Cost by Packing Instruc- M. Balakrishnan, and P. Marwedel. Reducing En-
tions into RegisterWindows. In Proceedings of ergy Consumption by Dynamic Copying of Instruc-
the 38th annual IEEE/ACM International Symposium tions onto On-chip Memory. In Proceedings of the
on Microarchitecture (MICRO’05), pages 19 – 29, International Symposium on System Synthesis (ISSS),
Barcelona, Spain, November 2005. Kyoto, Japan, October 2002.
[8] Jason D. Hiser. Effective Algorithms for Partitioned [17] S. Steinke, L. Wehmeyer, B.-S. Lee, and P. Marwedel.
Memory Hierarchies in Embedded Systems. PhD dis- Assigning Program and Data Objects to Scratchpad
sertation, University of Virginia, School of Engineer- for Energy Reduction. In Proceedings of the Design,
ing and Applied Science, May 2005. Automation and Test in Europe Conference (DATE),
Paris, France, March 2002.
[9] Andhi Janapsatya, Aleksandar Ignjatovic, and Sri
Parameswaran. Exploiting Statistical Information for [18] Sumesh Udayakumaran, Angel Dominguez, and Ra-
Implementation of Instruction Scratchpad Memory in jeev Barua. Dynamic Allocation for Scratch-Pad
Embedded System. IEEE Transactions on Very Large Memory Using Compile-Time Decisions. ACM
Scale Integration (VLSI) Systems, 14(8):816–829, Au- Transactions on Embedded Computing Systems
gust 2006. (TECS), 5(2):472–511, May 2006.

[10] Ramkumar Jayaseelan, Tulika Mitra, and Xianfeng [19] Manish Verma and Peter Marwedel. Overlay Tech-
Li. Estimating the Worst-Case Energy Consumption niques for Scratchpad Memories in Low Power Em-
of Embedded Software. In Proceedings of the 12th bedded Processors. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 14(8):802–815, Au-
gust 2006.
[20] Manish Verma, Lars Wehmeyer, and Peter Marwedel.
Cache-Aware Scratchpad-Allocation Algorithms for
Energy-Constrained Embedded Systems. IEEE Trans-
actions on Computer-Aided Design of Integrated Cir-
cuits and Systems, 25(10):2035–2051, October 2006.
[21] Manish Verma, Lars Wehmeyer, Robert Pyka, Peter
Marwedel, and Luca Benini. Compilation and Sim-
ulation Tool Chain for Memory Aware Energy Op-
timizations. In Proceedings of the 6th International
Workshop on Embedded Computer Systems: Architec-
tures, Modeling, and Simulation (SAMOS’06), pages
279–288, Samos, Greece, July 2006.
[22] Lars Wehmeyer, Urs Helmig, and Peter Marwedel.
Compiler-optimized Usage of Partitioned Memories.
In Proceedings of the 3rd Workshop on Memory Per-
formance Issues (WMPI’04), pages 114–120, Munich,
Germany, June 2004.
[23] Joseph Zambreno, Mahmut T. Kandemir, and Alok N.
Choudhary. Enhancing Compiler Techniques for
Memory Energy Optimizations. In Proceedings of
the Second International Conference on Embedded
Software (EMSOFT’02), pages 364–381, Grenoble,
France, October 2002.
Measuring the CPU energy consumption of a modern mobile device
Antti P Miettinen
[email protected]

Abstract
Modern mobile devices employ complex system-on-
chip (SoC) processors as their main computing en-
gine. Inside a SoC several functional blocks can
share the same external supply voltage line. This
makes measurement based study of a single block,
e.g. the main CPU, challenging as the current
drawn by an individual block cannot be measured
directly.
The goal of this work was to characterize the
feasibility of using simple board level current mea-
surement instrumentation for studying the energy
consumed by program code run on the ARM926
core inside the OMAP1710 SoC. The results indi-
cate that by using carefully planned experimental
setups energy consumption related to the following
factors can be studied at least qualitatively:
Figure 1: ARM926 core inside OMAP1710, picture
• instruction path bit switching from [1]

• data cache reads and writes

• multiplier usage and data dependency

• register bank bit switching

However, planning the test setups and performing


the measurements is quite time consuming and sig-
nificant uncertainty remains in the error margins.
SoC level instrumentation would certainly provide
a much more reliable and powerful research tool.

1 Introduction
The environment for developing software for an em-
bedded hardware target often consists of

• a macro board hosting


Figure 2: Embedded software development environ-
– the hardware components present in the ment
actual end user device

Page 1
– additional trace and debug instrumenta-
tion

• in circuit debug and trace hardware

• development host with

– cross compilation tools for the target de-


vice
– software tools interfacing with the debug
and trace hardware
Figure 3: Measurement setup
Traditionally the instrumentation to facilitate de-
velopment time activities is focused on satisfying
functional requirements, i.e. the normal software A Linux laptop (Ubuntu/feisty [6]) was used to
development task of making the software work. host the Trace32 software that interfaces with the
However, the instrumentation can also be useful for debug and trace hardware. The toolchain for com-
addressing nonfunctional requirements. The goal of piling software for the the ARM Linux based target
this work is to explore the extent to which energy environment was constructed with crosstool [7].
efficiency can be addressed with this kind of simple
board level instrumentation.
2.2 Workflow and methodology
For testing energy consumption of different primi-
2 Tools and methods tives it is necessary to be able to run test programs
on the target hardware and trace their execution
2.1 Test environment and power consumption. For this purpose simple
The hardware used in this work consist of a devel- C programs with inline assembly sections were con-
opment board hosting an OMAP1710 [1] processor structed.
and various peripherals, including The test programs were run under the control
of the Trace32 debugger. The operating system
• LCD display awareness feature of the Trace32 debugger allows
breaking execution when a named process is exe-
• USB device port cuted by the Linux kernel. This provides a conve-
• Serial console port nient method for enabling tracing and measurement
specifically for the code sections of interest: when
• MMC slot the test program is loaded the trace buffer can be
cleared and a breakpoint can be set at the end of
• JTAG port [2] the test code.
During the test run the ETM tracing was enabled
• ETM port [3]
and the Analog probe recorded the voltage drop
• jumpers in various voltage lines for connecting over a shunt resistor placed to the supply line of
a shunt resistor for current measurement the OMAP1710 processor. The supply voltage is
constant so the voltage drop gives a direct measure
The JTAG and ETM ports and the current mea- of the current drawn by the processor:
surement points are connected to a commercial de-
bug, trace and measurement system developed by UR
I= (1)
Lauterbach [4] Datentechnik GmbH. The system R
includes a PowerTrace module, a JTAG adapter,
Power can be obtained simply by multiplying (1)
an ETM preprocessor module, a PowerIntegrator
by the constant supply voltage:
module and an Analog probe (containing an A/D
converter). The system allows tracing and control- P = UI (2)
ling the instruction execution of the ARM926 [5]
core and simultaneus measurement of the supply and energy by further multiplying (2) by time:
line current with maximum sampling frequency of
625kHz. Einsn = P tinsn (3)

Page 2
As the ETM trace provides a cycle accurate in-
struction execution trace the energy per instruction
can be calculated from cycles per instruction (CPI)
and CPU clock speed:
CP I CP I
tinsn = , ⇒ Einsn = P (4)
f f
The ETM trace provides also a convenient
method for choosing the current samples corre-
sponding to execution of only the test code as the
Trace32 is able to display the A/D samples in sync
with the ETM trace.
Although the methodology seems trivial, the
challenge lies in obtaining the power level that is
representative for the power level of the factor of
interest. As the OMAP1710 contains much more
than just the ARM926, absolute power levels are
not of interest. Another important issue is to asses
the repeatability of the measurements because of
unknown uncontrollable factors in the setup. A
rough measure for the contribution of secondary
factors can be obtained by measuring the power
level when the ARM926 is in the wait-for-interrupt
mode. This still includes power contribution from
the ARM core but for the purpose of this work, the
ARM idle power is also a secondary factor.
As the sampling frequency of the A/D converter
is orders of magnitude smaller than the clock speed
of the circuit being measured, the only feasible Figure 4: ARM926 overview from [9]
way to study instruction level effects seems to be
contructing test programs that excercise the phe-
nomenon of interest for sufficiently long time and to the problem was more or less ad hoc. However, a
use averaging to obtain a representative power brief look into ARM9 microarchitecture [8] is useful
value. for constructing feasible tests.
For estimating the error in the value obtained by The ARM926EJ-S inside the OMAP1710 is a
averaging the straight forward approach is to use member of the ARM9 general-purpose micropro-
e.g. sample standard deviation. However, both the cessors and is targeted for multitasking operating
sample mean and standard deviation can give bi- systems with full memory protection and virtual
ased estimates. The current drawn by a digital cir- memory support. The caches and MMU can be
cuit is by nature composed of current peaks. Sam- significant power consumers and can be studied as
pling might get synchronized and therefore the av- primary factors but must also be addressed as sec-
erage value could be biased. ondary factors when studying e.g. core internal fac-
For energy measurements it would make sense tors. The JTAG and ETM instrumentation also
to perform the A/D conversion by integrating the consume power so the test setup should strive to
current over the sampling period. This could be keep the contribution of those blocks as constant
achieved by analog integrating circuitry but imple- as possible.
menting this was not feasible because of time con- The actual ARM9EJ-S [10] processor core imple-
straints of this work. ments the ARM v5TE architecture. This includes
support for ARM, Thumb and Jazelle instruction
2.3 Test programs sets and DSP instructions. The pipeline consists of
five stages:
2.3.1 General
• Instruction fetch
As the goal of this work was simply to establish
whether a given effect is measurable, the approach • Decode

Page 3
• Instruction fetch
• Data read
• Data write
Figure 5: ARM9 pipeline from [8]
• Functional unit activity
• Register access
To limit the scope of this work, the test pro-
grams were constructed so that instruction and
data fetches occur always from cache, i.e. code se-
quences of test loops were kept smaller than instruc-
tion cache (32k) and data accesses were localized
to areas smaller than data cache (16k). To max-
imize the proportion of instructions that excercise
the factor under study test loops were contructed as
instruction sequences of 4096 instructions repeating
the effect being measured. The loop overhead was
below ten instructions so the loop overhead effect
on power should be quite small.

2.3.2 Instruction fetch


One could assume that for instruction fetch the
amount of alternating bits in consecutive instruc-
tion words would affect the amount of switching in
the instruction pipeline. Therefore, a feasible test
series would be a set of test cases where the number
of alternating bits between consecutive instructions
is varied and the power level for each case is mea-
Figure 6: ARM9 data path from [11] sured. However, the test should also be constructed
so that the amount of bit switching in the instruc-
tion words does not affect the switching activity in
• Execute the data path.
The nop instruction is probably the first thing
• Memory access that comes to mind. The ARM ISA does not actu-
ally define any dedicated nop instruction. Instead
• Register writeback
no-operation instruction execution can be achieved
The datapath consists of a register bank (contain- by e.g. moving the value from register zero to regis-
ing program counter) with three read ports and two ter zero. This allows constructing different instruc-
write ports. Execution units include e.g. multiplier, tion streams with no effect. However, it turns out
accumulator, shifter and ALU blocks. that the functionally no-operation mov rx,rx in-
The dynamic power consumption of a digital cir- struction is not short circuited in any way by the
cuit is relative to the switching capacitance: instruction decode/execution logic and does actu-
ally cause data path activity. This can be observed
P =α×C ×V2×f (5) by measuring the power level of mov r15,r15, i.e.
using the program counter as the operand regis-
where α is the switching activity. We can use ter. Accessing the program counter causes clearly
the overall architectural description of the ARM926 higher power level for the instruction stream than
core to guide our guesses as to what affects the using any other register.
switching activity related to executing different Fortunately most of ARM instructions can also
code sequences. From the above description we can be conditionally executed. The data path activity
assume for example that the following factors could can be prevented by using a condition code which
affect the energy consumed: never gets true during the test.

Page 4
Two test cases were constructed for measuring 52
nop
move immediate
the effect of bit toggling in instruction words. One 50
test varies the operand register of a mov rx,rx in-
struction allowing varying the number of alternat- 48

ing bits with values 0, 2, 4 and 8 bits. Another 46


test varied the immediate value of a mov r0,#imm

Energy
instruction allowing bit variation between zero and 44

eight bits. In both tests a condition code forced to 42


false was used to eliminate the data path contribu-
tion. 40

38

2.3.3 Data access


36
0 1 2 3 4 5 6 7 8
Simple tests were constructed for exercising ldr and Number of alternating bits

str instructions accessing repeatedly the same ad-


dress. For isolating the data access power contribu- Figure 7: Instruction path bit switching effect
tion from instruction execution, the test sequences
were run with conditional execution with always
deviation is taken from the idle power level mea-
true condition and never true condition.
surement. The variation of the standard deviation
was very small between different tests so this unit
2.3.4 Functional unit activity closely matches the error margin of the different
For exercising the data path, the instruction path tests.
switching is the secondary factor to be eliminated.
Instruction fetch bit switching variation is mini- 3.2 Instruction fetch
mized by using long test code sequences of exactly
the same instruction. Simple tests excercising the Figure 7 shows the relative energy of mov rx,rx
add and mul instructions were constructed as well and mov r0,#immediate instructions as a func-
as mov with shift for exercising the shifter. tion of the number of alternating bits. As can be
seen the trend is very clear and the two test cases
agree reasonably well. Using register r15 (program
2.3.5 Register access
counter) seems to deviate slightly from the other
For testing the effect of bit switching in the register data points, which could suggest that PC might be
bank with minimal functional unit contribution, a handled specially already in the instruction decode
test program moving values to r0 alternatively from stage.
r1 and r2 was constructed. The test was repeaded
with different values stored to registers r1 and r2. 3.3 Data access
As the actual code sequence does not change, only
the values in the registers, the instruction path con- The relative energies for load and store instructions
tribution should remain constant and the variation are shown below:
should be due to the datapath. instruction data taken not taken
ldr all zeros 70 ± 1 39 ± 1
ldr all ones 72 ± 1 39 ± 1
3 Results str all zeros 63 ± 1 38 ± 1
str all ones 63 ± 1 38 ± 1
3.1 General
As can be seen, the data cache access is clearly
As the measured absolute power levels are not measurable and the difference between reads and
of interest for this work and even absolute differ- writes is also clear. A surprising finding is that ap-
ences between different test cases can be mislead- parently all ones data takes slightly more energy
ing, choosing the metric to describe the energy con- than all zeros data (at least for reads). Similar
sumption of different factors is not very straight phenomenon was observed in other tests too but
forward. In the following presentation the energy usually the effect was well below error margin. In
and power values are presented as number of stan- principle, the value of constant data should not af-
dard deviations above idle level, where the standard fect the switching activity.

Page 5
54
it seems that the only significant functional unit
52
energy-wise is the multiplier. The effects of e.g. bit
switching in the instruction path and in the register
50
bank as well as the data cache reads and writes are
48
clearly measurable but with the performed tests it
was not possible to measure the effects of e.g. shift
Energy

46 and addition.
44
The measurements performed in this work were
quite limited. For example comparing ARM and
42 Thumb execution was completely omitted as was
40
all testing related to the branch instructions. Also
only addition was tested as an arithmetic-logic op-
38
0 5 10 15 20 25 30
eration. Possible future work could include more
Number of alternating bits complete tests and comparisons to e.g. ARM11 and
Cortex architectures.
Figure 8: Register bank bit switching effect

References
3.4 Functional unit activity
[1] Texas instruments OMAP1710 overview.
The relative energies for the tested data processing
http://focus.ti.com/...
instructions are shown below:
instruction minimum maximum [2] Joint test action group, standard test ac-
mul 2 × ( 48 ± 1 ) 2 × ( 62 ± 1 ) cess port and boundary-scan architecture, ieee
add 38 ± 1 39 ± 1 1149.1. http://en.wikipedia.org/wiki/JTAG.
mov with shift 39 ± 1 39 ± 1
[3] Embedded trace macrocell architecture speci-
The instructions were run with different data val- fication. http://www.arm.com/pdfs/...
ues as operands and as can be seen multiplication
shows significant data dependecy. High cost of mul- [4] Lauterbach datentechnik GmbH debug and
tiplication is clearly measurable already before tak- trace products. http://www.lauterbach.com/.
ing CPI into account. On the other hand it seems
[5] ARM926EJ-S technical reference manual.
that adder and shifter consume neglible energy (the
http://arm.com/pdfs/DDI0198D 926 TRM.pdf.
power level difference to e.g. nop is below error
margin). [6] Ubuntu, community developed linux-based op-
erating system. http://www.ubuntu.com/.
3.5 Register access [7] Dan Kegel. Building and testing gcc/glibc
Figure 7 shows the relative energy of mov r0,r1, cross toolchains. http://kegel.com/crosstool/.
mov r0,r2 code sequence as a function of the num-
[8] The ARM9 family - high performance mi-
ber of alternating bits in the values in registers r1
croprocessors for embedded applications. In
and r2. As for instruction path, the trend is very
ICCD ’98: Proceedings of the International
clear.
Conference on Computer Design, page 230,
Washington, DC, USA, 1998. IEEE Computer
4 Conclusions and future Society.

work [9] ARM926EJ-S product overview.


http://www.arm.com/pdfs/DVI0035B 926 PO.pdf.
Before this work I had doubts about the feasibility
of using the SoC supply line current for measuring [10] ARM9EJ-S technical reference manual.
core level effects. However, the measurements indi- http://www.arm.com/pdfs/DDI0222B 9EJS r1p2.pdf.
cate that at least rudimentary analysis can be per- [11] ARM9E-S (rev 2) technical reference manual.
formed with carefully planned and performed tests. http://www.arm.com/pdfs/DDI0222B 9EJS r1p2.pdf.
Even though one needs to be careful about draw-
ing conclusions about even the relative differences

Page 6
Energy-Aware Scheduling

Juhani Peltonen
Software Technology Laboratory/TKK
[email protected]

Abstract time systems and dynamic voltage scaling. Section 3 dis-


cusses earliest deadline first and rate monotonic scheduling
Because of increase in processing power and slow de- algorithms for hard real-time systems both to which DVS is
velopment in battery technology in contrast to circuit tech- integrated. Section 4 considers interval scheduling in which
nology, battery life has become the most limiting factor in system load is monitored at fixed intervals. Section 5 dis-
mobile devices. One of the most energy consuming part cusses stochastic soft real-time scheduling for applications
is the processor. A common mechanism to lower proces- with statistical performance requirements. Section 6 dis-
sor’s energy consumption is dynamic voltage scaling (DVS) cusses some compiler-assisted scheduling algorithms. Sec-
which scales both voltage and frequency. There exist nu- tion 7 considers procrastination scheduling which tries to
merous different scheduling algorithms which try to mini- take leakage current into consideration. Finally, section 8
mize the processor’s energy consumption. This paper con- discusses about feasibility of different scheduling algo-
siders several algorithms for energy-aware scheduling. The rithms.
algorithms are EDF and RM for hard real-time systems, one
interval and stochastic scheduling algorithm for soft real- 2. Background
time systems and two compiler-assisted and procrastination
scheduling algorithms also for soft real-time systems. The
2.1 Real-Time Systems
considered algorithms use dynamic voltage scaling to ad-
just processor’s performance.
Real-time systems [9] are systems which run applica-
tions which have deadline requirements. The system should
ensure that deadlines are not missed.
1. Introduction Real-time systems can be divided into two categories:
hard and soft. Hard real-time systems differ from soft ones
Due to increase in processing power battery life has be- in that in hard real-time systems all deadlines should be
come the most limiting factor in mobile devices. The part, met. In soft real-time systems, missing a deadline doesn’t
that accounts significantly for total energy consumption, is cause damage but decreases effectiveness. Soft real-time
the microprocessor. It is typical, however, that the peak per- systems typically provide statistical performance guaran-
formance of the processor is rarely needed, which means tees (i.e. meeting deadlines at certain probability). Mul-
that most of the time a lower performance processor would timedia applications are typical examples of applications
suffice. A common technique to lower processor’s energy which have soft real-time requirements. For example, for
consumption is dynamic voltage scaling (DVS). an application, which displays a video, it is not crucial to
DVS is a technique to lower processor’s performance. display every frame on time, although it is preferable.
The idea is to run the processor at lower speed by reducing The third group of applications, besides the ones which
both voltage and frequency. Because of non-linear relation- have either hard or soft real-time requirements, are best-
ship between voltage and consumed energy, running at low effort ones. Best-effort applications are applications which
speeds saves energy. Performance scaling, however, inter- have no performance requirements. They are not considered
feres with real-time requirements of applications and can in this paper.
cause deadline misses. That is why care should be taken
when the performance is scaled. 2.2 Dynamic Voltage Scaling
This paper concentrates on DVS scheduling algorithms.
Focus is on real-time systems. The rest of the paper is orga- Dynamic voltage scaling is a common method to adjust
nized as follows: section 2 gives some background to real- processor’s energy consumption. It exploits the following
characteristics of CMOS-based circuits: the consumed en- which simply selects the lowest possible execution speed
ergy is proportional to the square of the voltage and the with which a given task set remains schedulable. All tasks
maximum frequency is determined by the voltage [2]. In in the set are executed with the selected speed. The worst-
other words, by using lower frequencies it is possible to use case execution times (WCETs) [15] are scaled according to
lower voltages which leads to savings in energy. Although the new utilization factor U by amount of 1/U . The se-
the processor’s performance could be scaled by scaling just lected processor utilization is changed only when the task
the frequency, energy savings are likely to be negligible be- set is changed. The approach is simple, but doesn’t take
cause of linear relationship between energy consumption into consideration the fact that tasks typically use less cy-
and frequency. cles than their worst-cases require. Because it is typical that
Current processors (e.g. Intel Pentium M [5]) have only tasks finish earlier than their WCET based deadlines, static
some discrete speed settings. This means that it is not al- scaling loses energy by executing tasks with a speed that is
ways possible to find the optimal speed. Usually, the speed too high.
to be used is acquired by rounding the calculated speed up- The second approach, cycle-conserving RT-DVS (real-
wards to the closest matching one. Some algorithms, that time DVS), exploits the fact that a typical execution time
are discussed later, benefit from small amount of speed set- for a task is generally much less than its WCET. The name,
tings in a way that they consume less energy compared to cycle-conserving, implies that instead of wasting cycles by
a situation in which there is more speed settings available. idling, cycles are conserved by reducing performance. The
Some other algorithms behave just the opposite. idea of the algorithm for EDF is as follows: if a task Ti
The usage of DVS implies that using a lower perfor- completes earlier than its WCET, the processor utilization is
mance saves energy. However, as is shown in [11], this recomputed by using the actual execution time cci in place
doesn’t always hold but, in some cases, it is more energy ef- of the WCET Ci . Because the new utilization factor is less
ficient not to reduce voltage below a certain point. In other than the previous one, there is a possibility to use lower per-
words, more energy can sometimes be saved by running at formance. The performance scaling doesn’t affect schedu-
a higher speed and then idle rather than running at a lower lability of the task set, until the task Ti gets re-scheduled,
speed and idle less. Also, as technology scales, leakage because the schedulability test continues to hold. When the
current becomes more and more significant decreasing the task Ti gets re-scheduled, the actual execution time cci is
effectiveness of DVS [3]. DVS scheduling algorithms can replaced by the WCET Ci . At this point, the utilization can
take the aforementioned facts into consideration to increase increase. Because several tasks can finish early, the cycle-
energy savings. conserving algorithm can gain significant energy savings.
Cycle-conserving RT-DVS for RM is a bit different
3. EDF and RM because of O(n2 ) (where n is the number of tasks to
scheduled) requirement for the schedulability test. The
This sections discusses the approach to DVS scheduling same approach, that calculated a new utilization for cycle-
described in [13], in which DVS was integrated to earliest conserving EDF, could be used, though. The algorithm
deadline first (EDF) and rate monotonic (RM) scheduling starts with a utilization obtained from the static scaling al-
algorithms. gorithm which uses WCETs to calculate the utilization. Af-
Earliest deadline first (EDF) [8] is a dynamic, preemp- ter that, the work that should be accomplished before the
tive scheduling algorithm for hard real-time systems. EDF next deadline is spread out between the current time and the
assigns priorities to tasks according to their deadlines such next deadline. After a task has been executed the remaining
that the task, which has the earliest deadline, gets the high- work is again spread out between the current time and the
est priority and the task, which has the latest deadline, gets next deadline. This procedure is repeated.
the lowest priority. The task with the highest priority will be The third approach, look-ahead RT-DVS EDF, tries to
executed. For a set of tasks to be schedulable, EDF sched- defer as much work as possible so that it is now possible
uler requires that the processor utilization factor doesn’t ex- to use as low utilization values as possible. In contrast to
ceed one i.e. U = C1 /P1 + · · · + Cn /Pn ≤ 1 [8], where cycle-conserving RT-DVS, look-ahead RT-DVS starts from
Ci denotes the execution time, Pi the period and Ci /Pi the low speeds instead of high ones. The idea of the algorithm
fraction of processor time spent in the execution of task i. is as follows: the algorithm pushes as much work over the
Rate monotonic (RM) [8] is a fixed priority, preemptive next deadline as possible and computes the minimum num-
scheduling algorithm which assigns priorities according to ber of cycles needed before the next deadline such that no
task periods. The task with the shortest period gets the high- future deadlines are missed. The execution speed is set just
est priority. For a set of tasks to be schedulable,
¡ ¢it must be high enough to execute the selected number of cycles be-
that U = C1 /P1 + · · · + Cn /Pn ≤ n 21/n − 1 . fore the next deadline. Because the algorithm starts from
The first approach taken in [13] was static voltage scaling a low utilization it is possible that high utilizations are re-
quired later. If tasks tend to finish early then it is likely that best one were that it never missed any deadlines and that it
no highest utilizations are needed. also saved energy (although the amount of saved energy was
Based on the simulation results in [13], energy savings small). Other policies didn’t give good results. A problem
were significant. The available voltage and frequency set- with AVGN was that the response time of the system was
tings of the processor had large impact on the effective- increased with increased N . The greater the value of N
ness of algorithms. Because look-ahead EDF tries to de- was, the more time it took before the frequency was scaled
fer work, large amount of settings allowed the algorithm when a change in the system load occurred.
to match closely to the desired performance level which PAST was later improved with PACE (Processor Accel-
required high performance settings later increasing energy eration to Conserve Energy) in [10]. PACE is not a com-
consumption. With fewer number of settings, less work was plete DVS algorithm But a method to improve existing ones.
deferred requiring less high performance processing later. In short, PACE changes the way how tasks are scheduled
Static and cycle-conserving EDF benefit from large number to decrease expected energy consumption without chang-
of settings because the more setting there were, the more ing performance. In the modified version of PAST, energy
closely they could match the calculated performance. When consumption dropped significantly compared to the original
applications’ used less computation time when their WCET, one.
cycle-conserving RM didn’t adapt that well to the situation.
Instead, cycle-conserving and look-ahead EDF showed sig- 5. Stochastic SRT Scheduling
nificant reductions in energy consumption. Simulation re-
sults showed that look-ahead and cycle-conserving algo-
In this section, stochastic soft real-time (SRT) schedul-
rithms significantly reduce total energy consumption. Mea-
ing is discussed. Discussion is based on [17].
surements on actual platform reduced energy consumption
GRACE-OS, the scheduler presented in [17], is a sched-
from 20% to 40%.
uler for mobile devices that primarily run multimedia appli-
cations. Instead of requiring worst-case executions times,
4. Interval Scheduling GRACE-OS uses a probability distribution of applications’
cycle demands. The distribution is obtained at run-time.
This sections discusses two intervals schedulers, PAST Although there are great variations in cycle demands of ap-
and AVGN , originally proposed by [14], and is based on plications, their probability distribution is stable (or changes
the work in [4]. slowly and smoothly) indicating that it is feasible to perform
Interval schedulers work by performing predictions of stochastic scheduling based on demand distribution.
future system loads and scalings at fixed intervals. At GRACE-OS scheduler consists of three components: a
each interval, the scheduler predicts the processor utiliza- profiler, a scheduler and a speed adaptor. The profiler con-
tion based on one or more preceding intervals. In PAST, the structs the probability distribution by monitoring cycle us-
predicted load for the current interval is set to be as high as age of applications. The scheduler schedules tasks such that
the last interval. In AVGN , where N is the number of inter- given performance guarantees are met. The speed adaptor
vals to be averaged, a “weighted utilization” at time t, Wt , adjusts CPU speed based on task’s demands in order to save
is calculated as a function of the utilization of the previous energy.
interval Ut−1 and the previous weighted utilization, Wt−1 , Cycles are allocated to applications by GRACE-OS as
such that Wt = (N Wt−1 + Ut−1 ) / (N + 1) [4]. AVG0 follows: for a task to meet its deadlines at probability ρ,
corresponds to PAST. the scheduler allocates C cycles such that the probability
Measurements in [4] were done using an actual test plat- that a task doesn’t use more than C cycles, is at least ρ i.e.
form, the Itsy Pocket Computer, which ran a port of Linux P [X ≤ C] ≥ ρ [17]. The cycle count C for the task can be
operating system. The Linux kernel was modified to sup- found from its probability distribution. After cycle counts
port the tested algorithms. The tested applications were for are found, an earliest deadline first based scheduling algo-
web browsing, text reading, chess and MPEG video and au- rithm is used to dispatch a task, which has the earliest dead-
dio. line and a positive budget (the budget is set to C for every
The best policy, that [4] found, was PAST (i.e. AVG0 ), period and, as the task is executed, its budget is decreased
which scaled frequency such that it selected only the min- by the amount of cycles it used). If a task uses its budget
imum or maximum frequency. The minimum frequency without finishing its job, the scheduler can either notify it
was selected when the processor’s utilization dropped be- to abort or let it to run in best-effort mode. In best-effort
low 93%. The maximum frequency was selected when uti- mode, a task can either be set to use surplus cycles from
lization was greater than 98%. Other policies were exam- other tasks or to block until the next period.
ined with N > 0 and by scaling frequency by a one step The approach to scheduling, that GRACE-OS takes, is
or by doubling or halving it. Reasons for PAST to be the to start executing a task from a low speed and accelerate
as execution progresses. The approach was taken because processor is calculated based on the unused time so far and
most multimedia tasks use less than their allocated cycles the unused time in the future which is based on the av-
so it is possible to avoid the highest speeds. The scheduler erage execution time of a section. The measurement re-
uses a speed schedule, which is a list of scaling points, to sults in [12] were reported as a function of the slack (un-
scale the speed of the processor as follows: if a task has used time). NPM performed worse, which is quite obvious.
used n cycles, such that n ≥ x, where x is a cycle count DPM-G had very stable energy consumption which avail-
for speed y, speed is increased to y. The speed schedule is able slack had very little influence. This was because when
based on demand distribution and is calculated such that it there was much slack, the first few sections were executed
minimizes the energy consumption while meeting statistical very slowly, which consumed most of the slack. When fur-
performance requirements. ther sections were started to execute, there was very little
Measurements in [17] were performed using a laptop slack available which increased processor utilization. Most
which ran Linux. Applications were codecs for video, audio of the sections were executed with almost no slack, caus-
and speech. Linux kernel was modified to support the ex- ing the consumed energy to be at the same level. The en-
periments. The new system calls incurred negligible over- ergy consumption of the other three schemes, SPM, DPM-P
head (from 0.0004% to 0.5%). Overhead of construction of and DPM-S, decreased as slack was increased. Energy con-
the demand distribution depended on the size of the profil- sumption of the DPM-S scheme was always smaller than
ing window (i.e. how many jobs of a task are kept tracked) other schemes.
and the size of histogram groups. Overhead varied from
0.1% to 100%, which means that the demand distribution The approach in [16] was to divide scheduling into de-
should be estimated rarely. Cost of the scheduling was at sign and run-time phases. Reasons for this were that the
most 0.4%. Energy savings ranged from 10% to 72% with scheme better optimizes the embedded software design, it
a single high-demand application and from 7% to 64% with gives the system more run-time flexibility and it reduces
multiple concurrent applications. GRACE-OS meets almost run-time computation complexity.
all deadlines in a lightly loaded system and meets deadlines
within statistical requirements in a heavily loaded system. In [16], the atomic unit of the design-time scheduler was
called a thread node and the atomic unit of the run-time
scheduler was called a thread frame. Thread frame consists
6. Compiler-Assisted Approaches of thread nodes. The design-time scheduler, which works
on thread node granularity, explores different scheduling
This section describes some methods to DVS scheduling combinations and gives several solutions which the run-
with a help of a compiler and is based on the work in [12] time scheduler uses. Naturally, the more solutions the
and [16]. design-time scheduler provides, the better results the run-
The idea in compiler-assisted scheduling is that the com- time scheduler can achieve but with increased overhead.
piler analyzes the program and provides information about A genetic algorithm was used in [16] to find solutions at
it; for example, the compiler can insert information into the design-time. Reasons for selecting a genetic algorithm were
program which is used by the run-time scheduler. its speed and near-optimal solution. For larger problems, al-
In [12], checkpoints were inserted at loop boundaries and gorithms that search for optimal solution are not applicable
procedure call sites. Checkpoints were used to calculate because of long computation time they require.
the actual execution time of a program section such that the
checkpoint at the beginning of the section recorded the cur- The design-time scheduler provides a set of solutions
rent time and the checkpoint at the end of the section com- from which the run-time scheduler chooses one. The cho-
puted the actual execution time. The execution time was sen one is the one which optimizes the system’s energy con-
compared to the WCET and based on the result, the proces- sumption when all ready-to-run thread frames are consid-
sor’s speed could be adjusted if necessary. The following ered.
voltage adjustment schemes were used: no power manage-
ment (NPM) where all tasks execute at maximum speed; The two phase approach was tested with randomly gen-
static power management (SPM) where the minimum per- erated examples and also with an actual ADSL-modem. At
formance is precalculated based on WCETs and deadlines; the randomly generated experiments, there was two paral-
dynamic power management-proportional (DPM-P) where lel processors: the other worked three times faster and used
the utilization of the processor is recalculated after every three times the voltage than the other and consumed nine
section, which allows further sections to slow down; dy- times more energy. The results were that the two-processor
namic power management-greedy (DPM-G) where all un- solution, compared to the one-processor one, consumed up
used time is given to the next section and dynamic power to 72% less energy. In the ADSL-modem experiment, en-
management-statistical (DPM-S) where the speed of the ergy savings ranged from 20% to 40%.
7. Procrastination Scheduling the task. When a task arrives in the system, it is assigned
a time budget. Each task is allowed to use its run-time and
This section discusses procrastination scheduling and is equal or higher priority run-time from the FRT-list. The al-
based on the work in [6] and [7]. gorithm can perform both dynamic slowdown and dynamic
procrastination.
Procrastination scheduling tries to delay the execution
of a task to maximize the duration of idle intervals. The An important thing to consider is how the available slack
goal is to minimize the energy consumption. Procrastina- is used: it can either be used to slowdown or procrastination.
tion scheduling also considers leakage current which has The solution to the slack distribution in [6] is as follows: if
become a major concern while technology has scaled [7]. the processor is in the shutdown state and the available slack
As [7] demonstrated, there exist a certain speed (critical would be consumed by executing a task at critical speed,
speed) below which static energy consumption starts to dynamic procrastination is not performed. Otherwise, the
dominate. Thus, in some cases, it is more energy efficient shutdown state is continued. When the processor has woken
to execute with the critical speed and shut down the sys- up, it uses the available slack for dynamic slowdown the
tem than to execute below the critical speed. Naturally, critical speed being the lower bound.
shutting down and waking up the processor isn’t free but Based on the experiments in [6] dynamic slack recla-
requires saving and restoring registers, caches, etc., which mation with dynamic procrastination doesn’t offer signif-
incurs additional energy consumption. [7] used a threshold icant energy savings compared to dynamic slack reclama-
value such that if the length of an idle period is less than the tion with static procrastination. This is because the energy
threshold value, the processor is not shutted down. consumption due to the leakage current is already mostly
avoided by the static procrastination. However, when inter-
The procrastination algorithm presented in [7] is as fol-
vals are not long enough for static procrastination for the
lows: first, a maximum procrastination interval, Zi , is pre-
shutdown to be energy efficient, dynamic procrastination
calculated for every task (Zi defines the maximum time that
will extend the intervals resulting in significant energy sav-
a task i can be delayed while guaranteeing that all deadlines
ings [6].
are met). When the processor is in the shutdown state, a
controller keeps track of time and wakes up the processor
after a time period, which is determined by the minimum Zi 8. Conclusions
of tasks that the controller has received. After the processor
has woken up, it schedules the highest priority task with the Dynamic voltage scaling is a good solution to save sig-
assigned slowdown factor (the slowdown factor scales the nificant amounts of energy. However, its efficiency as such
execution speed of the task). Tasks are scheduled with EDF diminishes as static (leakage) current gets larger because of
policy. smaller and smaller scales in circuit manufacturing. Static
Tests in [7] were performed in a simulator. The follow- current is also the cause of the fact that lower voltages and
ing approaches were compared: no DVS (no-DVS) where frequencies will not always lead to energy savings but there
all tasks are executed at maximum performance; traditional exists a critical speed below which static current starts to
DVS (DVS) where tasks are executed with a minimum pos- dominate energy consumption. Operating below the criti-
sible slowdown factor; critical speed DVS (CS-DVS) where cal speed is not energy efficient anymore and it can actually
no task gets a slowdown below the critical speed and crit- be more energy efficient to shut down the processor than to
ical speed DVS with procrastination (CS-DVS-P) which is continue executing with the critical speed.
a CS-DVS with procrastination. All algorithms performed Procrastination scheduling tries to address the aforemen-
almost identically above the critical speed. However, be- tioned facts. The proposed procrastination scheduling al-
low the critical speed CS-DVS consumed up to 5% less en- gorithms, in [7, 6] discussed in section 7, are good candi-
ergy compared to DVS and at utilization of 10% CS-DVS-P dates for energy-efficient scheduling because of their sim-
consumed 18% less energy compared to CS-DVS. All algo- plicity and small overhead and because of their ability to
rithms clearly outperformed no-DVS. take leakage current into consideration. They do not yet
The work in [6] proposes a dynamic slack (unused run- offer large energy savings compared to traditional DVS al-
time) reclamation algorithm with procrastination schedul- gorithms, though. It is predicted that chip’s leakage cur-
ing. The idea of the slack reclamation algorithm is as fol- rent increases about five times each generation [1] which
lows: an early completion of a task results in a (dynamic) can make procrastination scheduling an effective solution.
slack which is stored in a free run time list (FRT-list). The One thing to consider in [7, 6] is that both papers used a
list is sorted by the priority of the slacks such that the slack controller which handled all the interrupts and task arrivals
with the highest priority goes to the head of the list (slacks while the processor was in the shutdown state. Neither one
are always taken from the head). A task can reclaim a slack gave any details about the controller.
from the FRT-list if the slack has higher or equal priority to Compiler-assisted approaches are interesting ones be-
cause programs can be analyzed off-line when there is a theless, the energy-aware versions of EDF and RM are not
lot more processing time available as there is at run-time. much more complex than the traditional ones. And, because
They should also decrease run-time scheduling overhead. they don’t miss deadlines and are still able to provide sig-
A major drawback of some approaches, like the one in [12] nificant energy savings, they are good options.
discussed in section 6, is the need to modify existing pro-
grams. Another drawback, although not as severe as the
References
previous one, is the need to analyze programs. The best so-
lution would naturally be the one which would be able to
use existing programs as such, without a need for modifica- [1] S. Borkar. Design challenges of technology scaling. IEEE
tions or analyzing. The approach in [16], discussed in sec- Micro, 19(4):23–29, 1999.
tion 6, needs no program modifications but analysis is still [2] A. Chandrakasan, S. Sheng, and R. Brodersen. Low-power
CMOS digital design, 1992.
required. Because of aforementioned reasons, compiler-
[3] I. M. Duarte D., Vijaykrishnan N. and Y.-F. Tsai. Im-
assisted approaches are not that well suited as a general
pact of technology scaling and packaging on dynamic volt-
solution. However, for some dedicated devices, compiler- age scaling techniques. In 15th Annual IEEE International
assisted approach can be a good choice. ASIC/SOC Conference, September 2002.
From the point of what additional information is needed [4] D. Grunwald, P. Levis, K. I. Farkas, C. B. M. III, and
in advance about applications, stochastic (SRT) scheduling M. Neufeld. Policies for dynamic clock scheduling. In
is a good choice. The algorithm in [17], discussed in sec- OSDI, pages 73–86, 2000.
tion 5, needs no prior knowledge about applications. All [5] Intel. Intel Pentium M Processor Datasheet, April 2004.
required information is acquired at run-time. One possible [6] R. Jejurikar and R. Gupta. Dynamic slack reclamation with
problem in statistical approaches is the ability to respond procrastination scheduling in real-time embedded systems.
to rapid changes in the computation requirements. The ap- In DAC ’05: Proceedings of the 42nd annual conference on
Design automation, pages 111–116, New York, NY, USA,
proach in [17], for example, used a run-time generated prob-
2005. ACM Press.
ability distribution to estimate applications’ cycle demands.
[7] R. Jejurikar, C. Pereira, and R. Gupta. Leakage aware dy-
Even though the distribution was relatively stable in spite
namic voltage scaling for real-time embedded systems. In
of varying computation requirements, there is still the time DAC ’04: Proceedings of the 41st annual conference on
at the first runs of applications that it takes to construct the Design automation, pages 275–280, New York, NY, USA,
distribution and to let it stabilize. Before the distribution 2004. ACM Press.
is stabilized, it is possible that the predicted cycle demands [8] C. L. Liu and J. W. Layland. Scheduling algorithms for mul-
are not that good with respect to energy consumption. Pro- tiprogramming in a hard-real-time environment. J. ACM,
filing could also be performed off-line but its suitability, for 20(1):46–61, 1973.
instance, to multimedia applications is questionable. Con- [9] J. W. S. Liu. Real-Time Systems. Prentice-Hall, 1st edition,
structing the probability distribution can also incur large 2000.
overhead. The overhead can be avoided by updating the [10] J. R. Lorch and A. J. Smith. Improving dynamic voltage
distribution less frequently. But, the more time there is be- scaling algorithms with PACE. In SIGMETRICS ’01: Pro-
tween updates, the more time it takes to get a distribution ceedings of the 2001 ACM SIGMETRICS international con-
that is stable enough at the beginning. Of course, the distri- ference on Measurement and modeling of computer systems,
pages 50–61, New York, NY, USA, 2001. ACM Press.
bution could be updated frequently when an application is
[11] A. Miyoshi, C. Lefurgy, E. V. Hensbergen, R. Rajamony,
started to run but that would increase the overhead.
and R. Rajkumar. Critical power slope: understanding the
Interval scheduling, discussed in section 4, is a rather runtime effects of frequency scaling. In ICS ’02: Proceed-
simple approach which makes it a tempting alternative. ings of the 16th international conference on Supercomput-
One problem, and probably the biggest one, with inter- ing, pages 35–44, New York, NY, USA, 2002. ACM Press.
val scheduling is the choice for the length of the interval. [12] D. Mosse, H. Aydin, B. Childers, and R. Melhem. Compiler-
Too long intervals will cause reduced system responsive- assisted dynamic power-aware scheduling for real-time ap-
ness to rapid changes but incur less overhead. Too short plications. In Workshop on Compilers and Operating Sys-
intervals, instead, will give higher responsiveness but incur tems for Low-Power (COLP’00), October 2000.
more overhead. [13] P. Pillai and K. G. Shin. Real-time dynamic voltage scaling
for low-power embedded operating systems. In SOSP ’01:
The energy efficient versions of EDF and RM, discussed Proceedings of the eighteenth ACM symposium on Operat-
in section 3, save energy compared to situation without ing systems principles, pages 89–102, New York, NY, USA,
energy-awareness and are still able to meet all deadlines. 2001. ACM Press.
The algorithms proposed in [13] incurred small overhead [14] M. Weiser, B. Welch, A. J. Demers, and S. Shenker.
and saved energy. One problem with EDF and RM is the Scheduling for reduced CPU energy. In Operating Systems
need of WCETs which can be hard to determine. Never- Design and Implementation, pages 13–23, 1994.
[15] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti,
S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heck-
mann, F. Mueller, I. Puaut, P. Puschner, J. Staschulat, ,
and P. Stenstrm. The determination of worst-case execu-
tion times—overview of the methods and survey of tools.
accepted for ACM Transactions on Embedded Computing
Systems (TECS), 2007.
[16] P. Yang, C. Wong, P. Marchal, F. Catthoor, D. Desmet,
D. Verkest, and R. Lauwereins. Energy-aware runtime
scheduling for embedded-multiprocessor SOCs. IEEE Des.
Test, 18(5):46–58, 2001.
[17] W. Yuan and K. Nahrstedt. Energy-efficient soft real-time
CPU scheduling for mobile multimedia systems. In SOSP
’03: Proceedings of the nineteenth ACM symposium on Op-
erating systems principles, pages 149–163, New York, NY,
USA, 2003. ACM Press.
Instruction-level Energy Consumption Models

Kristian Söderblom
Software Technology Laboratory/TKK
[email protected]

Abstract methods. While hardware design is often top-down, start-


ing at the system level, hardware designers use low level
A lot of effort is being spent on finding ways to decrease tools, at the circuit level and gate level, to simulate their
energy consumption and increase battery life in portable designs and estimate power consumption. Software devel-
devices like mobile phones. To minimize the energy con- opers need higher level tools. One reason is that circuit and
sumption in a device, designers have to optimize at all lev- gate level information is not available in processor manu-
els and in all parts of the device, starting at the system level. als. Low level tools are also slow. With high level energy
Energy optimizations of the source code and in an energy estimates, for example at the instruction level, software de-
aware compiler can be used to reduce energy consumption velopers can get feedback on how energy efficient their code
in the software part of the device. Energy estimation at the is. If an energy-aware compiler is used, the compiler itself
machine instruction level requires an instruction-level en- can make energy optimizations and provide feedback to the
ergy consumption model. This paper focuses on the well user about energy consumption. With a traditional com-
known instruction-level energy model by Tiwari et al. In piler, which optimizes only for speed and/or size, the pro-
addition, the paper briefly describes some other models that grammer can use an energy estimate to get feedback about
attempt in various ways to be more powerful than the Tiwari source level optimizations.
model. Finally the paper concludes. The structure of the paper is as follows. In chapter
2, we present high-level energy estimation and optimiza-
tion as complementary to more traditional low-level meth-
1. Introduction ods. Chapter 3 presents the well established instruction-
level power analysis methodology introduced by Tiwari et
al. Chapter 4 presents some other models which build on
The recent years have seen a huge increase in the amount the same ideas. In chapter 5 the paper concludes.
of available types of hand-held devices like mobile phones,
cameras, PDAs etc. New features and software in the de-
vices, and decreasing device size mean that energy con- 2. High-level energy estimation and optimiza-
sumption and battery life continue to be very important de- tion
sign factors. To reduce the time to market and expenses
there has been a move from application specific logic to Before talking about different abstraction levels, we want
programmable devices. Energy consumption in a device de- to mention some basic facts. Average energy equals aver-
pends primarily on the hardware and its design, but also, if age power multiplied by time: E = P × T . When the
the device is programmable, on the software running on the supply voltage in a device is V and the current is I, power
device. equals V ×I. Energy can be used to perform work, that is to
A huge amount of advances and improvements have handle a user’s work load. Power means how much energy
been made in embedded system hardware, especially mo- is used per time unit. While energy and power are not the
bile phone hardware [16]. Batteries keep improving slowly same thing, this difference is often not emphasized in power
but steadily. However, these advances are not in themselves estimation literature.
enough to minimize energy consumption in devices. One To be able to optimize the energy consumption in de-
reason for this is the digital convergence, which requires vices, optimizations have to be done at all possible levels
there to be more and more software in the devices, for ex- and in all components, both hardware and software. Huge
ample multimedia codecs. gains can be realized by making good design choices at a
To improve battery life in devices and cope with ther- high level of abstraction. For example, optimizing software
mal issues we need low power design and power estimation for power at the source code level can in some cases lead to
energy savings of 90% [9]. 2.3. Instruction level modeling, estimation and op-
timization
2.1. Top-down design flow and energy estimation
For software people that are used to programming and
compilers, the instruction level is an intuitive level to make
Embedded system design often begins at the highest energy optimizations. Both the programmer and the com-
level, namely the system level. Macii et al. [7] present a piler can have an effect on the energy consumption of the
possible low-power design flow. The design process is top- resulting software.
down, that is, first the designer partitions the design into At the instruction level, we are dealing with machine in-
components or modules, for example analog or digital parts, structions of a certain processor. The hardware details dis-
or hardware and software components of the system. For appear behind a simple interface, the instruction set. To be
hardware parts that are developed in-house, the top-down able to estimate the energy consumed by software, a model
design continues for the modules down to the lowest levels. must be made of the energy consumption. When a program
To estimate the energy consumption in devices, hardware is given as input the model is used to estimate the amount
designers can use tools with a low level of abstraction, since of energy used by the hardware as a result of running this
all the details of the design are available to them. For exam- particular program. The model can be used to estimate the
ple, the SPICE tool is a tool for simulating a design at the energy consumed when running a whole program or just a
circuit level. certain sequence of instructions.
Energy estimates are also needed at higher levels, for ex- The instruction sequence used as input can be obtained
ample designers need to make system level power budgets, statically or dynamically. Statically one can for example
determining how much energy the different parts of the sys- estimate energy consumption of individual basic blocks, or
tem are allowed to use. To deal with the increasing com- if more advanced static analyzes are available, of the whole
plexity and features in devices, there is a need for design program. In the dynamic case, one simulates the program
tools working at high level [16]. for one or many different inputs and can then estimate the
In general one can say that low level methods require a power of the resulting instruction sequences.
lot of information and require time consuming simulation, Application software on embedded devices should be
but can be very accurate in estimating energy consumption. compiled with an energy aware compiler. These compil-
At higher levels, less details are known, so the estimation ers need to consider, in addition to performance and code
can be faster, but this might not capture all the subtle effects size, energy or power. Energy aware compilers need en-
in the design. One should note that often relatively correct ergy estimates of instruction sequences to be able to mini-
power estimates are enough when making design tradeoffs. mize the energy consumption of compiled programs. The
If one wants to make sure a component doesn’t exceed its energy aware compiler can use an instruction-level power
energy budget, absolutely correct estimates are required. model to make power-performance tradeoffs. Good estima-
tion speed allows a compiler to try more alternatives. If an
2.2. High-level abstraction of the design energy aware compiler is not available, one can use an en-
ergy analyzer or simulator to evaluate the energy efficiency
of source level optimizations.
High-level power estimation [4, 7] methods are not as
mature as those at the lower levels of abstraction. However,
a lot of research has gone into developing high-level power 3. Tiwari model
estimation models and methods. These high levels are, in
order of increasing abstraction: the architecture, algorithm In 1994 Tiwari et al. published a methodology [14, 13]
and system levels. The architecture level is the lowest level for developing and validating an instruction level energy
of these: it deals with things like voltages, capacitances and consumption model for any processor. One motivation for
technology parameters of hardware components like memo- this is that instruction set of a programmable device is al-
ries, registers etc. The algorithm level deals with behavior- ways known, whereas all hardware details might not. The
and instruction-level issues. On the behavior level the activ- Tiwari model relies on physical measurements taken from
ity of resources is predicted, either statically or dynamically hardware, which can be done by third parties. Tiwari calls
(using profiling). On the instruction level, hardware details the approach instruction level power analysis (ILPA). Being
are abstracted behind a well known interface, the machine able to estimate the energy consumption of machine instruc-
instruction set. The highest level is the system level, where tions, one can proceed to estimate the energy consumption
supported features, hardware, compilers et cetera, are se- of entire programs. The presented methodology makes few
lected. This paper focuses on the software, that is, instruc- assumptions which makes it applicable to a wide variety of
tion level. processors (e.g. CISC, RISC, and DSP).
3.1. Tiwari’s expression for energy consumption1 that take an equal amout of clock cycles to execute take ap-
proximately the same amount of energy. This is because
In the Tiwari model, each instruction has a base cost, energy is used in parts of the processor to calculate results
and in addition to this, there is an inter-instruction cost, the that are later discarded.
overhead of executing two specific instructions after one an-
other. To this is added energy of other effects like branch 3.4. Energy of other effects
misprediction, pipeline effects and caches:
P P P The other effects are for example prefetch buffer and
Ep = i (Bi × Ni ) + i,j (Oi,j × Ni,j ) + k Ek write buffer stalls, pipeline stalls, and cache misses.
Ep = energy consumed by program
3.5. Measurements
Bi = base cost of instruction i
Ni = number of executions of instruction i
Oi,j = overhead of executing instruction pair i, j In the Tiwari model, estimates are based on current mea-
Ni,j = number of executions of instruction pair i, j surements. Core voltage and clock frequency are assumed
Ek = energy of other effects to be constant. Therefore it is enough to measure the cur-
rent usage. Base costs can be determined by putting the rel-
evant instruction in an infinite loop and measuring the cur-
3.2. Base costs rent with an ammeter. Inter-instruction costs can be mea-
sured by alternating between two different instructions in
The base cost of an instruction is the energy necessarily the loop; the current will in this case be greater than the av-
used in the processor to execute the instruction. By defini- erage of the base currents for the two relevant instructions.
tion, it does not include any energy consumption caused by In theory, for a processor with n instructions we need
executing multiple instructions, e.g. stalls or cache misses. to measure n base costs and O(n2 ) inter-instruction costs.
One should note that even in power optimized processors, In practice, instructions can be grouped: e.g. arithmetic,
base costs for instructions with similar functionality tend to bit operation, logical, move, etc. Energy consumption of
be the same. This leads to the idea of grouping instructions instructions in a group is almost the same, because the in-
to reduce the amount of needed measurements and speed up structions exercise the same parts of the processor. The way
the power estimation. to measure the other effects is to write some code where
Having a fixed average base cost of an instruction, in- these occur and subtract the base and inter-instruction costs
dependent of operand values, is justified because variation from the total energy consumed.
due to operand values have been found to be quite small (at
most <10%) [13]. 3.6. Memory
Note that memory operands accessing different kinds of
memory could lead to huge differences in the energy con- If an instruction accesses memory, the energy consump-
sumption. A possible solution is to have separate base cost tion depends on the specific memory (e.g. on-chip or off-
tables for different kinds of memory. chip). Each kind of memory might require an own set of
measurements. If one considers the cases where the in-
3.3. Inter-instruction (overhead) costs structions and data can both be either in on-chip or off-
chip memory, there is four different combinations. This
The inter-instruction cost is a cost which involves differ- sort of measurements have been done for example for the
ent instructions. When two different instructions are exe- ARM7TDMI at Dortmund [12]. For example load instruc-
cuted after one another, changes in the circuit state cause tions that access only off-chip memory are found to be about
some energy consumption overhead. The inter-instruction ten times more energy consuming than those that use only
effects can also be e.g. stalls or cache misses[13]. The inter- on-chip memory.
instruction effect is usually in the order of 10% of the base
cost or so, depending very much on the processor. 3.7. Application
One interesting thing to note about inter-instruction costs
is that the more power-optimized the processor is, the more The Tiwari model has been found to work well on both
likely it is that inter-instruction effects are relatively big. On CISC (e.g. i486) [14], RISC (e.g. ARM7), and on DSPs:
processors that have few power optimizations instructions [5],[1].
1 Remember that E = P × T = (V × I) × (N × τ ), where N is the Once the model has been established it can be used to es-
number of clock cycles and τ the length of the clock cycle. timate the average energy consumed by a certain sequence
of machine instructions. Note that the instruction level en- 4.3. Instantaneous dynamic power consumption
ergy model is not as complex as that needed in a cycle- model
accurate energy simulator, where one needs architecture
level information, for example energy consumption in dif- Instantaneous power consumption can be estimated by
ferent parts of the processor pipeline. modeling the processor as an LTI (linear time invariant) sys-
The Tiwari model can be used to check that the software tem, where the program instructions are the input signal and
parts are within the given energy budget. Also, even though the power consumption is the response. [8]
the estimates are of average power, more fine grained esti-
mates can be made. That is, energy estimates can be made 4.4. Worst case energy consumption
for parts of programs, not only for the whole program.
The Tiwari model can be used to evaluate compiler opti- A worst-case energy consumption (WCEC) model is
mizations. One possible optimization that comes to mind necessarily more complex than an average energy consump-
when familiar with the Tiwari model is rescheduling in- tion model, for example it needs to know the worst case ex-
structions for low power. It is often possible to change the ecution time (WCET) to estimate the worst case leakage en-
order of the instructions in a program without changing the ergy [2]. This requires extending the instruction level model
end result of the computation. The idea is to reorder the in- with features from the architecture level.
structions so that the inter-instruction effects are minimized.
In [10] a list-scheduling algorithm is presented for this pur-
pose. 5. Conclusions
Hardware designers can use the model to determine
whether some instructions in typical software use unnec- Instruction level power analysis can be used for exam-
essarily much power and should be optimized at the micro- ple in energy aware compilers. Software developers can
architecture level. get feedback from the compiler about how energy efficient
their code is and use this knowledge to learn about writing
more power efficient programs. Instruction level power esti-
4. Extensions and other models mates can also be used as a part of even higher level models.
For example, a framework for optimizing at the source code
The Tiwari model has some limitations, most notably: level can employ an instruction level power model to make
the energy estimate is only an average. It does not tell ex- a database of estimated energy for probable code sequences
actly how much power is needed at every instant, or the [3].
worst case energy consumption.
References
4.1. Dortmund model
[1] Miguel Casas-Sanchez, Jose Rizo-Morente, and
Chris J. Bleakley. Power Consumption Characterisa-
In 2001, a parametric energy consumption model [11], tion of the Texas Instruments TMS320VC5510 DSP.
made at the University of Dortmund, was published. The In 15th International Workshop on Integrated Circuit
model is used in the research group’s energy aware com- and System Design, Power and Timing Modeling, Op-
piler, encc, and in general in the scratchpad research[15] timization and Simulation (PATMOS’05), pages 561–
that is going on there. In this model, parameters and auxil- 570, September 2005.
iary functions model bit toggling on busses, memory hier-
archy, functional unit activity etc.; parameters are estimated [2] Ramkumar Jayaseelan, Tulika Mitra, and Xianfeng
from current measurements using linear regression. Li. Estimating the Worst-Case Energy Consumption
of Embedded Software. In Proceedings of the 12th
4.2. A parametric model based on regression anal- IEEE Real-Time and Embedded Technology and Ap-
ysis plications Symposium (RTAS’06), pages 81–90, San
Jose, California, USA, April 2006.

A technique to create an energy consumption model for [3] I. Kadayif, M. Kandemir, G. Chen, N. Vijaykrish-
a RISC processor using measurements and a regression nan, M. J. Irwin, and A. Sivasubramaniam. Compiler-
model was presented in [6]. Experiments showed an aver- Directed High-Level Energy Estimation and Opti-
age energy estimation error of 2.5% for random instruction mization. ACM Transactions on Embedded Comput-
sequences. ing Systems (TECS), 4(4):819–850, November 2005.
[4] Paul Landman. High-Level Power Estimation. In Pro- [13] V. Tiwari, S. Malik, A. Wolfe, and T. C. Lee. Instruc-
ceedings of the 1996 international symposium on Low tion Level Power Analysis and Optimization of Soft-
power electronics and design, pages 29–35, Monterey, ware. Journal of VLSI Signal Processing, 13(2):1–18,
California, United States, August 1996. August 1996.

[5] Mike Tien-Chien Lee, Vivek Tiwari, Sharad Malik, [14] Vivek Tiwari, Sharad Malik, and Andrew Wolfe.
and Masahiro Fujita. Power Analysis and Minimiza- Power Analysis of Embedded Software: A First Step
tion Techniques for Embedded DSP Software. IEEE Towards Software Power Minimization. IEEE Trans-
Transactions on Very Large Scale Integration (VLSI) actions on Very Large Scale Integration (VLSI) Sys-
Systems, 5(1):123–135, March 1997. tems, 2(4):437–445, December 1994.
[15] Lars Wehmeyer, Urs Helmig, and Peter Marwedel.
[6] Sheayun Lee, Andreas Ermedahl, and Sang Lyul Min.
Compiler-optimized Usage of Partitioned Memories.
An Accurate Instruction-Level Energy Consumption
In Proceedings of the 3rd Workshop on Memory Per-
Model for Embedded RISC Processors. In LCTES
formance Issues (WMPI’04), pages 114–120, Munich,
’01: Proceedings of the ACM SIGPLAN workshop
Germany, June 2004.
on Languages, compilers and tools for embedded sys-
tems, 2001. [16] Yrjö Neuvo. Cellular phones as embedded systems.
In 2004 IEEE International Solid-State Circuits Con-
[7] E. Macii, M. Pedram, and F. Somenzi. High-level ference, San Francisco, CA, USA, February 2004.
power modeling, estimation, and optimization. IEEE
Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 17(11):1061–1079, November
1998.

[8] Jose Rizo-Morente, Miguel Casas-Sanchez, and C.J.


Bleakley. Dynamic Current Modeling at the Instruc-
tion Level. In Proceedings of the 2006 International
Symposium on Low Power Electronics and Design
(ISLPED’06), pages 95–100, Tegernsee, Germany,
October 2006.

[9] T. Simunic, L. Benini, and G. De Micheli. Energy-


Efficient Design of Battery-Powered Embedded Sys-
tems. In Proceedings of the International Symposium
on Low-Power Electronics and Design (ISLPED’99),
pages 212–217, San Diego, CA, USA, August 1999.

[10] G. Sinevriotis and T. Stouraitis. A novel list-


scheduling algorithm for the low-energy program ex-
ecution. In IEEE International Symposium on Cir-
cuits and Systems (ISCAS 2002), pages IV94–100,
Phoenix-Scottsdale, AZ, USA, May 2002.

[11] S. Steinke, M. Knauer, L. Wehmeyer, and P. Mar-


wedel. An Accurate and Fine Grain Instruction-Level
Energy Model Supporting Software Optimizations. In
Proceedings of the International Workshop on Power
and Timing Modeling, Optimization and Simulation
(PATMOS), Yverdon, Switzerland, September 2001.

[12] Michael Theokharidis. Energiemessung von


ARM7TDMI Prozessor-Instruktionen. Diploma
thesis, University of Dortmund, Department of
Computer Science 12, November 2000.
Energy Accounting

Timo Töyry
Software Technology Laboratory/TKK
[email protected]

Abstract 2 ACPI

In this paper I will present shortly the two method ACPI (Advanced Configuration and Power Interface)
most commomly used in energy accounting these are model specification was originally introduced by Intel, Toshiba
based energy accounting and measurement based energy and Phoenix in December 1996. Compaq (now Hewlett-
accounting. I will also do a sort survey to taday’s com- Packard) and Microsoft joined to development group later.
puters power management in form of ACPI. ACPI specification has been updated couple times after
is initial introduction. The current version is 3.0b which
was published in October 2006. ACPI specification defines
common interfaces for energy management for both soft-
1 Introduction ware and hardware. ACPI allows the operating system to
control energy management of whole system as well indi-
Energy consumption is now one of the biggest chal- vidual devices. The core of ACPI is ACPI System Descrip-
lenges of today’s mobile devices. As traditional solution tion Tables which expose power saving modes of hardware
almost every chip and other devices have many different to operating system. Usually ACPI hardware support means
energy saving modes, but they are not efficiently used in only that hardware provides these tables to above software
today’s operating systems since devices are put to lower en- layer. In these tables are listed for example energy modes
ergy mode after some timeout period and then put back to supported by the device. ACPI is in wide use nowadays,
higher energy mode when some process needs it. actually it is the most commonly used energy saving inter-
There is a new different approach to this issue: What if face in all kind of computers today. But unfortunately ACPI
user is given the power to choose how long system should does not provide any information about energy consump-
stay running with certain set of programs. Then the user tion to software (operating system). For this reason ACPI is
shall prioritize his/hers running programs in relation to each not usable for energy accounting. [1], [2],[3].
other to reflect their importance to he/she. And then the
operating system takes care that users requirements are met 3 Model based energy accounting
if they are realistic. One method to achieve this is so called
energy accounting. Model based energy accounting employs some model of
The term energy accounting could be defined for exam- used hardware to estimate its energy consumption. These
ple as so: A system which allows operating system to con- systems do not usually have any feedback from hardware
trol the amount of energy consumed by some program. This about real energy consumption, but when the model is de-
requires that operating system have knowledge about that veloped energy consumption of system in certain states are
programs energy usage. The knowledge about program’s measured and the model is then fitted to measurement data.
power consumption can be carried out many different ways. There are couple different approaches to build an energy
Probably the two most common methods are model based model for system [4], [6].
energy accounting, which uses a model of the system to cal- Event counter based models employs hardware perfor-
culate estimates for used energy, and measurement based mance counters available in system as input to the model.
energy accounting which employs realtime measurements Performance counters are configured to measure relevant
of power usage of the system. In following chapters I will events for energy consumption such as CPU cache misses.
introduce the today’s standard in computers power manage- This kind of model can be used only for CPUs since other
ment the ACPI and these two methods in little bit more de- devices do not usually have any performance counters.
tail and also prototype examples for both. Only exception is the main memory whose usage can be
measured although indirectly by CPU cache misses and model because it forms a common unit for energy allocation
memory write backs. The accuracy of model is relative and accounting over different hardware devices and soft-
to amount of input data from performance counters. The ware, in which different processes are competing for limited
model itself is usually very simple and does not require hardware resources [6].
much computing which makes its results suitable to be used One unit of currentcy is defined as right to consume cer-
as input for energy accounting algorithm in running system. tain amount of energy in certain period of time. In the pro-
The simple model is also the weak point of this approach totype system one currentcy is specified to correspond 0.01
because simplifications can cause inaccurancies in estima- mJ of battery energy [6].
tion [4], [6]. In currentcy model power costs consists from two parts,
State based models represents system as a state machine first part is so called base cost and the second part is cost
which changes states between different energy states. States from managed devices. Base cost is defined to include low-
are changed correspondingly to actions of running program. est power states of managed devices and default power state
When the model is built its energy states are calibrated to of unmanaged devices. The second part of cost is from use
match real energy consumption of the state. It is done by of managed devices (in the prototype CPU, hard drive and
measuring energy consumption of components related to WLAN) so that they go to higher energy state. All man-
the inspected energy state and adjusting the model when aged devices may have own charging policies for higher en-
required to match measured values. These models give ergy states than the lowest energy state. Base costs are not
usually more accurate estimates about systems energy con- charged from processes currentcy containers, but they are
sumption than event counter models. The good point of taken in account in total energy consumption when target-
this approach is that it is an all software solution, but the ing to certain battery life [6].
model can not represent any variation in energy consump- There are two main aspects in currentcy allocation first
tion within states so the accuracy of the model is essen- and most important is target battery life selected by user.
tially defined by number of states used. In energy account- The target battery life specifies the amount of currentcy
ing cost of using a device is usually charged from process avalaible in each epoch. Epoch is the time interval in which
which causes that device to change its state to higher energy the allocated currentcy should be consumed. Second thing
state[4], [6]. is the currentcy allocation to competing processes in epochs
Third different approach to obtain estimates for the en- this is done by user given priorities. If process consumes all
ergy consumption of system is statistical modeling. The its currentcy it will be halted until next epoch and new cur-
energy estimates are calculated statistically from measure- rentcy allocation even if it is otherwise ready to run. Pro-
ment data which is collected while test programs are ran on cesses can also accumulate some unused currentcy for next
system. The measurement data contains usually the process epochs to pay some more expensive task. However cur-
id and value of program counter of currently running pro- rentcy accumulation is quite strictly limited to avoid situa-
cess and measured total energy consumption of system. The tions where many processes have lots of currentcy to spend
sample rate of measurement data is quite low compared to in single epoch. If these wealthy processes use all their cur-
clock rate of the CPU, but if data are collected low enough rentcy one there will be heavy peak in battery discharge rate
there will be statistically significant amount of samples for which is an unwanted effect that can reduce battery life [6],
every instruction of inspected program. Estimates obtained [5].
by this method are more detailed and accurate than esti- The processes are required to pay for usage of managed
mates from both event counter based and state bases models devices when they are going to use them, for example a pro-
[4], [6]. cess pays for execution time on CPU before it is executed.
The process can be executed as long as it have currentcy left
3.1 ECOSystem to pay for execution [6].

ECOSystem (Energy-Centric Operating System) is an 3.1.2 The Prototype


example of usage state based model in energy accounting.
ECOSystem was developed by Zeng et al [6]. The prototype is a modified version of RedHad Linux 2.4.0-
test9. They added custom resource containers, that supports
currentcy allocation and accounting and so energy account-
3.1.1 Currentcy model
ing, to the kernel. In the prototype they had currentcy man-
Currentcy model is the energy model used in ECOSystem. agement implemented for three different devices those were
Currentcy is an coined term, which is probably invented by CPU, hard drive and Wireless network card [6].
the authors, combining the meanings of current and cur- They used an IBM Thinkpad T20 laptop as hardware
rency. They state that currentcy is the key feature of their platform for their system. The currentcy energy model was
tuned to match real measured energy consumption values of CPU core, memory and IO. Each powersupply line have a
the platform. The CPU on that laptop was 650 MHz Intel current sensor for energy measurement [4].
Pentium III processor. The used power model for CPU was The system have some on-board peripherals which are
very simple and coarse since it assumed constant energy infra-red, USB and serial ports. They get their power from
consumption in active CPU. More accurate results could be IO-power. There are also an 8-bit Atmel AVR microcon-
obtained by using an event counter model to track internal troller on-board as supervisor for the rest of the system. The
energy states of the CPU. Hard drive was a IBM Travel- system have some extension possibilities for additional pe-
star 12GN and was connected to system trough standard ripheral devices since the unused connectors of the XScale
ATA interface. The Travelstar had all standard ATA power are wired to some extension sockets on-board [4].
states and some additional internal power states managed The PLEB2 platform with its ARM core and peripher-
by an undocumented power management algorithm which als models a typical embedded system. They had modi-
might reduce overall energy savings from disk management fied couple Linux systems (Linux 2.4.19, Linux 2.6.8 and
since the operating system is not aware of disk real en- L4ka::Pistachio) to be run able on the PLEB2 system [4].
ergy state. Wireless network card was an Orinoco Silver The power consumed in each area (CPU, memory and
PC card, which had three different power states; sleep, re- IO) is calculated from measured current output of each
ceive and transmit. Sleep state is the lowest energy state power supply and known well regulated output voltage,
and transmit state is the highest. The card also have two which is considered to be constant. The on-board AVR mi-
modes for communication with base station, the states are crocontroller, which have a build in analog to digital con-
active mode, in which card have continuous connection to verter, is responsible for periodically gathering the measure-
base station and all incoming data is transferred immedi- ment data from current sensors. The sampling rate of mea-
ately. The other mode is periodical polling mode in which surements are limited by the speed of AD-converter, which
card is most of time in sleep mode and periodically wakes is able to read sensor in rate up to 15 kHz, but since there are
up and requests new incoming data from base station. The three sensors maximum sampling rate per sensor is then 5
base station buffers the traffic directed to the card while it is kHz when all sensors are sampled in equal intervals. When
in sleep [6]. AD-conversion is ready the sample is transferred to XScale,
which stores it with corresponding process id and program
4 Measurement based energy accounting counter for later evaluation or uses it in realtime energy ac-
counting. The transferring process of measurement results
from AVR to XScale is very slow, because data is trans-
Measurement based energy accounting requires some
ferred trough slow I2C serial bus, and causes total of five
special hardware support, which takes care of the measure-
interrupts to XScale. The first interrupt is mark for XScale
ments. Unlike in model based energy accounting systems
that microcontroller starts reading a sensor so the XScale
measurement based energy accounting systems have real-
can save correct process id and program counter value for
time (or at least almost realtime) feedback for consumed en-
the measurement data. Second interrupt is mark that micro-
ergy. Since the hardware provides information about energy
controller are ready with AD-conversion and XScale should
consumption and if operating system make use of this infor-
start transfer data from microcontroller. The rest three in-
mation it efficiently voids the need for energy model of sys-
terrupts are caused by data transfer trough the I2C bus [4].
tem. Measurement based energy accounting could provide
better performance and lower energy consumption in en-
ergy accounting system because it have less computational 5 Conclusions
overhead compared to model based energy accounting [4].
Next some comparison between model and measure
4.1 PLEB2 based energy accounting systems. Firstly both systems have
their strong points. For model based system it is systems
PLEB2 is a prototype system built by Snowdon et al. The all software nature, there are no need for special hardware
prototype is a single board computer which is based on Intel to take care of energy consumption measurements. This is
XScale PXA255 processor. The PXA255 have ARMv5TE probably one of the weak points of measurement based sys-
compatible core which operates at 400 MHz clock speed, tems. The strongest point for measurement bases system is
and some peripheral devices like memory, DMA, interrupt the measured realtime information about real energy con-
and LCD display controllers. The ARMv5TE core contains sumption. And because the energy consumption is mea-
the CPU core, some onchip SRAM memory and some flash sured there is no need for system model, which may require
memory [4]. significant amount of computing. In contrast this is one of
The system have three switching power supplies which draw backs of model based systems since they can not be
provide power from lithium-ion battery independently to sure about the energy consumption. Another strong area of
measurement based system compared to model based sys-
tem, which usually uses some kind of state model of system,
is that measurement based system can capture variations in
energy usage within the states of the model and is there for
more accurate [6], [4].
Model based energy accounting is currently more inter-
esting technology because it can be implemented to existing
system without any need for additional hardware. It only
requires some measures about systems energy consumption
in calibration phase of the model. Model based energy ac-
counting can be successfully used to lengthen battery life of
laptop computer. Zeng et al. proved this with their ECOSys-
tem prototype. They were able to achieve battery life up to
25 hours which tough was possible with expense of inter-
activity (the response times for example for net surfing got
very long) [6].
Measurement based energy accounting is more like fu-
ture technology mainly due lack of hardware support in
current systems. However Snowdon et al. demonstrated
with their PLEB2 platform that is possible to build measure-
ment based energy accounting system with realtime mea-
surements from multiple sensors [4].

References

[1] S. Balakrishnan and J. Ramanan. Power-aware operating sys-


tems using acpi, cs 736 project. Technical report, University
of Wisconsin, Computer Science Department, 2001.
[2] Hewlett-Packard, Intel, Microsoft, Phoenix, and Toshiba. Ad-
vanced Configuration and Power Interface Specification 3.0b,
October 2006.
[3] Intel. Acpi overview, 2000.
[4] D. C. Snowdon, S. M. Petters, and G. Heiser. Power measure-
ment as the basis for power management. In Proceedings of
the 2005 Workshop on Operating System Platforms for Em-
bedded Real-Time Applications, July 2005.
[5] H. Zeng, C. Ellis, A. Lebeck, and A. Vahdat. Currentcy: Uni-
fying policies for resource management. In Proceedings of
the USENIX 2003 Annual Technical Conference, June 2003.
[6] H. Zeng, X. Fan, C. Ellis, A. Lebeck, and A. Vahdat. Ecosys-
tem: Managing energy as a first class operating system re-
source. In ASPLOS02, October 2002.

You might also like