0% found this document useful (0 votes)
25 views80 pages

Unit II

This document discusses key topics in embedded programming including components like state machines, circular buffers, and queues commonly used in embedded software. It also covers models of programs using control/data flow graphs, assembly, linking and loading of embedded programs, and techniques for program-level performance, energy/power analysis and optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views80 pages

Unit II

This document discusses key topics in embedded programming including components like state machines, circular buffers, and queues commonly used in embedded software. It also covers models of programs using control/data flow graphs, assembly, linking and loading of embedded programs, and techniques for program-level performance, energy/power analysis and optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

UNIT III Embedded Programming

Components for embedded programs


Models of programs- Assembly, linking and loading –
compilation techniques
Program level performance analysis
Software performance optimization
Program level energy and power analysis and
optimization
Analysis and optimization of program size- Program
validation and testing.
Components for embedded programs
In this section, we study in detail the process of programming embedded processors.
The creation of embedded programs is at the heart of embedded system design.

Embedded code must not only provide rich functionality, it must also often run at a
required rate to meet system deadlines, fit into the allowed amount of memory, and
meet power consumption requirements. Designing code that simultaneously meets
multiple design constraints is a considerable challenge, but luckily there are
techniques and tools that we can use to help us through the design process.

we consider code for three structures or components that are commonly


used in embedded software:
1. The state machine,
2. The circular buffer,
3. The queue.

 State machines are well suited to reactive systems such as user interfaces;
 Circular buffers and Queues are useful in digital signal processing.
Software State Machine
When inputs appear intermittently rather than as periodic
samples, it is often convenient to think of the system as reacting
to those inputs. The reaction of most systems can be
characterized in terms of the input received and the current
state of the system. This leads naturally to a finite-state machine

• State machine keeps internal state as a


variable, changes state based on inputs.
• Uses:
– control-dominated code;
– reactive systems.
State machine example
(Seat belt controller)

The controller’s job is to turn on a buzzer if a person sits in a seat and does not
fasten the seat belt within a fixed amount of time.

This system has three inputs and one output.

The inputs are a sensor for the seat to know when a person has sat down, a seat
belt sensor that tells when the belt is fastened, and a timer that goes off when
the required time interval has elapsed. no seat/-
The output is the buzzer.no seat/
idle
buzzer off seat/timer on

no seat/- no belt
and no
buzzer Belt/buzzer on seated timer/-
belt/-
belt/
buzzer off
belted no belt/timer on
C implementation
#define IDLE 0
#define SEATED 1
#define BELTED 2
#define BUZZER 3
switch (state) {
case IDLE: if (seat) { state = SEATED; timer_on = TRUE; }
break;
case SEATED: if (belt) state = BELTED;
else if (timer) state = BUZZER;
break;


Circular buffer
The circular buffer is a data structure that lets us handle
streaming data in an efficient way.

• Commonly used in signal processing:


– new data constantly arrives;
– each datum(factual information derived from measurement) has a
limited lifetime.

• Use a circular buffer to hold the data stream.


– EX: FIR filter
– For each sample, the filter must emit one output
that depends on the values of the last n inputs.
Circular buffer

Data stream

x1 x2 x3 x4 x5 x6

t1 t2 t3

x1
x5 x2
x6 x3
x7 x4

Circular buffer
Circular buffers

• Indexes locate currently used data, current


input data:
input d1 use d5

input d2
d2

d3 d3

use d4 d4

time t1 time t1+1


Circular buffer implementation: FIR filter
int circ_buffer[N], circ_buffer_head = 0;
int c[N]; /* coefficients */

int ibuf, ic;
for (f=0, ibuff=circ_buff_head, ic=0;
ic<N; ibuff=(ibuff==N-1?0:ibuff++), ic++)
f = f + c[ic]*circ_buffer[ibuf];
Queues
• Queues are also used in signal processing and event
processing.
• Queues are used whenever data may arrive and
depart at somewhat unpredictable times or when
variable amounts of data may arrive.
• A queue is often referred to as an elastic buffer,
which holds data that arrives irregularly.
• One way to build a queue is with a linked list.
– This approach allows the queue to grow to an arbitrary
size.
• Another way to design the queue is to use an array
to hold all the data.
Buffer-based queues(to manage
interrupt-driven data;)

#define Q_SIZE 32 int dequeue() {


#define Q_MAX (Q_SIZE-1) int returnval;
int q[Q_MAX], head, tail; if (head == tail) error();
void initialize_queue() { head = tail = returnval = q[head];
0; } if (head == Q_MAX) head =
void enqueue(int val) { 0;
if (((tail+1)%Q_SIZE) == head) else head++;
error(); return returnval;
q[tail]=val; }
if (tail == Q_MAX) tail = 0; else
tail++;
}
Models of programs
In this section , we develop models for programs that are more
general than source code like ALP, C …and so on...

• Source code is not a good representation for


programs:
– clumsy;
– leaves much information implicit.
• Compilers derive intermediate representations
to manipulate and optimize the program.

Our fundamental model for programs is the


control/data flow graph (CDFG).
Data flow graph

• DFG: data flow graph.


• Does not represent control.
• Models basic block: code with no entry or exit.
• Describes the minimal ordering requirements
on operations.
Single assignment form

x = a + b;
y = c - d;
z = x * y;
y ==bb++d;d;
y1

originalassignment
single basic blockform
Data flow graph

x = a + b;
a b c d
y = c - d;
z = x * y; + -
y1 = b + d; y
x
* +
single assignment form
z y1
DFG
DFGs and partial orders

Partial
a b
order: c d
• a+b, c-d; b+d x*y
Can do+ pairs of operations
- in any order.
y
x

* +

z y1
Control-data flow graph
• CDFG: represents control and data.
• Uses data flow graphs as components.
• Two types of nodes:
– decision;
– data flow.
Data flow node

Encapsulates a data flow graph:

x = a + b;
y=c+d
Write operations in basic block form for simplicity.
Control

T v1 v4
cond value

v2 v3
F

Equivalent forms
CDFG example
if (cond1) bb1(); T
cond1 bb1()
else bb2();
F
bb3();
switch (test1) { bb2()

case c1: bb4(); break;


case c2: bb5(); break; bb3()
case c3: bb6(); break;
} c1 test1
c3

c2
bb4() bb5() bb6()
for loop
for (i=0; i<N; i++)
loop_body(); i=0

for loop
F
i<N

i=0; T

while (i<N) {
loop_body()
loop_body(); i++; }
equivalent
Assembly , linking, loading
• Assembly and linking are the last steps in the compilation process.
• they turn a list of instructions into an image of the program’s bits in
memory.
• Loading actually puts the program in memory so that it can be executed.

HLL
HLL assembly
HLL compiler assembly
assembly assembler

Object
ObjectCode
Code

Executable
loader linker
Binary
Assembly , linking, loading
• As the figure shows, most compilers do not directly generate machine
code, but instead create the instruction-level program in the form of
human-readable assembly language.
• The assembler’s job is to translate symbolic assembly language
statements into bit-level representations of instructions known as
object code.
• A linker allows a program to be stitched together out of several
smaller pieces. The linker operates on the object files created by the
assembler and modifies the assembled code to make the necessary
links between files.
• The linker, which produces an executable binary file.
• That file may not necessarily be located in the CPU’s memory, however,
unless the linker happens to create the executable directly in RAM.
• The program that brings the program into memory for execution is
called a loader.
Assemblers

• Major tasks:
– generate binary for symbolic instructions;
– translate labels into addresses;
– handle pseudo-ops (data, etc.).
• Generally one-to-one translation.
• Assembly labels:
ORG 100
label1 ADR r4,c
Pseudo-operations
• Pseudo-ops do not generate instructions:
– ORG sets program location.
– EQU generates symbol table entry without
advancing PLC.
– Data statements define data blocks.
Linking

• Combines several object modules into a single


executable module.
• Jobs:
– put modules in order;
– resolve labels across modules.
Dynamic linking

• Some operating systems link modules


dynamically at run time:
– shares one copy of library among all executing
programs;
– allows programs to be updated with new versions
of libraries.
COMPILATION TECHNIQUES
It is useful to understand how a high-level language program is
translated into instructions.

Understanding how the compiler works can help you know when
you cannot rely on the compiler.

Next, because many applications are also performance sensitive,


understanding how code is generated can help you meet your
performance goals, either by writing high-level code that gets
compiled into the instructions you want or by recognizing when you
must write your own assembly code.

Compilation combines translation and optimization.


Compilation
• Compiler determines quality of code:
– use of CPU resources;
– memory access scheduling;
– code size.
Basic compilation phases

HLL

The high-level language program is


parsed to break it into statements
parsing, symbol table and expressions.
In addition, a symbol table is
generated, which includes all the
machine-independent named objects in the program.
optimizations
Simplifying arithmetic expressions
is one example of a machine-
machine-dependent independent optimization.
Optimizations
Instruction –level optimization and
code generation
assembly
Statement translation and optimization
• Source code is translated into intermediate
form such as CDFG.
• CDFG is transformed/optimized.
• CDFG is translated into instructions with
optimization decisions.
• Instructions are further optimized.
Compiling an Arithmetic expressions
DFG
expression

a b c d
a*b + 5*(c-d)
* -
5
W,X,Y,Z are temp X
W
variables *

Z
Compilation of Arithmetic expressions,
cont’d.

a b c d ADR r4,a
MOV r1,[r4]
1 * 2 - ADR r4,b
5 MOV r2,[r4]
ADD r3,r1,r2
ADR r4,c
3 * MOV r1,[r4]
ADR r4,d
MOV r5,[r4]
SUB r6,r4,r5
4 +
MUL r7,r6,#5
ADD r8,r7,r3

DFG code
Similarly for Control code generation

if (a+b > 0)
x = 5;
else a+b>0 x=5

x = 7;
x=7
Control code generation, cont’d.

ADR r5,a
LDR r1,[r5]
ADR r5,b
1 a+b>0 x=5 2 LDR r2,b
ADD r3,r1,r2
BLE label3
LDR r3,#5
3 x=7
ADR r5,x
STR r3,[r5]
B stmtent
LDR r3,#7
ADR r5,x
STR r3,[r5]
stmtent ...
Procedure linkage
Another major code generation problem is the creation of
procedures
• Need code to:
– call and return;
– pass parameters and results.
• Parameters and returns are passed on stack.
– Procedures with few parameters may use
registers.
Procedure stacks

growth

proc1(int a) {
proc1 proc2(5);
}
FP
frame pointer
(defines the end of the Last frame)
proc2
5 accessed relative to SP
SP
stack pointer
(defines the end of the current frame)

When a new procedure is called, the sp and fp are modified to push


another frame onto the stack.
ARM procedure linkage

• APCS (ARM Procedure Call Standard):


– r0-r3 pass parameters into procedure. Extra
parameters are put on stack frame.
– r0 holds return value.
– r4-r7 hold register values.
– r11 is frame pointer, r13 is stack pointer.
– r10 holds limiting address on stack size to check
for stack overflows.
Data structures
The compiler must also translate references to data structures into
references to raw memories. In general, this requires address
computations.
• Different types of data structures use different
data layouts.
• Some offsets into data structure can be
computed at compile time, others must be
computed at run time.
• An array element must in general be computed
at run time, since the array index may change.
• Let us first consider one-dimensional arrays:
One-dimensional arrays

• C array name points to 0th element:

a a[0]
= *(a + 1)
a[1]

a[2]
Two-dimensional arrays

• Column-major layout:
a[0,0]
M
a[0,1]
...
N

... a[1,0]
= a[i*M+j]
a[1,1]
Structures

• Fields within structures are static offsets:

aptr
struct {
int field1; field1 4 bytes
char field2;
} mystruct; *(aptr+4)
field2
struct mystruct a, *aptr = &a;
Using your compiler

• Understand various optimization levels (-O1, -


O2, etc.)
• Look at mixed compiler/assembler output.
• Modifying compiler output requires care:
– correctness;
– loss of hand-tweaked code.
Interpreters and JIT(Just In-Time)compilers
Programs are not always compiled and then separately executed. In
some cases, it may make sense to translate the program into
instructions during execution.
Two well-known techniques for on-the-fly translation are
interpretation and just-in-time (JIT ) compilation.

• Interpreter: translates and executes program statements on-


the-fly. An interpreter translates program statements one at
a time.
• The interpreter sits between the program and the machine.
• The interpreter may or may not generate an explicit piece of
code to represent the statement. Because the interpreter
translates only a very small piece of the program at any given
time,
• A small amount of memory is used to hold intermediate
representations of the program.
Interpreters and JIT(Just In-Time)compilers
JIT compiler: compiles small sections of code into instructions
during program execution.
Eliminates some translation overhead.
Often requires more memory.
Best suited for Java environments
A JIT compiler is somewhere between an interpreter and a stand-
alone compiler. A JIT compiler produces executable code segments
for pieces of the program. However, it compiles a section of the
program (such as a function) only when it knows it will be executed.

Unlike an interpreter, it saves the compiled version of the code so


that the code does not have to be retranslated the next time it is
executed.
The JIT compiler usually generates machine code directly rather
than building intermediate program representation data structures
such as the CDFG.
Program design and analysis

• Program-level performance analysis.


• Optimizing for:
– Execution time.
– Energy/power.
– Program size.
• Program validation and testing.
Program-level performance analysis
• Need to understand performance in detail:
– Real-time behavior, not just typical.
– On complex platforms.
• Program performance ¹ CPU performance:
– Pipeline, cache are windows into program.
– We must analyze the entire program.
Complexities of analyzing program
performance
• The execution time of a program often varies
with the input data values because those values
select different execution paths in the program.
- For example, loops
• Cache effects.
– The cache’s behavior depends in part on the data
values input to the program.
• Instruction-level performance variations:
– Pipeline interlocks.
– Fetch times.
How to measure program performance
• Simulate execution of the CPU (Simulator).
– Makes CPU state visible.
– Be careful for some microprocessor performance simulators
are not 100% accurate, and simulation of I/O-intensive code
may be difficult.
– Also measures execution time of program
• Measure on real CPU using timer.
– A timer connected to the microprocessor bus can be used
to measure performance of executing sections of code.
– Requires modifying the program to control the timer.
• Measure on real CPU using logic analyzer.
– By measuring the start and stop times of a code segment
– Requires events visible on the pins.
Program performance metrics
• Average-case execution time.
– Typically used in application programming.
• Worst-case execution time.
– A component in deadline satisfaction.
• Best-case execution time.
– Task-level interactions can cause best-case program
behavior to result in worst-case system behavior.
Elements of program performance

• Basic program execution time formula:


– execution time = program path + instruction timing
• Solving these problems independently helps simplify
analysis.
– Easier to separate on simpler CPUs.
• Accurate performance analysis requires:
– Assembly/binary code.
– Execution platform.
Data-dependent paths in an if statement
if (a || b) { /* T1 */ a b c path
if ( c ) /* T2 */ 0 0 0 T1=F, T3=F: no assignments
x = r*s+t; /* A1 */
else y=r+s; /* A2 */ 0 0 1 T1=F, T3=T: A4

z = r+s+u; /* A3 */ 0 1 0 T1=T, T2=F: A2, A3


}
0 1 1 T1=T, T2=T: A1, A3
else {
if ( c ) /* T3 */ 1 0 0 T1=T, T2=F: A2, A3
y = r-t; /* A4 */
1 0 1 T1=T, T2=T: A1, A3
}
1 1 0 T1=T, T2=F: A2, A3

1 1 1 T1=T, T2=T: A1, A3


Paths in a loop

for (i=0, f=0; i<N; i++)


i=0
f = f + c[i] * x[i]; f=0

Loop N
exit i<N

f = f + c[i] * x[i]

i=i+1
Instruction timing
Once we know the execution path of the program, we have to measure the execution time of
the instructions executed along that path.

However , even ignoring cache effects, this technique is simplistic for the reasons summarized
below.
• Not all instructions take the same amount of time.
– Multi-cycle instructions (RISC, Fixed length instruction)
– Fetches.
• Execution times of instructions are not independent.
(many CPUs use register bypassing to speed up instruction
sequences when the result of one instruction is used in the next instruction.)
– Pipeline interlocks.
– Cache effects.
• Execution times may vary with operand value.
– This is clearly true of floating-point instructions in which a different
number of iterations may be required to calculate the result
– Some multi-cycle integer operations.
Measurement-driven performance analysis

• Not so easy as it sounds:


– Must actually have access to the CPU.
– Must know data inputs that give worst/best case
performance.
– Must make state visible.
• Still an important method for performance
analysis.
Feeding the program
• Need to know the desired input values.
• May need to write software scaffolding to
generate the input values.
• Software scaffolding may also need to
examine outputs to generate feedback-driven
inputs.
Trace-driven measurement
• Trace-driven:
– Instrument the program.
– Save information about the path.
• Requires modifying the program.
• Trace files are large.
• Widely used for cache analysis.
Physical measurement
• In-circuit emulator allows tracing.
– Affects execution timing.
• Logic analyzer can measure behavior at pins.
– Address bus can be analyzed to look for events.
– Code can be modified to make events visible.
• Particularly important for real-world input streams.
CPU simulation
• Some simulators are less accurate.
• Cycle-accurate simulator provides accurate
clock-cycle timing.
– Simulator models CPU internals.
– Simulator writer must know how CPU works.
SimpleScalar FIR filter simulation
int x[N] = {8, 17, … };
N total sim sim cycles per
int c[N] = {1, 2, … }; cycles filter execution
100 25854 259
main() {
1,000 155759 156
int i, k, f; 1,0000 1451840 145
for (k=0; k<COUNT; k++)
for (i=0; i<N; i++)
f += c[i]*x[i];
}
Performance optimization motivation
• Embedded systems must often meet
deadlines.
– Faster may not be fast enough.
• Need to be able to analyze execution time.
– Worst-case, not typical.
• Need techniques for reliably improving
execution time.
Programs and performance analysis
• Best results come from analyzing optimized
instructions, not high-level language code:
– non-obvious translations of HLL statements into
instructions;
– code may move;
– cache effects are hard to predict.
Software performance optimization
Loop optimizations

• Loops are important targets for optimization


because programs with loops tend to spend a
lot of time executing those loops.
• There are three important techniques in
optimizing loops:
– code motion,
– induction variable elimination, and
– Strength reduction (x*2 -> x<<1).
Code motion
Code motion lets us move unnecessary code out of a loop.

for (i=0; i<N*M; i++)


z[i] = a[i] + b[i];

We can avoid
N *M-1
unnecessary
executions of this
statement by
moving it before
the loop,
as shown in the
figure 2.
Induction variable elimination
An induction variable is a variable whose value is derived from the loop iteration
variable’s value. The compiler often introduces induction variables to help
it implement the loop.

• Induction variable: A nested loop is a good


example of the use of induction variables.
• Consider loop:
for (i=0; i<N; i++)
for (j=0; j<M; j++)
z[i,j] = b[i,j];
• Rather than recompute i*M+j for each array in
each iteration, share induction variable
between arrays, increment at end of loop body.
The compiler uses induction variables to help it address the
arrays.
Let us rewrite the loop in C using induction variables and pointers.

In the above code, zptr and bptr are pointers to the heads of the z and b arrays
and zbinduct is the shared induction variable.
Strength reduction
• Strength reduction helps us reduce the cost of a
loop iteration.
• Consider the following assignment:
y = x * 2;
– In integer arithmetic, we can use a left shift rather
than a multiplication by 2 (as long as we properly
keep track of overflows).
– If the shift is faster than the multiply, we probably
want to perform the substitution.
Performance optimization hints
• Use registers efficiently.
• Use page mode memory accesses.
• Analyze cache behavior:
– instruction conflicts can be handled by rewriting
code, rescheudling;
– conflicting scalar data can easily be moved;
– conflicting array data can be moved, padded.
PROGRAM-LEVEL ENERGY AND POWER ANALYSIS
AND OPTIMIZATION
Energy/power optimization
• Energy: ability to do work.
– Most important in battery-powered systems.
• Power: energy per unit time.
– Important even in wall-plug systems---power
becomes heat.
Opportunities for saving power
■ We may be able to replace the algorithms with
others that do things in clever ways that consume
less power.
■ Memory accesses are a major component of power
consumption in many applications. By optimizing
memory accesses we may be able to significantly
reduce power.
■ We may be able to turn off parts of the system—
such as subsystems of the CPU, chips in the system,
and so on—when we do not need them in order to
save power.
Measuring energy consumption

• Execute a small loop, measure current:


Figure executes the code
under test over and over in
a loop. By measuring
the current flowing into the
CPU, we are measuring the
power consumption of the
complete loop, including
both the body and other
code.
By separately measuring
the power consumption of
a loop with no body
Sources of energy consumption
• Relative energy per operation (Catthoor et al):
– memory transfer: 33
– external I/O: 10
– SRAM write: 9
– SRAM read: 4.4
– multiply: 3.6
– add: 1
Cache behavior is important
• Energy consumption has a sweet spot as cache
size changes:
– cache too small: program thrashes, burning
energy on external memory accesses;
– cache too large: cache itself burns too much
power.
Optimizing for energy
• Use registers efficiently.
• Identify and eliminate cache conflicts.
• Moderate loop unrolling eliminates some loop
overhead instructions.
• Eliminate pipeline stalls.
• Inlining procedures may help: reduces linkage,
but may increase cache thrashing.
Efficient loops
• General rules:
– Don’t use function calls.
– Keep loop body small to enable local repeat (only
forward branches).
– Use unsigned integer for loop counter.
– Use <= to test loop counter.
– Make use of compiler---global optimization,
software pipelining.
Program validation and testing
• But does it work?
• Concentrate here on functional verification.
• Major testing strategies:
– Black box doesn’t look at the source code.
– Clear box (white box) does look at the source
code.
Clear-box testing
• Examine the source code to determine whether it
works:
– Can you actually exercise a path?
– Do you get the value you expect along a path?
• Testing procedure:
– Controllability: Provide program with inputs.
– Execute.
– Observability: examine outputs.
How much testing is enough?
• Exhaustive testing is impractical.
• One important measure of test quality---bugs
escaping into field.
• Good organizations can test software to give very low
field bug report rates.
• Error injection measures test quality:
– Add known bugs.
– Run your tests.
– Determine % injected bugs that are caught.

You might also like