Daf
Daf
CONTENTS
2.1 Input Output on Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Accessing the I/O Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Synchronization in I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Direct Memory Access (DMA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Input/Output Operations and the Operating System . . . . . . . . . . . . . . . . 22
2.2.1 User and Kernel Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Input/Output Abstraction in Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Acquiring Images from a Camera Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Synchronous Read from a Camera Device . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.3 Handling Data Streaming from the Camera Device . . . . . . . . . . . 37
2.4 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.1 Optimizing the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5 Finding the Center Coordinates of a Circular Shape . . . . . . . . . . . . . . . . . 54
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
The presented case study consists of a Linux application that acquires a se-
quence of images (frames) from a video camera device. The data acquisition
program will then perform some elaboration on the acquired images in order
to detect the coordinates of the center of a circular shape in the acquired
images.
This chapter is divided into four main sections. In the first section general
concepts in computer input/output (I/O) are presented. The second section
will discuss how I/O is managed by operating systems, in particular Linux,
11
while in the third one the implementation of the frame acquisition is pre-
sented. The fourth section will concentrate on the analysis of the acquired
frames to retrieve the desired information; after presenting two widespread
algorithms for image analysis, the main concepts about software complexity
will be presented, and it will be shown how the execution time for those al-
gorithms can be reduced, sometimes drastically, using a few optimization and
parallelization techniques.
Embedded systems carrying out online analysis of acquired images are be-
coming widespread in industrial control and surveillance. In order to acquire
the sequence of the frames, the video capture application programming inter-
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
face for Linux (V4L2) will be used. This interface supports most commercial
USB webcams, which are now ubiquitous in laptops and other PCs. There-
fore this sample application can be easily reproduced by the reader, using for
example his/her laptop with an integrated webcam.
RAM1 RAM2
I/O Device 1
I/O Bus
I/O Device 2
FIGURE 2.1
Bus architecture with a separate I/O bus.
values onto the I/O bus locations (i.e., at the addresses corresponding to the
device registers) via specific I/O Read and Write instructions.
In memory-mapped I/O, devices are seen by the processor as a set of reg-
isters, but no specific bus for I/O is defined. Rather, the same bus used to
exchange data between the processor and the memory is used to access I/O
devices. Clearly, the address range used for addressing device registers must
be disjoint from the set of addresses for the memory locations. Figure 2.1 and
Figure 2.2 show the bus organization for computers using a dedicated I/O
bus and memory-mapped I/O, respectively. Memory-mapped architectures
are more common nowadays, but connecting all the external I/O devices di-
rectly to the memory bus represents a somewhat simplified solution with sev-
eral potential drawbacks in reliability and performance. In fact, since speed
in memory access represents one of the major bottlenecks in computer per-
formance, the memory bus is intended to operate at a very high speed, and
therefore it has very strict constraints on the electrical characteristics of the
bus lines, such as capacity, and in their dimension. Letting external devices
be directly connected to the memory bus would increase the likelihood that
possible malfunctions of the connected devices would seriously affect the func-
tion of the whole system and, even if that were not the case, there would be
the concrete risk of lowering the data throughput over the memory bus.
In practice, one or more separate buses are present in the computer for I/O,
even with memory-mapped architectures. This is achieved by letting a bridge
RAM1 RAM2
FIGURE 2.2
Bus architecture for Memory Mapped I/O.
component connect the memory bus with the I/O bus. The bridge presents
itself to the processor as a device, defining a set of registers for programming
the way the I/O bus is mapped onto the memory bus. Basically, a bridge
can be programmed to define one or more address mapping windows. Every
address mapping window is characterized by the following parameters:
bus) that are mapped onto the corresponding address range in the secondary
PCI bus (for which the bridge is the master, i.e., leads bus operations). Follow-
ing the same approach, new I/O buses, such as the Small Computer System
Interface (SCSI) bus for high-speed disk I/O, can be integrated into the com-
puter board by means of bridges connecting the I/O bus to the memory bus
or, more commonly, to the PCI bus. Figure 2.3 shows an example of bus con-
figuration defining a memory to PCI bridge, a PCI to PCI bridge, and a PCI
to SCSI bridge.
One of the first actions performed when a computer boots is the configuration
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
of the bridges in the system. Firstly, the bridges directly connected to the
memory bus are configured, so that the devices over the connected buses can
be accessed, including the registers of the bridges connecting these to new I/O
buses. Then the bridges over these buses are configured, and so on. When all
the bridges have been properly configured, the registers of all the devices in
the system are directly accessible by the processor at given addresses over the
memory bus. Properly setting all the bridges in the system may be tricky, and
a wrong setting may make the system totally unusable. Suppose, for example,
what could happen if an address map window for a bridge on the memory bus
were programmed with an overlap with the address range used by the RAM
memory. At this point the processor would be unable to access portions of
memory and therefore would not anymore be able to execute programs.
Bridge setting, as well as other very low-level configurations are normally
performed before the operating system starts, and are carried out by the Basic
Input/Output System (BIOS), a code which is normally stored on ROM and
executed as soon as the computer is powered. So, when the operating system
starts, all the device registers are available at proper memory addresses. This
is, however, not the end of the story: in fact, even if device registers are seen
by the processor as if they were memory locations, there is a fundamental
difference between devices and RAM blocks. While RAM memory chips are
expected to respond in a time frame on the order of nanoseconds, the response
time of devices largely varies and in general can be much longer. It is therefore
necessary to synchronize the processor and the I/O devices.
RAM1 RAM2
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
Mem to PCI
I/O Device 1 PCI Device 2
Bridge
PCI to PCI
PCI Bus 2
Bridge
PCI Bus 1
PCI Device 1
PCI Device 3
PCI to SCSI
SCSI Bus
Bridge
SCSI Device 1
FIGURE 2.3
Bus architecture with two PCI buses and one SCSI bus.
device. This comes, however, at a cost: no useful operation can be carried out
by the processor when synchronizing to devices in polling. If we assume that
100 ns are required on average for memory access, and assuming that access
to device registers takes the same time as a memory access (a somewhat
simplified scenario since we ignore here the effects of the memory cache),
acquiring a data stream from the serial port would require more than 8000
read operations of the status register for every incoming byte of the stream
– that is, wasting 99.99% of the processor power in useless accesses to the
status register. This situation becomes even worse for slower devices; imagine
the percentage of processor power for doing anything useful if polling were
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
current processor status. Assuming that memory locations used to store the
program and the associated data are not overwritten during the execution of
the interrupt service routine, it is only necessary to preserve the content of
the processor registers. Normally, the first actions of the routine are to save
in the stack the content of the registers that are going to be used, and such
registers will be restored just before its termination. Not all the registers can
be saved in this way; in particular, the PC and the SR are changed just before
starting the execution of the interrupt service routine. The PC will be set to
the address of the first instruction of the routine, and the SR will be updated
to reflect the fact that the process is starting to service an interrupt of a given
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
priority. So it is necessary that these two register are saved by the processor
itself and restored when the interrupt service routine has finished (a specific
instruction to return from ISR is defined in most computer architectures). In
most architectures the SR and PC registers are saved on the stack, but oth-
ers, such as the ARM architecture, define specific registers to hold the saved
values.
A specific interrupt service routine has to be associated with every possi-
ble source of interrupt, so that the processor can take the appropriate actions
when an I/O device generates an interrupt request. Typically, computer ar-
chitectures define a vector of addresses in memory, called a Vector Table,
containing the start addresses of the interrupt service routines for all the I/O
devices able to generate interrupt requests. The offset of a given ISR within
the vector table is called the Interrupt Vector Number. So, if the interrupt vec-
tor number were communicated by the device issuing the interrupt request,
the right service routine could then be called by the processor. This is ex-
actly what happens; when the processor starts serving a given interrupt, it
performs a cycle on the bus called the Interrupt Acknowledge Cycle (IACK)
where the processor communicates the priority of the interrupt being served,
and the device which issued the interrupt request at the specified priority
returns the interrupt vector number. In case two different devices issued an
interrupt request at the same time with the same priority, the device closest
to the processor in the bus will be served. This is achieved in many buses by
defining a bus line in Daisy Chain configuration, that is, which is propagated
from every device to the next one along the bus, only in cases where it did not
answer to an IACK cycle. Therefore, a device will answer to an IACK cycle
only if both conditions are met:
Note that in this case it will not propagate the daisy chain signal to the next
device.
The offset returned by the device in an IACK cycle depends on the cur-
rent organization of the vector table and therefore must be a programmable
parameter in the device. Typically, all the devices which are able to issue an
Processor Interrupt 1
PC
IACK 3
Device
SR IVN
4
5
2
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
SR
ISR
PC Address
IVN
FIGURE 2.4
The Interrupt Sequence.
interrupt request have two registers for the definition of the interrupt prior-
ity and the interrupt vector number, respectively. The sequence of actions is
shown in Figure 2.4, highlighting the main steps of the sequence:
1. The device issues an interrupt request;
2. The processor saves the context, i.e., puts the current values of the
PC and of the SR on the stack;
3. The processor issues an interrupt acknowledge cycle (IACK) on the
bus;
4. The device responds by putting the interrupt vector number (IVN)
over the data lines of the bus;
5. The processor uses the IVN as an offset in the vector table and
loads the interrupt service routine address in the PC.
I/O devices are configured, the code of the interrupt service routine
has to be loaded in memory, and its start address written in the
vector table at, say, offset N ;
3. The value N has to be communicated to the device, usually written
in the interrupt vector number register;
4. When an I/O operation is requested by the program, the device
is started, usually by writing appropriate values in one or more
command registers. At this point the processor can continue with
the program execution, while the device operates. As soon as the
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
In this case it is necessary to handle the fact that data reception is asyn-
chronous. A commonly used techniques is to let the program continue after
issuing an I/O request until the data received by the device is required. At
this point the program has to suspend its execution waiting for data, unless
not already available, that is, waiting until the corresponding interrupt service
routine has been executed. For this purpose the interprocess communication
mechanisms described in Chapter 5 will be used.
For a 1 GHz processor this means that 10% of the processor time is dedicated
to data transfer, a percentage clearly no more acceptable.
Very often data exchanged with I/O devices are transferred from or to
memory. For example, when a disk block is read it is first transferred to mem-
ory so that it is later available to the processor. If the processor itself were in
charge of transferring the block, say, after receiving an interrupt request from
the disk device to signal the block availability, the processor would repeat-
edly read data items from the device’s data register into an internal processor
register and write it back into memory. The net effect is that a block of data
has been transferred from the disk into memory, but it has been obtained
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
at the expense of a number of processor cycles that could have been used to
do other jobs if the device were allowed to write the disk block into memory
by itself. This is exactly the basic concept of Direct Memory Access (DMA),
which is letting the devices read and write memory by themselves so that the
processor will handle I/O data directly in memory. In order to put this simple
concept in practice it is, however, necessary to consider a set of facts. First
of all, it is necessary that the processor can “program” the device so that it
will perform the correct actions, that is, reading/writing a number N of data
items in memory, starting from a given memory address A. For this purpose,
every device able to perform DMA provides at least the following registers:
So, in order to program a block read or write operation, it is necessary that the
processor, after allocating a block in memory and, in case of a write operation,
filling it with the data to be output to the device, writes the start address
and the number of data items in the MAR and WC registers, respectively.
Afterwards the device will be started by writing an appropriate value in (one
of) the command register(s). When the device has been started, it will operate
in parallel with the processor, which can proceed in the execution of the
program. However, as soon as the device is ready to transfer a data item,
it will require the memory bus used by the processor to exchange data with
memory, and therefore some sort of bus arbitration is needed since it is not
possible that two devices read or write the memory at the same time on
the same bus (note however that nowadays memories often provide multiport
access, that is, allow simultaneous access to different memory addresses). At
any time one, and only one, device (including the processor) connected to the
bus is the master, i.e., can initiate a read or write operation. All the other
connected devices at that time are slaves and can only answer to a read/write
bus cycle when they are addressed. The memory will be always a slave in the
bus, as well as the DMA-enabled devices when they are not performing DMA.
At the time such a device needs to exchange data with the memory, it will
ask the current master (normally the processor, but it may be another device
performing DMA) the ownership of the bus. For this purpose the protocol
of every bus able to support ownership transfer is to define a cycle for the
bus ownership transfer. In this cycle, the potential master raises a request line
and the current master, in response, relinquishes the mastership, signaling this
over another bus line, and possibly waiting for the termination of a read/write
operation in progress. When a device has taken the bus ownership, it can then
perform the transfer of the data item and will remain the current master until
the processor or another device asks to become the new master. It is worth
noting that the bus ownership transfers are handled by the bus controller
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
components and are carried out entirely in hardware. They are, therefore,
totally transparent to the programs being executed by the processor, except
for a possible (normally very small) delay in their execution.
Every time a data item has been transferred, the MAR is incremented and
the WC is decremented. When the content of the WC becomes zero, all the
data have been transferred, and it is necessary to inform the processor of
this fact by issuing an interrupt request. The associated Interrupt Service
Routine will handle the block transfer termination by notifying the system of
the availability of new data. This is normally achieved using the interprocess
communication mechanisms described in Chapter 5.
system, which may lead to the crash of the whole system. (At least in mono-
lithic operating systems such as Linux and Windows; this may be not true
for other systems, such as microkernel-based ones.) User programs will never
interact directly with the driver as the device is accessible only via the Ap-
plication Programming Interface (API) provided by the operating system. In
the following we shall refer to the Linux operating systems and shall see how
a uniform interface can be adapted to the variety of available devices. The
other operating systems adopt a similar architecture for I/O, which typically
differ only by the name and the arguments of the I/O systems routines, but
not on their functionality.
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
routine terminates, it will pick the saved return address from the stack and
put it into the Program Counter, so that the execution of the calling program
is resumed. We have already seen, however, how the interrupt mechanism can
be used to “invoke” an interrupt service routine. In this case the sequence
is different, and is triggered not by the calling program but by an external
hardware device. It is exactly when the processor starts executing an Inter-
rupt Service routine that the current execution mode is switched to kernel
mode. When the interrupt service routine returns and the interrupted pro-
gram resumes its execution, unless not switching to a new interrupt service
routine, the execution mode is switched to user mode. It is worth noting that
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
the mode switch is not controlled by the software, but it is the processor which
only switches to kernel mode when servicing an interrupt.
This mechanism makes sense because interrupt service routines interact
with devices and are part of the device driver, that is, of a software compo-
nent that is integrated in the operating system. However, it may happen that
user programs have to do I/O operations, and therefore they need to execute
some code in kernel mode. We have claimed that all the code handling I/O
is part of the operating system and therefore the user program will call some
system routine for doing I/O. However, how do we switch to kernel mode in
this case where the trigger does not come from an hardware device? The so-
lution is given by Software Interrupts. Software interrupts are not triggered
by an external hardware signal, but by the execution of a specific machine
instruction. The interrupt mechanism is quite the same: The processor saves
the current context, picks the address of the associated interrupt service rou-
tine from the vector table and switches to kernel mode, but in this case the
Interrupt Vector number is not obtained by a bus IACK cycle; rather, it is
given as an argument to the machine instruction for the generation of the
software interrupt.
The net effect of software interrupts is very similar to that of a function
call, but the underlying mechanism is completely different. This is the typical
way the operating system is invoked by user programs when requesting system
services, and it represents an effective barrier protecting the integrity of the
system. In fact, in order to let any code to be executed via software interrupts,
it is necessary to write in the vector table the initial address of such code but,
not surprisingly, the vector table is not accessible in user mode, as it belongs to
the set of data structures whose integrity is essential for the correct operation
of the computer. The vector table is typically initialized during the system
boot (executed in kernel mode) when the operating system initializes all its
data structures.
To summarize the above concepts, let’s consider the execution story of one
of the most used C library function: printf(), which takes as parameter the
(possibly formatted) string to be printed on the screen. Its execution consists
of the following steps:
library. Arguments are passed on the stack and the start address of
the printf routine is put in the program counter;
2. The printf code will carry out the required formatting of the
passed string and the other optional arguments, and then calls the
operating system specific system service for writing the formatted
string on the screen;
3. The system routine executes initially in user mode, makes some
preparatory work and then needs to switch in kernel mode. To do
this, it will issue a software interrupt, where the passed interrupt
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
vector number specifies the offset in the Vector Table of the corre-
sponding ISR routine to be executed in kernel mode;
4. The ISR is eventually activated by the processor in response to the
software interrupt. This routine is provided by the operating system
and it is now executing in kernel mode;
5. After some work to prepare the required data structures, the ISR
routine will interact with the output device. To do this, it will call
specific routines of the device driver;
6. The activated driver code will write appropriate values in the device
registers to start transferring the string to the video device. In the
meantime the calling process is put in wait state (see Chapter 3 for
more information on processes and process states);
7. A sequence of interrupts will be likely generated by the device to
handle the transfer of the bytes of the string to be printed on the
screen;
8. When the whole string has been printed on the screen, the calling
process will be resumed by the operating system and printf will
return.
Software interrupts provide the required barrier between user and kernel mode,
which is of paramount importance in general purpose operating systems. This
comes, however, at a cost: the activation of a kernel routine involves a sequence
of actions, such as saving the context, which is not necessary in a direct call.
Many embedded systems are then not intended to be of general usage. Rather,
they are intended to run a single program for control and supervision or, in
more complex systems involving multitasking, a well defined set of programs
developed ad hoc. For this reason several real-time operating systems do not
support different execution levels (even if the underlying hardware could), and
all the software is executed in kernel mode, with full access to the whole set of
system resources. In this case, a direct call is used to activate system routines.
Of course, the failure of a program will likely bring the whole system down,
but in this case it is assumed that the programs being executed have already
been tested and can therefore be trusted.
may seem at a first glance a bit surprising since the similarity between files
and devices is not so evident, but the following considerations hold:
• In order to be used, a file must be open. The open() system routine will
create a set of data structures that are required to handle further operations
on that file. A file identifier is returned to be used in the following operations
for that file in order to identify the associated data structures. In general,
every I/O device requires some sort of initialization before being used.
Initialization will consist of a set of operations performed on the device
and in the preparation of a set of support data structures to be used when
operating on that device. So an open() system routine makes sense also for
I/O devices. The returned identifier (actually an integer number in Linux)
is called a Device Descriptor and uniquely identifies the device instance in
the following operations. When a file is no more used, it is closed and the
associated data structures deallocated. Similarly, when a I/O device is no
more used, it will be closed, performing cleanup operations and freeing the
associated resources.
• A file can be read or written. In the read operation, data stored in the
file are copied in the computer memory, and the converse holds for write
operations. Regardless of the actual nature of a I/O device, there are two
main categories of interaction with the computer: read and write. In read
operation, data from the device is copied into the computer memory to be
used by the program. In write operations, data in memory will be trans-
ferred to the device. Both read() and write() system routines will require
the target file or device to be uniquely identified. This will be achieved by
passing the identifier returned by the open() routine.
However, due to the variety of hardware devices that can be connected to
a computer, it is not always possible to provide a logical mapping of the
device’s functions exclusively into read-and-write operations. Consider, as an
example, a network card: actions such as receiving data and sending data
over the network can be mapped into read-and-write operations, respectively,
but others, like the configuration of the network address, require a different
interface. In Linux this is achieved by providing an additional routine for I/O
management: ioctl(). In addition to the device descriptor, ioctl() defines
two more arguments: the first one is an integer number and is normally used
The evil, however, hides in the details, and in fact all the complexity in the
device/computer interaction has been simply moved to ioctl(). Depending
on the device’s nature, the set of operations and of the associated data struc-
tures may range from a very few and simple configurations to a fairly complex
set of operations and data structures, described by hundreds of user manual
pages. This is exactly the case of the standard driver for the camera devices
that will be used in the subsequent sections of this chapter for the presented
case study.
The abstraction carried out by the operating system in the application
programming interface for device I/O is also maintained in the interaction
between the operating system and the device-specific driver. We have already
seen that, in order to integrate a device in the systems, it is necessary to pro-
vide a device-specific code assembled into the device driver and then integrated
into the operating system. Basically, a device driver provides the implementa-
tion of the open, close, read, write, and ioctl operations. So, when a program
opens a device by invoking the open() system routine, the operating system
will first carry out some generic operations common to all devices, such as
the preparation of its own data structures for handling the device, and will
then call the device driver’s open() routine to carry out the required device
specific actions. The actions carried out by the operating system may involve
the management of the calling process. For example, in a read operation, the
operating system, after calling the device-specific read routine, may suspend
the current process (see Chapter 3 for a description of the process states) in
the case the required data are not currently available. When the data to be
read becomes available, the system will be notified of it, say, with an interrupt
from the device, and the operating system will wake the process that issued
the read() operation, which can now terminate the read() system call.
include at least:
• Device capability configuration parameters, such as the ability of support-
ing data streaming and the supported pixel formats;
• Image format definition, such as the width and height of the frame, the
number of bytes per line, and the pixel format.
Due to the large number of different camera devices available on the market,
having a specific driver for every device, with its own configuration parame-
ters and ioctl() protocol (i.e., the defined operations and the associated data
structures), would complicate the life of the programmers quite a lot. Suppose,
for example, what would happen if in an embedded system for on-line quality
control based on image analysis the type of used camera is changed, say, be-
cause a new better device is available. This would imply re-writing all the code
which interacts with the device. For this reason, a unified interface to camera
devices has been developed in the Linux community. This interface, called
V4L2 (Video for Linux Two), defines a set of ioctl operations and associated
data structures that are general enough to be adapted for all the available
camera devices of common usage. If the driver of a given camera device ad-
heres to the V4L2 standards, the usability of such device is greatly improved
and it can be quickly integrated into existing systems. V4L2 improves also
interchangeability of camera devices in applications. To this purpose, an im-
portant feature of V4L2 is the availability of query operations for discovering
the supported functionality of the device. A well-written program, first query-
ing the device capabilities and then selecting the appropriate configuration,
can the be reused for a different camera device with no change in the code.
As V4L2 in principle covers the functionality of all the devices available on
the market, the standard is rather complicated because it has to foresee even
the most exotic functionality. Here we shall not provide a complete description
of V4L2 interface, which can be found in [77], but will illustrate its usage by
means of two examples. In the first example, a camera device is inquired in
order to find out the supported formats and to check whether the YUYV
format is supported. If this format is supported, camera image acquisition is
started using the read() system routine. YUYV is a format to encode pixel
information expressed by the following information:
• Luminance (Y )
retrieving information from the shape of the objects in the image, so we shall
consider only the component Y .
The YUYV format represents a compressed version of the Y , Cb , and Cr . In
fact, while the luminance is encoded for every pixel in the image, the chromi-
nance values are encoded for every two pixels. This choice stems from the fact
that the human eye is more sensitive to variation of the light intensity, rather
than of the colors components. So in the YUYV format, pixels are encoded
from the topmost image line and from the left to the right, and four bytes are
used to encode two pixels with the following pattern: Yi , Cbi , Yi+1 , Cri . To get
the grey scale representation of the acquired image, our program will therefore
take the first byte of every pair.
# define M A X _ F O R M A T 100
# define FALSE 0
# define TRUE 1
# define C H E C K _ I O C T L _ S T A T U S( message ) \\
if ( status == -1) \\
{ \\
perror ( message ); \\
exit ( E X I T _ F A I L U R E); \\
}
int y u y v F o u n d;
for (;;)
{
status = select (1 , & fds , NULL , NULL , & tv );
if ( status == -1)
{
perror ( " Error in Select " );
exit ( E X I T _ F A I L U R E);
}
status = read ( fd , buf , i m a g e S i z e);
if ( status == -1)
{
perror ( " Error reading buffer " );
exit ( E X I T _ F A I L U R E);
}
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
The first action (step 1)in the program is opening the device. System routine
open() looks exactly as an open call for a file. As for files, the first argument
is a path name, but in this case such a name specifies the device instance. In
Linux the names of the devices are all contained in the /dev directory. The
files contained in this directory do not correspond to real files (a Webcam is
obviously different from a file), rather, they represent a rule for associating a
unique name with each device in the system. In this way it is also possible to
discover the available devices using the ls command to list the files contained
in a directory. By convention, camera devices have the name /dev/video<n>,
so the command ls /dev/video* will show how many camera devices are
available in the system. The second argument given to system routine open()
specifies the protection associated with that device. In this case the constant
O RDWR specifies that the device can be read and written. The returned value
is an integer value that uniquely specifies within the system the Device De-
scriptor, that is the set of data structures held by Linux to manage this device.
This number is then passed to the following ioctl() calls to specify the target
device. Step 2 consists in checking whether the camera device supports read-
/write operation. The attentive reader may find this a bit strange—how could
the image frames be acquired otherwise?—but we shall see in the second ex-
ample that an alternative way, called streaming, is normally (and indeed most
often) provided. This query operation is carried out by the following line:
status = ioctl(fd, VIDIOC_QUERYCAP, &cap);
In the above line the ioctl operation code is given by constant
VIDIOC QUERYCAP (defined, as all the other constants used in the manage-
ment of the video device, in linux/videodev2.h), and the associated data
structure for the pointer argument is of type v4l2 capability. This struc-
ture, documented in the V4L2 API specification, defines, among others, a
capability field containing a bit mask specifying the supported capabilities for
that device.
Line
if(cap.capabilities & V4L2_CAP_READWRITE)
will let the program know whether read/write ability is supported by the
device.
In step 3 the device is queried about the supported pixel formats. To do
this, ioctl() is repeatedly called, specifying VIDIOC ENUM FMT operation and
passing the pointer to a v4l2 fmtdesc structure whose fields of interest are:
• index: to be set before calling ioctl() in order to specify the index of the
queried format. When no more formats will be available, that is, when the
index is greater or equal the number of supported indexes, ioctl() will
return an error.
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
• type: specifies the type of the buffer for which the supported format is
being queried. Here, we are interested in the returned image frame, and
this is set to V4L2 BUF TYPE VIDEO CAPTURE
If the pixel format YUYV is found (this is the normal format supported by
all Webcams), the program proceeds in defining an appropriate image format.
There are many parameters for specifying such information, all defined in
structure v4l2 format passed to ioctl to get (operation VIDIOC G FMT) or to
set the format (operation VIDIOC S FMT). The program will first read (step 4)
the currently defined image format (normally most default values are already
appropriate) and then change (step 5) the formats of interest, namely, image
width, image height, and the pixel format. Here, we are going to define a
640 x 480 image using the YUYV pixel format by writing the appropriate
values in fields fmt.pix.width, fmt.pix.height and fmt.pix.pixelformat
of the format structure. Observe that, after setting the new image format,
the program checks the returned values for image width and height. In fact,
it may happen that the device does not support exactly the requested image
width and height, and in this case the format structure returned by ioctl
contains the appropriate values, that is, the supported width and height that
are closest to the desired ones. Fields pix.sizeimage will contain the total
length in bytes of the image frame, which in our case will be given by 2 times
width times height (recall that in YUYV format four bytes are used to encode
two pixels).
At this point the camera device is configured, and the program can start
acquiring image frames. In this example a frame is acquired via a read() call
whose arguments are:
Function read() returns the number of bytes actually read, which is not
necessarily equal to the number of bytes passed as argument. In fact, it may
happen that at the time the function is called, not all the required bytes are
available, and the program has to manage this properly. So, it is necessary to
make sure that when read() is called, a frame is available for readout. The
usual technique in Linux to synchronize read operation on device is the usage
of the select() function, which allows a program to monitor multiple device
descriptors, waiting until one or more devices become “ready” for some class
of I/O operation (e.g., input data available). A device is considered ready if
it is possible to perform the corresponding I/O operation (e.g., read) without
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
blocking. Observe that the usage of select is very useful when a program has to
deal with several devices. In fact, since read() is blocking, that is, it suspends
the execution of the calling program until some data are available, a program
reading on multiple devices may suspend in a read() operation regardless the
fact that some other device may have data ready to be read. The arguments
passed to select() are
The devices masks have are of type fd set, and there is no need to know
its definition since macros FD ZERO and FD SET allow resetting the mask
and adding a device descriptor to it, respectively. When the select has not
to monitor a device class, the corresponding mask is NULL, as in the above
example for the write and exception mask. The timeout is specified using the
structure timeval, which defines two fields, tv sec and tv usec, to specify
the number of seconds and microseconds, respectively.
The above example will work fine, provided the camera device supports
direct the read() operation, as far as it is possible to guarantee that the
read() routine is called as often as the frame rate. This is, however, not
always the case because the process running the program may be preempted
by the operating system in order to assign the processor to other processes.
Even if we can guarantee that, on average, the read rate is high enough, it is in
general necessary to handle the occasional cases in which the reading process
is late and the frame may be lost. Several chapters of this book will discuss
this fact, and we shall see several techniques to ensure real-time behavior, that
is, making sure that a given action will be executed within a given amount
of time. If this were the case, and we could ensure that the read() operation
for the current frame will be always executed before a new frame is acquired,
there would be no risk of losing frames. Otherwise it is necessary to handle
occasional delays in frame readout. The common technique for this is double
buffering, that is using two buffers for the acquired frames. As soon as the
driver is able to read a frame, normally in response to an interrupt indicating
that the DMA transfer for that frame has terminated, the frame is written
in two alternate memory buffers. The process acquiring such frames can then
copy from one buffer while the driver is filling in the other one. In this case,
if T is the frame acquisition period, a process is allowed to read a frame with
a delay up to T . Beyond this time, the process may be reading a buffer that
at the same time is being written by the driver, producing inconsistent data
or losing entire frames. The double buffering technique can be extended to
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
Virtual Address
Page Table
Physical Address
Page Table Entry
Physical Page Number Page Offset
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
FIGURE 2.5
The Virtual Memory address translation.
the common case of 32 bit architectures, where 32 bits are used to represent
virtual addresses, the top 32−K bits of virtual addresses are used as the index
in the page table. This corresponds to providing a logical organization of the
virtual address rage in a set of memory pages, each 2K bytes long. So the most
significant 32 − K bits will provide the memory page number, and the least
significant K bits will specify the offset within the memory page. Under this
perspective, the page table provides a page number translation mechanism,
from the logical page number into the physical page number. In fact also the
physical memory can be considered divided into pages of the same size, and
the offset of the physical address within the translated page will be the same
of the original logical page.
Even if virtual memory may seem at a first glance a method merely in-
vented to complicate the engineer’s life, the following example should convince
the skeptics of its convenience. Consider two processes running the same pro-
gram: This is perfectly normal in everyday’s life, and no one is in fact surprised
by the fact that two Web browsers or editor programs can be run by differ-
ent processes in Linux (or tasks in Windows). Recalling that a program is
composed of a sequence of machine instructions handling data in processor
registers and in memory, if no virtual memory were supported, the two in-
stances of the same program run by two different processes would interfere
with each other since they would access the same memory locations (they
are running the same program). This situation is elegantly solved, using the
virtual memory mechanism, by providing two different mappings to the two
processes so that the same virtual address page is mapped onto two different
physical pages for the two processes, as shown in Figure 2.6. Recalling that
the address translation is driven by the content of the page table, this means
that the operating systems, whenever it assigns the processor to one process,
will also set accordingly the corresponding page table entries. The page table
Physical Memory
FIGURE 2.6
The usage of virtual address translation to avoid memory conflicts.
contents become therefore part of the set of information, called Process Con-
text, which needs to be restored by the operating system in a context switch,
that is whenever a process regains the usage of the processor. Chapter 3 will
describe process management in more detail; here it suffices to know that
virtual address translation is part of the process context.
Virtual memory support complicates quite a bit the implementation of
an operating system, but it greatly simplifies the programmer’s life, which
does not need concerns about possible interferences with other programs. At
this point, however, the reader may be falsely convinced that in an operat-
ing system not supporting virtual memory it is not possible to run the same
program in two different processes, or that, in any case, there is always the
risk of memory interferences among programs executed by different processes.
Luckily, this is not the case, but memory consistence can be obtained only by
imposing a set of rules for programs, such as the usage of the stack for keeping
local variables. Programs which are compiled by a C compiler normally use
the stack to contain local variables (i.e., variables which are declared inside a
program block without the static qualifier) and the arguments passed in rou-
tine calls. Only static variables (i.e., local variables declared with the static
qualifier or variables declared outside program blocks) are allocated outside
Static variables
FIGURE 2.7
Sharing data via static variable on systems which do not support Virtual
Addresses.
the stack. A separate stack is then associated with each process, thus allow-
ing memory insulation, even on systems supporting virtual memory. When
writing code for systems without virtual memory, it is therefore important to
pay attention in the usage of static variables, since these are shared among
different processes, as shown in Figure 2.7. This is not necessarily a negative
fact, since a proper usage of static data structures may represent an effective
way for achieving interprocess communication. Interprocess communication,
that is, exchanging data among different processes, can be achieved also with
virtual memory, but in this case it is necessary that the operating system is
involved so that it can set-up the content of the page table in order to allow
the sharing of one or more physical memory pages among different processes,
as shown in Figure 2.8.
Physical Memory
FIGURE 2.8
Using the Page Table translation to map possibly different virtual addresses
onto the same physical memory page.
the network. The same holds for a video device, and read operation will get the
acquired image frame, not read from any “address.” However, when handling
memory buffers in double buffering, it is necessary to find some way to map
region of memory used by the driver into memory buffers for the program.
mmap() can be used for this purpose, and the preparation of the shared buffers
is carried out in two steps:
1. The driver allocates the buffers in its (physical) memory space, and
returns (in a data structure passed to ioctl()) the unique address
(in the driver context) of such buffers. The returned addresses may
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
be the same physical address of the buffers, but in any case they
are seen outside the driver as addresses referred to the conceptual
file model.
2. The user programs calls mmap() to map such buffers in its virtual
memory onto the driver buffers, passing as arguments the file ad-
dresses returned in the previous ioctl() call. After the mmap() call
the memory buffers are shared between the driver, using physical
addresses, and the program, using virtual addresses.
The code of the program using multiple buffering for handling image frame
streaming from the camera device is listed below.
# include < fcntl .h >
# include < stdio .h >
# include < stdlib .h >
# include < string .h >
# include < errno .h >
# include < linux / v i d e o d e v 2.h >
# include < asm / unistd .h >
# include < poll .h >
# define M A X _ F O R M A T 100
# define FALSE 0
# define TRUE 1
# define C H E C K _ I O C T L _ S T A T U S( message ) \\
if ( status == -1) \\
{ \\
perror ( message ); \\
exit ( E X I T _ F A I L U R E); \\
}
size_t length ;
} b u f f e r D s c;
int idx ;
fd_set fds ; // S e l e c t d e s c r i p t o r s
struct timeval tv ; //Timeout s p e c i f i c a t i o n s t r u c t u r e
Steps 1–6 are the same of the previous program, except for step 2, where
the streaming capability of the device is now checked. In Step 7, the driver is
asked to allocate four image buffers. The actual number of allocated buffers
is returned in the count field of the v4l2 requestbuffers structure passed
to ioctl(). At least two buffers must have been allocated by the driver to
allow double buffering. In Step 8 the descriptors of the buffers are allocated
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
via the calloc() system routine (every descriptor contains the dimension and
a pointer to the associated buffer). The actual buffers, which have been allo-
cated by the driver, are queried in order to get their address in the driver’s
space. Such an address, returned in field m.offset of the v4l2 buffer struc-
ture passed to ioctl(), cannot be used directly in the program since it refers
to a different address space. The actual address in the user address space is
returned by the following mmap() call. When the program arrives at Step 9,
the buffers have been allocated by the driver and also mapped to the pro-
gram address space. They are now enqueued by the driver, which maintains a
linked queue of available buffers. Initially, all the buffers are available: every
time the driver has acquired a frame, the first available buffer in the queue
is filled. Streaming, that is, frame acquisition, is started at Step 10, and then
at Step 11 the program waits for the availability of a filled buffer, using the
select() system call. Whenever select() returns, at least one buffer con-
tains an acquired frame. It is dequeued in Step 12, and then enqueued in Step
13, after it has been used in image processing. The reason for dequeuing and
then enqueuing the buffer again is to make sure that the buffer will not be
used by the driver during image processing.
Finally, image processing will be carried out by routine processImage(),
which will first build a byte buffer containing only the luminance, that is,
taking the first byte of every 16 bit word of the passed buffer, coded using the
YUYV format.
step allows reducing the size of the problem, since for the following analysis
it suffices to take into account the pixels representing the edges in the image.
Edge detection is carried out by computing the approximation of the gradients
in the X (Lx ) and Y (Ly ) directions for every pixel of the image, selecting,
only those pixels for which the gradient magnitude, computed as |∇L| =
then,
L2x + L2y , is above a given threshold. In fact, informally stated, an edge
corresponds to a region where the brightness of the image changes sharply,
the gradient magnitude being an indication of the “sharpness” of the change.
Observe that in edge detection we are only interested in the luminance, so in
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
the YUYV pixel format, only the first byte of every two will be considered. The
gradient is computed using a convolution matrix filter. Image filters based on
convolution matrix filters are very common in image elaboration and, based on
the matrix used for the computation, often called kernel, can perform several
types of image processing. Such a matrix is normally a 3 x 3 or 5 x 5 square
matrix, and the computation is carried out by considering, for each pixel image
P (x, y), the pixels surrounding the considered one and multiplying them for
the corresponding coefficient of the kernel matrix K. Here we shall use a
3 x 3 kernel matrix, and therefore the computation of the filtered pixel value
P f (x, y) is
2
2
P f (x, y) = K(i, j)P (x + i − 1, y + j − 1) (2.1)
i=0 j=0
Here, we use the Sobel Filter for edge detection, which defines the following
two kernel matrixes: ⎡ ⎤
−1 0 1
⎣−2 0 2⎦ (2.2)
−1 0 1
for the gradient along the X direction, and
⎡ ⎤
1 2 1
⎣0 0 0⎦ (2.3)
−1 −2 −1
# define T H R E S H O L D 100
/∗ S ob e l matrixes ∗/
static int GX [3][3];
static int GY [3][3];
/∗ I n i t i a l i z a t i o n of the S ob e l matrixes , to be c a l l e d b e for e
S ob e l f i l t e r computation ∗/
static void initG ()
{
/∗ 3x3 GX S ob e l mask . ∗/
GX [0][0] = -1; GX [0][1] = 0; GX [0][2] = 1;
GX [1][0] = -2; GX [1][1] = 0; GX [1][2] = 2;
/∗ 3x3 GY S ob e l mask . ∗/
GY [0][0] = 1; GY [0][1] = 2; GY [0][2] = 1;
GY [1][0] = 0; GY [1][1] = 0; GY [1][2] = 0;
GY [2][0] = -1; GY [2][1] = -2; GY [2][2] = -1;
}
/∗ Convolution s t a r t s here ∗/
else
{
/∗ X Gradient ∗/
for ( i = -1; i <= 1; i ++)
{
for ( j = - 1; j <= 1; j ++)
{
sumX = sumX + ( int )( (*( image + x + i +
( y + j )* cols )) * GX [ i +1][ j +1]);
}
}
/∗ Y Gradient ∗/
for ( i = -1; i <= 1; i ++)
{
for ( j = -1; j <= 1; j ++)
{
sumY = sumY + ( int )( (*( image + x + i +
( y + j )* cols )) * GY [ i +1][ j +1]);
}
}
/∗ Gradient Magnitude approximation to avoid square r oot ope r ati on s ∗/
sum = abs ( sumX ) + abs ( sumY );
}
is called the big-O notation and provides an indication of the complexity for
computer algorithms. More formally, given two functions f (x) and g(x), if a
value M and a value x0 exist for which the following condition holds:
two algorithms for a problem of dimension N , the first one requiring f (N ) op-
erations, and the second one requiring exactly 100f (N ). Of course, we would
never choose the second one; however they are equivalent in the big-O nota-
tion, being both O(f (N )).
Therefore, in order to assess the complexity of a given algorithm and to op-
timize it, other techniques must be considered, in addition to the choice of the
appropriate algorithm. This the case of our application: given the algorithm,
we want to make its computation as fast as possible.
First of all, we need to perform a measurement of the time the algorithm
takes. A crude but effective method is to use the system routines for getting
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
the current time, and measure the difference between the time measured first
and after the computation of the algorithm. The following code snippet makes
a raw estimation of the time procedure makeBorder() takes in a Linux system.
# define I T E R A T I O N S 1000
struct time_t beforeTime , a f t e r T i m e;
int e x e c u t i o n T i m e;
....
g e t t i m e o f d a y(& beforeTime , NULL );
for ( i = 0; i < I T E R A T I O N S; i ++)
m a k e B o r d e r( image , border , cols , rows );
g e t t i m e o f d a y(& afterTime , NULL );
/∗ Execution time i s expressed in microseconds ∗/
e x e c u t i o n T i m e = ( a f t e r T i m e. tv_sec - b e f o r e T i m e. tv_sec ) * 1000000
+ a f t e r T i m e. tv_usec - b e f o r e T i m e. tv_usec ;
e x e c u t i o n T i m e /= I T E R A T I O N S;
...
The POSIX routine gettimeofday() reads the current time from the CPU
clock and stores it in a time t structure whose fields define the number of
seconds (tv sec) and microseconds (tv usec) from the Epoch, that is, a
reference time which, for POSIX, is assumed to be 00:00:00 UTC, January 1,
1970.
The execution time measured in this way can be affected by several factors,
among which can be the current load of the computer. In fact, the process
running the program may be interrupted during execution by other processes
in the system. Even after setting the priority of the current process as the
highest one, the CPU will be interrupted many times for performing I/O and
for the operating system operation. Nevertheless, if the computer is not loaded,
and the process running the program has a high priority, the measurement is
accurate enough.
We are now ready to start the optimization of our edge detection algo-
rithm. The first action is the simplest one: let the compiler do it. Modern
compilers perform very sophisticated optimization of the machine code that
is produced when parsing the source code. It is easy to get an idea of the
degree of optimization by comparing the execution time when compiling the
program without optimization (compiler flag -O0) and with the highest degree
of optimization (compiler flag -O3), which turns out to be 5–10 times shorter
for the edge detection routine. The optimization performed by the compiler
addresses the following aspects:
variable, and which therefore would produce the same result at every loop
iteration.
Observe that code reduction does not mean reduction in the size of the
produced program; rather, it reduces the number of instruction actually
executed during the program. For example, whenever the number N of
loop iterations can be deduced at compile time (i.e., does not depend on
run-time information) and N is not too high, compilers often replace the
conditional jump instruction by concatenating N segments of machine in-
struction, each corresponding to the loop body. The resulting executable
program is longer, but the number of instructions actually performed is
lower since the conditional jumps instruction and the corresponding con-
dition evaluation are avoided. For the same reason, compilers can also per-
form inline expansion when a routine is called in the program. Inserting the
code of the routine again makes the size of the executable program bigger,
but avoids the overhead due to the routine invocation and the passage of
the arguments.
a = 15 * i;
.....
}
a = 0;
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
The compiler recognizes, then, induction variables and replaces more com-
plex operations with additions. This optimization is particularly useful for
the loop variables used as indexes in arrays; in fact, many computer ar-
chitectures define memory access operations (arrays are stored in memory
and are therefore accessed via memory access machine instructions such as
LOAD or STORE), which increment the passed memory index by a given
amount in the same memory access operation.
via pointers. We have already seen that a compiler can often maintain in a
register a copy of a variable stored in memory so that the register copy can
be used instead. However, it is not possible to store in a register a memory
location accessed via a pointer and reuse it afterwards in spite of the memory
location, because it is not possible to make sure that the memory address has
not been modified in the meantime. In fact, while in many cases the compiler
can analyze in advance how variables are used in the program, in general it
cannot do the same for memory location accessed via pointers because the
pointer values, that is, the memory addresses, are normally computed run
time, and cannot therefore be foreseen during program compilation.
As we shall see shortly, there is still room for optimization in the edge de-
tection routine, but it is necessary to introduce first some concepts of memory
caching.
In order to speed memory accesses computers use memory caches. A mem-
ory cache is basically a fast memory that is much faster that the RAM memory
used by the processor, and which holds data recently accessed by the com-
puter. The memory cache does not correspond to any fixed address in the ad-
dressing space of the processor, and therefore contains only copies for memory
locations stored in the RAM. The caching mechanism is based on a common
fact in programs: locality in memory access. Informally stated, memory ac-
cess locality expresses the fact that if a processor makes a memory access,
say, at address K, the next access in memory is likely to occur at an address
that is close to K. To convince ourselves of this fact, consider the two main
categories of memory data access in a program execution: fetching program
instructions and accessing program data. Fetching memory instructions (re-
call that a processor has to read the instruction from memory in order to
execute it) is clearly sequential in most cases. The only exception is for the
Jump instructions, which, however, represent a small fraction of the program
instructions. Data is mostly accessed in memory when the program accesses
array elements, and arrays are normally (albeit not always) accessed in loops
using some sort of sequential indexing.
Cache memory is organized in blocks (called also cache lines), which can be
up to a few hundreds bytes large. When the processor tries to access a memory
location for reading or writing a data item at a given address, the cache
controller will first check if a cache block containing that location is currently
present in the cache. If it is found in the cache memory, fast read/write access
is performed in the cached copy of the data item. Otherwise, a free block in
the cache is found (possibly copying in memory an existing cache block if the
cache is full), and a block of data located around that memory address is first
copied from memory to the cache. The two cases are called Cache Hit and
Cache Miss, respectively. Clearly, a cache miss incurs in a penalty in execution
time (the copy of a block from memory to cache), but, due to memory access
locality, it is likely that further memory accesses will hit the cache, with a
significant reduction in data access time.
The gain in performance due to the cache memory depends on the program
itself: the more local is memory access, the faster will be program execution.
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
Consider the following code snippet, which computes the sum of the elements
of a MxN matrix.
double a [ M ][ N ];
double sum = 0;
for ( i = 0; i < M , i ++)
for ( j = 0; j < N ; j ++)
sum += a [ i ][ j ];
In C, matrixes are stored in row first order, that is, rows are stored sequen-
tially. In this case a[i][j] will be adjacent in memory to a[i][j+1], and the
program will access matrix memory sequentially. The following code is also
correct, differing from the previous one only for the exchange of the two for
statements.
double a [ M ][ N ];
double sum = 0;
for ( j = 0; j < N ; j ++)
for ( i = 0; i < M , i ++)
sum += a [ i ][ j ];
However in this case memory access is not sequential since matrix elements
a[i][j] and a[i+1][j] are stored in memory locations that are N elements
far away. In this case, the number of cache misses will be much higher than
in the former case, especially for large matrixes, affecting the execution time
of that code.
Coming back to routine makeBorder(), we observe that it is accessing
memory in the right order. In fact, what the routine basically does is to con-
sider a 3 x 3 matrix sweeping along the 480 rows of the image. The order
of access is therefore row first, corresponding to the order in which bytes
are stored in the image buffer. So, if bytes are being considered in a “cache
friendly” order, what can we do to improve performance? Recall that the
compiler is very clever in optimizing access to information stored in program
variables, but is mostly blind as regard the management of information stored
in memory (i.e., in arrays and matrixes). This fact suggests to us a possible
strategy: move the current 3 x 3 portion of the image being considered in the
Sobel filter into 9 variables. Filling this set of 9 variables the first time a line
is considered will require reading 9 values from memory, but at the follow-
ing iterations, that is, moving the 3 x 3 matrix one position left, only three
new values will be read from memory, the others already being stored in pro-
gram variables. Moreover, the nine multiplications and summations required
to compute the value of the current output filter can be directly expressed in
the code, without defining the 3 x 3 matrixes GX and GY used in the program
listed above. The new implementation of makeBorder() is listed below, using
the new variables c11, c12, . . . , c33 to store the current portion of the image
being considered for every image pixel.
void m a k e B o r d e r( char * image , char * border , int cols , int rows )
{
int x , y , sumX , sumY , sum ;
/∗ Vari ab le s to hold the 3x3 porti on of the image used in the computation
of the S ob e l f i l t e r output ∗/
int c11 , c12 , c13 , c21 , c22 , c23 , c31 , c32 , c33 ;
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
}
c21 = c22 ;
c22 = c23 ;
c33 = *( image + x +2 + y * cols );
if ( y < cols - 1)
{
c31 = c32 ;
c32 = c33 ;
c33 = *( image + x + 2 + ( y + 1) * cols );
}
if ( sum > 255) sum = 255;
if ( sum < T H R E S H O L D) sum =0;
/∗ Report the new p i x e l in the output image ∗/
*( border + x + y * cols ) = 255 - ( unsigned char )( sum );
}
}
}
The resulting code is for sure less readable then the previous version, but, when
compiled, it produces a code that is around three times faster because the
compiler has now more chance for optimizing the management of information,
being memory access limited to the essential cases.
In general code optimization is not a trivial task and requires ingenuity and
a good knowledge of the optimization strategies carried out by the compiler.
Very often, in fact, the programmer experiences the frustration of getting no
advantage after working hard in optimizing his/her code, simply because the
foreseen optimization had already been carried out by the compiler. Since
optimized source code is often much less readable that a nonoptimized one,
implementing a given algorithm taking care also of possible code optimization,
may be an error-prone task. For this reason, implementation should be done
in two steps:
r
θ
x
FIGURE 2.9
r and θ representation of a line.
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
check the correctness of the new code and the amount of gained
performance.
cos θ r
y = −( )x + ( ) (2.5)
sin θ sin θ
Imagine an image containing one line. After edge detection, the pixels
associated with the detected edges may belong to the line, or to some other
element of the scene represented by the image. Every such pixel at coordinates
(x0 , y0 ) is assumed by the algorithm as belonging to a potential line, and the
(infinite) set of lines passing for (x0 , y0 ) is considered. For all such lines, the
associated parameters r and θ obey to the following relation:
r0
θ0
θ
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
FIGURE 2.10
(r, θ) relationship for points (x0 , y0 ) and (x1 , y1 ).
that is, a sinusoidal law in the plane (r, θ). Suppose now that the considered
pixel effectively belongs to the line, and consider another pixel at position
(x1 , y1 ), belonging to the same line. Again, for the set of lines passing through
(x1 , y1 ), their r and θ will obey the law:
Plotting (2.5) and (2.7) in the (r, θ) (Figure 2.10) we observe that the two
graphs intersect in (r0 , θ0 ), where r0 and θ0 are the parameters of the line
passing through (x0 , y0 ) and (x1 , y1 ). Considering every pixel on that line, all
the corresponding curves in place (r, θ) will intersect in (r0 , θ0 ). This suggests a
voting procedure for detecting the lines in an image. We must consider, in fact,
that in an image spurious pixels are present, in addition to those representing
the line. Moreover, the (x, y) position of the line pixels may lie not exactly in
the expected coordinates for that line. So, a matrix corresponding to the (r, θ)
plane, initially set to 0, is maintained in memory. For every edge pixel, the
matrix elements corresponding to all the pairs (r, θ) defined by the associated
sinusoidal relation are incremented by one. When all the edge pixels have been
considered, supposing a single line is represented in the image, the matrix
element at coordinates (r0 , θ0 ) will hold the highest value, and therefore it
suffices to choose the matrix element with the highest value, whose coordinates
will identify the recognized line in the image.
A similar procedure can be used to detect the center of a circular shape
in the image. Assume initially that the radius R of such circle is known.
In this case, a matrix with the same dimension of the image is maintained,
initially set to 0. For every edge pixel (x0 , y0 ) in the image, the circle of radius
R centered in (x0 , y0 ) is considered, and the corresponding elements in the
matrix incremented by 1. All such circles intersect in the center of the circle
in the image, as shown in Figure 2.11. Again, a voting procedure will allow
discovery of the center of the circle in edge image, even in presence of spurious
pixels, and the approximate position of the pixels representing the circle edges.
If the radius R is not known in advance, it is necessary to repeat the above
procedure for different values of R and choose the radius value that yields
FIGURE 2.11
Circles drawn around points over the circumference intersect in the circle
center.
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
FIGURE 2.12
A sample image with a circular shape.
the maximum count value for the candidate center. Intuitively, this holds,
because only when the considered radius is the right one will all the circles
built around the border pixels of the original circle intersect in a single point.
Observe that even if the effective radius of the circular object to be detected
in the image is known in advance, the radius of its shape in the image may
depend on several factors, such as its distance from the camera, or even from
the illumination of the scene, which may yield slightly different edges in the
image, so in practice it is always necessary to consider a range of possible
radius values.
The overall detection procedure is summarized in Figures 2.12, 2.13, 2.14,
and 2.15. The original image and the detected edges are shown in Figures 2.12
and 2.13, respectively. Figure 2.14 is a representation of the support matrix
used in the detection procedure. It can be seen that most of the circles in the
image intersect in a single point (the others are circles drawn around the other
edges of the image), reported then in the original image in Figure 2.15.
The code of routine findCenter() is listed below. Its input arguments are
the radius of the circle, the buffer containing the edges of the original image
(created by routine makeBorder()), and the number of rows and columns. The
routine returns the position of the detected center and a quality indicator,
expressed as the normalized maximum value in the matrix used for center
detection. The buffer for such a matrix is passed in the last argument.
FIGURE 2.13
The image of 2.12 after edge detection.
FIGURE 2.14
The content of the voting matrix generated from the edge pixels of 2.13.
FIGURE 2.15
The detected center in the original image.
/∗ Black t h r e s h o l d :
a p i x e l v alu e l e s s than the t h r e s h o l d i s considered b l a c k . ∗/
# define B L A C K _ L I M I T 10
void f i n d C e n t e r( int radius , unsigned char * buf , int rows , int cols ,
int * retX , int * retY , int * retMax , unsigned char * map )
{
int x , y , l , m , currCol , currRow , maxCount = 0;
int maxI = 0 , maxJ = 0;
/∗ Square r oots needed for computation are computed only once
and maintained in array sq r ∗/
static int sqr [2 * M A X _ R A D I U S];
static int s q r I n i t i a l i z e d = 0;
/∗ Hit counter , used to normalize the returned q u a l i t y i n d i c a t o r ∗/
double t o t C o u n t s = 0;
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
/∗ The matrix i s i n i t i a l l y s e t to 0 ∗/
memset ( map , 0 , rows * cols );
/∗ I f square r oot v al u e s not y e t i n i t i a l i z e d , compute them ∗/
if (! s q r I n i t i a l i z e d)
{
s q r I n i t i a l i z e d = 1;
for ( l = - radius ; l <= radius ; l ++)
/∗ i n t e g e r approximation of s q r t ( r adi u s ˆ2 − l ˆ2) ∗/
sqr [ l + radius ] = sqrt ( radius * radius - l * l ) + 0.5;
}
for ( currRow = 0; currRow < rows ; currRow ++)
{
for ( currCol = 0; currCol < cols ; currCol ++)
{
/∗ Consider only p i x e l s corresponding to borders of the image
Such p i x e l s are s e t by makeBorder as dark ones∗/
if ( buf [ currRow * cols + currCol ] <= B L A C K _ L I M I T)
{
x = currCol ;
y = currRow ;
/∗ Increment the v al u e of the p i x e l s in map b u f f e r which corresponds to
a c i r c l e of the g i v e n r adi u s centered in ( currCol , currRow) ∗/
for ( l = x - radius ; l <= x + radius ; l ++)
{
if ( l < 0 || l >= cols )
continue ; // Out of image X range
m = sqr [l - x + radius ];
if (y - m < 0 || y + m >= rows )
continue ; //Out of image Y range
map [( y - m )* cols + l ]++;
map [( y + m )* cols + l ]++;
t o t C o u n t s += 2; //Two more p i x e l s incremented
/∗ Update current maximum ∗/
if ( maxCount < map [( y + m )* cols + l ])
{
maxCount = map [( y + m )* cols + l ];
maxI = y + m ;
maxJ = l ;
}
if ( maxCount < map [( y - m )* cols + l ])
{
maxCount = map [( y - m )* cols + l ];
maxI = y - m ;
maxJ = l ;
}
}
}
}
}
/∗ Return the (X, y ) p o s i t i o n in the map which y i e l d s the l a r g e s t v alu e ∗/
* retX = maxJ ;
* retY = maxI ;
/∗ The returned q u a l i t y i n d i c a t o r i s expressed as maximum p i x e l
v al u e in map matrix ∗/
* retMax = maxCount ;
}
As stated before, due to small variations of the actual radius of the circular
shape in the image, routine findCenter() will be iterated for a set of radius
values, ranging between a given minimum and maximum value.
When considering the possible optimization of the detection procedure,
we observe that every time routine findCenter() is called, it is necessary to
compute the square root values that are required to select the map elements
which lie on a circumference centered on the current point. Since the routine is
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017
called for a fixed range of radius values, we may think of removing the square
root calculation at the beginning of the routine, and to pass on an array of
precomputed values, which are prepared in an initialization phase for all the
considered radius values. This improvement would, however, bring very little
improvement in speed: in fact, only few tens of square root computations (i.e.,
the pixel dimension of the radius) are carried out every time findCenter() is
called, a very small number of operations if compared with the total number of
operations actually performed. A much larger improvement can be obtained by
observing that it is possible to execute findCenter() for the different radius
values in parallel instead of in a sequence. The following code uses POSIX
threads, described in detail in Chapter 7, to launch a set of thread, each
computing the center coordinates for a given value of the radius. Every thread
can be considered an independent flow of execution for the passed routine. In a
multicore processor, threads can run on different cores, thus providing a drastic
reduction of the execution time because code is executed effectively in parallel.
A new thread is created by POSIX routine pthread create(), which takes as
arguments the routine to be executed and the (single) parameter to be passed.
As findCenter() accepts multiple input and output parameters, it cannot be
passed directly as argument to pthread create(). The normal practice is to
allocate a data structure containing the routine-specific parameters and to
pass its pointer to pthread create() using a support routine (doCenter()
in the code below).
After launching the threads, it is necessary to wait for their termina-
tion before selecting the best result. This is achieved using POSIX routine
pthread join(), which suspends the execution of the calling program un-
til the specified thread terminates, called in a loop for every created thread.
When the loop exits, all the centers have been computed, and the best can-
didate can be chosen using the returned arguments stored in the support
argument structures.
# include < pthreads .h >
/∗ D e fi n i ti on of a s t r u c t u r e to contain the arguments to be
exchanged with findCenter ( ) ∗/
struct a r g u m e n t s{
unsigned char * edges ; //Edge image
int rows , cols ; //Rows and columns i f the image
int r ; //Current r adi u s
int retX , retY ; //Returned c e n te r p o s i t i o n
int retMax ; //Returned q u a l i t y f a c t o r
args [ i ]. r = minR + i ;
args [ i ]. rows = rows ;
args [ i ]. cols = cols ;
args [ i ]. map = ( unsigned char *) malloc ( rows * cols );
}
}
2.6 Summary
In this chapter a case study has been used to introduce several important facts
about embedded systems. In the first part, the I/O architecture of computers
has been presented, introducing basilar techniques such as polling, interrupts
and Direct Memory Access.
The interface to I/O operations provided by operating systems, in par-
ticular Linux, has then been presented. The operating system shields all the
Downloaded by [Lund University Libraries (master)] at 08:08 29 August 2017