Embedded System - UG - Eng - 3rd Yr
Embedded System - UG - Eng - 3rd Yr
EMBEDDED SYSTEMS
Author
Santanu Chattopadhyay
Professor,
Electronics and Electrical Communication Engineering
Indian Institute of Technology, Kharagpur
Reviewer
Srinivasulu Tadisetty
Professor,
Kakatiya University, Warangal,
Vidyaranyapuri, Hanamkonda,
Warangal, Telagana
ii
BOOK AUTHOR DETAIL
Santanu Chattopadhyay, Professor, Electronics and Electrical Communication Engineering, Indian Institute of
Technology, Kharagpur.
Email ID: [email protected]
1. Dr. Sunil Luthra, Director, Training and Learning Bureau, All India Council for Technical Education
(AICTE), New Delhi, India.
Email ID: [email protected]
2. Sanjoy Das, Assistant Director, Training and Learning Bureau, All India Council for Technical Education
(AICTE), New Delhi, India.
Email ID: [email protected]
3. Reena Sharma, Hindi Officer, Training and Learning Bureau, All India Council for Technical Education
(AICTE), New Delhi, India.
Email ID: [email protected]
4. Avdesh Kumar, JHT, Training and Learning Bureau, All India Council for Technical Education (AICTE),
New Delhi, India.
Email ID: [email protected]
December 2024
© All India Council for Technical Education (AICTE)
ISBN : 978-93-6027-184-8
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means,
without permission in writing from the All India Council for Technical Education (AICTE).
Further information about All India Council for Technical Education (AICTE) courses may be obtained from the
Council Office at Nelson Mandela Marg, Vasant Kunj, New Delhi-110070.
Printed and published by All India Council for Technical Education (AICTE), New Delhi.
Disclaimer: The website links provided by the author in this book are placed for informational, educational &
reference purpose only. The Publisher do not endorse these website links or the views of the speaker / content of
the said weblinks. In case of any dispute, all legal matters to be settled under Delhi Jurisdiction, only.
iii
iv
ACKNOWLEDGEMENT
The author is grateful to the authorities of AICTE, particularly Prof. T.G. Sitharam, Chairman; Dr.
Abhay Jere, Vice-Chairman; Prof. Rajive Kumar, Member Secretary; Dr. Sunil Luthra, Director,
and Reena Sharma, Hindi Officer Training and Learning, for their planning to publish the book on
Embedded Systems and other subjects. I sincerely acknowledge the valuable contributions of the
reviewer of the book Dr. Srinivasulu Tadisetty, Professor, Kakatiya University, Warangal for going
through the entire manuscript line-by-line and providing valuable suggestions for content
improvements. My sincere thanks to my wife Santana for being the constant source of inspiration
to write this, as well as, several other books in the past. I would also like to mention about our son
Sayantan, who always pushes me to newer challenges to put my knowledge into black and white
– leading to this book as well.
This book is an outcome of various suggestions of AICTE members, experts and authors who
shared their opinion and thought to further develop the engineering education in our country.
Acknowledgements are due to the contributors and different workers in this field whose published
books, review articles, papers, photographs, footnotes, references and other valuable information
enriched me at the time of writing the book.
Santanu Chattopadhyay
v
PREFACE
Embedded Systems are computing systems present in electronic/electrical devices and appliances,
controlling their overall operation and often going unnoticed by the user. Unlike general computing
systems, an embedded computing system has a limited set of activities based upon the intended
system functionality. The design is highly constrained in terms of resources like size, speed, power
consumption, and stringent time-to-market. Designer must have knowledge of contemporary
developments in hardware, software, networking, sensors and communication, as decision is
needed regarding the suitable target architecture for the system.
This book attempts to provide a good understanding of various angles involved in the process.
Features of embedded systems, the design metrics and the design flow for a complete system has
been discussed. Embedded processors differ from general ones in terms of features like I/O
handling, timers/counters, data converters etc. Microcontrollers are typically used in embedded
systems. One of the most popular microcontrollers, ARM has been covered. Digital Signal
Processors (DSPs) have been discussed as the natural choice for signal processing applications. To
address performance criticality, the hardware platforms like FPGA and ASIC have been
enumerated. Embedded hardware often contains nonconventional interfaces. Some such interfaces
are Serial Peripheral Interface (SPI), Inter Integrated Circuits (I2C), Infrared communication
(IrDA), Controller Area Network (CAN), Bluetooth etc. Traditional interfaces like RS-232,
Universal Serial Bus (USB) are also used.
Relatively large embedded applications like network switch, aviation equipment, missile etc.
contain a number of interacting processes that work at real-time. In order to ensure the satisfaction
of their requirements, a Real-Time Operating System (RTOS) is needed. Designing a process is
also challenging, as the code will have both computation and hardware interactions. This comes
under the broad paradigm of embedded programming.
For an embedded application, all its modules may not be equally critical in meeting the area,
performance and power constraints. To keep the overall system cost low, the target platform may
contain both hardware and software modules interacting with each other. Hardware-Software
Codesign and Cosimultaion paradigm has come up to address this partitioning.
The book is an outcome of the author’s experience of handling the theory and laboratory courses
on subjects like Microprocessors, Microcontrollers, and Embedded Systems for more than two
decades. It is expected that the book will be highly useful to the students and the subject teachers.
vi
It may also be useful to anybody starting a career as an embedded system designer. Any
constructive suggestion to improve the quality of the book in its future editions is most welcome.
Santanu Chattopadhyay
vii
OUTCOME BASED EDUCATION
For the implementation of an outcome based education the first requirement is to develop an
outcome based curriculum and incorporate an outcome based assessment in the education system.
By going through outcome based assessments evaluators will be able to evaluate whether the
students have achieved the outlined standard, specific and measurable outcomes. With the proper
incorporation of outcome based education there will be a definite commitment to achieve a
minimum standard for all learners without giving up at any level. At the end of the programme
running with the aid of outcome based education, a student will be able to arrive at the following
outcomes:
PO2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
PO5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities with an understanding of the limitations.
PO6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant
to the professional engineering practice.
viii
PO7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
PO8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
PO9. Individual and team work: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings.
PO11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
PO12. Life-long learning: Recognize the need for, and have the preparation and ability to engage
in independent and life-long learning context of technological change.
ix
COURSE OUTCOMES
CO-1 3 3 3 3 2 2 1 1 1 1 1 2
CO-2 3 3 3 1 2 1 1 1 1 1 1 2
CO-3 3 3 3 3 3 1 1 1 1 1 1 2
CO-4 3 3 3 2 3 1 1 1 1 1 1 2
x
GUIDELINES FOR TEACHERS
To implement Outcome Based Education (OBE) knowledge level and skill set of the students
should be enhanced. Teachers should take a major responsibility for the proper implementation
of OBE. Some of the responsibilities (not limited to) for the teachers in OBE system may be as
follows:
• Within reasonable constraint, they should manipulate time to the best advantage of all students.
• They should assess the students only upon certain defined criterion without considering any
other potential ineligibility to discriminate them.
• They should try to grow the learning abilities of the students to a certain level before they
leave the institute.
• They should try to ensure that all the students are equipped with the quality knowledge as
well as competence after they finish their education.
• They should always encourage the students to develop their ultimate performance capabilities.
• They should facilitate and encourage group work and team work to consolidate newer approach.
• They should follow Blooms taxonomy in every part of the assessment.
Bloom’s Taxonomy
Students ability to
Creating create Design or Create Mini project
Students ability to
Evaluating Justify Argue or Defend Assignment
Students ability to
Understanding explain the ideas Explain or Classify Presentation/Seminar
Students ability to
Remembering recall (or remember) Define or Recall Quiz
xi
GUIDELINES FOR STUDENTS
Students should take equal responsibility for implementing the OBE. Some of the
responsibilities (not limited to) for the students in OBE system are as follows:
• Students should be well aware of each UO before the start of a unit in each and every course.
• Students should be well aware of each CO before the start of the course.
• Students should be well aware of each PO before the start of the programme.
• Students should think critically and reasonably with proper reflection and action.
• Learning of the students should be connected and integrated with practical and real life
consequences.
• Students should be well aware of their competency at every level of OBE.
xii
LIST OF ABBREVIATIONS
xiii
DMA Direct Memory Access PCIe Peripheral Component
Interconnect Express
DMA Deadline Monotonic Algorithm PCP Priority Ceiling Protocol
DP Instruction Dispatch PG Program Address Generate
DPC Deferred Procedure Call PIP Priority Inheritance Protocol
DRAM Dynamic Random Access Memory PLA Programmable Logic Array
DSP Digital Signal Processor PLD Programmable Logic Device
DSR Data Set Ready PPM Pulse Position Modulation
DTE Data Terminal Equipment PR Program Fetch Packet Receive
DTR Data Terminal Ready PS Program Address Send
ECC Error Correcting Code PSO Particle Swarm Optimization
ECU Electronic Control Unit PW Program Access Ready Wait
EDF Earliest Deadline First RISC Reduced Instruction Set
Computer
EOC End-of-Conversion RMA Rate Monotonic Analysis
FeRAM Ferroelectric Random Access RMS Rate Monotonic Scheduling
Memory
FIQ Fast Interrupt Processing ROM Read Only Memory
FPGA Field Programmable Gate Array SPI Serial Peripheral Interface
RTL Register Transfer Level SPSR Saved Program Status
Register
RTOS Real-Time Operating System SRAM Static Random Access
Memory
RTS Request To Send SS Slave Select
RZ Return-to-Zero SSD Solid State Drive
SAR Successive Approximation Register TWI Two-Wire Interface
SCL Serial Clock UI User Interface
SCLK Serial Clock UML Universal Modelling
Language
SDA Serial Data USB Universal Serial Bus
SIMD Single Instruction Multiple Data UX User Experience
SOC System-On-a-Chip VLIW Very Large Instruction Word
xiv
LIST OF FIGURES
Unit 1
Fig 1.1 : Conceptual view of Embedded System 4
Fig 1.2 : Embedded system design methodology 10
Fig 1.3 : Smart card physical structure 11
Fig 1.4 : Authentication protocol sequence 12
Fig 1.5 : Block diagram of digital camera chip 14
Fig 1.6 : Digital camera operation 15
Unit 2
Fig 2.1 : Generic embedded processor 25
Fig 2.2 : ARM7 processor block diagram 28
Fig 2.3 : ARM7 functional diagram 30
Fig 2.4 : CPSR register structure 34
Fig 2.5 : ARM registers in different modes 35
Fig 2.6 : Little-endian vs. big-endian representation 39
Fig 2.7 : Stack implementation in ARM 40
Fig 2.8 : Format of branch instruction 43
Fig 2.9 : Execution of swap instruction 44
Fig 2.10 : THUMB instruction processing 44
Fig 2.11 : Typical multiply-accumulate unit in DSP 47
Fig 2.12 : Typical DSP architecture 48
Fig 2.13 : C6000 DSP architecture 52
Fig 2.14 : Generic FPGA architecture 54
Fig 2.15 : SRAM based switching 55
Fig 2.16 : Antifuse structure 56
Fig 2.17 : Floating-gate transistor 56
Fig 2.18 : Crosspoint logic block 56
Fig 2.19 : Xilinx XC4000 Configurable Logic Block (CLB) 57
Fig 2.20 : Actel ACT1 logic block 58
Fig 2.21 : FPGA design flow 59
Fig 2.22 : Memory modules in embedded systems 62
xv
Fig 2.23 : SRAM cell 63
Fig 2.24 : DRAM cell 63
Unit 3
Fig 3.1 : The SPI interface 74
Fig 3.2 : Connecting multiple slaves in SPI 75
Fig 3.3 : Data transfer through SPI interface 76
Fig 3.4 : I2C wires 77
Fig 3.5 : Timing diagram for I2C communication 78
Fig 3.6 : 9-pin connector for RS-232 80
Fig 3.7 : Tiered-star USB tree structure 82
Fig 3.8 : USB cable with connectors 85
Fig 3.9 : Type-A and Type-B connectors 85
Fig 3.10 : USB Type-C (a) Receptacle (b) Plug 86
Fig 3.11 : IrDA transmission 87
Fig 3.12 : IrDA encodings (a) Data bits “01010011” (b) RZ coding (c) 4PPM 87
coding
Fig 3.13 : CAN bus structure 89
Fig 3.14 : Digital-to-analog converter 91
Fig 3.15 : Weighted resistors DAC 91
Fig 3.16 : R-2R ladder network DAC 92
Fig 3.17 : Analog-to-digital converter 93
Fig 3.18 : Stages in an ADC 94
Fig 3.19 : Delta-sigma ADC 95
Fig 3.20 : Flash ADC 96
Fig 3.21 : Successive approximation ADC 97
Fig 3.22 : Integrating ADC (a) Circuit (b) Output voltage 98
Fig 3.23 : PCI express bus connection scheme 99
Fig 3.24 : PCI express link structure 100
Fig 3.25 : Typical AMBA based microcontroller platform 100
Unit 4
Fig 4.1 : Result utility in firm real-time systems 113
Fig 4.2 : Result utility in soft real-time systems 114
xvi
Fig 4.3 : Periodic task behaviour 115
Fig 4.4 : Sporadic task behaviour 116
Fig 4.5 : Classification of scheduling algorithms 117
Fig 4.6 : RMS schedule of tasks in Example 4.1 120
Fig 4.7 : RMS schedule of tasks in Table 4.2 127
Fig 4.8 : EDF schedule of tasks in Table 4.2 127
Fig 4.9 : Domino effect with EDF scheduling 129
Unit 5
Fig 5.1 : Internal block diagram of the 8051 149
Fig 5.2 : Interfacing 8 DIP switches and 8 LEDs 157
Fig 5.3 : Four multiplexed 7-segment displays 158
Fig 5.4 : Interfacing ADC0808/0809 with the 8051 159
Fig 5.5 : Interfacing DAC0808 with 8051 162
Fig 5.6 : Block diagram of LPC214x 164
Unit 6
Fig 6.1 : Typical target architecture 177
Fig 6.2 : Cosimulation in Codesign process 179
Fig 6.3 : Cosimulation environment 182
Fig 6.4 : Abstract-level cosimulation 182
Fig 6.5 : Detailed-level cosimulation 183
Fig 6.6 : (a) Example task graph (b) A target architecture 191
Fig 6.7 : Example graph 194
Fig 6.8 : Role of functional partitioning 207
Fig 6.9 : Example call graph 207
Fig 6.10 : Call graph after granularity transformations 210
Fig 6.11 : Call graph after pre-clustering 210
Fig 6.12 : Call graph after 2-way partitioning 211
xvii
CONTENTS
Foreword iv
Acknowledgement v
Preface vi
Outcome Based Education viii
Course Outcomes x
Guidelines for Teachers xi
Guidelines for Students xii
List of Abbreviations xiii
List of Figures xv
xviii
Unit 2: Embedded Processors 22-70
Unit Specifics 22
Rationale 22
Pre-Requisites 23
Unit Outcomes 23
2.1 Introduction 24
2.2 Choice of Microcontroller 26
2.3 ARM Microcontroller 27
2.3.1 Structure of ARM7 Microcontroller 28
2.3.2 ARM7 Instruction Set Architecture 32
2.3.3 Exceptions in ARM 46
2.4 Digital Signal Processors 46
2.4.1 Typical DSP Architecture 48
2.4.2 Example DSP – C6000 Family 51
2.5 Field Programmable Gate Array – FPGA 53
2.5.1 Programming the FPGA 55
2.5.2 FPGA Logic Blocks 56
2.5.3 FPGA Design Process 58
2.6 Application Specific Integrated Circuit – ASIC 60
2.6.1 ASIC Design Flow 60
2.7 Embedded Memory 61
2.8 Choosing a Platform 64
Unit Summary 65
Exercises 66
Know More 69
References and Suggested Readings 70
xix
3.1 Introduction 73
3.2 Serial Peripheral Interface 74
3.3 Inter-Integrated Circuit 76
3.4 RS-232 79
3.4.1 Handshakimg 79
3.5 Universal Serial Bus – USB 80
3.5.1 USB Versions 81
3.5.2 USB Connection and Operation 81
3.5.3 USB Device Classes 83
3.5.4 USB Physical Interface 83
3.6 Infrared Communication (IrDA) 86
3.7 Controller Area Network (CAN) 88
3.8 Bluetooth 90
3.9 Digital-to-Analog Converter 90
3.9.1 DAC Performance Parameters 92
3.10 Analog-to-Digital Converter (ADC) 93
3.10.1 Delta-Sigma (ΔΣ) ADC 94
3.10.2 Flash ADC 95
3.10.3 Successive Approximation ADC 96
3.10.4 Integrating ADC 97
3.11 Subsystem Interfacing 98
3.11.1 PCI Express Bus 99
3.11.2 Advanced Microcontroller Bus Architecture (AMBA) 100
3.12 User Interface Design 102
Unit Summary 103
Exercises 104
Know More 107
References and Suggested Readings 107
xx
Pre-Requisites 110
Unit Outcomes 110
4.1 Introduction 111
4.2 Types of Real-Time Tasks 112
4.2.1 Hard Real-Time Tasks 112
4.2.2 Firm Real-Time Tasks 113
4.2.3 Soft Real-Time Tasks 113
4.3 Task Periodicity 114
4.3.1 Periodic Tasks 114
4.3.2 Sporadic Tasks 115
4.3.3 Aperiodic Tasks 116
4.4 Task Scheduling 116
4.5 Rate Monotonic Scheduling 119
4.5.1 Schedulability in RMS 121
4.5.2 Advantages and Disadvantages of RMS 123
4.5.3 Aperiodic Server 124
4.5.4 Handling Limited Priority Levels 125
4.6 Earliest Deadline First Scheduling 126
4.6.1 Advantages and Disadvantages of EDF Policy 128
4.7 Resource Sharing 129
4.7.1 Priority Inheritance Protocol 130
4.8 Other RTOS Features 133
4.9 Some Commercial RTOS 135
4.9.1 Real-Time Extensions 135
4.9.2 Windows CE 136
4.9.3 LynxOS 136
4.9.4 VxWorks 137
4.9.5 Jbed 137
4.9.6 pSOS 137
Unit Summary 137
Exercises 138
Know More 141
References and Suggested Readings 142
xxi
Unit 5: Embedded Programming 143-173
Unit Specifics 143
Rationale 143
Pre-Requisites 144
Unit Outcomes 144
5.1 Introduction 145
5.2 Embedded Programming Languages 146
5.3 Choice of Language 147
5.4 Embedded C 148
5.5 Embedded C for 8051 149
5.5.1 Embedded C Data Types and Programming 151
5.5.2 Interfacing using Embedded C 156
5.6 Embedded C for ARM 162
5.6.1 LPC214x – An ARM7 Implementation 163
5.6.2 I/O Port Programming for LPC2148 164
Unit Summary 167
Exercises 168
Know More 172
References and Suggested Readings 172
xxii
6.3.2 Kernighan-Lin Heuristic Based Partitioning 190
6.3.3 Genetic Algorithm Based Partitioning 199
6.3.4 Particle Swarm Optimization Based Partitioning 200
6.3.5 Power-Aware Partitioning on Reconfigurable Hardware 202
6.3.6 Functional Partitioning 205
Unit Summary 211
Exercises 212
Know More 215
References and Suggested Readings 216
xxiii
Embedded System Basics | 1
1 Embedded System
Basics
UNIT SPECIFICS
Through this unit we have discussed the following aspects:
• Defining an embedded system and its difference from other general electronic systems;
• Features of an embedded system;
• The metrics that an embedded system designer attempts to optimize;
• The step-by-step design process of an embedded system;
• Role of various synthesis and simulation tools that aid in the design process;
• Predesigned libraries to reduce the time to market for embedded systems;
• Three typical examples of embedded systems – smart card, digital camera and automobiles
have been elaborated.
The overall discussion in the unit is to give an overview of embedded systems, starting with its
definition and illustrating its features. It has been assumed that the reader has a good
understanding of System Design with hardware and software design processes. A basic
understanding of tools and libraries will be helpful. A good number of examples have been included
to explain the concepts, wherever needed.
A large number of multiple choice questions have been provided with their answers. Apart
from that, subjective questions of short and long answer type have been included. A list of
references and suggested readings have been given, so that, one can go through them for more
details. The section “Know More” has been carefully designed so that the supplementary
information provided in this part becomes beneficial for the users of the book. This section mainly
highlights the initial activity, history of the development of embedded systems, recent developments
and application areas.
2 | Embedded Systems
RATIONALE
This unit on embedded system basics helps the reader to understand the definition of an embedded
system, its features and design process. Starting with the conceptual view of an embedded system,
it goes into the details of important features that distinguishes an embedded system from general
electronic systems. The designers target a specific set of goals to optimize while designing such
systems. The unit details all those aspects. The design process for any system goes through a set of
interdependent hierarchical steps. The unit explains the process of refining a system specification,
passing through several stages into a set of processes implemented either in hardware or in
software. The synthesis process passes the system specification (or its refinement) through several
tools. Some of these tools help in the synthesis process, whereas some others are used for
verification and testing. Predesigned libraries play a major role in reducing the turn-around time.
The unit takes the reader through an overview of all these steps to give complete picture of the
embedded system design paradigm. To explain the functionality of embedded systems, three
examples have been taken up. The dedicated nature of these examples makes the reader acquainted
further with embedded systems. Thus, this unit gives a bird’s-eye-view of embedded systems and
their design process.
PRE-REQUISITES
System Design
UNIT OUTCOMES
List of outcomes of this unit is as follows:
U1-O1: Define an embedded system
U1-O2: Differentiate embedded systems from general electronic systems
U1-O3: List features of embedded systems
U1-O4: Enumerate the metrics of embedded system design
U1-O5: Illustrate the design flow of embedded systems
UI-O6: List examples of embedded systems
Embedded System Basics | 3
U1-O2 3 2 - -
U1-O3 3 1 - -
U1-O4 3 - - -
U1-O5 3 2 - -
U1-O6 3 2 - -
1.1 Introduction
Information Technology has made information processing the heart of modern electronic
systems. Over the decades, there have been significant changes in the devices employed for such
processing. Traditional systems possess separate computational and physical systems (such as,
mechanical, chemical etc.). For example, an algorithm running on a computer may decide upon the
concentration of various chemicals, temperature requirements etc. to synthesize an end product in
a chemical process. However, the actual chemical process is carried out separately. The two
systems here are distinct from each other and are easily differentiable by an outside observer. On
the other hand, a new class of instruments/equipments have emerged in which the computational
and physical systems have got integrated into a single one. Looking from outside, as a user of the
system, it is not possible to demarcate between its computational part and physical part. A typical
example of such a system is an automobile which has got a number of computational processors
integrated with its mechanical components. New electronic gadgets are being developed in all
fronts of life that put computation inside the gadget itself. The effort of embedding computational
elements inside the bigger devices, which is often unnoticed by the system user, has given rise to
the new class of devices, called embedded systems. Conceptual representation of an embedded
system has been shown in Fig 1.1.
4 | Embedded Systems
Physical System
Ports
Sensors &
Actuators
Signals
Output Output
Ports Signals
The design goals of embedded systems differ significantly from the general computational
systems. Embedded systems contain a very strict performance requirement, while meeting many
other design constraints. Hence, an embedded system may be defined as a computing device
contained within a larger electronic device, performing a single task, or a specific set of tasks
repeatedly, often going unnoticed by the user. Though this given definition is not very precise,
giving a complete definition of embedded system is quite difficult. In a wider sense, an embedded
system can be thought to be computing systems excepting desktops and other higher configuration
computers. Gadgets designed around embedded processors are available in plenty around us. It can
be predicted that each and every electrical device will have some computational component in it,
if not already integrated. The following are a few example domains for the embedded systems.
• Automobiles (components like anti-lock brake, fuel-injection system etc.)
• Home appliances (like refrigerator, washing machine, microwave oven etc.)
• Consumer electronics (with devices like cell-phone, pager, video camera etc.)
• Office automation devices (like fax, printer, EPBX etc.)
• Guided missiles
• Aviation equipments
The list is ever increasing as newer domains are ventured for electronic automation. The
systems are highly heterogeneous in nature. Thus, bringing them under the common umbrella of
embedded systems is a daunting task. The operating principle of the devices vary significantly.
However, the devices coming under embedded systems share certain common features as depicted
in the next section.
Embedded System Basics | 5
5. Tight Constraints: The constraints put on an embedded system design are often much more
stringent than the general purpose systems. Typically, it should be a low-cost solution to
meet the computational needs, so that, the overall system is cheap. Often, the non-
engineering constraints are given more importance in choosing the technology alternative
to be adopted for the system design. For example, to keep the system size small, power
budget must be low for an autonomous system, so that a small battery may suffice for the
requirement. Unlike high-performance general-purpose systems, separate cooling
arrangement may not be possible. Finally, to reduce the design time, predesigned hardware
and software modules may be utilized which may not be optimized for the present
application.
6. Real-Time: A real-time system responds to a request within a finite and fixed time. As
embedded systems are interactive in nature, most of them face the requirement of
responding in a time-bound manner. Failure to meet the time-line may or may not mean
failure of the system as a whole. Some systems, named as hard real-time systems, must
respond within the time limit, after an event has occurred. For example, the fire
extinguishers must be actuated within a fixed time after a fire detector detects a fire. On
the other hand, in a live video streaming application, if a few frames arrive late, the video
may become jittery, but not considered as a system failure. In contrast, for a general-
purpose system, it is only expected that the system works fast enough after getting the user
inputs. Systems with a relaxed timing requirement are termed as soft real-time ones.
7. Hybrid Structure: Most embedded systems contain both analog and digital components.
This is primarily because of the fact that in the environment, most of the physical
parameters are analog in nature, while the processing is on digital data in a computing
platform. Analog-to-Digital and Digital-to-Analog Converters (ADCs and DACs) are used
for interfacing. General computing systems take inputs in digital format and produce
outputs in digital format, as well.
8. Reactive: A reactive system interacts with the environment. Such a system possesses
internal states and transits between the states on occurrence of events in the environment.
Embedded systems are reactive in nature as they respond automatically with changes in
the environmental parameters. On the other hand, a proactive system is not interactive in
nature. Once initiated, a proactive system continues along the pre-decided execution
sequence to produce outputs. General-purpose systems without interrupt facilities are
proactive in nature.
Embedded System Basics | 7
5. Design Flexibility: Requirement and specification of a system may change over the time,
after a system has been put into operation. The design should be flexible enough to
accommodate such change requests. A software implementation of the system is generally
much more flexible compared to a hardware based one. Among the hardware
implementations also, FPGA based realizations can be modified with much less effort,
compared to an ASIC realization.
6. Design Turnaround Time: This is defined as the time taken from starting the design
specification to taking the product into market. Many of the consumer electronics products
(such as, mobile phones, camera, smart watch) have very high rate of obsolescence. Hence,
it is imperative to keep the turnaround time small. To meet a stringent turnaround time, the
designer may have to go for off-the-shelf and predesigned components, foregoing the
possibility of designing optimized subcomponents targeted towards the application in
hand. Design reuse becomes the key to achieve this turnaround time reduction.
7. Maintainability: It corresponds to the ease of maintaining and monitoring the system, after
it has been put into operation. Proper design documentation is necessary so that the design
can be easily understandable by those intending to augment the system functionality, fix
bugs etc., even if they are not part of the initial design team.
captured completely in the specification. This is easy for a formal specification, compared
to an informal one. For example, if the specification is written in some programming
language like C/C++, it may be executed to confirm the input-output behavior of the
system to be designed. The system synthesis tools transform a formal specification into a
set of interactive sequential processes. The individual processes can then be realized by the
later synthesis steps. Individual processes may be implemented in software on a general-
purpose processor or in hardware, such as, FPGA and ASIC. The decision to put a process
in hardware or software is often determined by the availability of pre-designed modules,
which may significantly reduce the time-to-market for the product. These modules form
the system-level library consisting of complete solutions to some previous problems.
Verification of system specification is carried out by tools commonly called model
simulators/checkers. Such tools model the entire specification using some mathematical
logic. The desired behavior of the system is also captured into a few logic formula. The
tools verify whether the logic formula are valid on the model or not. In case some formula
turns out to be false on the model, the tool generates a counter-example for the situation.
This counter example can help the designer to refine the specification rectifying the error.
2. Behavioral Specification: The system synthesis tool generates behavioral specification for
the constituent processes. Each process is marked to be implemented either in software or
in hardware. While the software parts are to be implemented in general-purpose processors,
the hardware parts are to be realized by dedicated hardware modules. Verification at
behavioral specification level is carried out by hardware-software co-simulation. It may
be noted that standalone hardware simulator can work only on the hardware part, while
software simulator can work only on the software processes. Thus, to check the correctness
of the combined system, a co-simulation of hardware and software parts is essential.
3. Register Transfer (RT) Specification: The behavioral specification is refined into register
transfer level specification by the behavioral synthesis tools. For the software processes to
be executed by general-purpose processors, the code is compiled/assembled into machine
language instructions. Since these instructions operate on machine registers only, the
machine language program itself is the register transfer specification for software
processes. On the other hand, for the processes to be realized in hardware, synthesis tools,
commonly known as high level synthesis tools, are used which convert the behavioral
specification into a netlist of library components. The library typically contains
components, such as, registers, counters, ALUs etc. While the RT specification of the
10 | Embedded Systems
software processes are directly executable, the RT specification of the hardware parts are
simulated by hardware simulators, normally used for hardware description languages like
VHDL and Verilog.
System Specification
System Library
Model Simulators/ HW/SW/OS
Checkers System Synthesis Tool
Hardware-Software
Cosimulators
RT Simulators
RT Synthesis Tool
Gate level
Simulators Logic Synthesis Tool
Final Implementation
Contact points
Microchip
(ROM), non-static random access memory (RAM) and an electrically erasable programmable read-
only memory (EEPROM). As silicon may break easily, the chip must be within a few millimetres
in size.
A smart card works in conjunction with an external Card Acceptor Device (CAD) for power
and input/output information. To enhance the security of operation, the following standards are
adopted.
1. Data exchange rate between the chip and CAD is limited to 9600 bps while the data
exchange is controlled by the microchip processor. Also, the data transfer is half-duplex,
reducing the possibility of massive data attacks.
2. A strict authentication protocol has to be followed between the card and the CAD, as shown
in Fig 1.4. First, the card generates a random number rs and sends to the CAD. The CAD,
on receiving the number, encrypts it with a symmetric key Ksc and transmits back to the
card. The card also encrypts rs and compares the result with the encrypted value just
received from CAD. The whole process is then repeated with the roles of card and CAD
reversed.
1. Random number rs
3. After this initial authentication, message transfer between the card and the CAD can
proceed. However, each message transmitted is verified by Message Authentication Code
(MAC).
4. Standard encryption algorithms, such as, DES, 3DES and RSA can be used. These
algorithms, though breakable, considering the limited computing power of smart card
processors, is an acceptable trade-off.
CCD
From the designer’s perspective, there are two tasks to be accomplished by the digital camera.
1. Processing of images and storing in memory. The process is initiated with the pressing of
the mechanical shutter. When the shutter is pressed, the image gets captured. The captured
image is then converted to digital form by the charge-coupled device (CCD) and ADC.
The image data is next compressed and archived in the memory.
2. Uploading of images to computer. For this, the camera is connected to a computer (via
interfaces). The images are transmitted from camera to computer under software control.
The flowchart corresponding to the first task has been shown in Fig 1.6. The charge-
coupled device is an image sensor containing a two-dimensional array of light-sensitive silicon
solid-state devices, also called cells. To capture an image, the electronic circuitry associated with
CCD, discharges the cells, activates the shutter and then reads the digital charge value of each cell.
Due to manufacturing errors, the value captured by the cells may be slightly above or below the
actual light intensity. The zero-bias adjustment rectifies such errors. The next step, Discrete Cosine
Transform (DCT), transforms original 8×8 pixel block into cosine frequency domain. In the
transformed representation, the upper-left corner values represent more of the image information,
Embedded System Basics | 15
whereas the lower right corner values represent finer details. Thus, the lower right corner values
can be stored with lower precision, while retaining reasonable image quality. This is commonly
done by reducing the bit precision of the encoded data. The result often passes through a Huffman
encoding (which is loss-less in nature) to assign variable length codes to the pixel values. The
resulting bit-stream corresponding to the 8×8 block is stored in memory and further such blocks
are processed.
CCD input
Zero bias
adjustment
DCT
Quantization
Store in
memory
No
Fig 1.6: Digital Camera Operation
1.5.3 Automobiles
Over the time, cars have become safer, easier to control and more comfortable for the driver as
well as the passengers. Embedded systems have played a crucial role in the evolution of
automobiles. Electronic Control Units (ECUs) have become the part-and-parcel to support most of
the car features. These ECUs can be various microprocessors, microcontrollers, digital signal
processors, FPGAs etc. A high-end vehicle may have more than 100 such ECUs communicating
16 | Embedded Systems
by means of around 5000 signals. Any typical vehicle contains 25-35 such ECUs. Embedded
systems are being used in different units of an automobile, some of which are noted next.
UNIT SUMMARY
Embedded systems are computing systems often hidden within a large mechanical, electrical or
electronic system. However, their design goals differ significantly from a general-purpose
computing system as these systems are highly constrained in terms of size, allowable power
consumption, stringent time-to-market etc. The systems are optimized towards a single- or a small
set of predefines functionalities only. Most of the embedded systems are reactive in nature and
interact a lot with the environment, often through quite nonconventional interfaces. The systems
must respond to the events in a time-bound manner, missing deadlines may mean system failures.
Like any general system, the design process of an embedded system also starts with its
specification. The specification gets refined through a number of intermediary stages to result into
the final implementation. Keeping the cost and the performance metrics in view, part of the
embedded system is realized in software. However, the speed critical design modules may be
required to be implemented in some hardware platform, such as FPGA and ASIC. A number of
synthesis and simulation tools take part in this transformation of specification into final
implementation. Many predesigned library modules are utilized by the synthesis tools to reduce the
turnaround time for the cycle from specification to final implementation. The example embedded
systems presented in this unit have brought out the limited features of such systems.
18 | Embedded Systems
EXERCISES
Multiple Choice Questions
MQ1. Number of functions carried out by an embedded system is
(A) One (B) Two
(C) Fixed (D) No limit
MQ2. Interfaces to an embedded system are
(A) Conventional (B) Non-conventional
(C) Both conventional and non-conventional (D) None of the other options
MQ3. A dependable system should be
(A) Reliable (B) Maintainable (C) Safe (D) All of the other options
MQ4. Constraint(s) on an embedded system is/are
(A) Size (B) Cost (C) Power (D) All of the other options
MQ5. A hybrid system can have modules that are
(A) Analog (B) Digital (C) Reliable (D) Analog or Digial
MQ6. Reactive system
(A) Interacts with environment (B) Performs chemical reactions
(C) Gives measured reaction (D) None of the other options
MQ7. System cost is
(A) Recurring (B) NRE (C) Unit cost (D) None of the other options
MQ8. Size of embedded software is determined by the size of
(A) Code (B) Processor (C) Inputs (D) Outputs
MQ9. Most flexible design of embedded system is via
(A) Software (B) FPGA (C) ASIC (D) None of the other options
MQ10. Design turnaround time is defined as the time needed
(A) From specification to market (B) From design to market
(C) From specification to design (D) None of the other options
MQ11. Verification is carried out on
(A) Design (B) Unit (C) Both design and unit (D) None of the other options
Embedded System Basics | 19
KNOW MORE
Looking back into the history of embedded systems, the first such system was developed in 1960,
used in Apollo Guidance System. It was developed by Charles Stark Draper at MIT. In 1965,
Autonetics developed D-17B, an embedded computer used in Minuteman missile guidance system.
The first embedded system in vehicle was utilized in Volkswagen 1600, in the year 1968. The
Embedded System Basics | 21
vehicle used a microprocessor to control its electronic fuel injection system. Embedded systems
(particularly, in small and medium sized applications) became popular with the progress in
microcontrollers. In 1971, Texas Instruments developed the first microcontroller TMS1000, which
was immediately followed by Intel 4004. Relatively larger embedded systems are based upon
processors hosting embedded operating systems. The first embedded OS, VxWorks was released
by Wind River in 1987. VxWorks was eventually used as the operating system for the Mars
Pathfinder Space Mission in 1997. Versions of embedded Linux systems started appearing by late
1990s. Mobile phones constitute another major category of embedded systems. The first all-in-one
embedded mobile device is the original iPhone introduced in 2007. The first Android phone was
introduced by T-Mobile (T-Mobile G1) in the year 2008. The embedded market is projected to grow
at a very rapid rate and is predicted to be more than 40 billion USD, by the year 2030.
d Embedded Processors
UNIT SPECIFICS
Through this unit we have discussed the following aspects:
• Generic structure and features of an embedded processor;
• Guideline for choosing an embedded processor from different processing alternatives and
available microcontrollers;
• ARM microcontroller basics for embedded applications;
• Features of Digital Signal Processors for embedded processing;
• Usage of Field Programmable Gate Array as embedded hardware platform;
• Application Specific Integrated Circuit chips for high-speed customized embedded system
realization;
• Memory alternatives available to an embedded system designer.
The overall discussion in the unit is to give an overview of embedded processor and memory
alternatives available to a system designer. It is assumed that the reader has a basic understanding
of CPU architecture, memories and digital design.
A large number of multiple choice questions have been provided with their answers. Apart
from that, subjective questions of short and long answer type have been included. A list of
references and suggested readings have been given, one can go through them for more details. The
section “Know More” has been carefully designed, supplementary information provided in this
part will be beneficial for the users of the book. It highlights the history of the development of
embedded processors along with recent developments in this domain.
RATIONALE
This unit on embedded processors helps the readers to understand the generic and
customizable platforms that are available to carry out the computations in an embedded system.
Embedded Processors | 23
Starting with the generic architecture of an embedded processor and its distinguishing features
(compared to general-purpose processors), the unit takes up discussions on microcontrollers for
small embedded system design. One of the most widely used microcontrollers, ARM has been
explained. Though ARM processors can be used, for signal processing applications, Digital Signal
Processors turn out to be an efficient, cost-effective alternative. They possess some typical
architectural features, markedly different from other processor classes that make them highly
suitable for signal processing. Beyond a certain speed of operation, software-based solutions to an
embedded application problem may not work, necessitating exploration of hardware alternatives.
Among the hardware platforms, FPGAs are often considered as capable to provide a good trade-
off between the Non-Recurring Engineering (NRE) cost of the system and the required performance
goal. Due to the programmability of logic blocks and interconnects, it is difficult to achieve a very
high speed with reasonably low power consumption in FPGAs. For such applications, fully
customized Application Specific Integrated Circuits (ASICs) are to be designed to realize the
embedded systems. Different technological alternatives are also available to be used for embedded
memories. This unit takes the reader through all these processor and memory alternatives to equip
them with the capability to choose the appropriate platform for the application in hand. Once a
decision has been made to follow a particular platform, experts from that domain may be consulted
to come up with the best possible system implementation.
PRE-REQUISITES
Digital Design, Computer Architecture
UNIT OUTCOMES
List of outcomes of this unit is as follows:
U2-O1: Enumerate generic features of an embedded processor
U2-O2: Choose the embedded platform for realizing an embedded application
U2-O3: State the instruction set architecture of ARM processor
U2-O4: List the architectural features of Digital Signal Processors
U2-O5: Enumerate the design alternatives available with FPGAs
U2-O6: List the basic steps in ASIC design process
U2-O7: Enumerate the memory storage alternatives for embedded systems
24 | Embedded Systems
U2-O2 3 2 - 2
U2-O3 3 1 1 -
U2-O4 3 2 - -
U2-O5 3 2 - -
U2-O6 3 2 - 2
U2-O7 3 2 - 1
2.1 Introduction
Processor is the heart of any electronic system. Peripheral devices attached to a system
either collect data from the environment or modify some parameters of the environment. However,
the action to be undertaken, corresponding to some event, is decided by the processor. Naturally,
the processor needs to be fast enough to respond to the occurrences of events. The speed of
operation of embedded processor is decided by the speed requirement of the application. The
requirement may be a relaxed one, in which the processor gets enough time to process events. For
such systems, a readily available general-purpose processor may be the ideal choice. However, for
more stringent, time-critical embedded applications, dedicated processors may be necessary. For a
highly time-critical application, it may be required that the customized processor is implemented
directly as semiconductor chips (like FPGA, ASIC).
Compared to general processors, embedded processors require less power, as these are targeted to
only a small number of functions. Another significant departure from general processors, while
designing an embedded processor, is to have the peripherals on the main processor chip itself. This
eliminates the requirements of off-chip communication significantly, thus reducing the delay,
power consumption and heat generation in the system. An ordinary processor will need to have
separate memory and other peripheral chips (timers, I/O controllers etc.) to build the full system.
Embedded Processors | 25
RAM Counter/
Serial Timer
Port
(USART) Bus Interface
This increases the size of the printed circuit board for the system. Moreover, off-chip
communication also increases both delay and power consumption. Fig 2.1 shows the typical
architecture of a generic embedded processor. As can be noted from the figure, apart from the CPU,
the processor chip contains memory blocks (ROM and RAM) to hold the program code and data.
There are digital I/O pins to drive digital inputs and outputs. Serial interface, in the form of USART
is also in-built. To interact with the environment which is mostly analog in nature, Analog-to-
Digital Converters (ADCs) are implemented on-chip. Some special interfacing standards, such as,
Serial Peripheral Interface (SPI), Inter Integrated Circuit (IIC/I2C) are often made available on-
chip to cater to a large number of special devices having these interfaces. Parallel bus interface is
also available which may be utilized to connect to additional memory and peripherals, over and
above the available on-chip resources. On-chip timers are often included to help in real-time
processing.
Typical examples of such embedded processors are the microcontrollers. These are single-
chip computers with relatively simple CPU equipped with timers, serial/parallel, digital/analog
input/output lines. On-chip program memory (ROM) of sufficient size is kept that can store the
code for small-to-medium complexity embedded systems. To help in storing temporary program
variables, small read-write memory block is kept on-chip. External bus interface may extend the
memory capacity beyond the on-chip limits. Some example microcontrollers are Intel 8051, ARM,
68HC11, ATmega, PIC, MSP-430 etc. Though targeted towards small applications, there are many
high-speed microcontrollers, along with dedicated features like security, wireless communication
and signal processing modules. It may be noted that for signal processing applications, there is a
dedicated class of embedded processors, called Digital Signal Processors (DSPs). At architectural
26 | Embedded Systems
level, such processors include hardware for easy implementation of filtering algorithms, parallel
computation to reduce overall execution time, and so on. C6000 series of DSP from Texas
Instruments is an example of the same.
In the following, we shall look into an overview of each of the embedded processor
categories – microcontrollers, DSPs, FPGAs and ASIC. As there are large number of
microcontrollers, in this category, we shall be focusing into the ARM processors – one of the most
widely used processor series in today’s embedded system designs.
ALE A[31:0]
Address Register
Incrementer Bus
PC Bus
Address Incrementer
�������
EXEC
DATA32
BIGEND
Register Bank PROG32
(31×32-bit registers) MCLK
��������
WAIT
(6 status registers) �����
RW
ALU Bus
�����
BW
�����
IRQ
Instruction �����
FIQ
Decoder & ���������
RESET
Booth’s Multiplier Control ABORT
A Bus
Logic ������
OPC
B Bus
���������
TRANS
��������
MREQ
Barrel Shifter SEQ
LOCK
�����
CPI
CPA
32-bit ALU CPB
� [4:0]
M
2. Instruction Decoder and Control Logic: This module controls the overall operation of the
ARM7 processor. Several internal and external control and status signals are generated
from this module. The external control signals are useful for interfacing the ARM7
processor with memory and peripherals. The signals have been discussed later.
3. Address Register: The address register holds the address of the next memory location to be
accessed by the processor. It may accessing an instruction or data. The address bus A[31:0]
is the output of this register. The content of the register is made available on the address
bus, as long as the ALE signal is held low by the external memory/device. The role the
ALE signal is a marked difference in ARM, compared to the Intel family of processors. In
the Intel family, ALE is an output signal used by the external interface to separate out the
address bus from the data bus. Whereas, in ARM, the external device controls the
availability of address on the bus by instructing the processor accordingly.
4. Address Incrementer: This module is responsible to increment the content of Address
Register by a proper amount, so that, it points to the next instruction/data, as per
requirement.
5. Register Bank: ARM contains a large number of registers, some of them are general-
purpose, while others are status registers. There are 31, 32-bit registers accessible in
different modes of operation. Number of status registers are six. The details of these
registers has been elaborated later.
6. Booth’s Multiplier: ARM7 uses the multiplication module developed around the Booth’s
algorithm. It has been elaborated while discussing on multiplication instruction.
7. Barrel Shifter: The module can be used to shift the operand reaching the ALU for
arithmetic/logic operations. The operand can be shifted by a few bits by this module.
8. ALU: ARM7 uses a 32-bit Arithmetic Logic Unit (ALU) to carry out the arithmetic/logic
operations corresponding to different instructions.
9. Write Data Register: The register is used to hold the data to be written onto the memory.
This is a 32-bit register, feeding the signal lines DOUT[31:0]. The other two signals, DBE
���������� have been illustrated later.
and ENOUT
The control and status signals of ARM processor can be grouped into some categories. Fig 2.3
shows the functional block diagram of the ARM processor in which the signals have been grouped
as per their functionalities.
30 | Embedded Systems
� [4:0]
Processor mode signals 𝐌𝐌
These are the status lines identifying the current mode of operation of the processor. The internal
� lines. The processor modes have been illustrated later.
status bits are inverted to output the M
��������
Clock signals MCLK, 𝐖𝐖𝐖𝐖𝐖𝐖𝐖𝐖
MCLK is the master clock for ARM processor. It has two phases – phase 1 (clock signal is low),
phase 2 (clock signal is high). Either phase of the clock can be stretched for slower devices by
adjusting the ALE input. If the ��������
WAIT signal is low, the ARM processor is made to wait for an
�������� must change only
integer number of MCLK cycles. These two signals are internally ANDed, WAIT
when MCLK is low.
Memory interface signals A[31:0], DATA[31:0], DOUT[31:0], ����������
𝐄𝐄𝐄𝐄𝐄𝐄𝐄𝐄𝐄𝐄, ��������� �����,
𝐌𝐌𝐌𝐌𝐌𝐌𝐌𝐌, SEQ, 𝐑𝐑𝐑𝐑
�����
𝐁𝐁𝐁𝐁, LOCK
MCLK � [4:0]
M Processor Mode
Clocks
��������
WAIT
A[31:0]
PROG32 DATA[31:0]
Configuration DATA32 DOUT[31:0]
BIGEND
����������
ENOUT Memory Interface
�������
EXEC ��������
MREQ
SEQ
�����
IRQ ARM7 �����
RW
Interrupts ����� Processor �����
BW
FIQ
LOCK
���������
RESET ���������
TRANS Memory Management
ABORT Interface
Bus Controls ALE
DBE ������
OPC
�����
CPI Coprocessor Interface
Power VDD CPA
VSS CPB
These signals are used for interfacing memory devices with the ARM processor. The lines A[31:0]
constitute the 32-bit address bus. The address becomes valid from the phase 2 of the previous clock
cycle, while the address is used in phase 1 of the reference clock cycle. DATA[31:0] is the input
32-bit data bus used for read operation. DOUT[31:0] is the output data bus for a memory write
Embedded Processors | 31
The RISC features, supported by ARM, enhance performance via balanced pipelining. To enhance
the performance further, the following CISC features have been included in the instruction set.
1. Each instruction has the capability to utilize both shifter and ALU. This makes the
instructions more powerful as two operations could be clubbed in one instruction.
2. Auto-increment and auto-decrement addressing modes have been included in ARM. This
allows the index registers to be incremented/decremented automatically after accessing a
location. It is particularly useful in array references. The index can be updated while a load
or store operation is in progress.
3. Multi-byte load/store operations are permitted. Upto 16 registers may be updated by one
instruction. That is, 16-registers may be loaded from memory or stored into the memory
via a single instruction. Though it violates the RISC principle of executing one instruction
per cycle, the availability of this feature is very much helpful for efficient implementation
of procedure invocation, bulk data transfer and creating compact code.
4. Conditional execution is possible for each instruction. The machine code of any instruction
has the first four bits identifying the desirable values of some of the status flags. The
instruction will be executed only if the specified values match with the corresponding bits
of the status register. If the code does not match, the operation intended by the instruction
is not carried out and can be treated to be equivalent to a ‘No operation’. However, the
feature has the potential to eliminate small branches in the program code, eliminating
pipeline stalls.
ARM7 Registers
There are sixteen general-purpose registers R0-R15 that can be used during the user-mode
operation in ARM processor. Out of these 16 registers, the register R15 acts as the program counter.
Unlike other processors, the program counter register R15 can also be manipulated as a general-
purpose one. The registers R13 and R14 also have some special roles, apart from being used as
general-purpose registers. The register R13 is conventionally used as the stack pointer. In fact, in
ARM, there is no dedicated stack defined in the architecture. However, programmers may
implement a stack using available instructions, it is generally the R13 register that acts as the stack
pointer. In the absence of explicit stack, implementing the subprogram calls and returns become an
issue. For this, the register R14 is utilized to keep a copy of the return address during a procedure
call. For return, the programmer simply needs to copy R14 into R15. The register R14 is aptly
named as link register. However, this simple arrangement fails to implement nested procedure
34 | Embedded Systems
calls. To support nested calls, the programmer needs to implement a stack separately using R13 as
stack pointer.
31 30 29 28 27 7 6 5 4 0
N Z C V Unused I F T Mode
The processor stores its current status in another 32-bit register, called current program status
register (CPSR). It has four one-bit status flags, spanning over bits 31 to 28, namely negative (N),
zero (Z), carry (C) and overflow (V). These four bits identify the status of last instruction executed
by the processor. The execution of the current instruction may be conditionally dependent on these
four bits. The structure of the CPSR register has been shown in Fig 2.4. The bits 27 to 8 are left
unused. Bits 7 and 6 are for interrupt enable, bit 7 enables the IRQ interrupt whereas bit 8 enables
the FIQ interrupt. Bit 5, called T-bit is used to switch between the instruction sets used by ARM –
ARM-to-THUMB or THUMB-to-ARM. Bits 4 to 0 are dedicated to select one of the six execution
modes, noted next.
1. User mode. This is the mode in which the application code can be executed. The CPSR
register cannot be updated in this mode. To change mode, an exception has to be
generated.
2. Fast Interrupt Processing (FIQ) mode. The mode is entered on getting the FIQ interrupt.
It is a high speed interrupt handler, typically used for a single critical interrupt to the
system.
3. Normal Interrupt Processing (IRQ) mode. This is the low speed interrupt processing
mode. The mode is entered on getting interrupt on the IRQ line.
4. Supervisor mode. The mode is entered on executing a software interrupt instruction. On
reset, the processor enters into this mode. The mode is typically used to implement
operating system services.
5. Undefined Instruction mode. This mode is entered if the opcode of fetched instruction
does not match with any of the ARM or coprocessor instructions.
Embedded Processors | 35
6. Abort mode. This mode is entered on occurrence of a memory fault. Typical example is
an instruction or data fetch from invalid memory region.
The registers in different modes have been shown in Fig 2.5. The registers R0 through R7
are accessible in all the modes of operation. The register CPSR is also common across the modes.
Apart from that, each mode has got its own R13 and R14 registers, acting as stack pointer and link
register. Among the remaining registers, R8 to R12 and R15 are common in all modes, excepting
FIQ. The FIQ mode has its own set of R8 to R14 registers. It may be noted that the availability of
this additional registers in FIQ helps in faster context switching during interrupts. Except the user
mode, other modes contain a saved program status register (SPSR) that holds the value of the
CPSR register before reaching this mode.
Instruction Sets
There are two instruction sets supported by the ARM processor.
36 | Embedded Systems
• ARM instruction set. This is the standard 32-bit instruction set used by the ARM processor.
The instructions may belong to one of the following categories – data processing, data
transfer, block transfer, branching, multiplication, conditional, software interrupts. The
individual categories have been elaborated next.
• THUMB instruction set. This is a compressed 16-bit instruction set. The THUMB
instructions are converted to equivalent ARM instruction before execution. Though the
performance suffers due to this, the resulting code density is better than most of the CISC
processors. The decompression is carried out dynamically in the pipeline.
Data Types
The ARM instructions support six different data types, as follows.
• 8-bit signed and unsigned
• 16-bit signed and unsigned
• 32-bit signed and unsigned
For these data types, the ARM architecture supports both little-endian and big-endian formats.
However, most of the ARM implementations realize the little-endian format.
where i is an 8-bit number between 0 and 255 while r is a 4-bit number between 0 and 15. ROR is
the rotate right operation. Some example values of n are 255 (with i = 255, r = 0), 256 (with i = 1,
r = 12) etc.
While executing a data processing instruction, the condition flags in the CPSR register may or may
not get updated depending upon the instruction variant used. For example, for the addition
operation, there are two instructions ADD and ADDS. While the ADD instruction does not affect
the status bits of CPSR, ADDS affects those. The feature adds flexibility to the programmers as the
status flags need not be checked immediately after the instruction that set them, some other
intervening operations may be carried out via instructions that do not affect the status flags. Some
example instructions of this category are noted next.
ADD R1, R2, R3 ; R1 R2 + R3
ADD R1, R2, R3, LSL #3 ; R1 R2 + (R3 × 8)
ADDS R1, R2, R3 ; R1 R2 + R3 and set status flags
R0 = 0x11121314
31 24 23 16 15 87 0
11 12 13 14
Stack operation
Though in conventional processors stack is implemented by the processor designers themselves, in
ARM, the stack implementation is left to the user programs. Stack is primarily used to realize
subprogram calls and returns along with parameter passing. Leaving the stack implementation to
the user program leverages the users with the following options.
• Full stack vs. Empty stack. In a stack implementation, the programmer may decide to make
the stack pointer to point to the latest filled entry of the stack (called a full stack
implementation) or to the immediate empty slot where the next item may be stored (called
an empty stack implementation).
• Ascending stack vs. Descending stack. In a descending stack implementation, after an item
has been put into the stack, the stack pointer is decremented (stack descends towards the
lower addresses). In ascending stack implementation, after an item has been put into the
stack, the stack pointer is incremented so that the stack grows to the higher addresses.
ARM has different sets of instructions to realize the stacks based on above features. These
instructions are derived from LDM/STM by post-fixing the stack type, as follows.
• LDMFD/STMFD – to implement a full descending stack
• LDMFA/STMFA – to implement a full ascending stack
• LDMED/STMED – to implement an empty descending stack
• LDMEA/STMEA – to implement an empty ascending stack
40 | Embedded Systems
Fig 2.7 shows some example stack implementations using the LDM/STM instructions. It may
be noted that the lowest numbered register gets stored at the lowest memory address, irrespective
of the relative ordering of registers in the instructions.
Block data movement
The LDM/STM instructions can be used to load/store multiple register contents using a single
instruction. This facility can be utilized to transfer data from a source memory block to a destination
memory block. Unlike the Intel architecture that allows direct transfers between the memory
blocks, in ARM it has to be done in two phases. A set of registers need to be identified that can be
utilized as intermediaries in this transfer process. If k number of registers participate in the transfer,
first, 4k bytes of data is transferred from the source memory block to these k registers (using LDM
instruction), which are then transferred to the destination block (using STM instruction). The
process is repeated till all data in the source block are transferred to the destination. To enhance the
transfer process, ARM has provided the following variants of the LDM/STM instructions to
facilitate transfer even in the presence of overlapping between source and destination blocks.
• STMIA/LDMIA – Increment after transfer
• STMIB/LDMIB – Increment before transfer
• STMDA/LDMDA – Decrement after transfer
• STMDB/LDMDB – Decrement before transfer
Embedded Processors | 41
The following code fragment transfers data from a source block pointed to by the register R12 to a
destination block pointed to by the register R13. The end of the source block is pointed to by the
register R14. The registers R0 to R11 are used as the intermediary ones. Thus, in one LDM/STM
instruction, 48 bytes of data can be transferred.
Loop: LDMIA R12!, {R0-R11}
STMIA R13!, {R0-R11}
CMP R12, R14
BNE Loop
from the user mode. In the execution of SWI instruction, the CPU branches to a vectored location
identified in the exception handler table (discussed later). The exception handler routine can
analyse the value of n and determine the action to be performed. It is important to note that for all
values of n, the CPU branches to the same exception handler. This is a marked difference from the
Intel family of processors in which each software interrupt instruction is handled by a different
handler routine. Also, the typical value of n for Intel processors is a 8-bit quantity. Thus, in ARM,
the number of such services is much larger (224, compared to 28 for Intel). These services are
typically utilized by the operating system designers to provide OS services.
31 28 27 25 24 23 0
Condition 1 0 1 L Offset
L: Link bit, 0 for branch
1 for branch with link
Fig 2.8: Format of branch instruction
The branching policy is to have a relative branch with respect to the current program counter
(PC) value. In this relative branching, the target address is specified as offset from the current value
of PC. During execution, the offset part is added with PC so that the PC now points to the targeted
instruction. The offset is a signed 26-bit number, making the target to be located within ±32MB
from the current location. Conditional branches are formed by specifying the condition codes with
the instruction. The format of branch instruction has been shown in Fig 2.8. For B instruction, the
L-bit is 0, for BL it is 1. It may be noted that though the actual offset for branch is 26-bit, in the
instruction only 24 bits are stored. Since the instructions are always word aligned (32-bit wide),
the least significant two bits of the address of any instruction are also zeros. Thus, in the difference
computed, these two bits are also zero and hence are not stored in the instruction. During execution,
the 24-bit offset specified in the instruction is left shifted by two bits to get 26-bit offset with two
least significant bits 0. This shifted offset is added with the PC to get the target address.
Another class of branch instruction is called ‘branch with exchange’, denoted as ‘BX’ or
‘BLX’. These are similar to the instructions ‘B’ and ‘BL’ respectively, excepting that the BX/BLX
instruction also result in changing the instruction set from ARM to THUMB or vice versa.
Memory
1
Rn temp
3
2
Rm Rd
In execution of the instruction, first the content of the memory location pointed to by Rn is copied
to a temporary location. Next, the content of register Rm is copied into the memory location pointed
to by Rn. Finally, content of the temporary memory location is copied into the register Rd. Thus,
if the registers Rm and Rd are same, the instruction effectively exchanges the content of this register
with the memory location pointed to by Rm. Fig 2.9 shows the sequence of operations in the swap
instruction.
The THUMB instruction set contains compressed 16-bit instructions. Assuming that the memory
word size is 32-bit wide, one memory word access gives two instructions to the CPU. However,
the THUMB instructions are not executed directly. Rather, such an instruction is first converted to
its equivalent 32-bit ARM instruction which is then executed by the CPU. The situation has been
explained in Fig 2.10. It may be noted that the switching between the ARM and the THUMB
instruction sets can be done by executing a BX/BLX instruction. The major differences in THUMB
instructions, compared to ARM are as follows.
1. Excepting branch, all THUMB instructions are executed unconditionally.
2. The instructions have unlimited access to the registers R0-R7 and R13-R15. Only a few
instructions can access the registers R8-R12.
3. The instructions are more like conventional processors. For example, the PUSH and POP
instructions are back for stack operations. A descending stack is implemented with the
stack pointer hardwired to R13.
4. No MSR/MRS instructions are supported.
5. The number of SWI calls are restricted to 256 from 224.
6. On reset or on exception, the processor enters into ARM instruction mode.
The THUMB instruction set provides several advantages over the ARM instruction set,
particularly for resource constrained (in terms of area, memory requirement, power requirement
etc.) embedded system designs. The THUMB code requires on an average 30% less space,
compared to ARM code. However, THUMB instructions require decompression before execution.
If the memory is organized as 32-bit words, the ARM code is 40% faster than THUMB. On the
other hand, if the memory is organized as 16-bit words, THUMB code is around 45% faster than
the ARM code. The savings come from the reduction in the number of memory accesses needed
for THUMB code. The maximum benefit of THUMB comes from the power savings angle.
THUMB code uses upto 30% less power than ARM code, making THUMB the choice for low-
power consuming simple embedded system designs. To make a compromise between the
performance and power consumption, in a typical embedded system, the speed-critical operations
should be coded in ARM instruction set with the program being hosted in 32-bit on-chip memory.
On the other hand, for non-critical routines, THUMB instruction set should be used with 16-bit off-
chip memory.
46 | Embedded Systems
It may be noted that the vector address 0x00000014 has not been used and is missed to ensure
backward compatibility. Also, the FIQ interrupt has been assigned the highest address, so that, the
corresponding service routine can start from that address itself. This eliminates the requirement of
further jump instruction from the vector address to the interrupt service routine, saving the interrupt
response time.
class of processors, called Digital Signal Processor (DSP) have been designed to cater to the signal
processing computations efficiently. In general, DSPs provide low-cost, high performance and low-
latency solutions. The power requirement is also less, limiting the battery size and making them
more suitable for mobile applications.
x[0] a[n]
x[1]
a[1]
x[n] a[0]
Multiplier
Adder
y[n]
y[1]
y[0]
Fig 2.11: Typical multiply-accumulate unit in DSP
DSPs are microprocessors optimized to carry out the signal processing tasks. A typical such
signal processing task is the Finite Impulse Response (FIR) filtering operation. The FIR filter has
input signal x, output signal y and a finite set of filter coefficients a0, a1, a2, …. The nth output y[n]
of the filter is given by,
𝑦𝑦[𝑛𝑛] = 𝑎𝑎0 𝑥𝑥[𝑛𝑛] + 𝑎𝑎1 𝑥𝑥[𝑛𝑛 − 1] + ⋯ + 𝑎𝑎𝑛𝑛 𝑥𝑥[0]
The coefficients a0, a1, … an form the kernel of the filter. The filtering operation is computationally
intensive with a number of multiplication and addition operations proportional to the size of the
kernel. Time required in the filtering operation may become crucial, particularly for real-time
applications. To carry out this type of repetitive multiplication and addition operations, the DSPs
often contain a multiply-accumulate (MAC) unit in their architecture. A typical MAC structure has
been shown in Fig 2.11.
48 | Embedded Systems
Program Data
Memory Program Sequencer Memory
Instruction
Cache
I/O
Multiplier Controller
ALU
High-speed I/O
Shifter
The internal modules of a typical DSP architecture has been shown in Fig 2.12. One
important feature of this architecture is the presence of Address Generation Units (AGUs). These
units can work in parallel, independent of the main datapath operations. Once configured, the
AGUs can operate in the background producing addresses to prefetch instructions and data
operands. The datapath of the processor contains a large number of registers along with one or
more instances of functional modules, such as, multiplier, ALU and shifter. Some important
features of the DSP architecture are noted in the following.
1. Bit-reversed addressing: Many of the DSP algorithms work with a buffer full of data. With
the arrival of the next data, the buffer is adjusted to accommodate the most recent one while
discarding the oldest one. If organized as a linear buffer, this essentially requires shifting all
the older data items to create space for the newest one. The operation is time consuming. An
alternative to this is to keep data in a circular buffer. In a circular buffer, instead of moving the
data items through the buffer locations, a pointer is moved through the data. This also needs
the pointer to jump to the other end of the buffer, once it has reached one end of the buffer. If
implemented in software, this requires checking the pointer value at every update of it. Bit-
reversed addressing provides hardware support so that the pointer value need not be checked
in software. In normal pointer arithmetic, the carry propagates to the left causing the value to
grow till the highest number is reached. In bit-reversed arithmetic, the carry propagates to the
right. Carries from the rightmost bit are ignored.
An example of bit-reversed addressing is as follows. Assume that the buffer size is 8 while
an index register has been set to 4. Further, assume that the buffer starts at memory address
0x100. The following table shows the addresses computed in successive increment operations
in steps of the values of index register. The three least significant bits of result at each step
have been noted. At each increment step, a unique address has been generated, reaching the
same address after eight increments.
Address 3 LSBs Comment
(Hex)
0x100 000
0x100 111 + 100 = 000 Carry falls off the right side
2. VLIW architecture: VLIW stands for Very Large Instruction Word. In VLIW architecture, a
number of instructions are executed in parallel by multiple execution units present in the
processor. The instruction word is much larger, so that multiple instructions can be issued
simultaneously. Such DSPs are called multi-issue DSP. Multiple, non-conflicting instructions
are executed simultaneously. The scheduling of instructions is determined at compile-time,
rather than runtime. This makes the execution time predictable. However, this requires high
memory bandwidth, as the instructions executing simultaneously will need number of operands
from memory. Power consumption also goes up. Due to the scheduling complexity, assembly
language programming for DSP is generally not recommended, rather the programmers are
encouraged to write their programs in high-level language and use optimizing compiler specific
to the DSP chip to get efficient implementation.
3. SIMD architecture: Single Instruction Multiple Data provides data-level parallelism. A single
instruction is applied on a number of data sets to produce output in parallel, enhancing
performance. To ensure the availability of a good number of such parallel operations, loop-
unrolling is often employed. For example, consider the for-loop,
for j = 1 to 5 do a[j] = b[j] + c[j]
The loop when unrolled, gives rise to the following set of five statements that can be executed
in parallel.
a[1] = b[1] + c[1]
a[2] = b[2] + c[2]
a[3] = b[3] + c[3]
a[4] = b[4] + c[4]
a[5] = b[5] + c[5]
Embedded Processors | 51
4. Fixed-point vs. Floating-point DSP: The fixed-point DSPs usually represent real-numbers in
16-bits. They are relatively cheap. On the other hand, floating-point DSPs are costly and use
minimum 32-bit representation for the real-numbers. The precision of floating-point numbers
is much higher than the fixed-point numbers. This provides better signal-to-noise ratio in the
case of floating-point numbers.
5. Saturation arithmetic: This is a special technique to handle overflow and underflow
situations in DSP. In standard binary arithmetic, in case of an overflow or underflow, the wrap
around value is returned. For example, if the numbers 01112 and 10012 are added, the result is
a 5-bit number 100002. If the register is only 4-bit wide, the content becomes 00002. In
saturation arithmetic, a value is returned which is as close as possible to the actual result. For
example, for the previous case, it returns the value 11112. The strategy is particularly useful
for audio and video applications. Humans cannot possibly distinguish between the true value
and the largest value that can be represented in such signals. This also prevents generation of
exceptions in many real-time situations. For example, generation of non-zero values prevent a
possible division-by-zero fault.
6. Large accumulator register: DSPs generally utilize an accumulator register that is 2-3 times
larger than the word size. This helps in taking care of quantization noise accumulation. In
repetitive calculations, errors get accumulated in successive operations. Thus, larger
accumulator can help in avoiding the round-off noise.
7. Zero overhead loops: Any loop execution consists of the following components –
initialization, termination check, loop body, loop index updation. Out of these, the loop body
is the only part that serves the computational requirement of the problem. The other parts are
aptly called overheads. In a general processor, instructions are to be used to code these
overhead parts also, over and above the loop body. In DSPs, often hardware support is provided
to implement the overhead parts. When supported by hardware, no instruction is spent for
coding the overhead parts – hardware takes care of all such operations. Thus, zero overhead
loops help in faster execution of application code.
in C/C++ with optimizing compilers provided by Texas Instruments. Fig 2.13 shows the C6000
DSP core structure. It has the following components.
Program Memory
256 bit
Instruction Fetch
64 bit 64 bit
Data Memory
1. Eight parallel functional units. Due to the presence of eight such units, upto eight instructions
without data dependency can be executed in parallel. The data path elements are noted next.
• .D (.D1 and .D2) – these modules can handle data load and store operations.
• .S (.S1 and .S2) –handle shift, branch and compare operations.
• .M (.M1 and .M2) –perform multiplication.
• .L (.L1 and .L2) – these modules handle logic and arithmetic operations.
2. 32, 32-bit registers in each side A and B. A side has registers A0-A31, B side B0-B31.
3. Program and data memory to hold instructions and data.
4. 256-bit wide internal program bus, capable of loading 8, 32-bit instructions simultaneously.
5. Two 64-bit internal data buses allowing .D1 and .D2 to fetch data from data memory.
The standard Fetch-Decode-Execute cycle has been divided further into a number of sub-stages.
This creates further balancing between the pipelined stages and thus enhancing performance of the
DSP. The sub-stages are as follows.
Embedded Processors | 53
• Programmable Logic Array (PLA). It contains two programmable logic planes – an AND-plane
and an OR-plane. The structure is suitable for two-level combinational logic realization with
the AND-plane generating the product terms and the OR-plane performing the OR operation.
• Programmable Array Logic (PAL). Similar to PLA, however, only the AND-plane is
programmable, OR-plane connections are fixed.
• Programmable Logic Device (PLD). Also referred to as CPLD (Complex PLD), it contains
arrays of simple programmable devices like PLA/PAL etc. in a single chip, along with some
memory elements (flip-flops).
• Field Programmable Gate Array (FPGA). These are also high capacity programmable logic
devices. Compared to CPLDs, the number of logic resources in an FPGA chip is much high.
Individual logic blocks in FPGA may have narrower inputs while the blocks in CPLD have
wider inputs. The number of sequential elements (that is, flip-flops) is also more in FPGA.
I/O block
Switch
block
Logic
block
columns to establish such connections. Along the periphery of the chip, I/O blocks are put, capable
of accepting inputs from the environment or producing output to the environment.
SRAM
SRAM
Among the one-time programming technology, the most common one is based on antifuse.
Antifuse devices are originally open-circuit, offering very high resistance. However, on applying
a voltage of 11-20V across the terminals, the resistance becomes very low, establishing electrical
contacts. These devices are very small and thus require much less space compared to five-transistor
SRAM bits. However, it has disadvantages – large size of programming transistors and the one-
time programmability feature. Fig 2.16 shows the structure of an antifuse. FPGAs from Actel,
Quicklogic, Crosspoint etc. use antifuse as the programming technology.
56 | Embedded Systems
Poly-Si
Dielectric
Oxide n+ diffusion
Fig 2.16: Antifuse structure
Some FPGA devices (from Altera, Plus Logic, AMD) use floating-gate transistor based
programming technology. The device contains a floating gate and a control gate in the transistor.
By applying a high voltage between the two gates, the transistor can be disabled. The process
induces additional charge into the floating gate, raising the threshold voltage. On the other hand,
UV light or electrical signal can be used in order to remove extra charges from the floating gate,
reducing the threshold voltage of the transistor. Thus, floating-gate technology provides re-
programmability like SRAM with the additional advantage that no external memory is needed to
store the configuration program of the FPGA chip. Fig 2.17 shows the floating-gate concept.
1. Fine Grain Logic Block: Fine grain logic blocks are very simple in structure, consuming very
small area. A number of such logic blocks need to be interconnected to realize the desired logic
function. A typical example of this type of logic blocks is from Crosspoint. It consists of a
transistor pair, controlled by an input variable. The structure of the logic block has been shown
in Fig 2.18.
2. Coarse Grain Logic Block: A coarse grain logic block can hold good amount of logic
functionality within a single block. FPGAs belonging to Xilinx and Actel are a few examples of
this category. It may be noted that complexity of such blocks vary significantly across the
manufacturers.
C1 C2 C3 C4
Selector
State
G1
S Q Q2
G2 D
Lookup
G3 Table >
G4
E R
Lookup
Table G
F1 State
F2 Lookup S Q1
D Q
F3 Table
F4 >
E R
Vcc F
Clock
Fig 2.19 shows the logic block of Xilinx FPGA XC4000. As it can be observed, the
structure is quite complex, named as Configurable Logic Block (CLB). A CLB is based upon
Look-Up Tables (LUTs). An LUT consists of one-bit SRAM based memory. A CLB contains
two 4-input LUTs and one 3-input LUT. A k-input LUT can realize any combinational logic
𝑘𝑘
function of upto k variables. The total number of such functions is 22 . The overall CLB can
58 | Embedded Systems
realize two 4-input Boolean functions or one 5-input Boolean function. Apart from the
combinational part, a CLB also contains two flip-flops. The structure makes it very convenient
to realize sequential logic functions. The advanced versions of Xilinx FPGAs can realize more
complex functions. For example, a Virtex-6 CLB can be configured to function as a 6-input
LUT or two 5-input LUTs. It also contains 156-1064 dual-port block RAMs (depending upon
family), each capable of storing 36Kbits of information.
Another example of coarse grain logic block is Actel’s ACT1, shown in Fig 2.20. It is
based on multiplexers. Three 2:1 multiplexers are connected in a tree-like fashion with four data
inputs and four control signals. Several logic functions of upto eight variables can be realized
by restricting some of the inputs to set logic values and/or shorting some of the input lines. This
logic block, though quite simple, has low overhead as well. There are definite trade-offs between
size of the logic block and the system performance, as noted next.
w
x
s1 f
s2 s3 s4
Fig 2.20: Actel ACT1 logic block
• Large logic block can realize more complex function within a single block. Thus, the total
number of logic blocks needed may be less. However, a large block requires more area.
• Large blocks may result in reduced delay. This happens as lesser number of switch boxes
are needed to route a net carrying some signal value. However, with larger block, average
fanout increases, number of switches may also increase as each logic block has more pins.
Length of the wires increases with increase in the size of the block.
simulation and realize the logic in terms of resources available in the target FPGA chip. The design
process consists of the following steps.
1. Design Entry. This is the first step in the overall process. The design can be entered at schematic-
level (consisting of predesigned library modules and interconnections) or at a behavioural level
(via languages like VHDL, Verilog). Design synthesis is needed to convert a behavioural
description into a netlist of library components. Since a schematic entry is already in the form
of a netlist, such translation is not needed for them.
2. Behavioural Simulation. A simulation is necessary to ensure correct functionality of the entered
design, both for schematic and behavioural level.
Design Entry
Schematic Behaviour
Behavioural
Simulation
Design Synthesis
Functional Simulation
Timing Analysis
Design Implementation
Timing Simulation
Back Annotation
3. Design Implementation. This step implements the generated netlist for the target device. A
functional simulation is performed to ensure functional correctness of the implementation.
Detailed place-and-route is performed to implement the design onto target device. Timing
analysis is carried out to identify the delays of different nets in the implementation. Delays are
back-annotated onto the design and the design is passed through timing simulation. This step
can identify the timing violations (set-up and hold time violations) in the implementation.
4. Device Programming. This stage generates the final configuration program for the FPGA
device. The configuration program may be downloaded onto a memory that is read at the time
60 | Embedded Systems
of power up (for SRAM based FPGAs). For antifuse based FPGAs, the configuration program
is used as the burning pattern of fuses to realize the system functionality. In-circuit verification
modules are often provided by the FPGA vendors to debug the downloaded configuration
program.
2. Design Entry. This step creates a high-level description of the system, typically in some
hardware description language, such as, VHDL and Verilog. The system functionality is
verified via functional simulation.
3. Synthesis. The high-level design description is next synthesized into a netlist of library
components. The resulting description is often called to be at Register Transfer Level (RTL).
4. Partitioning. The RTL description, particularly the complex ones, are next partitioned into a
number of smaller components. This enhances the block-by-block optimization of the design
and also promotes design reuse.
5. Design for Test (DFT) Insertion. This step introduces additional hardware into the circuit to
enhance its testability. The components inserted typically include scan-chains, Built-In-Self-
Test (BIST) module etc.
6. Floorplanning. This step plans the layout of the chip on the actual silicon floor. Provisions are
made to create space for functional blocks, I/O pads, power distribution network etc. Signal
integrity and thermal management issues are taken into consideration.
7. Placement. This step finalizes the places for the circuit components onto the silicon floor.
8. Clock Tree Synthesis. It creates and places the network to distribute clock signals to the required
pins of different modules in the chip. All sequential elements should get the clock signal with a
minimum jitter.
9. Routing. It plans the connectivity between the modules and decides how the signal lines be
taken through the silicon floor, satisfying timing requirements, reducing signal interferences
and decreasing the necessary signal power.
10. Final Verification. This step checks the satisfaction of design rules regarding the minimum
feature size (of transistors), spacing, metal density etc. Timing verification is performed to
guarantee the compliance with signal propagation delays.
11. GDS II. This is the final layout of the chip, called Graphical Data Stream Information
Interchange (GDS II). At foundry level, this data is used by the semiconductor manufacturers
to produce the final ASIC. After the chip has been fabricated, each unit goes through unit-level
testing to check its operation. The good chips are kept while the defective ones are rejected.
by the program are also stored in the memory. The program itself is kept in memory in the form of
instructions. Embedded memory may be of different types. A classification of memory modules
used in embedded systems has been shown in Fig 2.22. The two major memory groups are (i)
primary memory and (ii) secondary memory. The primary memory consists of Read-Write memory
and Read-Only memory. The commonly used Read-Write memories can be of two types – Static
and Dynamic. The Static memories include SRAM and NVRAM. The dynamic memories can be
SDRAM and DDR. The Read-Only memories may be either programmable or fused (one-time
usage). The programmable category includes EEPROM, EPROM and Flash (NAND, NOR, 3D).
The fused ones are PROM and Masked ROM. The secondary memory is created using SSD, Hard
Disk, Magnetic Tape and CCD. In the following a brief overview of the major memory alternatives
has been noted.
Memory
Static Dynamic
• SRAM • SDRAM Programmable Fused Memory
• NVRAM • DDR 1/2/3 Erasable Memory • PROM
/4/ECC • Masked ROM
PROM Flash
• EEPROM • NAND
• EPROM • NOR
• 3D
1. Static RAM. Static RAM or SRAM is a volatile memory that uses a latching mechanism to
store information bits. Individual cell structure consists of six transistors forming two back-to-
back CMOS inverters. It is expensive compared to other memory alternatives like DRAM.
Embedded Processors | 63
However, it is quite fast and typically used in Cache and internal registers of processors. The
structure of a typical SRAM cell has been shown in Fig 2.23.
WL VDD
BL BL
GND
Fig 2.23: SRAM cell
2. Non-Volatile RAM. Non-Volatile RAM or NVRAM retains its data even when the power is
switched off. The initial NVRAMs used the floating-gate MOSFET transistors to realize
EPROM and EEPROMS. However, these devices require the write operation to be done in
blocks only, precluding the true random-access feature. An alternative to this is Ferroelectric
RAM (FeRAM) that uses a ferroelectric material to store information via polarization of its
atoms. Other two emerging technologies of NVRAM are Magneto-resistive RAM (MRAM) and
Phase Change Memory (PCM). The fast reading and writing capabilities of NVRAM can be
used for Cache memory and high-performance computing systems.
3. Dynamic RAM. Dynamic Random Access Memory or DRAM stores
WL
each information bit in a memory cell consisting of a tiny capacitor
and a transistor. The structure has been shown in Fig 2.24. The
charge stored in the capacitor gradually leaks away, requiring
periodic rewriting of data bits. The process is known as memory
BL
refresh. Due to its small size, DRAM has very high density with GND
Fig 2.24: DRAM cell
much reduced cost per bit, compared to SRAM. DRAM is typically
used to realize main memory in computing systems. The most common variants of DRAM are
as follows.
• SDR, DDR, DDR2, DDR3 and DDR4. Single Data Rate (SDR) is the oldest DRAM type.
Double Data Rate (DDR) DRAMs support high speed transfer. DDR2 is faster than DDR
but consumes more power. DDR3 and DDR4 are faster than their previous versions and
also consume less power than predecessor.
64 | Embedded Systems
• SDRAM. Synchronous DRAM (SDRAM) works with a clock signal to synchronize with
other system components. It provides faster data transfer compared to the asynchronous
DRAM.
• ECC DRAM. The Error Correcting Code DRAM (ECC DRAM) checks for errors in data
transfer process, as well. Thus, it is useful for applications where the correctness of stored
and transmitted data is a prime concern.
4. Flash. Flash memory is non-volatile in nature and is primarily used as secondary storage and
ROM. Its solid-state technology makes Solid State Drive (SSD) a faster alternative to Hard
Disk Drive (HDD). The storage capacity can be in the range of few Gigabytes (GB) to several
Terabytes (TB). Power consumption is also low. The flash memories work on the principle of
floating-gate transistors. There are two types of flash memories – NAND and NOR. The
NAND flash has high memory density and high capacity. It is typically used in memory cards,
USB drives and SSD. On the other hand, NOR flash memory cells are attached in a parallel
manner, giving faster reading speed, compared to NAND. There is a 3D Flash memory with
density higher than NAND flash and is used in high-capacity SSDs. However, the storage
capacity and life-span of flash memory are often less than Hard Disks.
UNIT SUMMARY
Processors are needed in embedded systems to carry out the computations. Large number of
alternatives are available, ranging from the most generic one to completely customized solutions,
to be chosen as an embedded processor. For small-to-medium sized embedded systems,
microcontrollers often provide ideal solution due to their single-chip architecture having processor,
memory and other peripheral interfaces on the same chip. Among the available microcontrollers,
ARM has been used extensively in low-power, high performance embedded applications. For
signal processing tasks, Digital Signal Processors provide low-cost solutions. Special architectural
features of DSPs can be exploited to enhance signal processing. Beyond software, an embedded
system may be realized in hardware. FPGA and ASIC form two alternatives here. FPGAs are semi-
custom in nature and can be programmed to realize a desired computation. The feature of
programmability also aids in design debugging. However, for highly time-critical applications,
66 | Embedded Systems
FPGAs may not be able to meet the performance goals. For such cases, an ASIC may be designed
that is fully customized towards the application. ASIC design may target one or more of the features
– area, delay and power to be optimized in the design process. Apart from computation, storage of
information is also an important issue. Several technologies including SRAM, DRAM, NVRAM,
SDRAM, Flash, magnetic/optical disk, SSD etc. are available for use as embedded memory. A
detailed understanding of all these alternatives for computation and storage available to the
designer, paves the way for efficient embedded system design.
EXERCISES
Multiple Choice Questions
MQ1. ALE signal in ARM is
(A) Input to the processor (B) Output from the processor
(C) Bidirectional (D) Configurable
MQ2. Suppose the contents of registers R1, R2, R3 and R4 in an ARM processor are 1, 2, 3 and 4
respectively. Following two instructions are executed in sequence.
STMFD SP!, {R1, R3, R4, R2}
LDMFD SP!, {R4, R1, R3, R2}
The contents of the registers R1, R2, R3 and R4 will be
(A) 1, 3, 4, 2 (B) 3, 1, 2, 4
(C) 1, 2, 3,4 (D) 4, 3, 2, 1
MQ3. In ARM, conditional execution is supported in the instruction set
(A) ARM (B) THUMB
(C) Both ARM and THUMB (D) A programmable manner
MQ4. The registrar used as program counter in ARM is
(A) R12 (B) R13 (C) R14 (D) R15
MQ5. An ARM processor writes the 32-bit number 22292F3FH into memory and then reads a
byte from the same address. The value read in big-endian convention will be
(A) 22H (B) 29H (C) 2FH (D) 3FH
MQ6. Number of software interrupts in ARM processor is
(A) 1 (B) 24 (C) 224 (D) 220
Embedded Processors | 67
KNOW MORE
Looking back into the history, TMS1000 was the first microcontroller reported in the year 1971,
developed by Texas Instruments. This was followed by Intel 4004. ARM version1 was introduced
in the year 1985. The first digital signal processor is TMS5100, introduced by Texas Instruments
in the year 1978. The first commercially viable FPGA is XC2064 from Xilinx, introduced in the
year 1985. It had 64 CLBs, each with two three-input lookup tables. Each of these alternatives,
along with ASIC, has advanced significantly over the years. ARM has come up with different
versions, of which ARM7, ARM9 and ARM11 are the notable ones. Beyond ARM11, the ARM
architectures have developed into three distinct categories – Cortex-A, Cortex-R and Cortex-M.
Cortex-A series is for application processors targeted to be typically used in mobile computing,
smart phones, energy-efficient high-end servers etc. Cortex-R series is for real-time processors.
These can be used for real-time applications, such as, hard-disk controller, automotive power train,
base-band control in wireless communication etc. Cortex-M series is for microcontrollers, targeted
towards deeply embedded applications, such as, sensors, MEMs, IoT. The processors in this series
include M0, M0+, M1, M3m M4, M7.
70 | Embedded Systems
3
d
Interfacing
UNIT SPECIFICS
Through this unit we have discussed the following aspects:
• Features of Serial Peripheral Interface (SPI) protocol;
• Operation of Inter Integrated Circuit (IIC) interface ;
• Overview of RS-232 family protocol;
• Interfacing devices through Universal Serial Bus (USB);
• Operation of IrDA interface;
• Overview of Controller Area Network (CAN);
• Features of Bluetooth wireless interface;
• Operation of Digital-to-Analog and Analog-to-Digital converters;
• Interfacing of subsystems through PCIe and AMBA bus;
• Overview of user interface design.
The overall discussion in the unit is to give an overview of interfacing standards commonly
followed in embedded system design. It is assumed that the reader has a basic understanding of
device interfacings and digital design.
A large number of multiple choice questions have been provided with their answers. Apart
from that, subjective questions of short and long answer type have been included. A list of
references and suggested readings have been given, so that, one can go through them for more
details. The section “Know More” has been carefully designed so that the supplementary
information provided in this part becomes beneficial for the users of the book. This section mainly
highlights the advanced processors supporting many of the interfacing standards discussed in the
unit. The designer may choose these processor for the ease of system implementation.
72 | Embedded Systems
RATIONALE
This unit on interfacing makes the readers familiar with the typical interfacing standards followed
while connecting peripheral devices to an embedded processor. A processor used for embedded
system is expected to possess the traditional interfaces such as RS-232 for serial communication,
Universal Serial Bus (USB) for incorporating plug-and-play features. However, apart from them,
many other interfaces have been designed typically for embedded applications. Two such very
simple interfacing standards are Serial Peripheral Interface (SPI) and Inter Integrated Circuit
(IIC). Both the standards require minimum hardware and simple protocols. Another very important
standard is Controller Area Network (CAN) that can operate in highly noisy environment like
automobiles. It uses an arbitration-free mechanism with assigned message priorities. Two
important converters, digital-to-analog and analog-to-digital, are used in realization of embedded
applications interacting with the environment. Depending upon the cost and performance metrics,
the designer may choose one or other type of converters available in the market. It is often
necessary to connect a number of peripheral systems to the processor, over a bus. For this purpose,
the desktop computers may use PCIe (Peripheral Component Interconnect Express) bus, apart
from USB. An embedded system designer has similar choice. For microcontroller based System-
on-Chip type embedded system design, Advanced Microcontroller Bus Architecture (AMBA) has
come up as the standard. AMBA possesses a hierarchy of buses with varying performance
parameters. For human interaction with embedded applications, user interface design is of utmost
importance. Most such devices contain an LCD touchscreen panel that may be either resistive or
capacitive in nature. The goal is to make the user experience for the embedded application
satisfactory. This unit takes the reader through all these interfacing standards to equip them with
the capability to choose the appropriate one to connect devices to the system.
PRE-REQUISITES
Digital Design, Computer Architecture
UNIT OUTCOMES
List of outcomes of this unit is as follows:
U3-O1: Interface devices using SPI, I2C protocols
U3-O2: Enumerate features of RS-232 family
U3-O3: Interface devices using USB protocol
U3-O4: List the features of CAN protocol
Interfacing | 73
U3-O2 1 3 - -
U3-O3 1 3 - -
U3-O4 2 2 - -
U3-O5 1 3 - -
U3-O6 1 3 - -
U3-O7 1 3 - -
3.1 Introduction
Embedded systems interact with the environment through various types of sensors and actuators.
These devices differ significantly from those commonly found in general computing platforms. A
desktop computer typically contains devices like keyboard, display, mouse, printer etc. The devices
are mostly connected to the processor via interfaces like USB, serial/parallel communication ports
and Bluetooth. To the contrary, an embedded system may have many non-conventional devices
connected to it. For example, a climate-control system may have a few temperature and humidity
sensors. While designing a sensor, the primary concern is to sense the physical parameter and
convert into electrical signals. Transportation of this electrical equivalent to a processor (located at
a close or distant place) may not be given adequate importance. As a result, it is imperative to have
some very simple interfacing strategy to be followed. At the same time, the communication must
be reliable.
The requirement has given rise to a number of simple to moderate complexity interfacing standards
for embedded hardware. The traditional communication standards like RS232, USB, Bluetooth are
definitely used in embedded system design. However, a number of other standards have been
introduced from time-to-time by the industry, to be used in embedded applications. Some such
74 | Embedded Systems
interfaces include Serial Peripheral Interface (SPI), Inter-Integrated Circuits (IIC,I2C), IrDA,
Controller Area Network (CAN) etc. SPI and I2C are synchronous communication protocols with
the hardware requirement of only four and two signal wires, respectively. The CAN protocol can
be used in highly noisy environment of automobiles. Analog-to-digital and digital-to-analog
converters form a part of many embedded systems. Many of the embedded appliances interact with
human beings. For this, special attention must be given to the user interface design to ensure good
user experience. In the following, the device interfacing standards targeted to embedded processors
and their associated issues have been elaborated.
SI
Processor/Master
MOSI Peripheral/Slave
MISO SO
SCLK CLK
I/O CS
slave. The chip select signal CS acts to select the slave device to which master wants to
communicate. The CS signal is particularly useful in multi-slave environment. For example, Fig
3.2 shows the situation in which a master device selectively communicates with one of its three
slave devices. The slave device to interact with the master is selected via the Slave Select (SS)
signal from the master.
SI
Slave 1
SO
CLK
CS
MOSI
SI
Master
Slave 2
MISO SO
CLK
SCLK CS
SS1
SS2
SS3 SI
Slave 3
SO
CLK
CS
Fig 3.2: Connecting Multiple Slaves in SPI
Data transfer in SPI is effectively an exchange of data between the master and the slave
via MOSI and MISO signal lines. The master and the slave contain one 8-bit register each. These
data registers act as serial shift registers. The contents of these registers are exchanged serially to
facilitate data transfer. The data transfer is initiated when the master/processor writes a byte to its
SPI data register. Bits are transmitted serially over the MOSI line to the SPI data register of the
slave, one bit per clock cycle. Simultaneously, the bits from the SPI data register of slave device
are transmitted serially over the MISO line to reach the SPI data register of the master. The process
has been explained in Fig 3.3. It may be noted that the data exchange can be restructured to perform
only read or only write by the master.
• Read Operation: Master writes a dummy byte to its data register. This in turn, initiates
transfer of the content of data register of slave to the data register of the master. The master
can then transfer the content of its SPI data register to one of its internal registers.
• Write Operation: The master writes the intended byte to its SPI data register. As a result, the
content of this register gets transferred to the SPI data register of the slave. Simultaneously,
the previous content of the SPI data register of slave reaches the SPI data register of the
76 | Embedded Systems
master. As master intended to do only a write operation, the content available now in its SPI
data register is simply ignored by the master.
MOSI
Processor/Master
Peripheral/Slave
8-bit
SPI Data
Register
MISO
Fig 3.3: Data transfer through SPI Interface
It may be noted that though 8-bit data transfer is most common in SPI, other word sizes
are also used in some applications. For example, touch-screen controllers and audio codecs use 16-
bit words, many DACs and ADCs follow 12-bit word structure. Among the popular
microcontrollers, AVR ATmega series processors provide SPI communication. Some common
example peripherals having SPI interface are as follows.
• Analog-to-Digital Converters (ADCs): LTC2452 – an ultra-tiny, differential, 16-bit delta-
sigma ADC.
• Touch Screen: SX8652 – low power, high reliability controller for resistive touch screen with
supply voltage range of 1.65V to 3.7V.
• EEPROM: 25XXX series with densities varying from 128 bits to 512 kilobits, low-power
EEPROM with built-in write protection.
• Real-time Clock: DS3234 – low-cost, highly accurate real-time clock.
• Memory Card: MMC and SD cards.
wires are used to form the I2C bus, devices are connected to the bus in a multidrop fashion. The
bus is bidirectional with a speed of 100-400 kbps. The two wires are called,
• SDA – Serial Data
• SCL – Serial Clock
The bus has an open-drain configuration, pulled to VDD when idle. Fig 3.4 shows one system in
which several master and slave devices have been interfaced.
VDD VDD
SDA
SCL
A master device can start the communication by first pulling SDA to low, followed by SCL
to low. These transitions in SDA and SCL lines from high to low indicate a START condition for
communication. With SCL low, SDA transits to the first valid data bit. The rising edge of SCL
samples the data bit at destination. Data bit in SDA must remain valid as long as SCL remains high.
SDA transits to the next data bit after SCL becomes low. The STOP condition is indicated by SCL
becoming high, followed by SDA becoming high. After transmitting 8 bits, the master releases
SDA and gives an additional pulse on SCL. The slave acknowledges the receipt of the byte by
pulling SDA low. In this way, multiple bytes can be transmitted with acknowledgement for each
byte transfer. If the receiver is unable to accept more bytes, the transmission can be aborted by the
receiver by pulling SCL low. Each device has a unique 7-bit address. Thus, upto 128 devices can
be connected to an I2C bus. The first byte transmitted by the master contains this 7-bit slave address
along with a direction bit. If the direction bit is 0, it indicates a write operation by the master – the
slave receives subsequent bytes from the master over the bus. On the other hand, the direction bit
being 1 indicates a read operation, in which the slave starts sending subsequent bytes onto the bus,
to be read by the master. Fig 3.5 shows a timing diagram for the I2C operation.
78 | Embedded Systems
SCL
D7 D6 D5 D4 D3 D2 D1 D0 ACK
DSPs, ADCs, DACs etc. I2C provides better support for multi-master environment with flow
control.
3.4 RS-232
RS-232 is one of the oldest serial communication standards dating back to 1960s. Devices can
communicate serially over a cable length of upto 25m at data rates upto 38.4 kbps. The original
specification document has undergone several revisions over the years. EIA RS-232 was
introduced in 1960. The versions A, B and C of it came up in the years 1963, 1965 and 1969
respectively. It has been renamed as TIA standards later. Accordingly, TIA-232 D, E and F have
come up in 1986, 1991 and 1997 (with revision in 2012), respectively.
In RS-232, data bits are transmitted at voltage levels with respect to local grounds only.
This gives rise to an unbalanced interface. The communication is asynchronous. Unlike SPI, there
is no synchronizing clock running between the communicating devices. The protocol is quite
simple, justifying its demand for use over the decades, even though several other serial
communication protocols have come up. In order to send a 0-bit, the driver output is a signal in the
range +5V to +15V. On the other hand, for transmitting a 1-bit, the corresponding voltage range is
-5V to -15V. At receiver, a signal in the range +3V to +15V is taken as 0, while a signal in the
range of -3V to -15V is considered as 1. A logic1 is called a space while a logic 0 is called a mark.
The protocol is meant for connecting a Data Terminal Equipment (DTE) to a Data Communication
Equipment (DCE). While a computer may be an example DTE, a modem can be considered as a
DCE.
3.4.1 Handshaking
Due to the absence of synchronizing clock, flow control of information becomes a concern for RS-
232 communication. Additional flow control mechanisms need to be utilized to prevent data
overloading at the receiver. Handshaking methods are used to solve the overloading problem. RS-
232 supports three different handshaking strategies, as noted next.
1. Software handshaking: Software implementation of handshaking protocol saves the hardware
area and reduces the number of additional signal lines. Data transmission over telephone lines
is a typical example of software handshaking. The protocol used is known as Xon/Xoff. In this
protocol, control characters are embedded into the data stream and sent along the data lines.
The Xon command starts the transmission while Xoff stops it. To embed into the ASCII charcter
stream, character with code 17 has been taken for Xon and 19 for Xoff. As soon as the receiver
80 | Embedded Systems
is ready to receive further data, it sends a Xon character to start transmission. The receiver can
send a Xoff character to stop transmission.
2. Hardware handshaking: The strategy is superior to software based handshaking. The
connection pattern has been shown in Fig 3.6. As can be noted in the figure, the Request To
Send (RTS) of Data Terminal Equipment (DTE) needs to be connected to the Clear To Send
(CTS) pin of Data Communication Equipment (DCE). The line Data Terminal Ready (DTR)
of DTE is connected to the Data Set Ready (DSR) of DCE and vice versa. For initiating the
transmission, DTE sets RTS ON. Once DCE becomes ready for communication, it puts CTS
ON. The DTE responds by making DTR ON and the line remains ON as long as data is being
transmitted.
Pin details:
3. Combined hardware software handshaking: In the combined policy, both the hardware
connectivity and software control via Xon/Xoff character transmission are implemented for
flow control.
can also provide power to low power consuming devices, eliminating the necessity of additional
external power sources for them. Many of the USB devices can be used without installing the
manufacturer specific device driver software.
For example, a webcam may have a microphone integrated with it along with the camera, thus
containing two logical sub-devices – video device function and audio device function.
Root Hub
Device 1
Hub Hub
Printer Printers
• USB 1.1: 0.0 – 0.3V for low, 2.8 – 3.6V for high.
• USB 2.0: ±400 mv.
• USB 3.0: ±500 mv.
The interface has four wires (upto USB 2.0) or nine wires (USB 3.0 and beyond). Table 3.2 shows
the wires along with colour coding. Pins 1 to 4 are common in all USB versions, 5 to 9 are specific
for USB 3.0 onwards.
USB has two connection types.
• Upstream connection. This is the connection from device back to host/hub.
• Downstream connection. This connection is from host out to the device.
upstream downstream
Host/Hub
Device
Fig 3.8: USB cable with connectors
The connection pattern has been shown in Fig 3.8 Assuming USB 1.1or USB 2.0 communication,
USB cable possesses an upstream connection at host end and a downstream connection at device
end. For upstream connection, Type-A connectors are needed while the downstream connection
needs Type-B connectors. The cable possesses Type-A plug at host end (upstream) and Type-B
plug at device end (downstream). To connect to the plugs, the host/hub contains a Type-A
receptacle while the device contains a Type-B receptacle. As discussed later, the size of Type-A
and Type-B connectors ensure that the USB cable cannot be connected wrongly, even by a layman.
Standard-A Standard-B
Mini-B
Mini-A
Micro-A Micro-B
Fig 3.9: Type-A and Type-B connectors
The USB connectors are designed to be robust with electrical contacts protected by adjacent plastic
tongue. The entire connection assembly is further protected by an enclosing metal sheath to ensure
86 | Embedded Systems
GND RX1+ RX1- VBUS SBU2 D- D+ CC2 VBUS TX2- TX2+ GND
GND TX2+ TX2- VBUS VCONN SBU2 VBUS RX1- RX1+ GND
signals can penetrate glass, but not other opaque mediums. This restricts the communication to a
room only.
About 1m
Transmitter
Receiver
15-30o 15o
A directed infrared beam is used for communication from the transmitter to the receiver. This is a
line-of-sight communication with transmitter beaming out its transmission at 15-30o either side of
the line-of-sight. The viewing angle of the receiver is 15o either side of line-of-sight. The distance
between transmitter and receiver is at least 1m, typically going upto 2m. The transmission policy
has been shown in Fig 3.11. With the maximum surrounding illumination of 10 klux (daylight), the
Bit Error Ratio (BER) is 10-9. Devices start communicating at 9600 bps which may be negotiated,
without user intervention, for higher and lower data rates. IrDa is suitable for point-to-point and
even point-to-multipoint communication and is popular with portable devices like notebook,
handheld computer, digital camera.
0 1 0 1 0 0 1 1
(a)
(b)
(c)
IrDA uses two different bit encoding schemes depending upon the rate of transmission. The
schemes are as follows.
• Return-to-Zero (RZ). The scheme is used for low data rate upto 1.152 Mbps. In this, a
transmission frame is divided into subintervals with one bit transmitted per subinterval. A logic
0 bit is represented by a pulse of duration 3/16 the width of a subinterval, while a logic 1 is
represented by the absence of the pulse. Fig 3.12(b) shows the transmission of bit pattern
“01010011” using RZ encoding.
• Pulse Position Modulation (PPM). The strategy is used at higher data rate of 4 Mbps, also
denoted as 4PPM. Here, a data symbol duration (Dt) is defined that is further divided into four
equal-length time slices, called chips or cells (Ct). A data symbol contains a pulse at exactly
one of its four cells and represents two bits of information. Corresponding to the information
bits “00”, pulse is put at cell 1. This makes the data symbol “1000”. Similarly for information
bits “01”, “10” and “11”, the data symbols are “0100”, “0010” and “0001”, respectively. Fig
3.12(c) shows the transmission of bit pattern “01010011” using 4PPM encoding.
in its identifier field. In case multiple devices transmit simultaneously, the one transmitting more
number of dominant bits wins, thus asserting its higher priority. The node with lower-priority
message senses the same, backs off and waits for the bus to become idle. A ‘0’ is taken as the
dominant bit over ‘1’ as recessive. The physical bus is thus an open-collector, wired-AND
connection. If one node transmits a dominant bit (0) and another a recessive bit (1), the device
transmitting the dominant bit wins and its transmission continues, while the other device backs off.
Thus, higher priority messages are never delayed. The allocation of message IDs is very important
– each ID must be unique and has to be assigned by considering its priority.
The CAN bus structure has been shown in Fig 3.13. Two lines, CANH and CANL constitute the
differential pair terminated by 120Ω resistance to avoid reflection. Each node connected to the
CAN bus has the following components.
CANH
CAN Bus
CANL
Fig 3.13: CAN bus structure
• Host Processor. This is responsible for determining the type of messages received and their
meanings. It also performs the task of transmitting messages from the node.
• CAN Controller. It is responsible for the operation of transmitting and receiving messages.
Received bits are read from the bus till the complete message is available. It then interrupts the
host processor asking it to collect the message. For sending a message, the host processor stores
it in the CAN controller which is then transmitted serially through the bus.
90 | Embedded Systems
• Transceiver. It is responsible for adapting the bus signal levels to that expected by the CAN
controller. It contains protective circuitry for the CAN controller. On the transmitter side, it
converts the signal level of the CAN controller to that of CAN bus.
CAN bus can transmit at a rate of 1 Mbps over a network length of 40m having a maximum
of 30 nodes. Upto 500m distance can be covered at a lower data rate of 125 kbps.
3.8 Bluetooth
This is a standard for small, cheap, wireless communication, typically used between devices like
computers, printers and mobile phones. It was originally introduced by Ericsson and quickly
adopted by large number of companies. Bluetooth transmits data through low-power radio waves
at a frequency of 2.45 GHz (typically 2.40 to 2.4835 GHz). This frequency band is internationally
reserved for use of industrial, scientific and medical devices. The power of Bluetooth signal is weak
(about 1mW). This limits the range of Bluetooth transmission to 10m. It does not require line-of-
sight communication. The walls cannot stop a Bluetooth signal, thus devices distributed across
rooms can communicate using Bluetooth. A device can connect upto eight other devices over
Bluetooth, at a time. Communications do not interfere with each other due to the adoption of spread
spectrum frequency hopping – a device uses 79 individual randomly chosen frequencies within a
designated range, changing from one frequency to another on a regular basis. Transmitters change
frequencies 1600 times per second.
As soon as two Bluetooth devices come within a proximity, an electronic conversation takes place
deciding whether there is data to be shared or one device needs to control the other. No user
intervention is necessary in the process. After conversation, the devices form a Personal Area
Network (PAN), also called a piconet. A piconet is an adhoc network with one master device and
upto seven slave devices. Another 255 devices may remain as inactive or parked. The initial
communication between two Bluetooth devices is called pairing or bonding. It may be protected
by a 6-digit numeric passkey to be entered by at least one of the device users.
used in DAC design – Weighted Resistors method and R-2R Ladder Network method. An overview
of the two architectures have been presented in the following.
Vref
1. Weighted Resistors DAC. Fig 3.15 shows the schematic of a 4-bit Weighted Resistor DAC. The
digital input bits b3b2b1b0 control the switches in the connection pattern of weighted resistors.
Contribution of a bit gets consideration only if the corresponding bit value is 1. All such
contributions get summed up via a summing op-amp module. It may be noted that the resistors
are weighted based on the weights of the bits in the digital input. The output Vo of the op-amp
is given by,
𝑅𝑅𝑜𝑜 𝑏𝑏2 𝑏𝑏1 𝑏𝑏0
𝑉𝑉𝑜𝑜 = �𝑏𝑏3 + + + � 𝑉𝑉𝑟𝑟𝑟𝑟𝑟𝑟
𝑅𝑅 2 4 8
Ro
-
8R 4R 2R R
Vo
+
b0 b1 b2 b3
-Vref
A major drawback with the approach is its scalability. With increasing number of bits in
the digital input, the required range of resistor values become large, the accuracy also suffers.
92 | Embedded Systems
2. R-2R Ladder Network DAC. Fig 3.16 shows the schematic of a 4-bit DAC using R-2R Ladder
circuit. Using circuit analysis, it can be shown that the output voltage Vo for the summing
amplifier is given by,
1 1 1 1
𝑉𝑉𝑜𝑜 = �𝑏𝑏3 + 𝑏𝑏2 + 𝑏𝑏1 + 𝑏𝑏0 � 𝑉𝑉𝑟𝑟𝑟𝑟𝑟𝑟
2 2 4 8
3R
2R R R R 2R
-
2R 2R 2R 2R
Vo
+
b0 b1 b2 b3
-Vref
Fig 3.16: R-2R Ladder Network DAC
3. Linearity. This measures the deviation from the ideal straight-line behaviour of the DAC output.
1
The error should be no more than ±2 LSB. When all input bits are zero, the output voltage gives
the offset error for the DAC.
Interfacing | 93
4. Monotonicity. A DAC is said to be monotonic if the output voltage increases every time with
the increase in input code.
1
5. Settling Time. This is the time needed by the DAC for the output to switch and settle within ±
2
LSB when input changes from all 0s to all 1s. It should be less than the half of the data arrival
rate.
-to-
Digital
Converter
(ADC)
1. Sample. The analog input signal is sampled at certain frequency to obtain the instantaneous
amplitude values to be converted to digital form. As the analog signal is a continually time
varying one, the ADC can convert only the values at sampling instances into digital equivalents.
As per the Nyquist Criteria, if the highest frequency of analog signal is f, sampling should be
carried out at a frequency more than 2f.
2. Hold. This stage holds the analog signal sample at steady value so that the subsequent ADC
stages can work on it and that the value does not change till the next sampling interval.
3. Quantize. The quantization process approximates the sampled analog signal amplitude to one
of the few selected fixed values. Naturally, the process introduces an error, called quantization
error/noise.
94 | Embedded Systems
4. Encoder. This is the final stage of an ADC, converting the quantized analog signal into binary
digits (bits).
Analog Digital
Sample Hold Quantize Encoder
Signal Signal
The following are the most commonly used ADCs – Delta-Sigma ADC, Flash ADC,
Successive Approximation ADC, and Integrating ADC. A comparison between their features has
been shown in Table 3.3.
the desired data rate fD at the output of the ADC. Individual samples are accumulated over time and
averaged with other input-signal samples through a digital decimation filter. Fig 3.19 shows the
block diagram of the internals of Delta-Sigma ADC. The modulator samples input signal at a very
high rate into a one-bit stream. The digital decimation filter takes these sampled data and converts
into a high-resolution, slower digital code.
ΔΣ Modulator
Analog signal
R R R R R R R
Vref
Vin
+ - + - + - + - + - + - + -
Vdd
Binary Output
Fig 3.20: Flash ADC
DN-1 D2 D1 D0
Vref DAC
Comparator
-
VIN Sample & +
Hold
Fig 3.21: Successive Approximation ADC
To start the conversion process, the MSB of SAR is set to 1, all other bits 0. Accordingly,
DAC outputs an analog voltage Vref /2. If this voltage exceeds VIN, the SAR bit is reset, otherwise
left as 1. The next bit of SAR is set then. The process continues in this way for each bit and
determining its value, one bit at a time. The final content of the SAR is output as the digital count
and the end-of-conversion (EOC) signal is activated to inform the same.
VIN R C
-
Vref Vout
+
Vout (a)
Time
tu td
(b)
Fig 3.22: Integrating ADC (a) Circuit (b) Output voltage
CPU
END MEMORY
POINT
HOST
BRIDGE END
END
POINT POINT
END
POINT END
SWITCH POINT
END
POINT
END
POINT
Fig 3.23: PCI Express bus connection scheme
An electrical interface of PCI Express is measured in terms of number of lanes contained in it. A
lane correspond to a single send/receive line of data with traffic in both directions, a full-duplex
communication. A PCIe link between two devices can have 1 to 16 such lanes in it. In case of a
multi-lane link, packet data is striped across the lanes. Lane count is negotiated during device
initialization. The standard defines link widths of x1, x2, x4, x8 and x16 lanes (link xk has k lanes
in it). A lane consists of two differential signal pairs – one pair for receiving data and the other for
transmitting. A PCI Express link structure has been shown in Fig 3.24. The latest version is PCIe
7.0, having a per lane transfer rate of 128 Gigatransfers per second (GT/s).
100 | Embedded Systems
PCIe Device A
Wire Link
Lane
Signal
PCIe Device B
PCIe bus slots are typically backward compatible with other types of PCIe slots. This
makes it possible for the PCIe links with fewer lanes to use the same interface as PCIe links used
for more lanes. The computer users may install a better device into an expansion slot and the
corresponding device driver to get improved services on graphics, networking, storage etc.
On-chip
Processor Memory
Device #1 Device #2
External
BRIDGE
AHB/ASB APB
Memory
Interface
Device #3 Device #4
DMA
AMBA ASB is an alternative to AHB, in which high performance features may not be essential.
APB is used for connecting low-power peripherals. Fig 3.25 shows a typical AMBA based
microcontroller platform in which three types of buses have been used to connect to the system
modules. Table 3.4 shows a comparison between the features of the three standards.
AMBA AHB: This constitutes a new generation high-performance, high clock-speed bus for
synthesizable designs. The feature set includes burst transfers, split transactions, single-cycle and
single-clock operations, and wider data bus (64/128) configuration. Typically high bandwidth
peripherals (such as, internal memory, external memory interface) reside on AHB. The components
in AHB are as follows.
1. AHB Master. It is responsible for initiating read/write operations. Only one bus master may be
activated at a time.
2. AHB Slave. A slave responds to a read/write request within an address range.
3. AHB Arbiter. It selects a bus master to be active at a time instant. The arbitration protocol is
fixed, however, different algorithms can be used.
4. AHB Decoder. It decodes the address bus for each transfer and provides select signal to the
slave responsible for the transfer.
AMBA ASB: This has been the first generation AMBA system bus supporting features like burst
transfer, pipelined transfer and multiple bus masters. Some common devices connected through
ASB are Direct Memory Access (DMA), Digital Signal Processor (DSP) etc. The AMBA ASB
system contains the following components.
1. ASB Master. The master is capable of initiating a read/write operation by providing address and
control information. Only one bus master is active at a time.
102 | Embedded Systems
2. ASB Slave. A slave device responds to the read/write operation requests from the master.
3. ASB Decoder. Performs address decoding, similar to the decoder in AHB.
4. ASB Arbiter. Ensures that there is only a single bus master at a time. It is otherwise similar to
the AHB arbiter.
AMBA APB: In the AMBA hierarchy of buses, APB is used for minimal power consumption and
reduced interface complexity. To AHB/ASB, APB is a local secondary bus encapsulated as a single
AHB/ASB slave device. An APB bridge acts as the slave module that takes care of handshaking
and control signal retiming. The interfaced devices are of low bandwidth and do not require high
performance. The APB is much cheaper than the AHB or ASB. The bus is typically useful for
interfacing simple register-mapped devices, very low power situation where clock cannot be
globally routed and in grouping narrow bus peripherals to avoid loading of the system bus.
such as, patient data, vital signs and alerts. On the other hand, for industrial applications, critical
information are operational data and warnings.
2. Ideation. This involves generating alternate UI/UX designs. The following points need to be
considered in more detail.
• Navigation and organization of information.
• Colour schemes and contrast.
• Selection of font style and size.
• Iconography and graphics,
• Type of display.
3. Prototype. This is about creating a preliminary version of the interface. Prototyping tools can
be used for the purpose. The step emphasizes on the following aspects.
• Designing the navigation flow and interaction by the user.
• Layout and composition of screen.
• Iconography and typography.
• Consistent design of interface elements.
4. Test and evaluate. This is the final stage in UI/UX design dealing with test and evaluation of
the interface. Feedback from the users need to be collected and the suggestions are to be
incorporated into the final design of UI.
UNIT SUMMARY
Interfacing plays an important role in the design and implementation of embedded systems. This is
primarily due to the fact that such systems needs to interact with the environment via different
types of devices, including sensors and actuators. These associated devices are primarily optimized
towards their operational requirements, rather than incorporating standards used for processor
interaction. As a result, unlike conventional computing systems, an embedded processor may need
to support a large number of other interfacing techniques. In this unit, several such interfacing
standards have been introduced. Among the simplest ones, the standards like SPI and IIC are very
popular while connecting simple devices. For more complex interfaces, RS-232, USB can be used.
The standard USB has evolved a lot to become the de facto choice for many system designers. To
104 | Embedded Systems
operate in noisy environment, like automobiles, the CAN protocol has been introduced. The
interfaces IrDA and Bluetooth can be used for short-range point-to-point connections. For
interacting with the analog environment, analog-to-digital and digital-to-analog converters are
used. There are several categories of these converters varying in their performance and cost
parameters. For system design, the modules may be connected through a bus. The standards PCIe
and AMBA are used extensively for this purpose. Another very crucial issue to be looked after by
the embedded system designers is the user interface design. The interaction through the user
interface needs to provide good user experience.
EXERCISES
Multiple Choice Questions
MQ1. RS-232 communication is
(A) Balanced (B) Unbalanced
(C) Balanced or unbalanced (D) None of the other options
MQ2. With respect to RS-232, a computer is a
(A) DCE (B) DTE (C) DCE or DTE (D) None of the other options
MQ3. SPI protocol is
(A) Synchronous (B) Asynchronous
(C) Both synchronous and asynchronous (D) None of the other options
MQ4. Number of wires in SPI is
(A) 2 (B) 4 (C) 8 (D) 16
MQ5. Bit width of SPI data register is
(A) 8 (B) 16 (C) 32 (D) None of the other options
MQ6. In SPI, the master writes a dummy byte for the intended operation
(A) Read (B) Write (C) Both Read and Write (D) Compare
MQ7. An SPI operation can be
(A) Read (B) Write (C) Exchange (D) None of the options mentioned
MQ8. Number of wires in I2C is
(A) 2 (B) 4 (C) 8 (D) None of the other options
Interfacing | 105
(A) Start (B) Stop (C) Tristate (D) None of the other options
MQ12. Communication in SPI and I2C are
(A) Both half-duplex (B) Both full-duplex
(C) Half-duplex and full-duplex (D) Full-duplex and half-duplex
MQ13. USB4 supports speed upto
(A) 10 Gbps (B) 20 Gbps (C) 30 Gbps (D) 40 Gbps
MQ14. Connector type(s) supported by USB4 is/are
(A) A (B) B (C) C (D) All of the other options
MQ15. In USB cable upstream connection can be found in
(A) Host (B) Device
(C) Both host and device (D) None of the other options
MQ16. In USB cable downstream connection can be found in
(A) Host (B) Device
(C) Both host and device (D) None of the other options
MQ17. In RZ coding, presence of a pulse indicates bit
(A) 0 (B) 1 (C) Both 0 and 1 (D) Absence
MQ18. Number of pulse positions in a PPM data symbol is
(A) 2 (B) 4 (C) 8 (D) 16
MQ19. The dominant bit in CAN communication is/are
(A) 0 (B) 1 (C) Either 0 or 1 (D) None
MQ20. Frequency of Bluetooth communication is
(A) 2.45 GHz (B) 5 GHz (C) 3 GHz (D) None of the other options
MQ21. Size of passkey in Bluetooth is
(A) 4 (B) 2 (C) 6 (D) 3
106 | Embedded Systems
KNOW MORE
Traditional microprocessors and microcontrollers do not support most of the interfacing standards
discussed in this unit. However, in the recent times, many processors (mainly microcontrollers)
have integrated those interfaces in their architecture itself. One of the oldest microcontrollers, 8051
supports serial communication protocol similar to RS-232. However, the signal levels are different.
The 8051 uses 0V and 5V as the analog equivalents of digital bits. A chip named as MAX232 needs
to be interfaced with 8051 to achieve the desired RS-232 operation. AVR ATmega microcontrollers
possess SPI and I2C interfaces. The chip HC-05 can be interfaced with the older processors like
8051 to realize Bluetooth communication in them. Many microcontrollers like ESP32 supports Wi-
Fi communication. A wireless network processor CC3100 can be used to incorporate wireless
communication in traditional processors.
[4] “RS232 Serial Communication Protocol: Basics, Working & Specifications”, https://
circuitdigest.com/article/rs232-serial-communication-protocol-basics-specifications,
downloaded on 3rd October, 2024.
[5] “Introduction to Controller Area Network (CAN)”, Texas Instruments.
[6] https://www.bluetooth.com, downloaded on 3rd October, 2024.
[7] “Data Converters”, Texas Instruments, https://www.ti.com/data-converetrs/overview.html,
downloaded on 3rd October. 2024.
[8] “PCIe Slots Explained: Types, Speeds and Uses in Modern Computers”, https://www.hp.com,
downloaded on 3rd October, 2024.
[9] “AMBA”, https://developer.arm.com/Architectrures/AMBA, downloaded on 3rd October,
2024.
2. AMBA.
4
Real-Time System Design | 109
d
Real-Time System Design
UNIT SPECIFICS
Through this unit we have discussed the following aspects:
• Classification of real-time tasks based on criticality and periodicity;
• Classification of scheduling algorithms;
• Task scheduling under Rate Monotonic Scheduling policy;
• RMS schedulability tests for periodic tasks;
• Task scheduling under Earliest Deadline First policy;
• Priority Inheritance Protocols including HLP and PCP;
• Other features of RTOS excepting scheduling;
• Illustration of features of commercial RTOS.
The discussion in this unit gives an overview of system design commonly followed in real-time
embedded systems. It is assumed that the reader has a basic understanding of generic operating
systems and system software.
A large number of multiple choice questions have been provided with their answers. Apart
from that, subjective questions of short and long answer type have been included. A list of
references and suggested readings have been given, so that, one can go through them for more
details. The section “Know More” has been carefully designed so that the supplementary
information provided in this part becomes beneficial for the users of the book. This section mainly
highlights the usage of RTOS in different real-time applications.
RATIONALE
This unit on real-time system design familiarizes the readers with the intricacies of large embedded
system design having real-time interactions with the environment. The designed system, though
110 | Embedded Systems
running on a target processor only, is complex enough to have several interacting tasks. The tasks
share the system resources and need to be scheduled at proper instants of time so that the
application deadlines are met. Majority of these tasks are periodic in nature with its instances
appearing at regular time intervals. Each instance has its associated deadline and must be
completed within that time. Apart from those periodic tasks, the application may contain other
tasks which are aperiodic in nature with the possibility of occurring at some arbitrary instants. All
tasks may not be equally critical to impact the valid system operation. Thus, scheduling of real-
time tasks becomes a very important issue. The schedulers may be clock-driven or event-driven,
with the second being more powerful, in general. Scheduling policies are guided by task priorities,
which may be static or dynamic in nature. Rate Monotonic Scheduling (RMS) is a prominent
example of static priority guided event-driven scheduling policy, Earliest Deadline First (EDF) is
based on dynamic priority. In real-time systems with shared resources, a high-priority task may
sometimes need to wait for a low-priority task to release the required resources held by the low-
priority task. This leads to priority inversion. Special resource allocation policies are used to tackle
this problem. Several real-time operating systems are available for use in embedded system design.
This unit enumerates all these aspects of real-time embedded system design and presents an
overview of different algorithms and policies available. The designer may choose the most
appropriate alternative for the desired system.
PRE-REQUISITES
Operating System, System Software
UNIT OUTCOMES
List of outcomes of this unit is as follows:
U4-O1: Distinguish between hard-, firm- and soft real-time tasks
U4-O2: Classify tasks based on periodicity
U4-O3: Perform Rate Monotonic Scheduling for a task set
U4-O4: Check if a set of tasks schedulable under RMS
U4-O5: Perform EDF scheduling of task set
U4-O6: Compare between the performances of RMS and EDF
U4-O7: Enumerate priority inheritance schemes
U4-O8: State the features of commercial real-time operating systems
Real-Time System Design | 111
U4-O2 2 2 3 -
U4-O3 1 2 3 -
U4-O4 1 1 3 -
U4-O5 1 2 3 -
U4-O6 1 1 3 -
U4-O7 1 2 3 -
U4-O8 1 1 3
4.1 Introduction
In the previous few units, we have primarily discussed about the hardware platforms available for
embedded system design. The alternatives in terms of processors and peripheral device
interfacings have been enumerated. As noted in Unit-1, the complexity of embedded applications
varies widely – from small, hand-held devices to highly computationally intensive ones like
guided missile, avionics and telecom switches. For smaller applications, the software part happens
to be small as well – no special mechanism may be needed to coordinate between the hardware
components realizing the application. However, for the relatively complex embedded systems,
only hardware based solution may be difficult and costly. A good amount of system functionality
needs to be implemented in software. In such systems, a number of coordinating software tasks
interact with the hardware to realize the desired system functionality. The coordination essentially
leads to the scheduling of those interacting software tasks at proper time instants. Most of the
embedded applications require time-bound system responses. Scheduling is dependent upon the
properties of the tasks (like criticality, periodicity, and deadline). Ensuring the overall
coordination between such processes may not be very simple. A special piece of system software,
called operating system (OS) is needed for the purpose. Operating system takes care of managing
the underlying hardware and provides a platform to the embedded system designer. Applications
112 | Embedded Systems
are mostly real-time in nature, having continuous interaction with the environment, it is very
important that the operating system supports the handling of real-time tasks. This is a significant
departure from the conventional operating systems that primarily attempt to maximize the system
throughput and resource utilization. To understand the intricacies of real-time embedded system
design, it is essential to first look into the features of real-time tasks and the scheduling policies
commonly followed.
As far as scheduling of hard real-time tasks is concerned, we have to honour the deadline.
Unlike in ordinary operating systems in which we try to complete tasks as early as possible to
maintain a high throughput, there is no gain in finishing a hard real-time task early. As long as
the task completes within the specified time limit, the system runs fine.
Utility
Event occurs
100%
Deadline
Time
Fig 4.1: Result utility in firm real-time systems
In firm real-time tasks, results made available after the deadline are of no value and are
thus discarded. Fig. 4.1 shows the situation in which after the occurrence of the event, the response
has an utility of 100% till the deadline. Beyond deadline, utility is zero and the computed result
is discarded.
To illustrate soft real-time tasks, consider the railway-reservation system. In this system,
it is not mandatory that the booking request process should complete within a fixed deadline. It is
only expected that the average time for processing be small. Even if the processing of a particular
ticket request takes some additional time, the system is acceptable. The booking requester does
not reject the ticket altogether. Web-browsing is another soft real-time task. Here also, after typing
the URL, it is quite common to wait for some time for the page to get loaded. However, even if it
takes slightly more time, the system does not fail, neither the page loaded gets discarded. We do
not consider it as an exception if the process takes slightly longer time.
Utility
Event occurs
100%
Deadline
Time
Fig 4.2: Result utility in soft real-time systems
periodicity (pi). Individual occurrences of a periodic task are called its instances. The time instant
at which the first instance of a task arrives, is called its phase. Relative to the instant of occurrence
of a task instance, the time by which it must be completed, is called its deadline. The worst case
time requirement to carry out a task is called its execution time. It may be noted that the execution
time must be less than the deadline of the task. Periodicity defines the time interval at which the
successive instances of a task repeats. A periodic task is thus represented by the four-tuple < φi,
di, ei, pi >. Fig 4.3 shows the behavior of periodic tasks. Most of the real-time tasks in an embedded
system are periodic in nature. The four parameters of such tasks could be computed beforehand.
While designing an embedded system, scheduling of such task instances need to be carried out
carefully, so that proper resources are available for timely completion (before deadline) of them.
In many cases, the deadline of a task instance is taken to be same as the periodicity of the task.
This may be acceptable as no further instances of the task can arrive in between.
φ e d
0 φ φ+p φ + 2p Time
Fig 4.3: Periodic task behaviour
g g
d
e
its arrival, the precedence constraints between the tasks are met, and the resource constraints are
honoured. A valid schedule may or may not be acceptable by the system designer as it may not
honour the deadlines of all task instances. A valid schedule meeting all deadlines is said to be
feasible.
The quality of a schedule is identified by the resulting processor utilization. For a task,
its processor utilization is defined as the fraction of processor time used by the task. If the
periodicity of a task be pi and execution time ei, its processor utilization ui is given by the ratio
ei/pi. The processor utilization of a task correspond to the fraction of time for which the processor
is used by it. For overall processor utilization, resulting from a set of n tasks is defined as,
𝑒𝑒
𝑈𝑈 = ∑𝑛𝑛𝑖𝑖=1 𝑝𝑝𝑖𝑖
𝑖𝑖
Ideally, the processor utilization should be 100%. However, it may not be possible to
have a feasible schedule with 100% processor utilization meeting deadlines of all periodic tasks.
The scheduling algorithms attempt to generate feasible schedules with high processor utilization.
Operating system invokes the scheduling algorithm to decide upon the ready task to be
executed next by the processor. The invocation instants of the scheduling algorithm are called
scheduling points. Based upon the scheduling points and the task selection policy, the scheduling
algorithms can be classified into the categories shown in Fig 4.5. The algorithms can be broadly
classified into the following three categories – Clock driven scheduling, Event driven scheduling
and Hybrid scheduling.
Scheduling Algorithms
1. Clock driven scheduling. These schedulers work in synchronism with a clock/timer that
generates periodic interrupts. The interrupts act as the scheduling points at which the decision
118 | Embedded Systems
is taken about the next task to be executed till the next scheduling point. The schedulers maintain
a precomputed table of tasks to be executed at each scheduling point. Due to the usage of such
precomputed table, these schedulers are also called static schedulers. The clock driven scheduler
design is quite simple, however, the major drawback of this type of schedulers is their inability
to handle aperiodic and sporadic tasks, as their exact arrival times cannot be predicted, which
implies that their scheduling points cannot be determined. This type of schedulers are therefore
known as static schedulers. There are two popular clock driven schedulers – Table driven and
Cyclic. A Table driven scheduler maintains a table of tasks to be scheduled at each clock instant.
The scheduler is invoked at each clock interrupt, its interrupt service routine consults the table
to schedule the task earmarked in the table. A Cyclic scheduler works in terms of major and
minor cycles (also called frames). The major cycle is determined by taking the least common
multiple (LCM) of periodicity of all tasks and is divided into a number of fixed sized frames. In
a frame, only one task gets scheduled. The frame boundaries constitute the scheduling points,
thus reducing the number of interrupts to the processor.
2. Event driven scheduling. A fundamental problem with any clock driven scheduling policy is its
inability to handle large number of tasks. The LCM of the periods becomes a large number,
making it difficult to determine a proper frame size for the task set. In each frame, some
processor time is wasted to run the scheduler, which is treated as scheduling overhead. Also,
the sporadic and aperiodic tasks cannot be handled efficiently. These problems can be alleviated
to a large extent by the use of event driven schedulers. In these schedulers, the arrival and
completion of tasks are considered as events and are taken as scheduling points. Priorities are
assigned to tasks and/or their instances. The priority may be static or dynamic in nature. As
with any other priority based scheduling policies, arrival of a higher priority task will preempt
a running lower priority task. Three important scheduling policies have been reported in this
category – Foreground Background scheduling, Rate Monotonic scheduling (RMS) and
Earliest Deadline First (EDF) scheduling. Out of these, the foreground background scheduling
policy is very simple. The periodic real-time tasks of the application are assigned higher priority
than the aperiodic tasks. Periodic tasks run in the foreground, while others including non-real-
time tasks may run in the background. The background tasks are taken up for execution only if
there is no pending foreground task. At any scheduling point, the highest priority pending task
is taken up for execution. The other two schedulers RMS and EDF are more complex and have
been discussed separately in the following.
Real-Time System Design | 119
Example 4.1: Table 4.1 lists a set of three tasks P1, P2 and P3 with periodicity values 3, 5 and 15
time units, respectively. The corresponding execution times are 1, 2 and 3 time units. The tasks
are to be scheduled using the RMS policy. Looking into the periodicity values, the tasks are
assigned priorities P1 > P2 > P3.
P1 1 3
P2 2 5
P3 3 15
Fig 4.6 shows a schedule for this set of tasks. As the LCM of the periodicities of the three
tasks is 15, from 15th time instant the schedule repeats itself. To start with, it has been assumed
that at time instant 0, one instance each of the three tasks have arrived to the system. Later
instances of P1 come at time instants 3, 6, 9, 12 and so on. Such occurrences of all the three tasks
have been shown in the figure. Among three ready tasks at time 0, P1 has the highest priority,
hence it gets scheduled. Once the P1 instance is over at time instant 1, the waiting instance of P2
gets scheduled. It executes for full 2 time units. At time instant 3, the next instance of P1 has
arrived. Even though P3 is waiting is waiting at this point, the newly arrived instance of P1 gets
chance to execute, by virtue of its higher priority. At time instant 4, there is no waiting instance
of P1 and P2. Thus, P3 gets scheduled for a single time unit. At time instant 5, the newly arrived
instance of P2 preempts the running lower priority task P3. The process continues. At time instant
14, no task is ready for execution and the processor remains idle. Thus, the set of tasks can be
scheduled in RMS principle.
P1 P2 P1 P3 P2 P1 P2 P3 P1 P2 P1 P3 Idle P1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
P1 P1 P1 P1 P1 P1
P2 P2 P2 P2
P3 P3
Fig 4.6: RMS schedule of tasks in Example 4.1
Real-Time System Design | 121
Applying L’Hospital’s rule, it can be shown that the value of the expression is 0.692. This
implies that the maximum achievable utilization is 69.2%. For any set of tasks with
utilization exceeding this value fails the sufficiency test. However, it is not uncommon to
122 | Embedded Systems
have a set of tasks that fails the sufficiency test, but is still schedulable using RMS technique.
If the sufficiency condition is satisfied, the task set is definitely schedulable under RMS
policy.
3. Another sufficiency test: A more elaborate sufficiency check can be made using Lehoczky
test. The test can be stated as follows.
A set of periodic real-time tasks are schedulable using RMS technique under any task
phasing if all the tasks meet their respective first deadlines under zero phasing.
When a task arrives to the system, it needs to wait as long as there are higher priority tasks
running/waiting. Thus, a ready task waits for the longest period of time if all other higher
priority tasks are also ready at that point. It should also be noted that multiple instances of
higher priority tasks may arrive during the periodicity of the lower priority task. The worst
situation occurs when all these tasks are in-phase with each other, rather than being out of
phase. Thus, if a set of tasks meet their deadlines with zero phasing, the tasks will meet all
deadlines with non-zero phasings also.
Consider a set of tasks P1, P2 … Pn with their corresponding execution times e1, e2 … en,
and periodicity values p1, p2, … pn. Without any loss of generality, it can be assumed that
the tasks are ordered in decreasing order of priority, P1 having the highest priority and Pn
the lowest priority. The task Pi will meet its first deadline, provided,
𝑖𝑖−1
𝑝𝑝𝑖𝑖
𝑒𝑒𝑖𝑖 + � � � × 𝑒𝑒𝑘𝑘 ≤ 𝑝𝑝𝑖𝑖
𝑝𝑝𝑘𝑘
𝑘𝑘=1
Within the period pi of task Pi, a higher priority task Pk can appear (pi/pk) times. All these
instances of Pk must be completed before the task instance of Pi gets scheduled. Checking
in this way, if all tasks meet their first deadlines, the task set is definitely schedulable under
RMS policy. For the task set mentioned in Example 4.1, let us check the zero-phase
schedulability of all tasks.
• For task P1, e1 = 1 ≤ p1 = 3.
• For task P2, e2 + (p2/p1) × e1 = 2 + (5/3) × 1 = 2.6 ≤ p2 = 5.
• For task P3, e3 + (p3/p1) × e1 + (p3/p2) × e2 = 3 + (15/3) ×1 + (15/5) × 2 = 14 ≤ p3 = 15.
As all tasks pass the test, the task set is schedulable using RMS technique.
It may be noted that even if a set of tasks do not satisfy the Lehoczky’s test, the tasks
may still be schedulable using RMS policy. The test checks for the schedulability of the
Real-Time System Design | 123
tasks under zero phasings. This is a bit stringent for a system, as the individual tasks may
have non-zero phasings, thus allowing more time to schedule the instances satisfying the
deadlines.
4. Critical tasks with long periods pose another challenge. The tasks being critical, need to be
scheduled early, so that their deadlines are not missed. However, due to large periodicity, the
priority of such a task may be low enough to force it to wait till many other relatively
noncritical tasks finish their execution. For such situation, the DMA policy may be followed.
Another solution may be to virtually divide the critical task into k subtasks. Periodicity of
each subtask is reduced by a factor k. This reduction in periodicity value should be sufficient
to ensure that when the critical task arrives, it can preempt any of the running tasks. The
virtual division does not affect the monolithic nature of the critical task. The virtually divided
critical task is executed as one single task only. It may be noted that each virtual task now
has execution time ei/k and periodicity pi/k. Thus, for the overall critical task, the CPU
utilization becomes k(ei/k)(pi/k) = k(ei/pi), though the actual utilization is ei/pi. This may
wrongly indicate he task set to be not schedulable under RMS.
5. Limited priority levels may only be available in the underlying hardware and operating
system. Thus, for a task set with a large number of tasks, several tasks may need to be
grouped into the same physical priority level.
periodic tasks to wait for long time, possibly missing their deadlines. Sporadic servers solve this
problem by replenishing the budget only after a fixed time interval from the completion of
previous aperiodic task. A timer is started on utilization of budget by the server. Budget is
replenished only after the expiry of the timer. This guarantees a minimum separation between the
scheduling of two aperiodic task instances.
𝑝𝑝𝑚𝑚𝑚𝑚𝑚𝑚 × 𝑟𝑟 = 6.58
𝑝𝑝𝑚𝑚𝑚𝑚𝑚𝑚 × 𝑟𝑟 2 = 8.66
𝑝𝑝𝑚𝑚𝑚𝑚𝑚𝑚 × 𝑟𝑟 3 = 11.79
𝑝𝑝𝑚𝑚𝑚𝑚𝑚𝑚 × 𝑟𝑟 4 = 15.0
The highest priority level will get tasks with periods 5 and 6 time units. The next level will
hold the tasks with periods 7 and 8, the third level will have tasks with periods 9, 10, 11, finally
the fourth level will have tasks with periods 12, 13, 14, 15.
P1 1 4
P2 2 6
P3 3 8
Real-Time System Design | 127
Deadline miss
Idle
P1 P2 P3 P1 P3 P2 P1 P3 P3 P1 P2 P3 P1 P3 P2 P1 P3 P1
0 2 4 6 8 10 12 14 16 18 20 22 24
P1 P1 P1 P1 P1 P1 P1
Arrivals
P2 P2 P2 P2 P2
P3 P3 P3 P3
P1 P1 P1 P1 P1 P1
Deadlines
P2 P2 P2 P2
P3 P3 P3
Idle
Idle
P1 P2 P3 P1 P2 P3 P1 P2 P1 P3 P2 P1 P1
0 2 4 6 8 10 12 14 16 18 20 22 24
P1 P1 P1 P1 P1 P1 P1
Arrivals
P2 P2 P2 P2 P2
P3 P3 P3 P3
P1 P1 P1 P1 P1 P1
Deadlines
P2 P2 P2 P2
P3 P3 P3
In the simplest possible implementation of EDF, all ready tasks can be put into a queue. When a
new instance of a task arrives, it is attached at the end of the queue. The individual entries in the
queue notes the absolute deadlines of the corresponding instances. To select the next task to be
executed, the list is scanned to identify the task having the earliest deadline. For a queue with n
tasks, the operation of joining the queue can be carried out in O(1) time while determining the
next task to schedule needs O(n) time. Instead of a normal queue, a priority queue data structure
can be used. To join a priority queue, the arriving task instance takes O(log n) time. However, the
selection of the task to be executed next can be carried out in O(1) time.
in domino effect, all tasks in the application miss their deadlines, one after the other like a set
of dominos – falling of one domino initiates the falling of other dominos in sequence. In Table
4.3, four such tasks P1, P2, P3 and P4 have been taken with different execution times and
periods. Utilization corresponding to these four tasks is 2/5 + 2/6 + 2/7 + 2/8 = 1.27. Fig 4.9
shows the situation when the tasks are scheduled following EDF policy. As shown in the
figure, all the tasks miss their deadlines. Here, the task P1 misses its deadline at time instant
15. Successively, the instances of tasks P4, P2 and P3 also miss their deadlines. It may be noted
that domino effect does not occur in RMS since the highest priority task will never miss its
deadline, the next lower priority task may miss some of its deadlines. In the worst case, some
low priority tasks may miss all their deadlines.
Table 4.3: Example tasks
Task Execution time Period
P1 2 5
P2 2 6
P3 2 7
P4 2 8
Deadline misses
P1 P2 P3 P4 P1 P2 P3 P1 P4 P2 P1 P3 P1
0 2 4 6 8 10 12 14 16 18 20 22 24
P1 P1 P1 P1 P1
P2 P2 P2 P2
Arrivals
P2
P3 P3 P3 P3
P4 P4
Thus, the tasks need to coordinate between themselves. The system resources are to be shared
between the set of tasks. The resources in an embedded can be broadly divided into two categories
– pre-emptible resources and non-pre-emptible resources. A pre-emptible resource can be taken
back from the task currently using it. The typical example is the processor. On arrival of a higher
priority task, the processor can be taken from any currently executing lower priority task and
allotted to the higher priority one. The low priority task can be resumed later when the resource
is available in future and is allotted to the process. In the contrary, non-pre-emptible resources
cannot be taken prematurely from any task to which the resource has been allotted currently. A
typical example is the printer device. The printer may be allotted to a task currently carrying out
a printing operation. If it is taken back from the task in between and allotted to another task, the
print output may become meaningless as the outputs of two tasks get mixed up. In general, this
leads to an inconsistent state of the resource. As a result, if a low priority task has been allotted a
non-pre-emptible resource, a higher priority task will need to wait for the lower priority task to
finish its execution and release the resource. In this situation, a higher priority task has to wait
while the relatively lower priority task continues executing. The situation is known as priority
inversion. To make the definition more precise, it is also known as simple priority inversion. The
situation becomes more complex with the arrival of some intermediate priority tasks. These
intermediary priority tasks may not be willing to use the shared resource currently under use by
the low priority task. Since the intermediary tasks possess priorities higher than the lowest priority
task, they will pre-empt the CPU from the lowest priority task holding the resource. The lowest
priority task cannot complete, however, continues to hold the resource. A continuous flow of
intermediary priority tasks will lead to an indefinite wait for the highest priority task. The situation
is known as unbounded priority inversion. In the case of simple priority inversion, the delay faced
by the highest priority task is at most equal to the duration for which the resource is held by the
lowest priority task. If the designer takes special care to ensure that the tasks hold resources only
for the duration they are actually needed, the waiting time of the highest priority task can be
minimized to a large extent. The possibility of deadline miss may be low. With the presence of
intermediary priority tasks, unbounded priority inversion may get initiated, creating higher chance
for a deadline miss for the highest priority task.
task. The lowest priority task thus inherits the priority of the highest priority task, while using the
resource. With this priority inheritance mechanism, the priority of the lowest priority task gets
sufficiently increased so that the intermediary tasks cannot pre-empt the lowest priority task now.
The problem of unbounded priority inversion is thus resolved. As soon as the shared resource is
released by the lowest priority task, its priority is reverted back to its original value. The highest
priority task now pre-empts the lowest priority running task and gets scheduled along with the
resource.
There are two serious issues with the priority inheritance protocol, as noted in the following.
1. Deadlock. The priority inheritance policy may lead to potential deadlock situation in which
none of the tasks can progress. Consider the two tasks T1 and T2 using the non-pre-emptible
shared resources R1 and R2. It is further assumed that T1 has priority higher than T2, however,
when an instance of T1 arrives, an instance of T2 is already running and has acquired the
resource R2. Since T1 is of higher priority, it pre-empts T2 and starts executing. T1 acquires R1
and then attempts to acquire R2 held by T2. T1 sees a priority inversion and gets blocked, T2
acquires the priority of T1 via the priority inheritance protocol. T2 now tries to grab R1,
however, gets blocked due to non-availability of R1. Both the tasks T1 and T2 are now blocked,
leading to a deadlock situation.
2. Chain blocking. The situation occurs when a task attempts to acquire multiple shared
resources. In the worst case, a high priority task may need n such resources, all of which are
held by different lower priority tasks. As a result, in an effort to acquire these n resources, the
high priority task will see n priority inversions, creating n blockings for it. The situation may
occur in conjunction with a single lower priority task as well. Let the high priority task T1 need
shared resources R1, R2 … Rn, all of them currently held by the running low priority task T2.
When T1 arrives, it pre-empts T2 and tries to acquire R1. A priority inversion occurs, T2
assumes the higher priority of T1 and resumes execution. Task T1 sees its first blocking. As
132 | Embedded Systems
soon as T2 releases R1, its priority gets restored to the original value. The task T1 resumes,
acquires R1 and then attempts to acquire R2 (held by T2). The task T1 gets blocked again and
priority inheritance occurs. This way, the task T1 may see inversions for each of the n resources
and face n blockings in a chained manner.
In order to take care of the problems mentioned here, the following two protocols have
been proposed – Highest Locker Protocol (HLP) and Priority Ceiling Protocol (PCP).
1. Highest Locker Protocol (HLP). In this protocol, a ceiling priority is assigned to every non-
pre-emptible, shared resource (also called a critical resource). The ceiling priority of a resource
is set to be equal to the maximum priority of a task amongst the tasks that may request for the
resource. Whenever a task acquires a critical resource, its priority is set to be equal to the
ceiling priority of the resource. For a task holding multiple critical resources, the priority is set
to be equal to the highest ceiling priority of all the held resources.
It can be shown analytically that HLP can be used to resolve the problems of unbounded
priority inversion, deadlock and chain blocking. On the flip side, it introduces an inheritance
related inversion. As per the protocol, whenever a low priority task acquires a critical resource,
its priority is increased to the ceiling priority of the resource, irrespective of whether any higher
priority task has arrived and made to wait for the resource. There may be a number of tasks
with intermediary priorities that arrive now which do not need this critical resource for their
execution. All these tasks are now made to wait for the low priority task to complete its usage
of the critical resource and revert to its original low priority. Only after that the intermediate
priority tasks can be taken up for execution.
2. Priority Ceiling Protocol (PCP). This protocol has the potential to solve the problems of
unbounded priority inversion, chain blocking and deadlock. It can also reduce the probability
of occurrence of inheritance related inversions. Here, when a task requests for a critical
resource, it may or may not be allotted to the task, even if the resource is free. The allocation
is guided by a resource-grant rule. Similar to HLP, each critical resource Ri has an associated
ceiling priority CRi. Apart from that, a current system ceiling (CSC) is maintained keeping track
of the maximum ceiling value of all critical resources active at this point of time. The CSC is
initialized to a very low value – lower than the priority of the lowest priority task for the
application. Resource grant and release are handled by the corresponding rules noted next.
• Resource grant rule. The rule is applied to decide upon the allocation when a task Ti
requests for a critical resource. It consists of two clauses – Request clause and Inheritance
clause.
Real-Time System Design | 133
o Request clause. The resource is allocated to the task Ti if it is already holding another
resource with ceiling priority equal to the current CSC or if the task priority is higher
than the current CSC value. The CSC value is updated to reflect the maximum ceiling
value of all active critical resources.
o Inheritance clause. If a task asks for a critical resource but cannot be allocated failing
the request clause, the task gets blocked and the task holding the resource inherits the
priority of this blocked task, provided the holding task has lower priority than the
blocked task.
• Resource release rule. On releasing a critical resource by a task, the current CSC value is
updated to reflect the maximum ceiling priority of the active critical resources. The task
priority is set to the maximum of its original priority and the highest priority of all tasks
waiting for any resource still held by this task.
Compared to HLP, in PCP the priority of a task is not upgraded just because it has grabbed a
critical resource – only the CSC value gets updated. A low priority task holding a resource
inherits the priority of a higher priority task only if the higher priority task asks for the resource.
This reduces the possibility of inheritance related inversions significantly.
within deadline, the timer is reset as the last operation of the task. On the other hand, if the
task does not complete within its deadline, time out occurs, a timer interrupt is sent to the
processor to initiate appropriate action. Thus, the timer used to keep watch on occurrence of
deadline miss of some critical task.
2. Task Priorities. In general operating systems, depending upon the resource usage pattern of the
task, its priority may increase or decrease over the lifetime. In contrast, in RTOS, priority of a
task instance cannot change over its lifetime for reasons other than priority inheritance. It may
be noted that in general operating systems, task priorities are modified to improve system
throughput. In real-time systems, goal is to meet the deadlines, not throughput optimizations.
3. Context Switching Time. Whenever the operating system changes the task, currently being
executed by it, with a new task, enough information needs to be saved for resuming the exiting
task later. Also, the environment (in terms of CPU register contents) for the new task has to be
set up. This process is known as context switching. As this is an overhead for the processor, the
context switching time must be reduced for critical tasks in a real-time system. In traditional
operating systems, kernel (holding major OS routines and data structures) is designed to be non-
pre-emptive so that the system calls can be executed atomically without any pre-emption. This
has the potential to cause deadline miss for the critical tasks. In real-time operating systems, the
kernel is designed to be pre-emptive in nature, so that context switching can be much faster, of
the order of a few microseconds only.
4. Interrupt Latency. Latency for an interrupt is defined to be the time needed to invoke the
corresponding interrupt service routine (ISR) after the occurrence of the interrupt. For real-time
systems, the upper bound of the latency value is expected to be small, of the order of
microseconds. A policy, known as deferred procedure call (DPC) is often used to reduce the
interrupt processing time. Bulk of the ISR activity is transferred to this deferred procedure. The
ISR performs the most important part of the service, rest are taken care of by the DPC. For
example, in a real-time temperature monitoring system, the ISR may just read the temperature
sensor data. The DPC may perform other activities to generate the statistics from the
temperature data. The DPC executes at a much lower priority, compared to the ISR. Thus, the
latency to response to the interrupts and events reduce significantly.
5. Memory Management. Efficient utilization of the memory resources is an important
responsibility of the operating system. General purpose operating systems support the virtual
memory and memory protection features. Virtual memory uses a part of hard disk to hold the
task with only a few pages loaded into the main memory that interacts with the processor. In
Real-Time System Design | 135
case the processor generates an address beyond the pages available in the main memory, a page
fault occurs. The task is put into a blocked state till the referred page is loaded into the main
memory. Thus, virtual memory allows the task size to be practically infinite, the execution time
becomes unpredictable. Hence, in real-time systems virtual memory is either fully avoided or
is used only for non-real-time tasks. Memory locking is a policy that prevents a page being
swapped out of the memory. This makes the execution time of tasks more predictable and thus
useful for real-time tasks.
real-time tasks. The band supports 16 priority levels, however, each task is restricted to a range of
±2 levels with reference to its initial priority. Through DPCs, Windows NT provides fast response
to tasks, though it is not a hard real-time system. Priority inheritance not supported in Windows NT,
hence may lead to deadlock situation.
4.9.2 Windows CE
Windows CE is made for 32-bit mobile devices with limited memory capacity. The CPU time is
divided into slices. The time slices are assigned to individual tasks for execution. There are 256
priority levels available in this RTOS. The tasks are run in kernel mode – eliminating the need of
switching for the system calls. This enhances the performance of the system. To protect the device
dependent routines, an equipment adaptation layer has been defined that isolates the routines from
general tasks. Trusted modules and tasks can be defined allowing access to the system APIs
(Application Programming Interfaces). The non-pre-emptible portion of the kernel routines are
broken down into small sections, called Kcalls. This reduces the durations of non-pre-emptible
codes. Virtual memory is used with the provision of memory locks for the kernel data during the
execution of non-pre-emptive kernel codes.
4.9.3 LynxOS
LynxOS is a multithreaded operating system targeting complex real-time applications that require
fast, deterministic response. It can cater to a wide range of platforms – from very small embedded
products to very large ones. Services like TCP/IP, I/O and file handling, sockets etc. are supported.
To serve interrupts, kernel threads (tasks) are created that can be assigned priorities and scheduled
as other threads. Response time is always predictable. Memory protection is supported though
virtual memory and is also available. Scheduling policies, such as, Prioritized First-In-First-Out,
Real-Time System Design | 137
Dynamic DMS, Time-slicing are supported. There are 512 priority levels available with the
possibility of remote operation.
4.9.4 VxWorks
VxWorks is a widely adopted RTOS supporting a visual development environment. It is available
on most of the popular processor platforms. The OS contains more than 1800 APIs. There are 256
priority levels supported by the microkernel, along with multitasking, dynamic context switching.
Pre-emptive and round robin scheduling policies are available coupled with priority inheritance. It
offers facilities like network support, file system and I/O management.
4.9.5 Jbed
The RTOS Jbed supports applications and device drivers in Java. The byte-code, instead of being
interpreted, is translated into machine code before class loading. Features like real-time memory
allocation, exception handling etc. are supported. Specific class libraries have been introduced for
the hard real-time tasks. The number of priority levels supported is ten. The EDF policy is available
for task scheduling.
4.9.6 pSOS
This is an object-oriented operating system with the facilities for tasks, memory regions, message
queue and synchronization primitives. The scheduling policies supported include pre-emptive,
priority-driven and EDF. For priority inversion, both priority inheritance and priority ceiling
protocols are available.
UNIT SUMMARY
Many of the embedded applications are real-time in nature, requiring the intended tasks to be
completed in a time-bound manner within their deadlines. Moreover, for a relatively complex
embedded system, it may be designed as multiple processes, each accomplishing a specific task.
The tasks need to be initiated at proper times and the resources need to be made available to them.
For this, the systems often use a Real-Time Operating System (RTOS) to handle the task scheduling
and priority concerns. All tasks may not be equally sensitive to the deadlines. Accordingly, tasks
can be grouped into hard-, firm- and soft real-time categories. Depending upon the frequency of
occurrence of a task, it may be a periodic task, aperiodic task or sporadic task. Scheduling
algorithms have been developed to cater primarily to the periodic tasks. This unit has discussed
138 | Embedded Systems
about the popular event-driven scheduling algorithms – RMS and EDF along with their
shortcomings and the avenues to come over them. Priority of tasks are handled using several
protocols that ensures the avoidance of deadlock and the maximum availability of shared resources.
Finally, the unit has looked into the most important features of many of the popular real-time
operating systems.
EXERCISES
Multiple Choice Questions
MQ1. Which of the following is not a class of real-time tasks
(A) Hard (B) Firm
(C) Soft (D) Permanent
MQ2. Deadline miss for a hard real-time task in a system means failure of the
(A) Task (B) System (C) RTOS (D) None of the other options
MQ3. After the deadline, utility of the output of a firm real-time task is
(A) 100% (B) 50% (C) 10% (D) 0%
MQ4. Which of the following is not a feature of periodic tasks?
(A) Deadline (B) Periodicity (C) Phase (D) None of the other options
MQ5. A minimum separation between two instances is ensured for task type
(A) Aperiodic (B) Sporadic
(C) Both aperiodic and sporadic (D) None of the other options
MQ6. All deadlines are met by a schedule that is
(A) Valid (B) Feasible (C) Conservative (D) None of the other options
MQ7. The RMS policy assigns priorities to the tasks in manner
(A) Static (B) Dynamic (C) Static or dynamic (D) None of the other options
MQ8. Aperiodic server task is
(A) Periodic (B) Aperiodic (C) Sporadic (D) None of the other options
MQ9. EDF policy assigns task priority that is
(A) Static (B) Dynamic
(C) Either static or dynamic (D) None of the other options
Real-Time System Design | 139
SQ9. Enumerate the priority inheritance protocol and the inheritance related inversion.
SQ10. State the working principle of HLP.
SQ11. How does PCP work?
SQ12. Why is virtual memory not recommended for hard real-time tasks?
LQ3. For the task set shown in LQ2, draw the pert chart for EDF scheduling.
LQ4. Show the results of RMS schedulability tests on the task set of LQ2.
LQ5. How are critical tasks with large periodicity handled in RMS? Why does it not need any
special treatment in EDF?
LQ6. How are aperiodic and sporadic tasks handled in RMS?
LQ7. Assume there are 100 periodic tasks with the ith task having priority i. If there are 10 priority
levels available with the RTOS, show the grouping of tasks for uniform, arithmetic,
geometric and logarithmic distributions.
LQ8. Why are watchdog timers useful for real-time systems?
LQ9. Explain the difference between simple and unbounded priority inversions with suitable
examples.
LQ10. Compare the commercial RTOSs in terms of features.
KNOW MORE
Process control systems in industrial applications are few major users of real-time operating
systems. Tasks in such applications help in maintaining the quality and improve performance and
142 | Embedded Systems
collecting relevant data. This data may be returned to the console for monitoring and
troubleshooting. For example, in the oil and gas sector, adoption of these techniques may facilitate
less downtime and fewer losses. Machine vision is another area having usage of real-time systems.
The tasks may help the systems to process data at near real-time. Robot Operating Systems are the
valuable parts of robotics technology targeting real-time computing and processing. Manufacturers
can use real-time tasks to maximize productivity and improve product quality. It can also ensure
enhanced safety on the factory floor. In the domain of healthcare, real-time systems can play very
important role to ensure that the data from patient monitoring systems can reach the doctors at real-
time to keep the patient safe and healthy.
d Embedded Programming
UNIT SPECIFICS
Through this unit we have discussed the following aspects:
• Popular programming languages for embedded software development;
• Features of embedded programming languages;
• Criteria to choose language for embedded system design;
• Additional data types available in embedded C, over the conventional C language;
• Programming with embedded C for 8051;
• Programming with embedded C for ARM7 processor;
• Interface I/O devices with microcontrollers using embedded C;
The discussion in this unit gives an overview of embedded software development targeting
microcontroller based system design. It is assumed that the reader has a basic understanding of
generic microcontrollers, programming and the C/C++ language.
A large number of multiple choice questions have been provided with their answers. Apart
from that, subjective questions of short and long answer type have been included. A list of
references and suggested readings have been given, so that, one can go through them for more
details. The section “Know More” has been carefully designed so that the supplementary
information provided in this part becomes beneficial for the users of the book. This section mainly
highlights the latest developments in embedded programming techniques.
RATIONALE
This unit on embedded programming introduces the readers to the techniques followed and
languages used to develop embedded software. A program for an embedded application may have
a complex algorithm to be implemented and also have direct interactions with the system hardware
resources. The resources may be CPU registers, memory locations, I/O ports and so on. An
144 | Embedded Systems
embedded system programmer must have the facility to perform read/write operations onto them.
This can naturally be done by writing programs in the assembly language of the processor.
However, the absence of structured programming primitives like if-then-else, switch-case and loop
in the assembly language may make the programming untidy, particularly for algorithms with
complex computational requirements. Thus, embedded programming languages should have the
features of high-level languages commonly used in software system development and also low-level
features like direct hardware resource access. Keeping these two objectives in view, many of the
conventional structured programming languages have been augmented to evolve their embedded
variants. Amongst them, the languages C/C++ have been the front-runners and their embedded
variants have been used extensively by the embedded system design community. The major
augmentations in the these languages have come in terms of the new data types like
signed/unsigned character/integer, bit, and special function register usage. With these, an
embedded program can directly control the I/O ports, timers, counters and communication
facilities (like SPI, I2C, UART and Bluetooth) available in the platform designated for software
execution. It is to be noted that even if two hardware platforms use the same processor, the other
system resources may differ significantly. This implies that an embedded program developed for
one of the platforms may not be directly portable onto the other, even if we consider the programs
at their source code level. Each platform will have its own development environment that can
translate the high-level program into machine code suitable for that platform. This unit enumerates
all these aspects of embedded programming and presents a large number of example programs
and device interfaces using the embedded C language. The designer of an embedded application
may choose the most appropriate language alternative for the desired system and develop the
software code accordingly. However, unlike normal software developers, an embedded software
developer needs to consult the design environment and specific hardware resources available with
the target platform in order to have an efficient system realization.
PRE-REQUISITES
Programming, C/C++ language, Microcontrollers
UNIT OUTCOMES
List of outcomes of this unit is as follows:
U5-O1: List commonly used embedded programming languages
U5-O2: Choose suitable language for programming in developing embedded application
Embedded Programming | 145
U5-O2 1 3 3 -
U5-O3 - 3 2 -
U5-O4 1 3 2 -
U5-O5 2 3 3 -
U5-O6 1 3 2 -
U5-O7 1 3 2 -
5.1 Introduction
Embedded programming refers to the task of developing the software to be executed by a target
processor, in order to realize an embedded application. The programming techniques should have
capabilities to access hardware resources like CPU registers, memory locations and other device
internals (like sensors and actuators) interfaced to the system. Development of such software is
constrained by allowed execution time and memory requirements. Most of the electronic devices
now contain one or more embedded processor(s) along with software. The software can be very
simple, such as, controlling a few lights via processor controlled switches. It can also be at the
complexity levels of routers, missiles and other process control systems. Embedded software
interacts with the underlying hardware and is definitely different from the firmware in general
computing systems. Unlike firmware, embedded software is generally written in high-level
languages like C, C++, Java, Python and so on. It is more sophisticated in the sense that an
146 | Embedded Systems
embedded software performs interactions with devices, along with high-level data processing
tasks.
Myriads of programming languages have been developed over the years to design
software systems. Each such language is generic in nature, in the sense that a corresponding
compiler generates the code for a specific processor. As a programmer, the user is not bothered
about the internal architecture of the CPU executing the program. The optimizing compiler takes
care of the task of efficient execution of programs by the target processor. Many-a-times, the user
may not be aware of the computation, communication and interfacing features of the underlying
processor. Embedded system programs are a marked deviation from this high-level view of the
target processor. The programs will rather like to manipulate the hardware resources (like CPU
registers, memory locations, I/O ports) via control exercised by them. This is commonly done by
assembly language programming. However, an assembly language program may be tedious and
complicated to be developed manually. Unavailability of high-level language constructs (like if-
then-else, switch-case and loop) in an assembly language is a serious constraint in structured
program development. Due to their complex nature, embedded programs need to be developed in
languages similar to the conventional high-level ones. Additionally, there should be facilities to
interact with the hardware. This makes the language more specific towards the target platform. It
may be noted that, even if two/more platforms use the same processor, the hardware modules
interfaced with them may be different. The software for different boards (even with same
processor) may need to utilize different sets of registers built-in with the processor, to access
interfaced devices. For example, two boards may both utilize the ARM7 processor but one of
them may contain only SPI while the other one may have SPI, I2C and Bluetooth interfaces. Thus,
the program developed for one board is not compatible with the other, even if both of them utilize
the same high-level language (such as, C).
2. C++: This is a superset of C language with object-oriented features. Usage of libraries can
save the time in writing code. The language is of course difficult to learn and only a subset of
features can be used by embedded developers.
3. Python: Developed in 1980s, the language is highly suitable for machine learning, artificial
intelligence and data analytics. The language is easy to learn, read and write. However, the
language is non-deterministic and hence not suitable for real-time applications. Thus, the
language can be used for non-real-time embedded system development.
4. MicroPython: This is a version optimized for microcontrollers. Similar to Python, it is open-
source, easy to use. However, the code written is not generally as fast as C/C++ and possibly
end up using more memory.
5. Java: This is one of the efficient, general-purpose languages, widely used for internet based
applications. Java is the most suited language for embedded systems, developed over the
Android operating system. The code is portable across devices and operating systems. It is
quite reliable as well. However, it cannot be used for real-time systems. Particularly, systems
with graphical user interfaces (GUIs) may have performance issues.
6. JavaScript: It is a text-based programming language used by some of the embedded system
developers. It may be useful for systems with good amount of networking and graphics.
However, it may show poor runtime performance.
7. Ada: The language was developed in 1970s as a U.S. Department of Defense project for usage
in its own embedded system designs. The language is efficient and reliable, however, it is
difficult to run and is also not used widely.
8. Assembly: This is the language closest to the processor, directly interacting with the underlying
computer hardware. It has the potential to result into memory efficient, fast code. However,
the language is highly processor specific, difficult to read and maintain.
is also available for them. This makes these two languages highly suitable for embedded
system development platforms.
• Real-time requirements. Many embedded systems need to respond to the events in its
environment in a time-bound manner. Choice of programming language for such systems will
be guided by factors like, how well it can handle real-time operations, interrupts, scheduling
of tasks and event-driven programming. Languages like C and Ada have such real-time
capabilities and control over hardware resources.
• Memory efficiency. Embedded systems generally possess limited RAM and flash storage.
Thus, judicious memory allocation is an important aspect. Languages like C and C++ allow
manual memory management for optimization by the embedded system developers.
• Development tools. Availability of development tools, libraries and other support often decide
the language selection. Compilers, debuggers and integrated development environments
(IDEs) are some such important tools. Well-established languages like C and C++ have tool-
chains, large number of libraries and large open-source community supporting system
developments.
• Available expertise. The expertise and language familiarity of the development tool influence
the choice significantly. For relatively less complex projects or the development team being
more familiar with high-level languages, Python or JavaScript may be good option targeting
rapid development. On the other hand, for complex, performance-critical applications, C and
C++ may provide better control and optimization.
• C vs. C++. The choice between C and C++ is more of a personal choice of the designer. This
is because the language C++ is a superset of C with additional features and introduction of
class concept. C is a more traditional language and it is likely that many designers are more
comfortable with it. For many designers, C++ coding can make the code cumbersome to
understand and modify. However, there is no solid logic in favour of using C instead of C++
for embedded system design, excepting availability of large number of development platforms
of C. Since our objective is to understand the basic features of embedded programming
languages, we shall look into the embedded C language, for the sake of simplicity.
5.4 Embedded C
The programming language C was introduced by Dennis Ritchie, in the year 1969. This is one of
the most structured programming languages. A program in C language is a collection of one or
Embedded Programming | 149
more functions. Every C program should have a unique function, called main(), which may in
turn call other functions defined in the program. The language is sometimes referred to as middle-
level programming language, as it enables the programmers to write at a very high-level of
abstraction and also at a level close to the assembly language programming, accessing the bare
hardware.
The variants of C used for embedded software development form the class of embedded C
languages. This is not a single, unique language as the hardware resources (like I/O ports, timers,
memory location addresses) vary across the microprocessors/microcontrollers. As a result,
program developed targeting one processor may not be fully compatible with another, even at the
source code level. For example, the names of the I/O ports and their sizes are not uniform across
all the microcontrollers. Each processor vendor provides processor-specific C compiler for the
purpose, In our discussion, we shall be looking into two such variants of embedded C – one for
8051 and the other for ARM processor.
External Interrupts
Counter Inputs
On-chip Memory
CPU
Bus Serial
I/O Ports
Control Port
Oscillator
P0 P1 P2 P3 TXD RXD
understand the basic architecture of 8051 microcontroller. Fig 5.1 shows the internal block diagram
of the processor. The major components in the microcontroller are as follows.
1. CPU. As in any other processor, CPU constitutes the main part of it, having components like
Arithmetic Logic Unit (ALU), Instruction Decoder and Control and a number of registers, such
as, Program Counter (PC) and Instruction Register (IR). The supported arithmetic and logic
operations are on 8-bit data. A number of general-purpose registers are part of the CPU, but are
mapped as RAM locations.
2. Memory. The 8051 chip contains 4kB program memory realized using EPROM/flash
technology. Apart from that, the chip contains 128 byte data memory realized using RAM with
addresses ranging over 00H to 7FH. The RAM area has the following parts in it.
• Register Banks (00H to 1FH). These are the lowest 32 bytes of RAM divided into four
register banks with 8 registers in each bank. The registers are named as R0 through R7. Only
one bank is active at a time.
• Bit Addressable RAM (20H to 2FH). These 16 memory locations can be referred to at the
level of individual bits. The bits are numbered from 00H to 7FH.
• General Purpose RAM (30H to 7FH). These 80 bytes can be used for general purpose storage
bytes.
• Special Function Registers (SFRs). Some of the RAM addresses in the range 80H to 0FFH
have been dedicated for some special-purpose registers. Such registers and their addresses
are noted in Table 5.1.
IE A8H Yes
Interrupt related
IP B8H Yes
TL0 8AH No
TH0 8CH No
TL1 8BH No
Timer related
TH1 8DH No
TCON 88H Yes
TMOD 89H No
3. Digital I/O Ports. There are four 8-bit I/O ports – P0, P1, P2 and P3. The port bits can be
programmed individually to act as either input or output. The port bits are also used for some
other functions (details could be found in the 8051 manual).
4. Timers. There are two timers – Timer-0 and Timer-1, each configurable as 8-, 13- or 16-bit
modules. The timers can also act as counters for the external events occurring on T0 and T1
pins.
1. Unsigned char: This is an 8-bit data type supporting values in the range 0 to 255. It is one of
the most used data types for embedded C programming on 8051. The program noted in Example
5.1 uses a variable of this type.
Example 5.1: The following C language program outputs the values 00-7FH sequentially
to the port P1 of 8051.
#include <reg51.h>
void main (void)
{
unsigned char x;
for (x = 0; x < 0x80; x++)
P1 = x;
}
2. Signed char: This is also an 8-bit data type that can be used to store signed characters with
values in the range –128 to +127. Signed char is the default data type in embedded C. It may be
noted that unsigned char can store values in the range 0 to 255, requiring specific declarations
in the program. The program shown in Example 5.2 uses both signed and unsigned char data
types.
Example 5.2: The following C language program adds 10 signed characters stored in an array
a. The result is put in the variable sum.
#include <reg51.h>
void main (void)
{
signed char a[] = { -2, 5, 10, 20, -3, 17, -1, 45, 25, -6 };
signed char sum = 0;
unsigned char j;
for (j = 0; j < 10; j++)
sum += a[ j ];
}
Embedded Programming | 153
3. Unsigned int: This is a 16-bit data type that can represent numbers in the range 0 to 65535
(FFFFH). It may be noted that in 8051, 16-bit numbers are used for some specific purposes
only, such as, memory address, counter values etc. All arithmetic and logic operations are
performed on 8-bit data. The memory locations are also 8-bit wide, hence, an unsigned integer
variable will occupy two bytes. Accessing these two-byte memory locations is also more
complex than a byte access. As a result, the machine code generated by the compiler may be
much larger than the same program using unsigned character variables only. The program noted
in Example 5.3 shows some uses of unsigned int data type variables.
Example 5.3: The following C language program uses an unsigned integer variable to
introduce some delay in the execution so that the bits of the port P2 are flipped continually.
#include <reg51.h>
void main (void)
{
unsigned int m;
for (; ;) {
P2 = 0x55;
for (m = 0; m < 65530; m++);
P2 = 0xAA;
for (m = 0; m < 65530; m++);
}
}
4. Signed int: This is the 16-bit data type variant allowing representation of signed integers.
Variables of this type can hold values in the range –32768 to +32767. This is the default integer
data type used by the embedded C compilers.
5. Single bit: Bit operands are special for embedded C programming. In this direction, there are
two bit data types supported by the compilers of 8051. One of them, bit can be used to refer to
the bits in the bit-addressable RAM locations 20H-2FH. The other type sbit can be used to refer
to the individual bits of special function registers. The program noted in Example 5.4 shows the
usage of bit and sbit data types.
154 | Embedded Systems
Example 5.4: The following C language program flips the bit-4 of port P1 continually.
#include <reg51.h>
sbit BIT4 = P1^4; // Bit-4 of Port P1
void main (void)
{
unsigned char m;
for (; ;) {
BIT4 = 0;
for (m = 0; m < 65530; m++);
BIT4 = 1;
for (m = 0; m < 65530; m++);
}
}
6. Special function register: This is a byte type data, written as sfr, corresponding to the special
function registers of the 8051. Compared to sbit, sfr data type corresponds to the byte-size
special function registers. The program noted in Example 5.5 shows the usage of sfr data type.
Example 5.5: The following C language program flips the bits of the port P2 continually
using sfr data type.
#include <reg51.h>
sfr P2 = 0xA0;
void main (void)
{
unsigned char m;
for (; ;) {
P2 = 0x55;
for (m = 0; m < 65530; m++);
P2 = 0xAA;
for (m = 0; m < 65530; m++);
}
}
Embedded Programming | 155
Next, we shall be looking into two embedded C programs for 8051 that use mix of data
types discussed so far. The programs mainly read some port bits, perform some operation on input
data and outputs the result on some port. The first program noted in Example 5.6 works with ports
at 8-bit unsigned char level. The second program noted in Example 5.7 works at bit level.
Example 5.6: The following program reads the ports P0 and P1 continually, computes AND of
the values read and outputs to the port P2.
#include <reg51.h>
unsigned char data1;
unsigned char data2;
unsigned char result;
void main (void)
{
while (1) {
data1 = P1;
data2 = P0;
result = data1 & data2;
P2 = result;
}
}
Example 5.7: The following program reads the port bits P0.0 and P1.4 continually, computes
AND of the values read and outputs to the port P2.1.
#include <reg51.h>
bit data1 = P0.0;
bit data2 = P1.4;
bit result = P2.1;
void main (void)
{
while (1) {
result = data1 & data2;
}
}
156 | Embedded Systems
The following program in Example 5.8 illustrates how the internal registers
corresponding to the timer operation in 8051, can be utilized in a high-level embedded C program.
It uses Timer-0 for producing delays.
Example 5.8: Consider the problem of generating a square wave with 50% duty cycle and
ON/OFF periods of 10 mS. The time delay of 10 mS will be generated using the Timer 0 in mode
1. The square wave is generated at bit 5 of port P2.
#include <reg51.h>
void Delay_Rtn (void);
sbit p2_5 = P2^5;
void main (void)
{
while (1) {
p2_5 = ~p2_5; // Toggle P2.5
Delay_Rtn(); // Call delay
}
}
void Delay_Rtn (void)
{ // Refer to 8051 manual for the settings needed for timer operation
TMOD = 0x01; // Timer 0, Mode 1
TL0 = 0xFF; TH0 = 0xDB; // Load TL0 and TH0
TR0 = 1; // Start timer
while (TF0 == 0); // Wait for TF0 rollover
TR0 = 0; TF0 = 0; // Turn off timer and clear TR0
}
5V 5V
DIP Switch
LEDs
P1.7 P0.7
P1.6 P0.6
P1.5 P0.5
8051
P1.4 P0.4
P1.3 P0.3
P1.2 P0.2
P1.1 P0.1
P1.0 P0.0
1. Interfacing eight DIP switches and LEDs. Fig 5.2 shows the hardware connection for interfacing
of eight switches and LEDs to the ports of 8051. The switches are connected to Port 1, while
the LEDs are connected to Port 0. The C language program for the task has been noted next.
The program reads the settings of the switches through Port 1. If a switch is pressed, the
corresponding port bit will be read as a 0. The pattern read is output to Port 0. Thus, the
corresponding LED will turn ON.
#include <reg51.h>
void main (void)
{
while (1) P0 = P1;
}
2. Interfacing four multiplexed 7-segment LED displays. This example illustrates the hardware
connection and the software program needed to glow the specific segments in a multiplexed
display to output the desired pattern for display. Fig 5.3 shows the hardware connection. To
display a particular digit on a 7-segment module, first the module has to be selected. This has
been accomplished via the four least significant bits of port P2. Then, the specific bit pattern to
158 | Embedded Systems
glow the required segments (out of {a, b, c, d, e, f, g, dp}) are sent via the bits of port P1, to all
the display modules. However, the segments get activated only for the module selected via P2.
This is repeated across the display modules at a sufficiently high speed to produce a flicker-free
display for the human eyes.
1 kΩ
P2.0
P2.1
P2.2 5V
P2.3
330Ω BC557 BC557 BC557 BC557
P1.0
8051
Let us assume that the pattern to be displayed is “1947”, stored in array STRING. The array
DGT_PATT stores the bit pattern to be output to a module to glow the required segments for a
digit. For details about the determination of the contents of DGT_PATT, the datasheet for the
display modules may be consulted. The corresponding C language can be as follows.
# include <reg51.h>
void delay_rtn( unsigned int del_val );
unsigned char DGT_PATT[10] = { 0xC0, 0xCF, 0xA4, 0xB0, 0x99,
0x92, 0x82, 0xF8, 0x80, 0x90 };
unsigned char STRING[] = “1947”;
void main()
{ while (1) do {
P1 = DGT_PATT[ STRING[ 0 ] – ‘0’]; P2 = 0xF7; delay_rtn( 20 );
P1 = DGT_PATT[ STRING[ 1 ] – ‘0’]; P2 = 0xFB; delay_rtn( 20 );
Embedded Programming | 159
3. Interfacing ADC0808/0809. Next we look into the usage of embedded C programming for
interfacing an analog-to-digital converter with 8051 ports. Fig 5.4 shows the connection for
interfacing ADC0808/0809 to the 8051. Out of eight available analog channels, in this interface
a single channel (Channel 1) has been used. As this is an 8-bit ADC, the port P1 of the 8051 has
been used to read-in the converted digital value. Port P2 has been used to connect to other signal
lines of the ADC. Bits 0, 1 and 2 of port P2 have been used to provide proper select values
needed for the input channel 1.The Vref(+) has been set to 2.56V and Vref(-) to 0 to ensure
5V
D Q D Q D Q D Q
Q Q Q Q
Fig 5.4: Interfacing ADC0808/0809 with the 8051
160 | Embedded Systems
10mV step size. Bit P2.3 controls ALE for the ADC. Bits P2.4 and P2.5 provide the output
enable (OE) and start conversion (SC) for the ADC. The end-of-conversion (EOC) pin of the
ADC has been connected to the port bit P2.6. The clock input for the ADC has been drawn from
the crystal connected to the 8051. Since the crystal frequency is high, it is passed through four
� connected to D) to reduce it. Each such flip-flop divides the frequency by
D flip-flops (with Q
two. The VCC is connected to 5V, GND to 0 and chip select (CS) also put at zero to select the
ADC chip. The C language program for the ADC interface can be as follows.
#include <reg51.h>
sbit SEL_C = P2^0;
sbit SEL_B = P2^1;
sbit SEL_A = P2^2;
sbit ALE = P2^3;
sbit OE = P2^4;
sbit SC = P2^5;
sbit EOC = P2^6;
sfr DIG_VAL = P1;
void delay_rtn( unsigned int del_val );
void main()
{
DIG_VAL = 0xFF; // Configure P1 as input
EOC = 1; // Configure EOC line as input
ALE = 0; // Clear ALE
SC = 0; // Clear SC
OE = 0; // Clear OE
while (1) {
SEL_C = 0; SEL_B = 0; SEL_A = 1; // Choose channel 1
delay_rtn(1);
ALE = 1;
delay_rtn(1);
SC = 1;
delay_rtn(1);
Embedded Programming | 161
ALE = 0;
SC = 0; // Start conversion
while (EOC == 1); // Wait for conversion to complete
while (EOC == 0);
OE = 1; // Enable output to be read
delay_rtn(1);
A = DIG_VAL;
OE = 0;
}
}
void delay_rtn( unsigned int del_val )
{
unsigned int x, y;
for (x = 0; x < 1000; x++)
for (y = 0; y < del_val; y++);
}
4. Interfacing DAC0808. Fig 5.5 shows an interface of DAC0808 with the 8051. Port P1 of the
8051 has been used to provide the digital input to be converted to analog current/voltage. VEE
has been given -15V while VCC has got +5V. VREF+ has been connected to 10V, VREF– to
zero and the resistance as 5kΩ. This ensures the reference current Iref to be 10mA. An
operational amplifier (such as, 741), can be used with feedback resistance 5kΩ to convert the
output current to analog voltage.
The interface shown in Fig 5.5 can be used to generate different types of analog waveforms. The
following program shows how to generate a triangular wave through the DAC interface. For this,
the values 0 to 255 can be output to port P1 followed by decrementing upto 0. The digital values,
when converted by the DAC will provide an analog triangular waveform as output. After output of
a digital value, a small delay has been introduced to allow time to the DAC to convert the pattern.
It may be noted that increasing the delay value will convert the waveform to be more of staircase
in shape. The waveform can be viewed on an oscilloscope to see the shape and adjust the delay
value accordingly.
162 | Embedded Systems
VCC = 5V
A1
P1.7 5 13 14 5 kΩ
VREF+ = 10V
A2
P1.6 6 VREF-
A3 15
P1.5 7 5 kΩ 5 kΩ
DAC0808
A4 8 2 GND
P1.4
8051
A5 IO
P1.3 9 4 –
A6
P1.2 10
A7 11 + Analog
P1.1 Comp.
A8 12 3 16 Output
P1.0
0.1μF
VEE = -15V
Fig 5.5: Interfacing DAC0808 with 8051
#include <reg51.h>
void main (void)
{
unsigned char i;
while (1) {
for (i = 0; i < 255; i++) { P1 = i; delay_rtn(2);
}
for (i = 255; i > 0; i --) { P1 = i; delay_rtn(2); }
}
}
void delay_rtn( unsigned int del_val )
{ unsigned int x, y;
for (x = 0; x < 1000; x++)
for (y = 0; y < del_val; y++);
}
microcontroller. In particular, the language has been augmented to include the hardware resources
in 8051, such as, ports, timers and configuration registers. These augmentations do not have any
meaning when a different processor is targeted. To illustrate the issue further, in the following, we
shall look into another embedded C variant that is used to describe the software to be executed by
ARM7 processor based systems. It may be noted, there are several manufacturers for the
microcontrollers based on ARM7 architecture. The hardware resources across them are again not
uniform. Thus, the embedded C language for each of them is different from the others. The
programmer needs to refer to the manual of the specific manufacturer to know the exact set of
available resources and their access mechanism. In the following we shall look into one such series
of microcontrollers developed around the ARM7 architecture.
System Clock
Trace Module
Interface
Emulation
P0[31:28], Functions
P0[25:0] ARM7 PLL1
Fast General
Purpose I/O AHB Bridge Vector
P1[31:16] Interrupt
USB Clock Controller
ARM7 local bus
AMBA AHB
Internal SRAM Internal Flash
Controller Controller
AHB
RAM Decoder
Flash AHB to APB APB D- UP_LED
SRAM D+
Bridge Divider
bits of P0 are available for I/O, however, these pins also have several functions multiplexed to
them. As a result, the actual number of I/O pins available, depends upon the application in hand
which may use other modules of the chip as well. For example, pin P0.21 also acts as PWM5 (Pulse
Width Modulation), AD1.6 (ADC), CAP1.3 (Capture input for Timer1, Channel 3). For Port 0, 28
out of 32 pins can be configured as bidirectional. P0.24, P0.26 and P0.27 are unavailable. P0.30
can be used as output pin only.
Table 5.3 shows the important registers and their settings for GPIO operations. The registers need
to be programmed properly to get the I/O operations done. The registers control the functionality
of the pins, I/O directions, setting and resetting of bits. Individual bits can be programmed
separately. In particular, for I/O operation, first the direction of the port bit be set. For example, the
setting “IO0DIR = 0x04” will configure the Pin 2 of Port 0 (P0.2) as output and other bits as input.
For an input operation, the IOxPIN register can be read. However, for output operation, to write a
1, corresponding bit of IOxSET register should be set to 1. To output a 0, the bit in IOxCLR be set
to 1. After a port pin has been written with some value, the value can be read from the IOxPIN
register bit.
Example 5.9: Consider the problem of setting P0.14 pin to high first and then low. The C code
fragment for it is as follows.
IO0DIR |= (1 << 14); // P0.14 set as output pin
IO0SET |= (1 << 14); // Output high to P0.14
IO0CLR |= (1 << 14); // Output low to P0.14
166 | Embedded Systems
Next, we shall look into two more examples of GPIO operations. The first example assumes two
LEDs connected to Port 0 pins 0 and 1. When the program is executed, the LEDs will blink. In the
second example, a switch and an LED are connected to port pins 0 and 1 respectively. When the
switch is closed, the LED will glow.
Example 5.10: Two LEDs are connected to port pins P0.0 and P0.1. The LEDs should blink
continually. The complete C program for the purpose is shown next.
# include <lpc214x.h>
void Delay_rtn(void);
int main(void)
{
IO0DIR = 0x3; // Configure P0.0 and P0.1 as output
while (1)
{
IO0SET = 0x3; // Turn LEDs on
Delay_rtn();
IO0CLR = 0x03; // Turn LEDs off
Delay_rtn();
}
}
void Delay_rtn(void)
{
int a, b;
b = 0;
for (a = 0; a < 2000000; a++) b++; // Do something, else compiler may
// remove the loop
}
Embedded Programming | 167
Example 5.11: A switch is connected to P0.0, an LED to P0.1. The LED should glow whenever
the switch is closed. The corresponding C program is
P0.0
shown next.
UNIT SUMMARY
A large number of programming languages have been developed over the years for software
systems. Most of these languages do not directly interact with the underlying hardware, in the sense
that the resources like registers, I/O ports and memory locations are not directly mentioned in the
high-level language programs. The compilers map the storage and I/O operations of the program
to the system resources aiming an efficient realization of it. Embedded programs, on the other hand,
interact directly with such system resources. Many of the conventional structured programming
languages have got augmented to support embedded programming as well. The most popular
amongst them are embedded C/C++. The embedded variants have new data types supporting
character and integer variables of both unsigned and signed types. Bit-level data types are also
supported. These languages allow access to the system registers of the underlying microcontrollers.
As a variant is targeted to one platform with specific hardware resources, each platform needs a
separate variant of the same embedded language. The unit has seen a large number of embedded
programs targeting the 8051 microcontroller. A good number of device interfacing examples have
168 | Embedded Systems
been included. The same embedded language changes significantly when it is targeting a different
platform, such as, a particular ARM7 based implementation. The set of registers that the
programmer can access in the high-level language program becomes different. For the same I/O
operations, the programming technique goes through significant modifications. Thus, in order to
design program for any target platform, the user needs to use the development environment
provided by the platform vendor and look into the manual to identify the resources and their access
mechanism.
EXERCISES
Multiple Choice Questions
MQ1. An embedded software is targeted to
(A) Single processor (B) Dual processor
(C) Multiple processors (D) None of the other options
MQ2. Embedded software and firmware are
(A) Equivalent (B) Structurally same
(C) Nonequivalent (D) None of the other options
MQ3. For generic programming languages, optimization is done by
(A) Compiler (B) Operating System
(C) Linker (D) None of the other options
MQ4. The construct “if-then-else” is generally available in
(A) High-level languages (B) Assembly languages
(C) Both high-level and assembly languages (D) None of the other options
MQ5. Object-oriented features are available in
(A) C (B) C++
(C) Both C and C++ (D) None of the other options
MQ6. Python is not suitable for real-time applications due to
(A) Complexity (B) Non-determinism
(C) Simplicity (D) None of the other options
Embedded Programming | 169
KNOW MORE
The languages C/C++ have been the most popular ones for embedded programming over the years.
Embedded C has been used in industrial automotive applications (highway speed checkers, signal
control, vehicle tracking software etc.), robotics (such as, line-following robots, obstacle avoidance
robots, robotic arms etc.), personal medical devices (such as, glucometers, pulse oximeters) and
many other microcontroller-based applications. However, in 2010 a modern multi-paradigm
programming language, called Rust, has been created by Mozilla. This is a fast and robust system
language that can be used in many applications including embedded devices and web development.
It is a statically typed programming language that combines the performance of C/C++ with
memory safety and expressive syntax. It ensures memory safety without a garbage collector. The
memory is managed efficiently at compile-time itself avoiding data races and memory leaks. Thus,
the programmer can debug the program at compile-time. The syntax of Rust is inspired by C/C++,
enabling the developers to learn it quickly. The national security agency at White House, USA has
referred to Rust as the most secure common programming language. The language provides fine-
grained control over system resources. Rust has been used extensively in embedded automotive
applications as it can be compiled into low-level code without additional runtime overhead. The
built-in parallelism features of Rust can prevent many common problems faced by concurrent
program developers. However, there is a need for creation of toolchain, standardization and
development of available expertise, before Rust can possibly replace the languages like C/C++ in
embedded programming.
d
Hardware-Software Codesign
UNIT SPECIFICS
Through this unit we have discussed the following aspects:
• Design and usage of cosimulation for hardware-software codesign;
• Issues with hardware-software codesign;
• Integer Linear Programming formulation of the hardware-software partitioning problem;
• Extension of Kernighan-Lin graph bi-partitioning for hardware-software partitioning;
• Usage of metasearch techniques like GA and PSO for the partitioning problem;
• Power-aware hardware-software codesign;
• Functional partitioning technique to transform task-graph and partition;
This unit includes a thorough discussion on the hardware-software codesign process.
Techniques and algorithms designed for the problem have been elaborated. It is assumed that the
reader has a basic understanding of computer hardware and software technologies.
A large number of multiple choice questions have been provided with their answers. Apart
from that, subjective questions of short and long answer type have been included. A list of
references and suggested readings have been given, so that, one can go through them for more
details. The section “Know More” has been carefully designed so that the supplementary
information provided in this part becomes beneficial for the users of the book.
RATIONALE
This unit on hardware-software codesign presents the strategies used for realizing embedded
systems onto some target architecture. The target architecture may have some general purpose
processors for software realization of tasks and ASIC/FPGA for hardware realization.
Computationally intensive tasks may need hardware realization to meet the system deadlines.
Other tasks may be mapped to either the software or the hardware component, based on issues like
Hardware-Software Codesign | 175
power consumption, system performance optimization etc. It becomes very crucial to take judicious
decision about selection of tasks to be realized on target architecture component. The problem is
computationally hard, with the brute force methods requiring exponential time. As a result, several
heuristic and metasearch techniques have been developed to address the partitioning problem.
Exact methods, such as integer linear programming formulation can be used to get optimal solution
for small embedded designs. These techniques are not scalable with the increase in task-graph
complexity, however, can act as yard-stick to judge other methods. Several heuristics have also
evolved for the partitioning problem. An important one in this category is the Kernighan-Lin bi-
partitioning technique, originally proposed for VLSI physical design. An extended version of the
same can be used for the hardware-software partitioning problem as well. Due to the complex
nature of the problem, researchers have explored metasearch techniques, such as, Genetic
Algorithm and Particle Swarm Optimization for solving the partitioning problem. Apart from
hardware area and delay, power consumption of the system is another important criteria to be
optimized, particularly for reconfigurable platforms like FPGA. Algorithms have been developed
targeting system architecture with general processor and FPGA. Transformations in the
specification task graph may aid in the partitioning process. The strategy, known as functional
partitioning, has been explored. This unit enumerates all these aspects of hardware-software
codesign and presents an overview of different algorithms and policies available. The designer
may choose the most appropriate alternative for the desired system and target architecture for
system realization.
PRE-REQUISITES
Computer Hardware and Software
UNIT OUTCOMES
List of outcomes of this unit is as follows:
U6-O1: State the design of cosimulators
U6-O2: State the role of cosimulators in hardware-software codesign
U6-O3: Formulate hardware-software partitioning as an Integer Linear Programming problem
U6-O4: Extend Kernighan-Lin bi-partitioning for hardware-software partitioning
U6-O5: Formulate Genetic Algorithm based framework for hardware-software partitioning
U6-O6: Perform hardware-software partitioning using Particle Swarm Optimization
176 | Embedded Systems
U6-O1 1 3 - 3
U6-O2 1 3 - 3
U6-O3 - 3 - 3
U6-O4 - 2 - 3
U6-O5 - 2 - 3
U6-O6 - 2 - 3
U6-O7 1 2 - 3
U6-O8 1 3 - 3
6.1 Introduction
An embedded application typically consists of a number of tasks with deadlines, both at task-level
and application-level. To meet the deadlines, it may be essential to realize at least a subset of tasks
using hardware, while other tasks may not be that much critical and software realization of them
may suffice. Realizing all tasks in software may not satisfy the performance requirements of the
application. On the other hand, a complete hardware realization of the system may turn out to be
too costly in terms of area and power consumption. An ideal solution to the problem is to have a
mixed hardware-software realization for the tasks of the application. In such a system, while some
tasks are realized in hardware, others are realized via software. This is called hardware-software
codesign of the application. As an embedded system designer, one must be able to judge judicially
the tasks that should go to hardware and those for software, so that the design constraints (as
discussed in Unit 1) are met and costs are optimized.
With so many options available for both the software and hardware platforms, the task of
codesign becomes almost intractable, unless the target architecture is explicitly specified. The
target architecture may consist of one or more of the following components.
• One or more general-purpose processors along with individual program- and data-memory.
Hardware-Software Codesign | 177
Program Data
Memory Memory
ARM ASIC
Processor
PCIe Bus
Fig 6.1: Typical target architecture
The codesign problem is to decide the implementation platform (hardware or software) for
each individual task of the application. If there is a data-dependency between two tasks of the
application, communication takes place over the bus, incurring delay and power. While checking
for performance of the resultant system, this communication delay needs to be factored in. To
ensure the correctness of the combined system with both hardware and software components, it is
essential to perform a simulation of the system. As this simulation has two different types of
platforms to be simulated, it is known as cosimulation. Thus, the codesign problem consists of two
important subproblems – cosimulation and hardware-software partitioning. In the following, we
shall be discussing on these two topics in detail.
6.2 Cosimulation
Simulation happens to be one of the most powerful tools for checking system correctness. A
complex embedded system consists of both software and hardware modules that exchange
information between them to carry out the designated system functionality. On the other
hand, traditional simulation techniques are applicable to either the hardware systems or
the software systems. Fundamentally, the software systems are inherently sequential in
nature while the hardware is inherently parallel. This characteristic difference precludes
the usage of software simulator to simulate hardware modules and vice versa. Simulators used
for simulating the software modules developed using parallel programming languages also cannot
capture the delays involved in carrying a signal value from one place to another in the hardware.
178 | Embedded Systems
Hence, designing a single simulator to cater to both the hardware and software modules of an
embedded system is difficult, if not impossible. At the same time, a joint simulation of the
hardware and the software modules is essential to prove the correctness of the designed system
that spans over both hardware and software platforms. The major difficulty in the process is the
non-existence of the actual hardware in the initial phase of the design process. Thus, joint
simulation cannot be taken up till a basic hardware module is ready to interact with the software
simulator. Previously, software developers used to develop their code with perceived assumptions
about behaviour of the hardware. There can be gross mismatch between this presumption and the
actual hardware behaviour, leading to quite painful integration steps at some later stage. Minor
miscommunication may lead to major design flaws. As the software modification is relatively
simple, the behavioural mismatch is patched in the software at the cost of performance, even
leading to expensive hardware/software redesign, in the worst case. The problem can be reduced
significantly through the usage of proper cosimulation at an early stage. Fig 6.2 shows the
integration of cosimulation with the hardware-software codesign process. Cosimulation is
primarily used for the purposes of verification and estimation at different stages of the codesign
process, as noted next.
• After the system specification has gone through refinement step, the resulting system
description needs to be verified to ensure that the desirable features are properly captured in the
resulting document.
• At hardware-software partitioning stage, the embedded system designer needs to take prudent
decision about the modules to be put into the two partitions. Cosimulation is necessary to ensure
that the resulting system will meet the performance requirements.
• After the system has been implemented, verification of the whole system is carried out via
cosimulation.
In a broader sense, cosimulation refers to simultaneous simulation of two or more parts of
a system at different abstraction levels. For example, in the hardware domain, if the circuit contains
both analog and digital components, the full circuit simulation needs both the analog- and the
digital- simulators to work in conjunction. A major problem with cosimulation is to create a bridge
between the component simulators and also to ensure synchronization between the component
simulators.
components. Further, the simulators need to interact with each other and exchange signal values.
However, the major problem occurs due to speed mismatch between the real components, which
may not be captured in the individual component level simulators. The component level
simulators need to interact with each other to exchange signal values. As a result, a number of
issues arise that are to be resolved in any successful cosimulation implementation. The major
issues involved are as follows.
System Specification
Refinement
Verification
System Description
Cosimulation
Modules for Modules for
SW realization HW realization
SW Compilation HW Synthesis
procedures written in VHDL/Verilog. Languages like VHDL and Verilog can link to C code
via foreign language kernel (VHDL) or programming language interface (Verilog).
2. Synchronization of simulators. As the hardware- and software- simulators are used to simulate
the behaviour of one full system, it is often necessary to synchronize between the modules.
Such communication and synchronization take place via the usage of shared memory.
Typically, a read-modify-write cycle is followed by the two simulators while accessing any
shared memory location. Synchronization is ensured by locking the memory locations so that,
the write access to any shared location is performed exclusively by any of the simulators. This
creates a tight coupling between the hardware and software modules. On the other hand,
following the design principle of loosely-coupled systems, the IPC channels (pipes, message
queues etc.) can be utilized to synchronize between the two modules. One module may send a
message and wait for response from the other module before proceeding, whenever
synchronization is required.
3. Scheduling of simulators. This refers to the issue about the time instants at which the software
and the hardware simulators will activate to simulate the corresponding modules. For this, a
global timing is introduced. Operation of individual simulators happen with respect to their
local timing. Every event generated by the environment or any of the simulation modules is
assigned a global time stamp. Whenever the local timing of a simulator matches with any such
global time stamp of an event for the corresponding module, the module is simulated.
4. Models of computation. All simulators work with a model for the underlying module. The
details captured by the model play vital role to control the accuracy and execution time of the
simulator.
Fig 6.3. Depending upon the time at which it is carried out, there can be two types of cosimulation
– abstract-level and detailed-level. The abstract-level cosimulation is useful when the target
processor/platform has not been selected. Due to the absence of detailed information about the
processor and hardware, the abstract-level simulation is not very accurate, as compared to the
detailed-level cosimulation.
System model
Interface model
Fig 6.3: Cosimulation environment
1. Abstract-level Cosimulation. Programming languages like C/C++ and also the hardware
description language like VHDL support socket calls that can be inserted into the descriptions
for the purpose of communication. The interprocess communication primitives (IPCs) of the
underlying operating system act as the communication channel. No concrete software and
hardware module implementations are available at this stage. The IPC routine calls in the
hardware component is grouped into an IPC handler process. In languages like VHDL, such
handler process can be realized in terms of foreign routines. Fig 6.4 explains the abstract-level
cosimulation.
2. Detailed-level Cosimulation. This is used after the designer has finalized the target architectures
for both the hardware and the software modules, the interface protocol has been decided upon.
For the hardware side, the interface module is synthesized by automatic generation of
decoder/signal register and the channel unit model available in the library. The channel unit
consists of abstract model of a physical channel device (for example, DMA controller). Signal
registers are additional registers introduced in the hardware unit to interface with the
communication channel. The core hardware corresponding to the embedded hardware module
may have ports of width different from the width of the interface channel. Thus, the data
transferred through the channel need to be reformatted to match with the width of the core
hardware ports. For example, the core hardware may have four 32-bit ports while the channel
may be of 16-bit width. Hence, eight number of 16-bit signal registers may be introduced in the
hardware part. Two such registers may feed one port of the core hardware. Fig 6.5 shows the
detailed cosimulation scheme.
Hardware
HW Core
C/C++ Program
…
socket call() Decoder/
… Signal Register
Channel Unit
IPC handler
Process
…
socket call()
…
IPC routines
socket call() socket call()
Socket IPC
5. Starting at the specification level, functional partitioning methods restructure the application
task graph and then apply partitioning heuristics on it to obtain good quality mapping solutions.
1. Target Architecture. Any codesign problem works for a target architecture on which the
application will run. The target architecture consists of both processor(s) to run the software
part and ASIC/FPGA to run the hardware part. More formally, if the embedded system
implementation platform consists of ASIC h and a set of processors P = {p1, p2 … pn} along
with external memory and buses for communication, the set of target architecture components
for the system can be represented as,
𝑇𝑇𝑇𝑇 = {ℎ} ∪ 𝑃𝑃.
For the sake of uniformity, the ASIC can be considered as the first component in the set TA
and denoted as 𝑡𝑡𝑡𝑡0 , while the processors may be marked as 𝑡𝑡𝑡𝑡1 through 𝑡𝑡𝑡𝑡𝑛𝑛 .
2. System. It represents the application to be realized. An application is represented by a task
graph with nodes corresponding to the individual tasks. Each task is of certain entity type. For
example, there may be two filtering operations in the application, each of them representing a
task while both of them are of the same entity type filter. Edges between the nodes of the task
graph represent the interconnection requirements between the corresponding pair of tasks.
Formally, a system is defined by the tuple 𝑆𝑆 = < 𝐸𝐸𝐸𝐸, 𝑉𝑉, 𝐸𝐸, 𝐼𝐼 >, where,
𝐸𝐸𝐸𝐸 = {𝑒𝑒𝑒𝑒1 , … , 𝑒𝑒𝑒𝑒𝑛𝑛𝐸𝐸𝐸𝐸 } is the set of entity types,
𝑉𝑉 = {𝑣𝑣1 , … 𝑣𝑣𝑛𝑛𝑉𝑉 } is the set of nodes which are instances of entity types.
𝐸𝐸 ⊂ 𝑉𝑉 × 𝑉𝑉 is the set of edges corresponding to the interconnections.
𝐼𝐼: 𝑉𝑉 → 𝐸𝐸𝐸𝐸, with 𝐼𝐼�𝑣𝑣𝑗𝑗 � = 𝑒𝑒𝑒𝑒𝑙𝑙 , defining the type of node 𝑣𝑣𝑗𝑗 to be 𝑒𝑒𝑒𝑒𝑙𝑙 .
3. Cost Metrics. For entity type 𝑒𝑒𝑒𝑒𝑙𝑙 , the associated cost metrics are as follows.
• 𝑐𝑐 𝑎𝑎 (𝑒𝑒𝑒𝑒𝑙𝑙 ) – Hardware area.
• 𝑐𝑐 𝑡𝑡ℎ (𝑒𝑒𝑒𝑒𝑙𝑙 ) – Hardware execution time.
• 𝑐𝑐 𝑑𝑑𝑑𝑑 (𝑒𝑒𝑒𝑒𝑙𝑙 ) – Software data memory requirement.
• 𝑐𝑐 𝑝𝑝𝑝𝑝 (𝑒𝑒𝑒𝑒𝑙𝑙 ) – Software program memory requirement.
• 𝑐𝑐 𝑡𝑡𝑡𝑡 (𝑒𝑒𝑒𝑒𝑙𝑙 ) – Software execution time.
The interface cost for an edge 𝑒𝑒 = (𝑣𝑣1 , 𝑣𝑣2 ) consists of the following.
• 𝑐𝑐𝑐𝑐 𝑎𝑎 (𝑒𝑒) – Additional hardware area.
• 𝑐𝑐𝑐𝑐𝑡𝑡 (𝑒𝑒) – Communication time.
4. Design Quality. It is identified by the following design metrics.
• 𝐶𝐶 𝑎𝑎 (𝑆𝑆) – Hardware area for the system.
Hardware-Software Codesign | 187
• 𝐶𝐶𝑘𝑘𝑥𝑥 is the design quality metric 𝐶𝐶 𝑥𝑥 (𝑆𝑆) of S on target architecture component 𝑡𝑡𝑡𝑡𝑘𝑘 , 𝑥𝑥 ∈
{𝑎𝑎, 𝑝𝑝𝑝𝑝, 𝑑𝑑𝑑𝑑, 𝑡𝑡}.
• 𝑀𝑀𝑀𝑀𝑀𝑀𝑘𝑘𝑥𝑥 is the design constraint 𝑀𝑀𝑀𝑀𝑀𝑀 𝑥𝑥 (𝑆𝑆) on 𝑡𝑡𝑡𝑡𝑘𝑘 of S, 𝑥𝑥 ∈ {𝑎𝑎, 𝑝𝑝𝑝𝑝, 𝑑𝑑𝑑𝑑, 𝑡𝑡}.
• 𝑇𝑇𝑗𝑗𝑆𝑆 is the execution start time of node 𝑣𝑣𝑗𝑗 .
• 𝑏𝑏𝑗𝑗1 ,𝑗𝑗2 ,𝑘𝑘 = 1, if 𝑣𝑣𝑗𝑗1 and 𝑣𝑣𝑗𝑗2 are mapped to different components or
𝑣𝑣𝑗𝑗1 finishes before 𝑣𝑣𝑗𝑗2 starts on 𝑡𝑡𝑡𝑡𝑘𝑘 , 0 otherwise.
Next, we shall look into the formulation of constraints to be satisfied by the solver to obtain the
map function of hardware-software partitioning.
1. General Constraints. Each node 𝑣𝑣𝑗𝑗 of the task graph must be mapped to exactly one target
architecture component 𝑡𝑡𝑡𝑡𝑘𝑘 .
∀𝑗𝑗 ∈ 𝐽𝐽: 𝑥𝑥𝑗𝑗,0 + ∑𝑘𝑘∈𝐾𝐾 𝑦𝑦𝑗𝑗,𝑘𝑘 = 1.
For the node 𝑣𝑣𝑗𝑗 , either 𝑥𝑥𝑗𝑗,0 or one of the 𝑦𝑦𝑗𝑗,𝑘𝑘 ’s can be 1, identifying the target to which 𝑣𝑣𝑗𝑗 has
been mapped.
2. Resource Constraints. For the software side, the program and data memory constitute the
resources. In hardware implementation, area is the resource. Considering data memory
requirement, for each processor, the total data memory used should not exceed the maximum
available data memory in it. Similar constraint holds for the program memory requirement.
For the hardware side, area requirement is given by the area used by mapped tasks in unshared
mode and in shared mode along with the area needed to realize the interface. The total
hardware area requirement should not exceed the maximum available area. The corresponding
conditions can be captured by the following expressions.
∀𝑘𝑘 ∈ 𝐾𝐾 − {0}: 𝐶𝐶𝑘𝑘𝑑𝑑𝑑𝑑 = ∑𝑙𝑙∈𝐿𝐿 𝑠𝑠ℎ𝑙𝑙,𝑘𝑘 × 𝑐𝑐𝑙𝑙,𝑘𝑘
𝑑𝑑𝑑𝑑
≤ 𝑀𝑀𝑀𝑀𝑀𝑀𝑘𝑘𝑑𝑑𝑑𝑑 .
𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝
∀𝑘𝑘 ∈ 𝐾𝐾 − {0}: 𝐶𝐶𝑘𝑘 = ∑𝑙𝑙∈𝐿𝐿 𝑠𝑠ℎ𝑙𝑙,𝑘𝑘 × 𝑐𝑐𝑙𝑙,𝑘𝑘 ≤ 𝑀𝑀𝑀𝑀𝑀𝑀𝑘𝑘 .
𝑎𝑎 𝑎𝑎
𝐶𝐶0𝑎𝑎 = ∑𝑗𝑗∈𝐽𝐽 𝑥𝑥𝑗𝑗,0 × 𝑐𝑐𝑗𝑗,0 + ∑𝑙𝑙∈𝐿𝐿 𝑠𝑠ℎ𝑙𝑙,0 × 𝑐𝑐𝑙𝑙,0 + 𝐶𝐶𝐶𝐶0𝑎𝑎 ≤ 𝑀𝑀𝑀𝑀𝑀𝑀0𝑎𝑎 .
3. Timing Constraints. To formulate the timing constraints, first a scheduling has to be done for
the task graph nodes. The schedule is obtained once by following the As Soon As Possible
(ASAP) policy and then with As Late As Possible (ALAP) policy. The start time allotted to a
node 𝑣𝑣𝑗𝑗 , 𝑇𝑇𝑗𝑗𝑆𝑆 must be between the ASAP and ALAP times for the node. Execution time of the
node, 𝑇𝑇𝑗𝑗𝐷𝐷 must be equal to either its software execution time or the hardware execution time,
depending upon the target architecture component to which the node has been mapped. The
finish time for the node, 𝑇𝑇𝑗𝑗𝐸𝐸 is equal to the sum of 𝑇𝑇𝑗𝑗𝑆𝑆 and 𝑇𝑇𝑗𝑗𝐷𝐷 . In the following, the constraints
have been formulated.
𝑡𝑡ℎ
∀𝑗𝑗 ∈ 𝐽𝐽: 𝑇𝑇𝑗𝑗𝐷𝐷 = 𝑥𝑥𝑗𝑗,0 × 𝑐𝑐𝑗𝑗,0 𝑡𝑡ℎ
+ 𝑦𝑦𝑗𝑗,0 × 𝑐𝑐𝑗𝑗,0 𝑡𝑡𝑡𝑡
+ ∑𝑘𝑘∈𝐾𝐾−{0} 𝑦𝑦𝑗𝑗,𝑘𝑘 × 𝑐𝑐𝑗𝑗,𝑘𝑘 .
Here, 𝑇𝑇𝑗𝑗𝐼𝐼1 ,𝑗𝑗2 is the time needed to pass information through the interface from node 𝑣𝑣𝑗𝑗1 to node
𝑣𝑣𝑗𝑗2 . An interface corresponding to an edge 𝑒𝑒 = (𝑣𝑣𝑗𝑗1 , 𝑣𝑣𝑗𝑗2 ) is necessary only if the nodes are
mapped onto different architecture components. The requirement of the interface along with
the cost computations can be done by the following set of equations and inequalities.
∀𝑒𝑒 = �𝑣𝑣𝑗𝑗1 , 𝑣𝑣𝑗𝑗2 � ∈ 𝐸𝐸:
𝑖𝑖𝑗𝑗1 ,𝑗𝑗2 ≥ 𝑥𝑥𝑗𝑗1 ,0 + 𝑦𝑦𝑗𝑗1 ,0 + ∑𝑘𝑘∈𝐾𝐾−{0} 𝑦𝑦𝑗𝑗2 ,𝑘𝑘 − 1.
𝑖𝑖𝑗𝑗1 ,𝑗𝑗2 ≥ 𝑥𝑥𝑗𝑗2 ,0 + 𝑦𝑦𝑗𝑗2 ,0 + ∑𝑘𝑘∈𝐾𝐾−{0} 𝑦𝑦𝑗𝑗1 ,𝑘𝑘 − 1.
𝑖𝑖𝑗𝑗1 ,𝑗𝑗2 ≥ ∑𝑘𝑘1 ∈𝐾𝐾−{0} ∑𝑘𝑘2 ∈𝐾𝐾−{0},𝑘𝑘2 ≠𝑘𝑘1 𝑦𝑦𝑗𝑗1 ,𝑘𝑘1 + 𝑦𝑦𝑗𝑗2 ,𝑘𝑘2 − 1.
𝑇𝑇𝑗𝑗𝐼𝐼1 ,𝑗𝑗2 = 𝑖𝑖𝑗𝑗1 ,𝑗𝑗2 × 𝑐𝑐𝑐𝑐𝑗𝑗𝑡𝑡1 ,𝑗𝑗2 .
4. Node Sharing. Regarding the sharing of computational resources, the tasks mapped onto one
processor (that is, target architecture component ta1 through tan) are considered to be sharing
it. On the other hand, for ta0, the ASIC part, an entity 𝑒𝑒𝑒𝑒𝑙𝑙 is shared only if at least two nodes
𝑣𝑣𝑗𝑗1 and 𝑣𝑣𝑗𝑗2 that are instances of 𝑒𝑒𝑒𝑒𝑙𝑙 , are mapped in shared mode onto it.
∀𝑙𝑙 ∈ 𝐿𝐿: ∀𝑗𝑗1 , 𝑗𝑗2 ∈ 𝐽𝐽: 𝐼𝐼�𝑣𝑣𝑗𝑗1 � = 𝐼𝐼�𝑣𝑣𝑗𝑗2 � = 𝑒𝑒𝑒𝑒𝑙𝑙 : 𝑠𝑠ℎ𝑙𝑙,0 ≥ 𝑦𝑦𝑗𝑗1 ,0 + 𝑦𝑦𝑗𝑗2 ,0 − 1.
∀𝑘𝑘 ∈ 𝐾𝐾 − {0}: ∀𝑙𝑙 ∈ 𝐿𝐿: ∀𝑗𝑗 ∈ 𝐽𝐽: 𝐼𝐼�𝑣𝑣𝑗𝑗 � = 𝑒𝑒𝑒𝑒𝑙𝑙 : 𝑠𝑠ℎ𝑙𝑙,𝑘𝑘 ≥ 𝑦𝑦𝑗𝑗,𝑘𝑘 .
The 𝑠𝑠ℎ𝑙𝑙,𝑘𝑘 values so computed are used to satisfy the inequalities noted earlier under Resource
Constraints.
5. Shared Node Scheduling. Sharing of nodes on to the same resource also leads to the
requirement of scheduling of nodes for execution. This is expressed by a set of constraints
involving the 0/1 variables 𝑏𝑏𝑗𝑗1 ,𝑗𝑗2 ,𝑘𝑘 and 𝑏𝑏𝑗𝑗2 ,𝑗𝑗1 ,𝑘𝑘 for a pair of nodes 𝑣𝑣𝑗𝑗1 and 𝑣𝑣𝑗𝑗2 mapped onto the
same architecture component 𝑡𝑡𝑡𝑡𝑘𝑘 . The first five inequalities ensure that the two variables
cannot be assigned the value 1 simultaneously, as 𝑦𝑦𝑗𝑗1 ,𝑘𝑘 and 𝑦𝑦𝑗𝑗2 ,𝑘𝑘 are both already 1’s (due to
sharing on 𝑡𝑡𝑡𝑡𝑘𝑘 ).
190 | Embedded Systems
design work well for small problem sizes with less number of modules. Thus, before running the
place-and-route tools, it becomes imperative to group the highly communicating modules into
same partition so that they are placed close to each other in the silicon floor. The algorithm has
been extended for usage in many other domains. The algorithm has extremely low runtime (of the
order of few seconds for moderately sized graph) and produces excellent results. While applying
to other domains, many augmentations have been made to the basic algorithm. Some such
augmentations include supporting imbalanced partitioning with unequal number of nodes in the
two partitions, using efficient data structures to make the implementation more strategic etc.
𝑣𝑣3 5 5 10 70 𝑒𝑒4 2 16
(a)
Fig 6.6: (a) Example task graph (b) A target architecture
192 | Embedded Systems
Kernighan-Lin Heuristic
In the following, we shall look into the basic Kernighan-Lin bi-partitioning heuristic for graphs
and hypergraphs. A bi-partitioning corresponds to a cut of the vertices of the graph. The cost of
the cut is equal to the sum of the edge-costs of all edges crossing the cut. The algorithm starts
with a balanced partitioning with each partition containing half of the vertices of the graph. This
initial bi-partition defines a cut-cost. The algorithm attempts to improve upon this cost by
swapping vertices across the partitions in an iterative fashion.
At each iteration, initially all vertices are marked as unlocked – they can participate in swap
operations. Next, all possible pair-wise swapping of unlocked vertices between the two partitions
are evaluated in terms of their resulting cut-cost values. Among all such possible swaps, the one
resulting in the greatest cost decrease or least cost increase (in case no swap can produce cost
decrease) is taken. The corresponding vertices are exchanged across the partitions and are marked
as locked. The process continues till there are unlocked pairs of vertices in the partitions. Once all
the vertices are labelled as locked, it marks the end of one iteration. Now, all vertices are relabelled
as unlocked and the whole process is repeated with the next iteration. The algorithm stops if after
the most recent iteration, cut-cost has not improved, compared to the previous iteration. The
complete algorithm has been noted next.
Hardware-Software Codesign | 193
Algorithm Kernighan-Lin
Input: Vertex set 𝑉𝑉 = {𝑣𝑣1 , 𝑣𝑣2 , … 𝑣𝑣𝑚𝑚 } partitioned into two equal-sized partitions 𝑝𝑝1 and 𝑝𝑝2 , such
that, 𝑝𝑝1 ∪ 𝑝𝑝2 = 𝑉𝑉 and 𝑝𝑝1 ∩ 𝑝𝑝2 = ∅.
Output: A refined partition with reduced cut-cost.
Begin
do
current-partition = best-partition = (𝑝𝑝1 , 𝑝𝑝2 );
Unlock all vertices;
while unlocked-vertices-exist(current-partition) do
swap = select-next-move(current-partition);
current-partition = move-and-lock-vertices(current-partition, swap);
best-partition = get-better-partition(best-partition, current-partition);
end while;
if not(cut-cost(best-partition) < cut-cost(𝑝𝑝1 , 𝑝𝑝2 )) then return (𝑝𝑝1 , 𝑝𝑝2 );
else // Start next iteration
(𝑝𝑝1 , 𝑝𝑝2 ) = best-partition;
Unlock all vertices;
end if;
end do;
End.
Procedure select-next-move(P)
begin
Let 𝑃𝑃 have partitions 𝑟𝑟1 and 𝑟𝑟2 ;
for each unlocked((𝑣𝑣𝑖𝑖 ∈ 𝑟𝑟1 , 𝑣𝑣𝑗𝑗 ∈ 𝑟𝑟2 ) do append(costlog, cut-cost(Swap(𝑃𝑃, 𝑣𝑣𝑖𝑖 , 𝑣𝑣𝑗𝑗 ));
return (𝑣𝑣𝑖𝑖 , 𝑣𝑣𝑗𝑗 ) swap in costlog with minimum cost;
end;
194 | Embedded Systems
To illustrate the operation of the algorithm, consider the graph shown in Fig 6.7. It has to be bi-
partitioned with minimum cut-cost. The graph has eight vertices and ten edges with cost of each
edge equal to 1. Let us assume that the algorithm starts with initial partition having 𝑝𝑝1 = {1, 2, 3, 4}
and 𝑝𝑝2 = {5, 6, 7, 8}. The initial cut-cost is equal to 6. The first iteration of the partitioning process
has been shown in Table 6.2. At the end of this iteration, the best-partition is {{1, 2, 5, 6}, {3, 4,
7, 8}} with cut-cost 2. Now, all nodes will get unlocked and the next iteration of the partitioning
process will start. It may be shown that the second iteration cannot improve the cut-cost, as it has
already reached the minimum value for this example. Thus, the algorithm terminates at the end of
the second iteration.
1 1
1 2 3 4
1
1 1 1 1
1
5 6 7 8
1 1
Fig 6.7: Example graph
Swaps (1,5) (1,6) (1,7) (1,8) (2,5) (2,6) (2,7) (2,8) (3,5) (3,6) (3,7) (3,8) (4,5) (4,6) (4,7) (4,8)
cut-cost 8 5 5 6 5 6 6 5 5 6 6 5 6 5 5 8
Swaps (1,6) (1,7) (1,8) (2,6) (2,7) (2,8) (4,6) (4,7) (4,8)
cut-cost 6 8 7 5 7 4 2 4 5
cut-cost 5 6 6 5
Swaps (2,8)
cut-cost 6
Hardware-Software Codesign | 195
Here, 𝑒𝑒𝑘𝑘 . 𝑡𝑡𝑡𝑡 is the time to transfer the bits corresponding to edge 𝑒𝑒𝑘𝑘 over the bus. If the bus
has a width of 𝑊𝑊 bits and one 𝑊𝑊-bit transfer over the bus takes 𝐵𝐵 time, 𝑒𝑒𝑘𝑘 . 𝑡𝑡𝑡𝑡 can be expressed
as,
𝑒𝑒𝑘𝑘 . 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏
𝑒𝑒𝑘𝑘 . 𝑡𝑡𝑡𝑡 = �𝐵𝐵 × �
𝑊𝑊
For the task graph shown in Fig 6.6(a), the execution times of individual tasks can be
formulated as follows.
𝑣𝑣5 . 𝑒𝑒𝑒𝑒 = 𝑣𝑣5 . 𝑖𝑖𝑖𝑖𝑖𝑖
𝑣𝑣4 . 𝑒𝑒𝑒𝑒 = 𝑣𝑣4 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒6 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒6 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣5 . 𝑒𝑒𝑒𝑒)
= 𝑣𝑣4 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒6 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒6 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣5 . 𝑖𝑖𝑖𝑖𝑖𝑖)
𝑣𝑣3 . 𝑒𝑒𝑒𝑒 = 𝑣𝑣3 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒4 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒4 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣4 . 𝑒𝑒𝑒𝑒)
= 𝑣𝑣3 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒4 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒4 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣4 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒6 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒6 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣5 . 𝑖𝑖𝑖𝑖𝑖𝑖))
𝑣𝑣2 . 𝑒𝑒𝑒𝑒 = 𝑣𝑣2 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒5 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒5 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣5 . 𝑒𝑒𝑒𝑒)
= 𝑣𝑣2 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒5 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑞𝑞 × (𝑒𝑒5 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣5 . 𝑖𝑖𝑖𝑖𝑖𝑖)
196 | Embedded Systems
𝑣𝑣1 . 𝑒𝑒𝑒𝑒 = 𝑣𝑣1 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒1 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒1 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣2 . 𝑒𝑒𝑒𝑒) + 𝑒𝑒2 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒2 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣3 . 𝑒𝑒𝑒𝑒) +
𝑒𝑒3 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒3 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣4 . 𝑒𝑒𝑒𝑒)
= 𝑣𝑣1 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒1 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒1 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣2 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒5 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑞𝑞 × (𝑒𝑒5 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣5 . 𝑖𝑖𝑖𝑖𝑖𝑖)) +
𝑒𝑒2 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒2 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣3 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒4 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒4 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣4 . 𝑖𝑖𝑖𝑖𝑖𝑖 +
𝑒𝑒6 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒6 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣5 . 𝑖𝑖𝑖𝑖𝑖𝑖))) + 𝑒𝑒3 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒3 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣4 . 𝑖𝑖𝑖𝑖𝑖𝑖 +
𝑒𝑒6 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝑒𝑒6 . 𝑡𝑡𝑡𝑡 + 𝑣𝑣5 . 𝑖𝑖𝑖𝑖𝑖𝑖))
As 𝑣𝑣1 is the root node of the application task graph, minimization of 𝑣𝑣1 . 𝑒𝑒𝑒𝑒 ensures
optimizing the overall execution time requirement. In the real hardware, computation and
communication go in parallel, however, this formulation has ignored that phenomena.
Loading of the bus may also affect the data transfer time through it. This factor has also
been ignored, leading to some inaccuracies in the overall model.
2. Single Object Move. The basic Kernighan-Lin heuristic works with two balanced
partitions, each having the same number of vertices in it. This requirement is justified in
VLSI physical design as the two partitions represent two segments of the chip area, which
are required to be mostly balanced. However, in hardware-software partitioning, this
requirement is not much meaningful. For the software modules the memory required to
store the instructions is meaningful, whereas for the hardware part, number of gates is
important. Balancing the partitions in terms of number of modules mapped is not of much
significance. Hence, instead of swapping the vertices across the partitions, it is better to
explore the design space by first assuming that all vertices belong to one of the partitions
– each iteration then moves vertices from this partition to the other. Thus, instead of
“swap”, a “move” operation is followed.
3. Change Equations. When a task vertex is moved from its currently assigned partition to
the other one, it may result in the change in execution times of many of the tasks. The
changes come in the internal execution time of the vertex and the transfer times for all
edges associated with the vertex. This change in transfer time will affect the execution
times of many other task vertices. This requires recomputation of many of the equations
corresponding to 𝑣𝑣𝑖𝑖 . 𝑒𝑒𝑒𝑒. However, all equations may not be affected by such a move. For
example, in the given task graph of Fig 6.6(a), if the vertex 𝑣𝑣1 changes its partition, the
quantities affected are 𝑣𝑣1 . 𝑖𝑖𝑖𝑖𝑖𝑖, 𝑒𝑒1 . 𝑡𝑡𝑡𝑡, 𝑒𝑒2 . 𝑡𝑡𝑡𝑡 and 𝑒𝑒3 . 𝑡𝑡𝑡𝑡. If these quantities change by the
amounts 𝐷𝐷𝑣𝑣1 . 𝑖𝑖𝑖𝑖𝑖𝑖, 𝐷𝐷𝑒𝑒1 . 𝑡𝑡𝑡𝑡, 𝐷𝐷𝑒𝑒2 . 𝑡𝑡𝑡𝑡 and 𝐷𝐷𝑒𝑒3 . 𝑡𝑡𝑡𝑡, the change in the execution time of the
vertex can be expressed by the following equation of 𝐷𝐷𝑣𝑣1 . 𝑒𝑒𝑒𝑒.
𝐷𝐷𝑣𝑣1 . 𝑒𝑒𝑒𝑒 = 𝐷𝐷𝑣𝑣1 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝐷𝐷𝑒𝑒1 . 𝑡𝑡𝑡𝑡 × 𝑒𝑒1 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 + 𝐷𝐷𝑒𝑒2 . 𝑡𝑡𝑡𝑡 × 𝑒𝑒2 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 + 𝐷𝐷𝑒𝑒3 . 𝑡𝑡𝑡𝑡 × 𝑒𝑒3 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
Hardware-Software Codesign | 197
Similar change equations can also be derived for the change in the quantity 𝐷𝐷𝑣𝑣1 . 𝑒𝑒𝑒𝑒 for
the movement of other vertices as well. Table 6.3 notes all such change equations. By
computing these equations, at any iteration, the algorithm can identify the vertex to be
moved to minimize the execution time 𝑣𝑣1 . 𝑒𝑒𝑒𝑒. The vertex may be moved and locked for
the remaining part of the iteration.
𝑣𝑣1 𝐷𝐷𝑣𝑣1 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝐷𝐷𝑒𝑒1 . 𝑡𝑡𝑡𝑡 × 𝑒𝑒1 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 + 𝐷𝐷𝑒𝑒2 . 𝑡𝑡𝑡𝑡 × 𝑒𝑒2 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 + 𝐷𝐷𝑒𝑒3 . 𝑡𝑡𝑡𝑡 × 𝑒𝑒3 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
𝑣𝑣2 𝑒𝑒1 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝐷𝐷𝑒𝑒1 . 𝑡𝑡𝑡𝑡 + 𝐷𝐷𝑣𝑣2 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒5 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × 𝐷𝐷𝑒𝑒5 . 𝑡𝑡𝑡𝑡)
𝑣𝑣3 𝑒𝑒2 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝐷𝐷𝑒𝑒2 . 𝑡𝑡𝑡𝑡 + 𝐷𝐷𝑣𝑣3 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒4 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × 𝐷𝐷𝑒𝑒4 . 𝑡𝑡𝑡𝑡)
𝑣𝑣4 𝑒𝑒2 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × 𝑒𝑒4 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝐷𝐷𝑒𝑒4 . 𝑡𝑡𝑡𝑡 + 𝐷𝐷𝑣𝑣4 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒6 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × 𝐷𝐷𝑒𝑒6 . 𝑡𝑡𝑡𝑡)
+ 𝑒𝑒3 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝐷𝐷𝑒𝑒3 . 𝑡𝑡𝑡𝑡 + 𝐷𝐷𝑣𝑣4 . 𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑒𝑒6 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × 𝐷𝐷𝑒𝑒6 . 𝑡𝑡𝑡𝑡)
𝑣𝑣5 𝑒𝑒1 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × 𝑒𝑒5 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝐷𝐷𝑒𝑒5 . 𝑡𝑡𝑡𝑡 + 𝐷𝐷𝑣𝑣5 . 𝑖𝑖𝑖𝑖𝑖𝑖) + 𝑒𝑒2 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × 𝑒𝑒4 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
× 𝑒𝑒6 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × (𝐷𝐷𝑒𝑒6 . 𝑡𝑡𝑡𝑡 + 𝐷𝐷𝑣𝑣5 . 𝑖𝑖𝑖𝑖𝑖𝑖) + 𝑒𝑒3 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 × 𝑒𝑒6 . 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
× (𝐷𝐷𝑒𝑒6 . 𝑡𝑡𝑡𝑡 + 𝐷𝐷𝑣𝑣5 . 𝑖𝑖𝑖𝑖𝑖𝑖)
𝑒𝑒1 2 × (16 ÷ 8) = 4
𝑒𝑒2 2 × (32 ÷ 8) = 8
𝑒𝑒3 2 × (8 ÷ 8) = 2
𝑒𝑒4 2 × (16 ÷ 8) = 4
𝑒𝑒5 2 × (16 ÷ 8) = 4
𝑒𝑒6 2 × (32 ÷ 8) = 8
Next we shall consider the partitioning of the application task graph of Fig 6.6(a) onto the
platform in Fig 6.6(b). Assuming that all the tasks are initially mapped onto software, the
execution time of the application, given by the execution time of vertex 𝑣𝑣1 is,
198 | Embedded Systems
The first iteration of the partitioning process has been shown in Table 6.5. Initially, all
the tasks are assumed to be mapped to software. Thus, the overall execution time is 24820 μs. At
the end of the first iteration, all tasks have moved to hardware and all the task vertices are locked.
The execution time becomes 1605 μs which is the case when all tasks are realized in hardware.
Now, all the vertices will be unlocked and the next iteration of the partitioning process will start.
It can be verified that this iteration does not improve the execution time further. Hence, the
algorithm terminates with the solution that all tasks be mapped to hardware. It may be noted that
instead of execution time, if some other metric (such as, area occupied or a weighed sum of area
and execution time) is tried to be minimized, some different partitioning result may be obtained.
fitness of the chromosome. The lower the numeric value of the fitness measure, more fit is the
chromosome.
To evolve the next generation from the current population, first the chromosomes are
sorted in the ascending order of their fitness. In order to ensure that good solutions are not lost in
the evolution process, a small percentage of better fit chromosomes are directly copied to the next
generation. The remaining population is created through the genetic operators – crossover and
mutation. Crossover operator selects two random chromosomes from the current population (with
a bias towards selecting better fit chromosomes). Parts of these parent chromosomes are exchanged
to create the offsprings. The crossover may be single-point or multi-point. For example, consider
two chromosomes, “1010001010” and “1111001101” for an application with ten tasks. If the
single-point crossover is followed with the crossover-point after four bits from the left, the
offsprings created are “1010001101” and “1111001010”. The mutation operator randomly changes
some bits of a parent chromosome to create an offspring. For example, in the chromosome
“1100101011” if mutation is applied at bits 1, 3 and 5 from the left side, the resulting offspring will
be “0110001011”. The rate of mutation may control the rate of convergence of the search
algorithm.
A major concern with the genetic algorithm is its slow rate of convergence. To converge
to a solution, a large number of iterations are to be carried out. Since the optimal fitness value for
a problem is not known, it is difficult to terminate the genetic algorithm. Typically, the termination
criteria is set to be “no improvement over last fixed number of generations” or completion of a
maximum number of generations for which the genetic algorithm has run. Once the terminating
criteria is met, the evolution process stops. The best solution at that point is declared as the output
of the optimization process. Increasing the mutation rate may ensure quicker convergence.
However, it often converges to some local optima rather than the global optima.
It is assumed that the birds are searching randomly for food items in an area. It is also assumed that
there is only one piece of food and none of the birds know where exactly the food is located.
However, each of them can estimate how far it is from the food, based on its own intelligence and
through learning via inter-communications with fellow birds. Naturally, among all the birds, one
will have the best estimation and assumes the leader role. Other birds decide on their flight based
upon the information from the leader and also its own intelligence captured in the history (that is,
different estimations about distance from food that it had over the past). All birds effectively
attempt to decide their movements at each time instant to ultimately reach the location of the food.
The PSO formulation applies the concept of bird flocking to optimization problems in which
proposing a large number of solutions and their cost estimation is easy, however, getting the
optimum solution is difficult. Similar to GA, PSO also works with a population, each element of it
called a particle. A particle constitute a solution to the partitioning problem. If the target
architecture consists of a hardware and a software platform, a binary stream similar to the GA
chromosome structure can be used to represent a particle. The particle may be thought of containing
two parts – the bit stream (named as position vector) and the fitness value. The updating of a particle
is guided by two best values. The first one is the best solution that the particle has achieved so far
during the evolution process. The corresponding position is called its pbest. The second parameter
is the global best position that the optimizer has obtained so far, known as current global best or
gbest. Once these two positions have been identified, a particle updates its velocity and position
using the following two equations.
𝑖𝑖
𝑣𝑣𝑘𝑘+1 = 𝑤𝑤 × 𝑣𝑣𝑘𝑘𝑖𝑖 + 𝑐𝑐1 × 𝑟𝑟1 × �𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑖𝑖 − 𝑥𝑥𝑘𝑘𝑖𝑖 � + 𝑐𝑐2 × 𝑟𝑟2 × �𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑘𝑘 − 𝑥𝑥𝑘𝑘𝑖𝑖 �
𝑖𝑖
𝑥𝑥𝑘𝑘+1 = 𝑥𝑥𝑘𝑘𝑖𝑖 + 𝑣𝑣𝑘𝑘+1
𝑖𝑖
Here,
𝑖𝑖
𝑣𝑣𝑘𝑘+1 : Velocity of particle 𝑖𝑖 at (𝑘𝑘 + 1)𝑡𝑡ℎ iteration
𝑣𝑣𝑘𝑘𝑖𝑖 : Velocity of particle 𝑖𝑖 at 𝑘𝑘 𝑡𝑡ℎ iteration
𝑖𝑖
𝑥𝑥𝑘𝑘+1 : Position of particle 𝑖𝑖 at (𝑘𝑘 + 1)𝑡𝑡ℎ iteration
𝑥𝑥𝑘𝑘𝑖𝑖 : Position of particle 𝑖𝑖 at 𝑘𝑘 𝑡𝑡ℎ iteration
𝑟𝑟1 , 𝑟𝑟2 : Two random numbers between 0 and 1
𝑤𝑤 : Inertia factor
𝑐𝑐1 : Self-confidence (Cognitive) factor
𝑐𝑐2 : Swarm-confidence (Social) factor
202 | Embedded Systems
Velocity computation has three parts in it. The first term captures the inertia of the particle to
maintain its current velocity. The second term captures the influence of particle memory in deciding
the velocity. The third term is responsible to incorporate the influence of swarm-intelligence
(position of the current best particle) into the velocity determination process. Velocity of a particle
is clamped at 𝑉𝑉𝑚𝑚𝑚𝑚𝑚𝑚 specified by the user. It has been found that in many optimization problems,
PSO performs better than GA due to lesser number of tunable parameters and the linear complexity
of the main loop.
For the hardware-software partitioning problem, every task in a particle may be initialized
with a velocity in the range –1 to +1. A negative velocity may indicate the task is moving towards
0 (software platform) while a positive velocity takes the task towards the hardware platform. If the
position goes beyond 0 and 1, it may be clamped to the nearest value. The particles may be allowed
to evolve over the generations and the process may be terminated if the improvement in global best
solution becomes less than a small predefined number.
increases the power consumption of the system. This may create occasional power limit violations.
Thus, the tasks to be moved to FPGA are to be chosen carefully. Once a task is mapped onto
hardware, if its predecessor and/or successor task(s) are mapped onto software, the bus
communication time and the associated bus power consumption also need to be accounted for.
These extra amounts may lead to deadline violations and/or power limit violations.
The partitioning process followed here uses iterative improvement to produce valid
solutions. It assumes that initially all tasks are mapped onto software. This realization is expected
to be satisfying the power-constraint, as hardware execution of tasks will need more power.
Hardware mapping of tasks may not be satisfying the power constraint and lead to power spikes in
the execution life-time of the application. The algorithm discussed here prioritizes between the
software tasks that are candidates for moving to FPGA, based on a mobility index detailed next.
For the set of tasks mapped onto software, the processor can execute only one at a time. A
software task can start executing only after all its predecessor tasks (in software and in FPGA) are
over. Thus, for a task 𝑖𝑖, its earliest possible start time, 𝐸𝐸𝑖𝑖 is given by,
𝐸𝐸𝑖𝑖 = 𝑀𝑀𝑀𝑀𝑀𝑀𝑘𝑘 ∈𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝(𝑖𝑖) 𝛿𝛿(𝑘𝑘)
where, 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝(𝑖𝑖) is the set of predecessors of task 𝑖𝑖 and 𝛿𝛿(𝑘𝑘) is the finish time of task 𝑘𝑘. Similarly,
the latest possible start time, 𝐿𝐿𝑖𝑖 is defined as,
𝐿𝐿𝑖𝑖 = 𝑀𝑀𝑀𝑀𝑀𝑀𝑘𝑘 ∈𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑖𝑖) (𝜎𝜎(𝑘𝑘) − 𝑡𝑡𝑖𝑖 )
where, 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑖𝑖) is the set of successor tasks of task 𝑖𝑖, 𝜎𝜎(𝑘𝑘) is the start time of task k. The quantity
𝑡𝑡𝑖𝑖 is equal to either the software execution time (𝑡𝑡𝑡𝑡𝑖𝑖 ) or the hardware execution time (𝑡𝑡ℎ𝑖𝑖 ) of task
𝑖𝑖, depending upon whether the task has been mapped onto software or hardware, respectively.
Mobility, denoted as 𝜇𝜇(𝑖𝑖) for task 𝑖𝑖, is a binary quantity defined as follows.
𝜇𝜇(𝑖𝑖) = 1 if 𝐿𝐿𝑖𝑖 > 𝐸𝐸𝑖𝑖 , 0 otherwise
The tasks with mobility value 1 are better suited for moving to hardware because of two reasons.
First, there is time to take care of communication over the bus. Second, the movement will reduce
the execution time of the task providing better parallelism opportunities.
Next we shall look into the computation of time needed to communicate 𝑁𝑁𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 number
of bits over a bus of width 𝑁𝑁𝑏𝑏𝑏𝑏𝑏𝑏 bits. The number of communication cycles needed per transfer is
𝐶𝐶𝐶𝐶×𝑁𝑁𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝐶𝐶𝐶𝐶. Then, the total number of cycles for the communications is 𝑁𝑁𝑏𝑏𝑏𝑏𝑏𝑏
. If the arbitration
decision takes additional 𝐴𝐴𝐴𝐴 cycles and the frequency of the channel is 𝐹𝐹, the communication time
is given by,
204 | Embedded Systems
𝐶𝐶𝐶𝐶×𝑁𝑁𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑁𝑁𝑏𝑏𝑏𝑏𝑏𝑏
+ 𝐴𝐴𝐴𝐴
𝑡𝑡𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 =
𝐹𝐹
The communication between tasks 𝑖𝑖 and 𝑗𝑗 is needed if task 𝑖𝑖 has been mapped to software
and task 𝑗𝑗 to hardware. This communication must be scheduled during the execution of task 𝑗𝑗, that
is between the earliest possible start time 𝐸𝐸𝑗𝑗 and latest possible finish time 𝐹𝐹𝑗𝑗 . Thus, if 𝜎𝜎(𝑖𝑖) and 𝜎𝜎(𝑗𝑗)
be the start times of the two tasks with their execution times 𝑡𝑡𝑖𝑖 and 𝑡𝑡𝑗𝑗 , 𝜎𝜎(𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐) the start time of
the communication, the following inequalities must hold.
𝜎𝜎(𝑖𝑖) + 𝑡𝑡𝑖𝑖 ≤ 𝜎𝜎(𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐) < 𝐹𝐹𝑗𝑗
𝐸𝐸𝑗𝑗 < 𝜎𝜎(𝑗𝑗) + 𝑡𝑡𝑗𝑗 ≤ 𝐹𝐹𝑗𝑗
The bus transfer activity also incurs power consumption. Assuming that the operating
voltage of the bus is 𝑉𝑉, capacitance 𝐶𝐶𝑏𝑏𝑏𝑏𝑏𝑏 , number of words transmitted per second is 𝑚𝑚 with the
number of bits per word as 𝑛𝑛, the power consumption is given by the following equation.
1
𝑃𝑃𝑏𝑏𝑏𝑏𝑏𝑏 = × 𝐶𝐶𝑏𝑏𝑏𝑏𝑏𝑏 × 𝑉𝑉 2 × 𝑚𝑚 × 𝑛𝑛
2
Algorithm Power-Aware-Partitioning
Begin
Map all tasks to software; // Schedule assumed to be power-valid
While schedule is not time-valid do
Begin
Call Task-Selection-Routine() to select task 𝑖𝑖 to go to hardware;
Determine start time 𝜎𝜎(𝑖𝑖) using mobility 𝜇𝜇(𝑖𝑖);
Update bus activity;
Reschedule task graph;
Update hardware requirement 𝐴𝐴ℎ;
If (execution time decreases) AND (schedule is power-valid) AND
(𝐴𝐴ℎ < 𝐴𝐴ℎ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ) then
Note new schedule; Continue next iteration of the loop;
Else if (𝐴𝐴ℎ < 𝐴𝐴ℎ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ) then
Hardware-Software Codesign | 205
Invalidate selection of task 𝑖𝑖 for hardware mapping for all future iterations;
Continue the loop;
Else if (schedule is not power-valid) then
Invalidate selection of task 𝑖𝑖 for hardware mapping for the next iteration;
Continue the loop;
Else if (no time improvement) then
Invalidate selection of task 𝑖𝑖 for hardware mapping for the next iteration;
Continue the loop;
End While;
End.
Procedure Task-Selection-Routine()
Begin
Let 𝑁𝑁𝑆𝑆 constitute the set of software tasks in decreasing order of execution time 𝑡𝑡𝑡𝑡𝑖𝑖 ;
Compute mobility of all tasks in 𝑁𝑁𝑆𝑆 ;
If (𝜇𝜇(𝑖𝑖) = 0) for all 𝑖𝑖 ∈ 𝑁𝑁𝑆𝑆
Select task 𝑖𝑖 with the maximum execution time 𝑡𝑡𝑡𝑡𝑖𝑖 ;
Else
Select task 𝑖𝑖 with the maximum execution time 𝑡𝑡𝑡𝑡𝑖𝑖 and non-zero mobility;
End;
highly suitable for implementation. For example, functionality of an application may be specified
in terms of some user-defined procedures. These modules represent user-level views of the system
functionality. However, there may be commonalities between a number of procedures, a
procedure may be too large or too small, etc. Taking these individual procedures as tasks in the
task-graph to be partitioned may lead to poor system implementations.
Functional partitioning is the process of dividing a given functional specification into a
set of sub-specifications. These individual sub-specifications correspond to the tasks in the task-
graph. Individual tasks in the task-graph will get mapped onto hardware and software components
of the target architecture. Thus, functional partitioning modifies a large functional specification
into a number of smaller specifications. Functional partitioning provides a number of advantages,
as noted next.
1. In ASIC/FPGA based hardware realizations, optimization tools (such as, logic synthesis tools)
are used to synthesize the logic structure. These optimization problems being quite complex
in nature, the tools can produce reasonably good circuits if the input problem size is not too
large. The heuristic tools work well for small-to-medium sized inputs. Also, the runtimes of
these tools increase at a rapid rate with increase in the input size.
2. For smaller designs synthesized in the hardware, the clock frequency can be made significantly
high. This may offset the delay of data transfer over the bus for communicating with tasks
mapped to software.
3. The FPGAs possess limited resources in terms of logic capacity and input-output pins. If the
mapped task is large, it may necessitate multiple FPGA chips with inter-chip communication
overheads.
4. Partitioning a large procedure may create scope for further parallelization.
Input to the functional partitioning process is a single behavioural process specification
𝑋𝑋, consisting of a set of user-defined procedures 𝐹𝐹 = {𝑓𝑓1 , 𝑓𝑓2 , … 𝑓𝑓𝑛𝑛 }. Amongst them, one is the
main procedure, which calls some other procedures, however this procedure cannot be called by
others. All procedures excepting main may call other procedures. It may be noted that a variable
access can also be treated as procedures –procedure read (to get the value of the variable) and
procedure write (to assign value to the variable). The functional partitioning of 𝐹𝐹 creates a set of
parts 𝑃𝑃 = {𝑝𝑝1 , 𝑝𝑝2 , … 𝑝𝑝𝑚𝑚 }, such that each procedure is put into one of these parts. That is, 𝑝𝑝1 ∪
𝑝𝑝2 ∪ … ∪ 𝑝𝑝𝑚𝑚 = 𝐹𝐹 and ∀𝑖𝑖, 𝑗𝑗 ∈ {1, 2, … 𝑛𝑛}, 𝑖𝑖 ≠ 𝑗𝑗: 𝑝𝑝𝑖𝑖 ∩ 𝑝𝑝𝑗𝑗 = ∅. The hardware-software
partitioning techniques can work on these parts of 𝑃𝑃 and put each of its components either in
Hardware-Software Codesign | 207
hardware or in software. In that way, each part is looked as tasks in the task-graph to be partitioned
by the codesign process. The situation has been explained in Fig 6.8.
Functional Partitioning
Parts
𝑃𝑃 = {𝑝𝑝1 , 𝑝𝑝2 , … 𝑝𝑝𝑚𝑚 }
Codesign
Hardware Software
Fig 6.8: Role of functional partitioning
By analysing the behavioural process specification 𝑋𝑋, the call graph of the same can be
constituted enumerating the procedures defined in the specification and their interrelations in
terms of caller-callee relationships. Any standard program analysis tool can be utilized to obtain
the call graph for 𝑋𝑋. For the sake of discussion in this section, let us assume that it contains nine
procedures apart from the main procedure and the calling pattern of the procedures is as shown
in the call graph in Fig 6.9.
𝑓𝑓1 𝑓𝑓8
𝑓𝑓2
Main 𝑓𝑓5
𝑓𝑓3 𝑓𝑓9
𝑓𝑓6
𝑓𝑓4
𝑓𝑓7
(iii) Procedure exlining. This is the reverse of inlining. A related portion of statements within
a procedure is taken out to constitute a new procedure. The original procedure is made to
contain a call to this new procedure, replacing the code fragment. The granularity becomes
finer. The following are the types of exlining followed in functional partitioning.
• Redundancy exlining. A redundancy corresponds to the occurrence of two or more near
identical code segments within a single procedure. The program is encoded as a string
of characters. Approximate string matching algorithms are used to identify the matching
code segments.
• Distinct-computation exlining. The technique attempts to determine tightly-coupled
portions within a procedure. Each statement is considered as an atomic unit. The
statements affect the values of variables. A dependency analysis is performed between
the values computed by the statements. This reveals relationship between the
statements. It has the potential to identify a group of highly related statements.
Statements within a group are highly related, whereas those between the groups are
rather uncorrelated.
To illustrate the granularity transformations, let us consider the situation that almost every
system contains an initialization routine which is called on resetting the system. For the
example call graph shown in Fig 6.9, the routine 𝑓𝑓1 may be an initialization routine. As it is
called only once, it may be inlined with the procedure Main. Procedure 𝑓𝑓2 may be first
computing some data and then packetize it and send through a transmission channel to a distant
place – a typical situation of an embedded application employed in remote processing. Now,
𝑓𝑓3 may be exlined into two parts – 𝑓𝑓3 and 𝑓𝑓3𝑎𝑎 , where 𝑓𝑓3 is responsible for doing the
computation and 𝑓𝑓3𝑎𝑎 takes care of the communication task. The procedure 𝑓𝑓3 calls 𝑓𝑓5, 𝑓𝑓8 and
𝑓𝑓9 , while 𝑓𝑓3𝑎𝑎 calls 𝑓𝑓6 and 𝑓𝑓7. The situation has been shown in Fig 6.10.
210 | Embedded Systems
𝑓𝑓2 𝑓𝑓8
𝑓𝑓3
Main 𝑓𝑓5
𝑓𝑓3𝑎𝑎 𝑓𝑓9
𝑓𝑓6
𝑓𝑓4
𝑓𝑓7
2. Pre-Clustering. This phase attempts to do a grouping of the procedures so that the job of next
phase becomes simpler. It identifies the procedures such that, putting them on different
architecture components will definitely degrade the solution quality. At the same time, these
procedures cannot be inlined into a single one, may be due to the size of the resulting
procedure. It may be noted that pre-clustering does not do any partitioning of the tasks, it just
groups the procedures into clusters. All procedures belonging to a cluster should be put into
the same architecture component by the actual partitioning tool, at the next phase. Grouping
of procedures may be guided by a closeness measure. The closeness is defined as a weighted
sum of metrics, such as, amount of parameter passing, number of calls etc. For example, in the
call graph of Fig 6.10, procedures 𝑓𝑓5 and 𝑓𝑓8 may be very close and thus put into the same
cluster. This has been shown in Fig 6.11.
𝑓𝑓2 𝑓𝑓8
𝑓𝑓3
Main 𝑓𝑓5
𝑓𝑓3𝑎𝑎 𝑓𝑓9
𝑓𝑓6
𝑓𝑓4
𝑓𝑓7
3. N-Way Assignment. This is the stage performing the actual hardware-software partitioning.
Any of the partitioning techniques discussed in Section 6.3 can be employed to perform this
stage. For example, assuming that the target architecture contains only two components – one
Hardware-Software Codesign | 211
𝑓𝑓2 𝑓𝑓8
𝑓𝑓3
Main 𝑓𝑓5
𝑓𝑓3𝑎𝑎 𝑓𝑓9
𝑓𝑓6
𝑓𝑓4
𝑓𝑓7
UNIT SUMMARY
An embedded system design may be highly constraint by its performance requirements, size and
cost. From the performance perspective, the system should be implemented in hardware, while
from the cost point of view software realization may be preferable. All functions in an embedded
system are not equally critical. Thus, a partitioning of the tasks to target hardware or software
realization becomes an optimization challenge. This leads to the field of Hardware-Software Co-
design. Ensuring the functional correctness of such a system distributed over the hardware and the
software platforms necessitate a new class of combined simulation, called cosimulation. Several
software development techniques have been reported for the design of cosimulators. For the
codesign problem, several optimization approaches are available. A fully mathematical approach
based on integer programming, can produce optimal partitioning results. However it is
computationally infeasible for moderate to large task graphs. An iterative improvement based
approach, known as Kernighan-Lin bi-partitioning has been used for the optimization. The
metasearch techniques like Genetic Algorithm and Particle Swarm Optimization have been used to
achieve good partitioning solutions within reasonable CPU time. For the reconfigurable platforms
like FPGA, power-aware partitioning has been carried out. To take care of large specification, a
functional partitioning of the procedures have been developed to aid in the hardware-software
partitioning process.
212 | Embedded Systems
EXERCISES
Multiple Choice Questions
MQ1. Which of the following is not a responsibility of cosimulation?
(A) System specification (B) Cost estimation
(C) Verification (D) None of the other options
MQ2. Number of simulators used in homogeneous cosimulation is
(A) 0 (B) 1 (C) 2 (D) Any number
MQ3. Number of simulators in heterogeneous cosimulation is
(A) 0 (B) 1 (C) 2 (D) 2 or more
MQ4. After the target architecture has been fixed, the cosimulation is at
(A) Abstract level (B) Detailed level
(C) Medium level (D) None of the other options
MQ5. Solution produced via integer programming of a problem is
(A) Exact (B) Constant
(C) Sub-optimal (D) None of the other options
MQ6. Partitions in basic Kernighan-Lin technique are
(A) Balanaced (B) Empty
(C) Unbalanced (D) Full
MQ7. In Kernighan-Lin algorithm, when all vertices are locked, it marks the end of
(A) Algorithm (B) Iteration
(C) Selection (D) None of the other options
MQ8. Partitions in Extended Kernighan-Lin technique are
(A) Balanaced (B) Empty
(C) Unbalanced (D) Full
MQ9. Change equations help in reducing
(A) Computation (B) System cost
(C) Area (D) None of the other options
Hardware-Software Codesign | 213
KNOW MORE
SystemC is a language for system-level design that can be used for specification, modelling and
verification for systems spanning hardware and software. It is based upon standard C++, extended
using class libraries. The language is suitable for modelling system partitioning, evluate and verify
assignment of blocks to either hardware or software implementations. Leading EDA and IP vendors
216 | Embedded Systems
use this language for architectural exploration and platforms for hardware-software codesign. It
has the capability to have several abstraction levels – functional level (in which algorithms are
defined and system partitioned into communicating tasks), architecture level (tasks are assigned to
execution units, communication mechanism is defined and implementation metrics, such as,
performance are modelled), implementation level (precise method of implementation of the
execution units).
Course outcomes (Cos) for this course can be mapped with the programme outcomes (POs) after
the completion of the course and a correlation can be made for the attainment of POs to analyse the
gap. After proper analysis of the gap in the attainment of POs, necessary measures can be taken to
overcome the gaps.
PO-1 PO-2 PO-3 PO-4 PO-5 PO-6 PO-7 PO-8 PO-9 PO-10 PO-11 PO-12
CO-1
CO-2
CO-3
CO-4
The data filled in the above table can be used for gap analysis.
d
EC
218 | Embedded Systems
Experiments for the Embedded System laboratory can be carried out by using the embedded system
trainer boards developed around 8051 and/or ARM processor. The trainer board is to be interfaced
with the host PC in which the software part can be developed, simulated and compiled for the target
system. Once satisfied with the simulation, the machine code can be downloaded onto the trainer
board via USB/JTAG port. The trainer boards, in general, contain the basic input-output facilities
in terms of DIP switches, onboard LEDs, ADC (with variable resistance to feed analog voltage)
and DAC. Additional circuitry can be developed on breadboards and connected to the system to
prototype the complete system. The trainer boards often support the interfaces like SPI, I2C, CAN,
USB, Bluetooth etc. The boards also contain rudimentary monitor routine to support some simple
RTOS (such as, FreeRTOS). The following is a tentative list of experiments that can be carried out
on such trainer boards.
Index
Floorplanning, 61 GT/s, 99
Gen 1, 81 Heterogeneous, 4
SD card, 76 SMULL, 41
SDA, 77 SoC, 27, 98
SDR, 63 Soft real-time, 6, 113
SDRAM , 64 Soft real-time task, 112
Secondary memory, 62 Software handshaking, 79
Sensors, 73 Software interrupt, 41
Serial clock, 74, 77 Software pipelining, 53
Serial data, 77 Solid state drive, 64
Serial peripheral interface, 25, 74 Space, 79
Settling time, 93 Special function register, 150, 154
SFR, 150 Specification, 60
Sfr, 154 SPI, 25, 74
Shadow register, 48 Sporadic , 114
Shared node scheduling, 189 Sporadic server, 124
Signal-to-noise ratio, 51 Sporadic task, 115
Signed char, 152 Spread spectrum frequency hopping, 90
Signed int, 153 SPSR, 35
SIMD, 48 SRAM, 55, 62
SIMD architecture, 50 SSD, 62, 63
Simple priority inversion, 130 Stack operation, 39
Simulation, 177 Stack pointer, 33
Single data rate, 63 Standard , 86
Single functioned, 5 Standard branch, 43
Single instruction multiple data, 48, 50 START condition, 77
Single object move, 196 Static, 62
Single register transfer, 37 Static RAM, 55, 62
Single-step, 26 Static-priority-based, 119
Slave, 74 Steer through wire, 16
Slave select, 75 STM, 38, 40
Smart card, 11 STOP condition, 77
SMLAL, 41 Store multiple, 38
Index | 233