Es - Mod 3
Es - Mod 3
ARM: Advanced RISC Machines or Acorn RISC Machine.ARM7 was Designed by Acorn
Computers ltd of Cambridge, England in the early 90’s.ARM7 supersedes the ARM6
processor and supports 3 stage pipeline architecture and ARM 9 supports 5 stage pipeline
architecture.ARM is an Industry Standard Architecture & has been licensed to many
semiconductor manufacturers all over the world.ARM is the market leader in low-power &
Cost sensitive embedded applications. High performance for very low power consumption
and price.Architecture is based on Reduced Instruction Set Computer (RISC) principles.ARM
also develops: Software tools, boards, debug tools, application software, peripherals etc.
Several extensions available like: Thumb instruction set and Java machine for different
applications
3.1.1ARM NOMENCLATURE
• TDMI = (?)
• • Thumb instruction set
• • Debug-interface
• • Multiplier (hardware)
• • In-circuit Emulator
The most used ARM-version
MODULE III ECT342 EMBEDDED SYSTEMS
2.CPU Core
Consists of the ARM processor core and some tightly coupled function blocks.Cache and
memory management blocks.E.g.: ARM710T, ARM720T, ARM740T, ARM920T,
ARM922T, ARM940T, ARM946E-S, and ARM966E-S
3.ARM Microcontroller
consists of ARM CPU core and additional I/O peripherals.Eg LPC 2378
3.1.2ARCHITECTURAL INHERITANCE
ARM FAMILY
MODULE III ECT342 EMBEDDED SYSTEMS
3.3.1.1ARM7TDMI FEATURES
Memory Access
Fetch
The instruction is fetched from memory and placed in the instruction pipeline
Decode
The instruction is decoded and data path control signals prepared for next cycle.
Execute
the instruction owns the data path,the register bank is read, an operand shifted,the ALU
result generated and written back into destination register.
- holds either data or an address. They are identified with letter “r” prefixed to the
register no.
• SPR-Special purpose registers
r13-stack pointer(SP) and stores the head of the stack in the current processor mode.
r14-link register where the core puts the return address whenever it calls a subroutine.
MODULE III ECT342 EMBEDDED SYSTEMS
Out of 37 registers ,20 registers are hidden from a program at different times. They are
available only when the processor is in a particular mode.Banked registers of a particular
mode are denoted by underline character post fixed to the mode
eg abort mode has banked registers r13_abt,r14_abt and spsr_abt
ARM Saves previous PSR into SPSRs during processor mode change
MODULE III ECT342 EMBEDDED SYSTEMS
ARM Saves previous PSR into SPSRs during processor mode change
– I =1 : IRQ disabled
MODULE III ECT342 EMBEDDED SYSTEMS
• T bit :
OR
Link Register – r 14
SPSR
3.3.2.1ARM 9 FEATURES
MODULE III ECT342 EMBEDDED SYSTEMS
5 STAGE PIPELINE
Fetch
The instruction is fetched from memory and placed in the instruction pipeline
Decode
The instruction is decoded and register operands read from the register file
Execute
Operand is shifted and ALU result generated. If instruction is a load or store, the memory
address is computed in ALU.
Buffer/data
Data memory is accessed if required. Otherwise the ALU result is simply buffered for one
clock cycle to give the same pipeline flow for all instructions
Write Back
The results generated by the instruction are written back to register file, including any data
loaded from memory
MODULE III ECT342 EMBEDDED SYSTEMS
• Little Endian
In little endian ordering, bytes of increasing significance are stored at increasing addresses
in memory.
• Big Endian
In big endian ordering, bytes of decreasing significance are stored at increasing addresses
in memory.
MODULE III ECT342 EMBEDDED SYSTEMS
appropriate library of standard functions. It uses the ARM Procedure Call Standard for all
externally available functions .The compiler can also produce Thumb code
is a full macro assembler which produces ARM object format.An assembler is a program that takes
basic computer instructions and converts them into a pattern of bits that the computer's processor
can use to perform its basic operations.output that can be linked with output from the C
compiler.Assembly source language is near machine-level, with most assembly instructions
translating into single ARM (or Thumb) instructions.
Linker
Takes one or more object files and combines them into an executable program.It resolves
symbolic references between the object files and extracts object modules from libraries as
needed by the program. It can assemble the various components of the program in a number
of different ways, depending on whether the code is to run in RAM or ROM.Normally the
linker includes debug tables in the output file. If the object files were compiled with full
debug information, this will include full symbolic debug tables .The linker can also produce
object library modules that are not executable but are ready for efficient linking with
object files in the future.
is a suite of programs that models the behaviour of various ARM processor cores in
software on a host system.the ARMulator allows an ARM program developed using the
C compiler or assembler to be tested and debugged on a host machine with no ARM
MODULE III ECT342 EMBEDDED SYSTEMS
processor connected. It allows the number of clock cycles the program takes to execute to be
measured exactly, so the performance of the target system can be evaluated.At its most
complex, the ARMulator can be used as the centre of a complete, timing-accurate, C model of
the target system, with full details of the cache and memory management functions added,
running an operating system.In between these two extremes the ARMulator comes with a set
of model prototyping modules including a rapid prototype memory model and coprocessor
interfacing support
2.ARITHMETIC INSTRUCTIONS
MODULE III ECT342 EMBEDDED SYSTEMS
3.LOGICAL INSTRUCTIONS
4.COMPARISON INSTRUCTIONS
5.MULTIPLY INSTRUCTIONS
MODULE III ECT342 EMBEDDED SYSTEMS
5.BRANCH INSTRUCTIONS
• LDRB R1,VALUE
• STRB R1,RESULT
2.WAP TO ADD TWO NUMBERS
• LDR R1,VALUE1
• LDR R2,VALUE 2
• ADD R1,R1,R2 ;R1=R1+R2
• STR R1,RESULT
3.WAP TO FIND LARGER OF TWO NUMBERS
• LDR R1,VALUE1
• LDR R2,VALUE 2
• CMP R1,R2
• BHI DONE
• MOV R1,R2
• DONE:STR R1,RESULT
WAP TO FIND ONES COMPLIMENT OF A NUMBER
• LDR R1,VALUE
• MVN R1,R1
• STR R1,RESULT
4.WAP TO ADD FOUR NUMBERS
• MOV R0,R1
• ADD R0,R0,R2; RO=RO+R2
• ADD R0,R0,R3
• ADD R0,R0,R4
5.WAP TO SWAP THE CONTENTS OF REGISTERS R0 AND R1
2
• MUL R0,R1,R1;R0= R1
• MOV R2,#04
2
• MUL R0,R2,R0 ; R0= 4 R1
• MOV R2,#03
• MUL R2,R1,R2 ;R2=3R1
2
• ADD R0,R0,R2 ; R0= 4 R1 +3R1
MODULE III ECT342 EMBEDDED SYSTEMS
• The ARM architecture supports a general mechanism for extending the instruction set
through the addition of coprocessors
• The most common use of a coprocessor is the system coprocessor used to control
on-chip functions such as the cache and memory management unit on the
ARM720
• A floating-point ARM coprocessor has also been developed, and application-specific
coprocessors are a possibility
Most important features of co- processor are
This signal, which stands for 'Coprocessor Instruction', indicates that the ARM has
identified a coprocessor instruction and wishes to execute it
• cpa (from the coprocessors to ARM)
This is the 'Coprocessor Absent' signal which tells the ARM that there is no coprocessor
present that is able to execute the current instruction
• cpb (from the coprocessors to ARM)
MODULE III ECT342 EMBEDDED SYSTEMS
This is the 'CoProcessor Busy' signal which tells the ARM that the coprocessor cannot
begin executing the instruction yet.
The timing is such that both the ARM and the coprocessor must generate their
respective signals autonomously.
The coprocessor cannot wait until it sees cpi before generating cpa and cpb.
Coprocessor registers
• ARM coprocessors have their own private register sets and their state is controlled
by instructions that mirror the instructions that control ARM registers.
• The ARM has sole responsibility for control flow, so the coprocessor instructions are
concerned with data processing and data transfer
Coprocessor data operations
• Coprocessor data operations are completely internal to the coprocessor and cause a
state change in the coprocessor registers
• An example would be floating-point addition, where two registers in the floating-
point coprocessor are added together and the result placed into a third register
• Coprocessor data transfer instructions load or store the values in coprocessor registers
from or to memory. Since coprocessors may support their own data types, the number
of words transferred for each register is coprocessor dependent.
• The ARM generates the memory address, but the coprocessor controls the number of
words transferred.
• A coprocessor may perform some type conversion as part of the transfer (for instance
the floating-point coprocessor converts all loaded values into its 80-bit internal
representation).
• Each coprocessor can have up to 16 private registers of any reasonable size, they are not
limited to 32 bits.
1. cpi (from ARM to all coprocessors): This signal, which stands for 'Coprocessor
Instruction', indicates that the ARM has identified a coprocessor instruction and wishes
to execute it.
This is the 'Coprocessor Absent' signal which tells the ARM that there is no coprocessor
present that is able to execute the current instruction.
This is the 'Co Processor Busy' signal which tells the ARM that the coprocessor cannot
begin executing the instruction yet.
The ARM system control coprocessor
• The instructions registers are all 32 bits long, and access is restricted to MRC and
MCR instructions which must be executed in supervisor mode.
• ARM CPUs which are used in embedded systems with fixed or controlled application
programs do not require a full memory management unit with address translation
capabilities.
3.3.4.1DATAPATH SECTION
Clocking scheme
Data movement is controlled by passing the data alternately through latches which are
open during phase 1 and latches which are open during phase 2
The non-overlapping property of the phase 1 and phase 2 clocks ensures that there are
no race conditions in the circuit
Datapath timing
• The register read buses are dynamic and are precharged during phase 2
MODULE III ECT342 EMBEDDED SYSTEMS
• When phase 1 goes high, the selected registers discharge the read buses which
become valid early in phase 1
• One operand is passed through the barrel shifter, which also uses dynamic techniques ,
and the shifter output becomes valid a little later in phase 1.
• The ALU has input latches which are open during phase 1, allowing the operands to
begin combining in the ALU as soon as they are valid, but they close at the end of
phase 1 so that the phase 2 precharge does not get through to the ALU.
• The ALU then continues to process the operands through phase 2, producing a valid
output towards the end of the phase which is latched in the destination register at the
end of phase 2.
•
DATAPATH CYCLE TIME
• The minimum datapath cycle time is therefore the sum of:
the register read time
the shifter delay
the ALU delay
the register write set-up time
the phase 2 to phase 1 non-overlap time
(i)ALU delay
• Of these, the ALU delay dominates
The ALU delay is highly variable, depending on the operation it is performing
Logical operations are relatively fast, since they involve no carry propagation
Arithmetic operations (addition, subtraction and comparisons) involve longer logic
paths as the carry can propagate across the word width
TYPES OF ADDER USED
• It is possible to create a logical circuit using multiple full adders to add N-bit
numbers.
MODULE III ECT342 EMBEDDED SYSTEMS
• The carry-select adder generally consists of two ripple carry adders and a multiplexer
• Adding two n-bit numbers with a carry-select adder is done with two adders
(therefore two ripple carry adders) in order to perform the calculation twice, one time
with the assumption of the carry-in being zero and the other assuming it will be one
• After the two results are calculated, the correct sum, as well as the correct carry-out,
is then selected with the multiplexer once the correct carry-in is known.
The ARM2 4-bit carry look-ahead scheme
• The carry-lookahead adder calculates one or more carry bits before the sum, which
reduces the wait time to calculate the result of the larger value bits.
• In order to allow a higher clock rate, ARM2 used a 4-bit carry look-ahead scheme to
reduce the worst-case carry path length.
• The logic produces carry generate (G) and propagate (P) signals which control the 4-
bit carry-out.
• The carry propagate path length is reduced to eight gate delays, again using merged
AND-OR-INVERT gates and alternating AND/OR logic
MODULE III ECT342 EMBEDDED SYSTEMS
• The adder logic was further improved on the ARM9TDMI, where a 'carry arbitration‘
adder is used
• This adder computes all intermediate carry values using a 'parallel-prefix' tree, which
is a very fast parallel logic structure.
• The carry arbitration scheme recedes the conventional propagate-generate information
in terms of two new variables, u and v
•
Carry arbitration adder (ARM9TDMI). ▫. Computes all intermediate carry values using a
'parallel-prefix' tree, which is a very fast parallel logic structure.
• The input operands are each selectively inverted, then added and combined in the
logic unit, and finally the required result is selected and issued on the ALU result
bus.
• The C and V flags are generated in the adder (they have no meaning for logical
operations), the N flag is copied from bit 31 of the result and the Z flag is evaluated
from the whole result bus
(ii)The barrel shifter
• In order to minimize the delay through the shifter, a cross-bar switch matrix is used
to steer each input to the appropriate output
• Each input is connected to each output through a switch.
• If pre-charged dynamic logic is used, as it is on the ARM datapaths, each switch can
be implemented as a single NMOS transistor.
MODULE III ECT342 EMBEDDED SYSTEMS
(iii)Multiplier design
• Two styles of multiplier have been used:
Older ARM cores include low-cost multiplication hardware that supports only the 32-
bit result multiply and multiply-accumulate instructions.
Recent ARM cores have high-performance multiplication hardware and support the
64-bit result multiply and multiply-accumulate instructions
MODULE III ECT342 EMBEDDED SYSTEMS
As the multiplier is shifted right eight bits per cycle in the 'Rs' register, the partial sum
and carry are rotated right eight bits per cycle. The array is cycled up to four times, using
early termination to complete the instruction in fewer cycles where the multiplier has
sufficient zeros in the top bits, and the partial sum and carry are combined 32 bits at a
time and written back into the register bank. The high-speed multiplier requires
considerably more dedicated hardware than the low-cost solution employed on other ARM
cores .There are 160 bits of shift register and 128 bits of carry-save adder logic.The
incremental area cost is around 10% of the simpler processor cores, though a rather smaller
proportion of the higher-performance cores such as ARMS and StrongARM.Its benefits are
that it speeds up multiplication by a factor of around 3 and it supports the added
functionality of the 64-bit result forms of the multiply instruction
(iv)The register bank
The last major block on the ARM datapath is the register bank.This is where all the user-
visible state is stored in 31 general-purpose 32-bit registers, mounting to around 1
Kbits of data altogether.Since the basic 1-bit register cell is repeated so many times in the
design, it is worth putting considerable effort into minimizing its size
ARM6 register cell circuit
MODULE III ECT342 EMBEDDED SYSTEMS
The ARM datapath is laid out to a constant pitch per bit. The pitch will be a compromise
between the optimum for the complex functions (such as the ALU) which are best suited to a
wide pitch and the simple functions (such as the barrel shifter) which are most efficient
when laid out on a narrow pitch.Each function is then laid out to this pitch, remembering
that there may also be buses passing over a function (for example the B bus passes through
the ALU but is not used by it) space must be allowed for these. It is a good idea to produce a
floor-plan for the datapath noting the 'passenger' buses through each block.The order of the
function blocks is chosen to minimize the number of additional buses passing over the
more complex functions.
MODULE III ECT342 EMBEDDED SYSTEMS
2.CONTROL SECTION
ARM control logic structure
The control logic on the simpler ARM cores has three structural components
An instruction decoder PLA (programmable logic array). This unit uses some of the
instruction bits and an internal cycle counter to define the class of operation to be
performed on the datapath in the next cycle.
Distributed secondary control is associated with each of the major datapath
function blocks. This logic uses the class information from the main decoder PLA to
select other instruction bits and/or processor state information to control the datapath.
Decentralized control units for specific instructions that take a variable number of
cycles to complete (load and store multiple, multiply and coprocessor operations).
Here the main decoder PLA locks into a fixed state until the remote control unit
indicates completion.