ARM Processor Architecture
ARM Processor Architecture
Some Slides are Adopted from NCTU IP Core Design Some Slides are Adopted from NTU Digital SIP Design Project
SOC Consortium Course Material
Outline
ARM Core Family ARM Processor Core Introduction to Several ARM processors Memory Hierarchy Software Development Summary
Secure Cores SecurCore SC100 SecurCore SC110 SecurCore SC200 SecurCore SC210
T: Thumb D: On-chip debug support M: Enhanced multiplier I: Embedded ICE hardware T2: Thumb-2 S: Synthesizable code E: Enhanced DSP instruction set J: JAVA support, Jazelle Z: Should be TrustZone? F: Floating point unit H: Handshake, clockless design for synchronous or asynchronous design
SOC Consortium Course Material 5
ARM processor core + cache + MMU ARM CPU cores ARM6 ARM7
3-stage pipeline Keep its instructions and data in the same memory system Thumb 16-bit compressed instruction set On-chip Debug support, enabling the processor to halt in response to a debug request Enhanced Multiplier, 64-bit result Embedded ICE hardware, give on-chip breakpoint and watchpoint support
SecurCore Family
Smart card and secure IC development
ARM Cortex-M Series, deeply embedded processors optimized for cost sensitive applications.
Supports the Thumb-2 instruction set only
10
Version 2
Sold in volume in the Acorn Archimedes and A3000 products 26-bit addressing, including 32-bit result multiply and coprocessor
Version 2a
Coprocessor 15 as the system control coprocessor to manage cache Add the atomic load store (SWP) instruction
11
Version 3M
Introduce the signed and unsigned multiply and multiplyaccumulate instructions that generate the full 64-bit result
12
Version 4T
16-bit Thumb compressed form of the instruction set is introduced
Version 5T
Introduced recently, a superset of version 4T adding the BLX, CLZ and BRK instructions
Version 5TE
Add the signal processing instruction set extension
13
14
15
Register Bank
2 read ports, 1 write ports, access any register 1 additional read port, 1 additional write port for r15 (PC)
incrementer
Barrel Shifter
Shift or rotate the operand by any number of bits
barrel shifter
ALU
data in register
Fetch
The instruction is fetched from memory and placed in the instruction pipeline
Decode
The instruction is decoded and the datapath control signals prepared for the next cycle
Execute
The register bank is read, an operand shifted, the ALU result generated and written back into destination register
SOC Consortium Course Material 18
19
Multi-Cycle Instruction
Memory access (fetch, data transfer) in every cycle Datapath used in every cycle (execute, address calculation, data transfer) Decode logic generates the control signals for the data path use in next cycle (decode, address calculation)
SOC Consortium Course Material 20
as instruction
as instruction [7:0]
data out
data in
i. pipe
data out
data in
i. pipe
=A / A+ B / A- B [11:0]
=A + B /A - B
data out
data in
i. pipe
byte?
data in
i. pipe
Computes a memory address similar to a data processing instruction Load instruction follows a similar pattern except that the data from memory only gets as far as the data in register on the 2nd cycle and a 3rd cycle is needed to transfer the data from there to the destination register
SOC Consortium Course Material 22
Branch Instructions
address register increment
mult shifter
=A+ B [23:0]
=A
data out
data in
i. pipe
data out
data in
i. pipe
The third cycle, which is required to complete the pipeline refilling, is also used to mark the small correction to the value stored in the link register in order that is points directly at the instruction which follows the branch
SOC Consortium Course Material 23
Breaking the pipeline Note that the core is executing in the ARM state
24
Separate instruction and data memories => 5 stage pipeline Used in ARM9TDMI
25
+4 I-cache fetch
Fetch
I decode
r15
pc + 4
pc + 8
The instruction is fetched from memory and placed in the instruction pipeline
instruction decode
immediate fields
register read
Decode
The instruction is decoded and register operands read from the register files. There are 3 operand read ports in the register file so most ARM instructions can source all their operands in one cycle
mul
LDM/ STM
+4
post index
shift ALU
reg shift
pre-index
execute
forwarding paths
mux
B, BL MOV pc SUBS pc
Execute
An operand is shifted and the ALU result generated. If the instruction is a load or store, the memory address is computed in the ALU
26
load/store address
D-cache
rot/sgn ex
LDR pc
register write
write-back
+4 I-cache fetch
Buffer/Data
I decode
r15
pc + 4
pc + 8
Data memory is accessed if required. Otherwise the ALU result is simply buffered for one cycle
instruction decode
immediate fields
register read
Write back
The result generated by the instruction are written back to the register file, including any data loaded from memory
mul
LDM/ STM
+4
post index
shift ALU
reg shift
pre-index
execute
forwarding paths
mux
B, BL MOV pc SUBS pc
load/store address
D-cache
rot/sgn ex
LDR pc
register write
write-back
27
Pipeline Hazards
There are situations, called hazards, that prevent the next instruction in the instruction stream from being executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are three classes of hazards:
Structural Hazards
They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution.
Data Hazards
They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.
Control Hazards
They arise from the pipelining of branches and other instructions that change the PC
SOC Consortium Course Material 28
Structural Hazards
When a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. If some combination of instructions cannot be accommodated because of a resource conflict, the machine is said to have a structural hazard.
29
Example
A machine has shared a single-memory pipeline for data and instructions. As a result, when an instruction contains a data-memory reference (load), it will conflict with the instruction reference for a later instruction (instr 3):
Clock cycle number instr load Instr 1 Instr 2 Instr 3 1 IF 2 ID IF 3 EX ID IF 4 MEM EX ID IF 5 WB MEM EX ID WB MEM EX WB MEM WB 6 7 8
30
Solution (1/2)
To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The effect of the stall is actually to occupy the resources for that instruction slot. The following table shows how the stalls are actually implemented.
Clock cycle number instr load Instr 1 Instr 2 Instr 3 1 IF 2 ID IF 3 EX ID IF 4 MEM EX ID stall 5 WB MEM EX IF WB MEM ID WB EX MEM WB
31
Solution (2/2)
Another solution is to use separate instruction and data memories. ARM belongs to the Harvard architecture, so it does not suffer from this hazard
32
Data Hazards
Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine.
Clock cycle number 1 ADD SUB AND OR XOR R1,R2,R3 R4,R5,R1 R6,R1,R7 R8,R1,R9 R10,R1,R11 IF 2 3 4 5 6 7 8 9
ID EX IF IDsub IF
Forwarding
The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding.
Clock cycle number 1 ADD SUB AND R1,R2,R3 R4,R5,R1 R6,R1,R7 IF 2 ID IF 3 EX IDsub IF 4 MEM EX IDand 5 WB MEM EX WB MEM WB 6 7
34
Forwarding Architecture
next pc
+4 I-cache fetch
pc + 4
pc + 8 r15
register read
mul
LDM/ STM
+4
post index
shift ALU
reg shift
pre-index
execute
forwarding paths
mux
B, BL MOV pc SUBS pc
load/store address
D-cache
rot/sgn ex
LDR pc
register write
write-back
35
Forward Data
Clock cycle number 1 ADD SUB AND R1,R2,R3 R4,R5,R1 R6,R1,R7 IF 2 ID IF 3 EXadd ID IF 4 EXsub ID 5 MEM EXand 6 WB MEM WB 7
MEMadd WB
The first forwarding is for value of R1 from EXadd to EXsub. The second forwarding is also for value of R1 from MEMadd to EXand. This code now can be executed without stalls. Forwarding can be generalized to include passing the result directly to the functional unit that requires it: a result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit.
36
Without Forward
Clock cycle number 1 ADD SUB AND R1,R2,R3 R4,R5,R1 R6,R1,R7 IF 2 ID IF 3 EX 4 MEM 5 WB IDsub IF EX IDand MEM WB EX MEM WB 6 7 8 9
37
Data Forwarding
Data dependency arises when an instruction needs to use the result of one of its predecessors before the result has returned to the register file => pipeline hazards Forwarding paths allow results to be passed between stages as soon as they are available 5-stage pipeline requires each of the three source operands to be forwarded from any of the intermediate result registers Still one load stall LDR rN, [] ADD r2,r1,rN ;use rN immediately
One stall Compiler rescheduling
38
EX MEM
The load instruction has a delay or latency that cannot be eliminated by forwarding alone.
39
2 ID IF
3 ID IF
5 EXsub ID IF
6 MEM EX ID
7 WB MEM EX
EX MEM WB WB MEM WB
40
LDR Interlock
In this example, it takes 7 clock cycles to execute 6 instructions, CPI of 1.2 The LDR instruction immediately followed by a data operation using the same register cause an interlock
SOC Consortium Course Material 41
Optimal Pipelining
In this example, it takes 6 clock cycles to execute 6 instructions, CPI of 1 The LDR instruction does not cause the pipeline to interlock
42
In this example, it takes 8 clock cycles to execute 5 instructions, CPI of 1.6 During the LDM there are parallel memory and writeback cycles
SOC Consortium Course Material 43
In this example, it takes 9 clock cycles to execute 5 instructions, CPI of 1.8 The SUB incurs a further cycle of interlock due to it using the highest specified register in the LDM instruction
SOC Consortium Course Material 44
Pipeline parallism
ALU/MAC, LSU LS instruction wont stall the pipeline Out-of-order completion
SOC Consortium Course Material 45
Comparison
Feature Architecture Pipeline Length Java Decode V6 SIMD Instructions MIA Instructions Branch Prediction Independent LoadStore Unit Instruction Issue Concurrency Out-of-order completion Target Implementation ARM9E ARMv5TE(J) 5 (ARM926EJ) No No No No Scalar, in-order None No Synthesizable ARM10E ARMv5TE(J) 6 (ARM1026EJ) No No Static Yes Scalar, in-order ALU/MAC, LSU Yes Synthesizable Intel XScale ARMv5TE 7 No No Yes Dynamic Yes Scalar, in-order ALU, MAC, LSU Yes Custom chip ARM11TM ARMv6 8 Yes Yes Available as coprocessor Dynamic Yes Scalar, in-order ALU/MAC, LSU Yes Synthesizable and Hard macro
46
47
Embedded ICE
scan chain 0
processor core
scan chain 1
other signals
Din[31:0] Dout[31:0]
bus splitter
50
Din[31:0]
Dout[31:0] D[31:0]
bl[3:0] r/w mas[1:0] mreq seq lock
memory interface
bus control
enin enout enouti abe ale ape dbe tbe busen highz busdis ecapclk
dbgrq breakpt dbgack exec extern1 extern0 dbgen rangeout0 rangeout1 dbgrqi commrx commtx
MMU interface
state
ARM7TDMI core
TAP information
debug
coprocessor interface
power
JTAG controls
51
Memory interface
32-bit address A[31:0], bidirectional data bus D[31:0], separate data out Dout[31:0], data in Din[31:0] \mreq indicates that the memory address will be sequential to that used in the previous cycle
mreq 0 0 1 1 s eq 0 1 0 1 Cy cl e N S I C Us e Non-sequential memory access Sequential memory access Internal cycle bus and memory inactive Coprocessor register transfer memory inactive
SOC Consortium Course Material 52
MMU interface
\trans (translation control), 0: user mode, 1: privileged mode \mode[4:0], bottom 5 bits of the CPSR (inverted) Abort, disallow access
State
T bit, whether the processor is currently executing ARM or Thumb instructions
Configuration
Bigend, big-endian or little-endian
SOC Consortium Course Material 53
Initialization
\reset, starts the processor from a known state, executing from address 0000000016
ARM7TDMI characteristics
Process Metal layers Vdd 0.35 um 3 3.3 V Transistors Core area Clock 74,209 2 2.1 mm 0 to 66 MHz MIPS Power MIPS/W 60 87 mW 690
54
Memory Access
The ARM7 is a Von Neumann, load/store architecture, i.e.,
Only 32 bit data bus for both instr. and data. Only the load/store instr. (and SWP) access memory.
Memory is addressed as a 32 bit address space Data type can be 8 bit bytes, 16 bit half-words or 32 bit words, and may be seen as a byte line folded into 4-byte words Words must be aligned to 4 byte boundaries, and half-words to 2 byte boundaries. Always ensure that memory controller supports all three access sizes
55
Non-sequential (N cycle)
(nMREQ, SEQ) = (0, 0) The ARM core requests a transfer to or from an address which is unrelated to the address used in the preceding address.
Internal (I cycle)
(nMREQ, SEQ) = (1, 0) The ARM core does not require a transfer, as it performing an internal function, and no useful prefetching can be performed at the same time
56
ARM710T
8K unified write through cache Full memory management unit supporting virtual memory Write buffer
ARM720T
As ARM 710T but with WinCE support
ARM 740T
8K unified write through cache Memory protection unit Write buffer
57
ARM8
Higher performance than ARM7
By increasing the clock rate By reducing the CPI Higher memory bandwidth, 64-bit wide memory Separate memories for instruction and data accesses
ARM8
ARM9TDMI ARM10TDMI
addresses
prefetch unit
Core Organization
The prefetch unit is responsible for fetching instructions from memory and buffering them (exploiting the double bandwidth memory) It is also responsible for branch prediction and use static prediction based on the branch prediction (backward: predicted taken; forward: predicted not taken)
PC instructions
memory (doublebandwidth)
read data
integer unit
CPinst. CPdata
write data
coprocessor(s)
58
Pipeline Organization
5-stage, prefetch unit occupies the 1st stage, integer unit occupies the remainder
(1) Instruction prefetch (2) Instruction decode and register read (3) Execute (shift and ALU) (4) Data memory access (5) Write back results
Integer Unit Prefetch Unit
59
register read
coproc data
multiplier
execute
memory
60
ARM8 Macrocell
ARM810
virtual address
prefetch unit
PC instructions
write data
CP15
8Kbyte unified instruction and data cache Copy-back Double-bandwidth MMU Coprocessor Write buffer
JTAG
write buffer
MMU
physical address
address buffer
data in data out address
61
ARM9TDMI
Harvard architecture
Increases available memory bandwidth
Instruction memory interface Data memory interface
62
ARM9TDMI Organization
next pc
+4 I-cache fetch
pc + 4
pc + 8 r15
register read
mul
LDM/ STM
+4
postindex
shift ALU
reg shift
pre-index
execute
forwarding paths
mux
B, BL MOV pc SUBS pc
load/store address
D-cache
rot/sgn ex
LDR pc
register write
write-back
63
Decode
Thumb decompress ARM decode reg read
Execute
shift/ALU reg write
ARM9TDMI:
instruction fetch r. read decode shift/ALU data memory access reg write
Fetch
Decode
Execute
Memory
Write
Not sufficient slack time to translate Thumb instructions into ARM instructions and then decode, instead the hardware decode both ARM and Thumb instructions directly
SOC Consortium Course Material 64
On-chip debugger
Additional features compared to ARM7TDMI
Hardware single stepping Breakpoint can be set on exceptions
ARM9TDMI characteristics
Process Metal layers Vdd 0.25 um 3 2.5 V Transistors Core area Clock 110,000 2 2.1 mm 0 to 200 MHz MIPS Power MIPS/W 220 150 mW 1500
65
instruction cache
virtual I A
data
data cache
virtual DA
CP15
instruction MMU
ARM9TDMI
EmbeddedICE & JT AG physical DA
data MMU
2 16K caches Full memory management unit supporting virtual addressing and memory protection Write buffer
AMBA interface
physical IA
write buffer
66
data cache
AMBA interface
write buffer
67
Architecture v5TE
ARM946E-S
68
ARM926EJ-S
ARMv5TEJ architecture (ARMv5TEJ) 32-bit ARM instruction and 16-bit Thumb instruction set DSP instruction extensions and single cycle MAC ARM Jazelle technology MMU which supports operating systems including Symbian OS, Windows CE, Linux Flexible instruction and data cache sizes Instruction and data TCM interfaces with wait state support EmbeddedICE-RT logic for real-time debug Industry standard AMBA bus AHB interfaces ETM interface for Real-time trace capability with ETM9 Optional MOVE Coprocessor delivers video encoding performance
69
Frequency (MHz)
ARM10TDMI (1/2)
Current high-end ARM processor core Performance on the same IC process
ARM10TDMI 2 ARM9TDMI 2 ARM7TDMI
decode
Fetch
Issue
Decode
Execute
Memory
Write
71
ARM10TDMI (2/2)
Reduce CPI
Branch prediction Non-blocking load and store execution 64-bit data memory transfer 2 registers in each cycle
72
ARM1020T Overview
Architecture v5T
ARM1020E will be v5TE
CPI ~ 1.3 6-stage pipeline Static branch prediction 32KB instruction and 32KB data caches
hit under miss support
64 bits per cycle LDM/STM operations Embedded ICE Logic RT-II Support for new VFPv1 architecture ARM10200 test chip
ARM1020T VFP10 SDRAM memory interface PLL
SOC Consortium Course Material 73
ARM1176JZ(F)-S
Powerful ARMv6 instruction set architecture
Thumb, Jazelle, DSP extensions SIMD (Single Instruction Multiple Data) media processing extensions deliver up to 2x performance for video processing
ARM1176JZ(F)-S
Vectored interrupt interface and low-interrupt-latency mode speeds interrupt response and real-time performance Optional Vector Floating Point coprocessor (ARM1136JF-S)
Powerful acceleration for embedded 3D-graphics
75
Area with cache (mm) Area w/o cache (mm) Frequency (MHz) Typical mW/MHz with cache Typical mW/MHz w/o cache
76
ARM11 MPCore
Highly configurable
Flexibility of total available performance from implementations using between 1 and 4 processors. Sizing of both data and instruction cache between 16K and 64K bytes across each processor. Either dual or single 64-bit AMBA 3 AXI system bus connection allowing rapid and flexibility during SoC design Optional integrated vector floating point (VFP) unit Sizing on the number of hardware interrupts up to a total of 255 independent sources
77
ARM11 MPCore
78
ARM Cortex-A8
Used for applications including mobile phones, set-up boxes, gaming consoles, and automotive navigation/entertainme nt systems High performance with low power consumption
79
ARM Cortex-A8
Architecture features
Thumb-2 instruction
Add 130 additional instructions to Thumb High density, high performance
80
ARM Cortex-A8
Superscalar pipeline
Dual issue, in-order, statically scheduled ARM integer pipeline
81
ARM Cortex-A8
NEON media engine (1/2)
82
ARM Cortex-A8
NEON media engine (2/2)
83
ARM Cortex-A8
Process Frequency Area with cache (mm) Area without cache (mm) Power with cache (mW/MHz) 65nm (LP) 650+ <4 <3 <0.59 65nm (GP) 1100+ <4 <3 <0.45
84
ARM Cortex-A9
85
Memory Hierarchy
86
Small
Fast
registers
Expensive
Main memory
Large capacity
Hard disk
Cheap Cost
87
Caches (1/2)
A cache memory is a small, very fast memory that retains copies of recently used memory values. It usually implemented on the same chip as the processor. Caches work because programs normally display the property of locality, which means that at any particular time they tend to execute the same instruction many times on the same areas of data. An access to an item which is in the cache is called a hit, and an access to an item which is not in the cache is a miss.
SOC Consortium Course Material 88
Caches (2/2)
A processor can have one of the following two organizations:
A unified cache
This is a single cache for both instructions and data
89
registers
instructions
processor address
copies of instructions
memory
00..0016
90
cache address
instructions
instructions
instructions
registers
processor
address
data
data
data
memory 00..0016
91
tag RAM
data RAM
compare hit
mux data
The index address bits are used to access the cache entry The top address bit are then compared with the stored tag If they are equal, the item is in the cache The lowest address bit can be used to access the desired item with in the line.
92
Example
19 9 4
address:
tag
index
line
tag RAM
data RAM
The 8Kbytes of data in 16-byte lines. There would therefore be 512 lines A 32-bit address:
512 lines
4 bits to address bytes within the line 9 bits to select the line 19-bit tag
compare hit
mux data
93
tag RAM
data RAM
A 2-way set-associative cache This form of cache is effectively two directmapped caches operating in parallel.
data
compare
mux
hit
compare
mux
tag RAM
data RAM
94
Example
20
address : tag
8
index
line
tag RAM
data RAM
256 lines
The 8Kbytes of data in 16-byte lines. There would therefore be 256 lines in each half of the cache A 32-bit address:
4 bits to address bytes within the line 8 bits to select the line 20-bit tag
compare
mux
hit
data
compare
mux
256 lines
tag RAM data RAM
95
tag CAM
data RAM
A CAM (Content Addressed Memory) cell is a RAM cell with an inbuilt comparator, so a CAM based tag store can perform a parallel search to locate an address in any location The address bit are compared with the stored tag If they are equal, the item is in the cache The lowest address bit can be used to access the desired item with in the line.
96
Example
28 4
address
line
tag CAM
data RAM
512 lines
The 8Kbytes of data in 16-byte lines. There would therefore be 512 lines A 32-bit address:
4 bits to address bytes within the line 28-bit tag
97
Write Strategies
Write-through
All write operations are passed to main memory
Copy-back (write-back)
No kept coherent with main memory
98
Software Development
99
ARM Tools
C source C libraries asm source
assembler
system model
ARMsd
ARMulator
development board
ARM software development ADS ARM system development ICE and trace ARM-based SoC development modeling, tools, design flow
SOC Consortium Course Material 100
ARM Development Suite (ADS), ARM Software Development Toolkit (SDT) (1/3) Develop and debug C/C++ or assembly language program armcc ARM C compiler armcpp ARM C++ compiler tcc Thumb C compiler tcpp Thumb C++ compiler armasm ARM and Thumb assembler armlinkARM linker armsd ARM and Thumb symbolic debugger
101
ARM Development Suite (ADS), ARM Software Development Toolkit (SDT) (2/3) .aof ARM object format file .aif ARM image format file The .aif file can be built to include the debug tables
ARM symbolic debugger, ARMsd
ARMsd can load, run and debug programs either on hardware such as the ARM development board or using the software emulation of the ARM AXD (ARM eXtended Debugger)
ARM debugger for Windows and Unix with graphics user interface Debug C, C++, and assembly language source
CodeWarrior IDE
Project management tool for windows
SOC Consortium Course Material 102
ARM Development Suite (ADS), ARM Software Development Toolkit (SDT) (3/3) Utilities
armprof ARM profiler Flash downloader download binary images to Flash memory on a development board
Supporting software
ARMulator ARM core simulator
Provide instruction accurate simulation of ARM processors and enable ARM and Thumb executable programs to be run on nonnative hardware Integrated with the ARM debugger
Angle
Run on target development hardware and enable you to develop and debug applications on ARM-based hardware
SOC Consortium Course Material 103
ARM C Compiler
Compiler is compliant with the ANSI standard for C Supported by the appropriate library of functions Use ARM Procedure Call Standard, APCS for all external functions
For procedure entry and exit
104
Linker
Take one or more object files and combine them Resolve symbolic references between the object files and extract the object modules from libraries Normally the linker includes debug tables in the output file
105
Timing accuracy model is used for cache, memory management unit analysis, and so on
SOC Consortium Course Material 107
109
Summary (1/2)
ARM7TDMI
Von Neumann architecture 3-stage pipeline CPI ~ 1.9
ARM9TDMI, ARM9E-S
Harvard architecture 5-stage pipeline CPI ~ 1.5
ARM10TDMI
Harvard architecture 6-stage pipeline CPI ~ 1.3
SOC Consortium Course Material 110
Summary (2/2)
Cache
Direct-mapped cache Set-associative cache Fully associative cache
Software Development
CodeWarrior AXD
111
References
[1] http://twins.ee.nctu.edu.tw/courses/ip_core_02/index.html [2] http://video.ee.ntu.edu.tw/~dip/slide.html [2] ARM System-on-Chip Architecture by S.Furber, Addison Wesley Longman: ISBN 0-201-67519-6. [3] www.arm.com
112