LPLP4
LPLP4
2 • Processor Architecture
4
• Pipelining
2
Architecture
In the context of computers, architecture has many meanings
Instruction Set Architecture (ISA)
The parts of a processor design that one needs to understand or write assembly/machine code
e.g. instruction set specifications, register
Example ISAs:
Intel: x86, IA32, x86-64, Itanium
ARM
Microarchitecture (or Computer Organization)
Implementation of the architecture
e.g. cache size, core frequency
Platform Architecture (or System Design)
Memory and I/O buses
Memory controllers
Direct memory access
3
Components of the Computer
Since 1946 all computers have had 4
components:
4
What is the CPU?
CPU is the brain of the computer system
The computer works by executing a program containing many instructions
Program
Sequence of instructions that perform a task
Think of executing a program like playing music
Instruction
A binary code representing the simplest operation performed by the processor
Think of an individual instruction as a note coming from a musical instrument
Code forms:
Machine code: byte-level program that a processor executes
Assembly code: text representation of machine code
5
Memory
k × m array of stored bits (k is usually 2n)
Address unique (n-bit) identifier of location
Contents m-bit value stored in location
Basic operations:
LOAD read a value from a memory location
STORE write a value to a memory location
Basic memory types:
RAM read-and-write
ROM read-only
6
Buses
Most processors use the three-bus architecture.
Address bus
Carries the address of memory or I/O device for a data transfer. Determines the addressing range.
Unidirectional: always acts output of the CPU
Data bus
Carries data to be transferred between processor and memory or I/O.
Bidirectional: set as input when reading data, and output when writing data
Control bus
Carries status and control signals required for various operations
An assortment of signals, anything not address or data
e.g. R/W*, IO/M*, Interrupt, DMA
7
Address Bus
The address bus contains the address of memory location or I/O device selected for a data
transfer.
Address bus width determines the addressing range.
8
Data Bus
Width of data bus determines the amount of data transferable in one step
Most microcontrollers have 8-bit data buses
Can transfer 1byte at any one time
A 32-bit word requires 4 transfers
ARM has a 32 bit data bus
Can transfer 4 bytes at once
Some chips has a external bus with selectable bus width of 8, 16 or 32 bits
Selecting smaller data bus results in lower performance but enables interfacing to lower cost memory
devices
9
Address and Data Buses $000000
224 addresses
10
Inside the CPU Memory-Facing Registers
Registers store binary data. The following
registers interface with memory.
11
Inside the CPU Internal Components
These components do not have direct access
to memory.
12
ALU size
The ALU operates on several bits simultaneously
“The size of the processor”
Usually (but not always) determines data bus size
Typical sizes:
4 bits (remote controllers etc)
8 bits (microcontrollers: 68HC05, 8051, PIC)
16 bits (low-end microprocessors: Intel 8086)
32 bits (most popular size today: ARM, MIPS)
64 bits (servers: IBM POWER, Intel Xeon)
13
Table of Contents
• Processor Architecture
3
• How the CPU Works
• Pipelining
14
How does the CPU Work?
15
Inside the CPU An Example Program
C version Explanation
Assembly version
void main(void)
LOAD 0x2000 Load value of a to Data Register
{ Address Assembly
int a = 1; ------- --------- ADD 0x2002 After adding the previously loaded
int b = 2; 0x1000 LOAD 0x2000 value of a and the newly loaded
int c; 0x1002 ADD 0x2002 value of b, save in ACC.
c = a + b; 0x1004 STORE 0x2004 STORE 0x2004 Save the added result to the
} address of c
16
Executing LOAD instruction
1. The address the CPU wants to execute is 0x10000 in
the PC.
2. Put 0x1000 in Address Register,
3. When 0x1000 enters the Address Register, it
automatically accesses 0x1000 of the memory
4. Instruction there is read from memory.
5. Instruction is stored in IR (LOAD 0x2000)
6. Instruction goes into the decoder. At the same time the
PC is increased
7. Decoder interprets what the content is. CU
understands that the content is to get the value of
address 0x2000
8. CU generates control signals to read the value of
0x2000 from the memory
9. A value of 1is entered into the data register by the
control signal generated by the CU.
10. Value of data register is available to any circuits
needing it
11. Since this value may be operated through ALU, it is
temporarily stored in ACC.
17
Executing ADD instruction
1. Like LOAD, the address the current CPU will execute is
0x1002, which has already been increased
2. Put 0x1002 in Address Register,
3. Address 0x1002 is accessed
4. Value at 0x1002 is available
5. This value is stored in IR
6. Value in IR is sent to decoder. At the same time the PC
is increased
7. Decoder interprets value in IR. CU understands that the
content is to get the add the value of address 0x2002
8. CU generates control signals to read the value at
0x2002 based on decoder interpretation The ALU is
given a control signal to add.
9. Data from 0x2002 is loaded and saved in data register
10. ALU add data in Data Register with current value in
ACC
11. The sum replaces the old value of ACC
18
Executing STORE instruction
1. The current Instruction to be executed is 0x1004,
which is the PC value
2. The value in PC is transferred to the Address Register
3. Location 0x1004 is accessed
4. The value from 0x1004 is made available to the CPU
5. This value is saved in IR
6. The value in IR is made available to the decoder, and
the PC is incremented
7. Decoder interprets the value in IR
8. The CU generates control signals to store the value in
ACC at 0x2004 based on decoder interpretation
9. The output of the ALU is stored in the ACC
10. Finally, the value in ACC is stored in location 0x2004
19
Table of Contents
• Processor Architecture
• Pipelining
20
Idea of the Pipeline 1/3
21
Idea of the Pipeline 2/3
With pipelining each stage (fetch,
decode, execution) of the
instructions can be processed at
once.
Pipelining is used even in the
smalles $2 microcontroller
For our short program, while LOAD
0x2000 is actually being executed,
ADD 0x2002 is being decoded and
STORE 0x2004 is being fetched
from memory
22
Idea of the Pipeline 3/3
The 3-stage pipeline of the famous
ARM7
1clock cycle per cell. from the first
cycle to the third cycle
The first opcode is executed, the
second opcode is decoded, and the
third opcode is fetched all at once.
Execute Fetch-Decode-Execute
without pipelining will take 3×3
cycles to execute 3 opcodes
If you use a pipeline, it takes only 5
cycles to execute 3 opcodes
23
A 5-stage Pipeline
The instruction execution steps can be refined to increase the number of pipeline stages
Non-pipelined
Pipelined
24
Pipeline Performance
Latency
Defined as the time (or #cycles) from entering the pipeline until an instruction completes
Pipelining doesn’t help latency of single task
Throughput
Defined as the number of instructions executed per time period
Potential speedup = Number of pipeline stages
Trivia
The longest pipeline on a commercial machine is 31 stages on the Intel Pentium 4.
25
Speedup
k-stage pipeline processes n tasks in k + (n − 1) clock cycles:
k cycles for the first task and
n − 1 cycles for the remaining n − 1tasks
Total time to process n tasks, k stages:
For the pipelined processor:
[k + (n − 1)]τ
For the non-pipelined processor:
nkτ
Speedup (Sk = k as n → ∞):
Tnon−pipelined
Sk =
Tpipelined
T1
=
Tk
nkτ
=
[k + (n − 1)]τ
nk
=
k+n− 1
26
Clocking Si+1
Si
t tm d
Latch delay: d
Clock cycle of the pipeline: τ
τ = max(τm) + d
Pipeline frequency: f
1
f=
τ
∴ Pipeline rate limited by slowest pipeline stage.
Also, increasing #stages adds delay d
27
Limits to Pipelining
Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards
Two instructions attempting to use the same resources at the same time
Data hazards
Instruction attempting to use data before it’s available in the register file
Control hazards
Caused by branch instructions, which invalidates data already in pipeline, requiring flushing and refilling.
Simplest solution is to stall the pipeline until the hazard is resolved, inserting one or more
“bubbles” in the pipeline
More stall cycles = lower performance
Complex solutions include branch prediction and data forwarding, out of the scope of this course
28
CPU Performance
Processor Performance is function of
IC: Instruction count
CPI: Cycle per instruction
Clock cycle
Seconds
CPU time =
Program
Instructions Cycles Seconds
= × ×
Program Instruction cycle
= IC × CPI × Clk
29