0% found this document useful (0 votes)
45 views

Isa Architecture

Uploaded by

seyfi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Isa Architecture

Uploaded by

seyfi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

ISA

By
AJAL.A.J - AP/ECE
Instruction Set Architecture
• Instruction set architecture is the structure of a
computer that a machine language programmer must
understand to write a correct (timing independent)
program for that machine.

• The instruction set architecture is also the machine


description that a hardware designer must
understand to design a correct implementation of
the computer.
• a fixed number of operations are formatted as
one big instruction (called a bundle)

op op op Bundling info
Instruction Set Architecture

Computer Architecture =
Instruction Set Architecture
+ Machine Organization

• “... the attributes of a [computing] system as seen by


the programmer, i.e. the conceptual structure and
functional behavior …”
Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)

Accumulator + Index Registers


(Manchester Mark I, IBM 700 series 1953)

Separation of Programming Model


from Implementation

High-level Language Based Concept of a Family


(B5000 1963) (IBM 360 1964)

General Purpose Register Machines

Complex Instruction Sets Load/Store Architecture


(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)

RISC
(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
LIW/”EPIC”? (IA-64. . .1999) VLIW
Instruction Set Architecture
• Computer Architecture = Hardware + ISA
– Interface between all the software that runs on the
machine and the hardware that executes it
instruction set, or instruction set
architecture (ISA)
• An instruction set, or instruction set
architecture (ISA), is the part of the computer
architecture related to programming, including the
native data types, instructions, registers, addressing
modes, memory,
architecture, interrupt and exception handling, and
external I/O. An ISA includes a specification of the
set of opcodes (machine language), and the native
commands implemented by a particular processor.
Microarchitecture
• Instruction set architecture is distinguished from
the microarchitecture, which is the set of processor
design techniques used to implement the
instruction set. Computers with different micro
architectures can share a common instruction set.
• For example, the Intel Pentium and
the AMD Athlon implement nearly identical
versions of the x86 instruction set, but have
radically different internal designs.
NUAL vs. UAL
• Unit Assumed Latency (UAL)
– Semantics of the program are that each
instruction is completed before the next one is
issued
– This is the conventional sequential model

• Non-Unit Assumed Latency (NUAL):


– At least 1 operation has a non-unit assumed
latency, L, which is greater than 1
– The semantics of the program are correctly
understood if exactly the next L-1 instructions are
understood to have issued before this operation
completes
Summary of RISC Design
 All instructions are typically of one size
 Few instruction formats
 All operations on data are register to register
 Operands are read from registers
 Result is stored in a register

 General purpose integer and floating point registers


 Typically, 32 integer and 32 floating-point registers

 Memory access only via load and store instructions


 Load and store: bytes, half words, words, and double words

 Few simple addressing modes


Instruction Set Architecture ICS 233 – Computer Architecture and Assembly Language – KFUPM

© Muhamed Mudawar slide 22


Instruction Set Architectures

 Reduced Instruction Set Computers (RISCs)


 Simple instruction
 Flexibility
 Higher throughput
 Faster execution

 Complex Instruction Set Computers (CISCs)


 Hardware support for high-level language
 Compact program
MIPS: A RISC example

 Smaller and simpler instruction set


 111 instructions
 One cycle execution time
 Pipelining
 32 registers
 32 bits for each register
MIPS Instruction Set
 25 branch/jump instructions
 21 arithmetic instructions
 15 load instructions
 12 comparison instructions
 10 store instructions
 8 logic instructions
 8 bit manipulation instructions
 8 move instructions
 4 miscellaneous instructions
Overview of the MIPS Processor
...

4 bytes per word Memory


Up to 232 bytes = 230 words
...

EIU $0 Execution FPU F0 Floating


32 General $1 & F1 Point Unit
Purpose $2 Integer Unit F2 (Coproc 1) 32 Floating-Point
Registers (Main proc) Registers
$31 F31
Arithmetic & Integer FP
ALU
Logic Unit mul/div Arith
Floating-Point
Arithmetic Unit
Hi Lo
TMU BadVaddr Trap &
Status Memory Unit
Cause (Coproc 0)
Integer EPC
Multiplier/Divider
Memory

MIPS R2000
Organization CPU

Registers
Coprocessor 1 (FPU)

Registers

$0 $0

$31 $31

Arithmetic Multiply
unit divide

Arithmetic
Lo Hi unit

Coprocessor 0 (traps and memory)


Registers

BadVAddr Cause

Status EPC
Definitions

Performance is typically in units-per-second


• bigger is better

If we are primarily concerned with response time


• performance = 1
execution_time

" X is n times faster than Y" means

ExecutionTime Performance
= =n
y x

ExecutionTime Performance x y

ECE 361 3-28


Organizational Trade-offs

Application

Programming
Language
Compiler

ISA Instruction Mix


Datapath
Control CPI
Function Units
TransistorsWiresPins Cycle Time

CPI is a useful design measure relating the Instruction Set


Architecture with the Implementation of that architecture, and the
program measured
ECE 361 3-29
Principal Design Metrics: CPI and Cycle Time

1
Performance =
ExecutionTime

1
Performance =
CPI × CycleTime

1 Instructions
Performance = =
Cycles Seconds Seconds
×
Instruction Cycle

ECE 361 3-30


Amdahl's “Law”: Make the Common Case Fast
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = -------------------- = ---------------------
ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task


by a factor S and the remainder of the task is unaffected then,
Performance improvement
is limited by how much the
ExTime(with E) = ((1-F) + F/S) X ExTime(without E) improved feature is used 
Invest resources where
Speedup(with E) = ExTime(without E) ÷ time is spent.
((1-F) + F/S) X ExTime(without E)

ECE 361 3-31


Classification of Instruction Set
Architectures
Instruction Set
Design

software

instruction set

hardware

Multiple Implementations: 8086  Pentium 4

ISAs evolve: MIPS-I, MIPS-II, MIPS-II, MIPS-IV,


MIPS,MDMX, MIPS-32, MIPS-64
ECE 361 3-33
The steps for executing an instruction:

1.Fetch the instruction


2.Decode the instruction
3.Locate the operand
4.Fetch the operand (if necessary)
5.Execute the operation in processor
registers
6.Store the results
7.Go back to step 1

ECE 361 3-34


Typical Processor Execution Cycle

Instruction Obtain instruction from program storage


Fetch

Instruction Determine required actions and instruction size


Decode

Operand Locate and obtain operand data


Fetch

Execute Compute result value or status

Result Deposit results in register or storage for later use


Store

Next Determine successor instruction


Instruction

ECE 361 3-35


Instruction and Data Memory: Unified or Separate

Programmer's View Computer


Program
ADD 01010 (Instructions)
SUBTRACT 01110
AND 10011 CPU
OR 10001 Memory
COMPARE 11010
. .
. . I/O
. .
Computer's View

Princeton (Von Neumann) Architecture Harvard Architecture


--- Data and Instructions mixed in same --- Data & Instructions in
unified memory separate memories

--- Program as data --- Has advantages in certain


high performance
--- Storage utilization implementations

--- Single memory interface --- Can optimize each memory

ECE 361 3-36


Basic Addressing Classes

Declining cost of registers


ECE 361 3-37
10-5 Data-transfer instructions
Arithmetic instructions
Logical and bit-manipulation instructions
Shift instruction
Stack Architectures

ECE 361 3-42


 Stack architecture
 high frequency of memory accesses has made it
unattractive
 is useful for rapid interpretation of high-level
language programs
Infix expression
(A+B) ×C+(D×E)
Postfix expression
AB+C×DE×+

ECE 361 3-43


Accumulator Architectures

ECE 361 3-44


Register-Set Architectures

ECE 361 3-45


Register-to-Register: Load-Store Architectures

ECE 361 3-46


Register-to-Memory Architectures

ECE 361 3-47


Memory-to-Memory Architectures

ECE 361 3-48


Addressing Modes

ECE 361 3-49


CPI
Instruction Set Design Metrics
Static Metrics
• How many bytes does the program
occupy in memory?

Instruction Count Cycle Time

Dynamic Metrics
• How many instructions are executed?
• How many bytes does the processor fetch to execute the
program?
• How many clocks are required per instruction?
• How "lean" a clock is practical?

1 Cycles Seconds
ExecutionTime = = Instructions × ×
Performance Instruction Cycle

ECE 361 3-50


Types of ISA and examples:

1. RISC -> Playstation


2. CISC -> Intel x86
3. MISC -> INMOS Transputer
4. ZISC -> ZISC36
5. SIMD -> many GPUs
6. EPIC -> IA-64 Itanium
7. VLIW -> C6000 (Texas Instruments)
Problems of the Past
• In the past, it was believed that hardware
design was easier than compiler design
– Most programs were written in assembly
language
• Hardware concerns of the past:
– Limited and slower memory
– Few registers
The Solution
• Have instructions do more work, thereby
minimizing the number of instructions called
in a program
• Allow for variations of each instruction
– Usually variations in memory access
• Minimize the number of memory accesses
The Search for RISC
• Compilers became more prevalent
• The majority of CISC instructions were rarely
used
• Some complex instructions were slower than
a group of simple instructions performing an
equivalent task
– Too many instructions for designers to optimize
each one
RISC Architecture

• Small, highly optimized set of instructions


• Uses a load-store architecture
• Short execution time
• Pipelining
• Many registers
Pipelining
• Break instructions into steps
• Work on instructions like in an assembly line
• Allows for more instructions to be executed
in less time
• A n-stage pipeline is n times faster than a non
pipeline processor (in theory)
RISC Pipeline Stages
• Fetch instruction
• Decode instruction
• Execute instruction
• Access operand
• Write result

– Note: Slight variations depending on processor


Without Pipelining
• Normally, you would perform the fetch, decode,
execute, operate, and write steps of an instruction
and then move on to the next instruction

Clock Cycle 1 2 3 4 5 6 7 8 9 10

Instr 1

Instr 2
With Pipelining
• The processor is able to perform each stage
simultaneously.
• If the processor is decoding an instruction, it may
also fetch another instruction at the same time.
Clock Cycle 1 2 3 4 5 6 7 8 9

Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
Pipeline (cont.)
• Length of pipeline depends on the longest
step
• Thus in RISC, all instructions were made to
be the same length
• Each stage takes 1 clock cycle
• In theory, an instruction should be finished
each clock cycle
Pipeline Problem

• Problem: An instruction may need to wait for


the result of another instruction
Pipeline Solution :

• Solution: Compiler may recognize which


instructions are dependent or independent of
the current instruction, and rearrange them to
run the independent one first
How to make pipelines faster
• Superpipelining
– Divide the stages of pipelining into more stages
• Ex: Split “fetch instruction” stage into two
stages
Super scalar pipelining

 Run multiple pipelines in parallel


Super duper pipelining
Automated consolidation of data from many
sources,
Dynamic pipeline

• Dynamic pipeline: Uses buffers to hold


instruction bits in case a dependent
instruction stalls
Why CISC Persists ?
• Most Intel and AMD chips are CISC x86
• Most PC applications are written for x86
• Intel spent more money improving the
performance of their chips
• Modern Intel and AMD chips incorporate
elements of pipelining
– During decoding, x86 instructions are split into
smaller pieces
VLSI ARCHITECTURES

Superscalar and VLIW


Architectures
Outline
• Types of architectures
• Superscalar
• Differences between CISC, RISC and VLIW
• VLIW ( very long instruction word )
Very Long Instruction Word

VLIW Goals:
Flexible enough
Match well technology

o Very long instruction word or VLIW refers to a processor architecture designed to


take advantage of instruction level parallelism

VLIW philosophy:
– “dumb” hardware
– “intelligent” compiler
VLIW - History
• Floating Point Systems Array Processor
– very successful in 70’s
– all latencies fixed; fast memory
• Multiflow
– Josh Fisher (now at HP)
– 1980’s Mini-Supercomputer
• Cydrome
– Bob Rau (now at HP)
– 1980’s Mini-Supercomputer
• Tera
– Burton Smith
– 1990’s Supercomputer
– Multithreading
• Intel IA-64 (Intel & HP)
VLIW Processors

Goal of the hardware design:


• reduce hardware complexity
• to shorten the cycle time for better performance
• to reduce power requirements

How VLIW designs reduce hardware complexity ?


1. less multiple-issue hardware
1. no dependence checking for instructions within a bundle
2. can be fewer paths between instruction issue slots & FUs
2. simpler instruction dispatch
1. no out-of-order execution, no instruction grouping
3. ideally no structural hazard checking logic

• Reduction in hardware complexity affects cycle time & power


consumption
VLIW Processors

More compiler support to increase ILP


detects hazards & hides
latencies
• structural hazards
• no 2 operations to the same functional unit
• no 2 operations to the same memory bank
• hiding latencies
• data prefetching
• hoisting loads above stores
• data hazards
• no data hazards among instructions in a
bundle
• control hazards
• predicated execution
VLIW: Definition
• Multiple independent Functional Units
• Instruction consists of multiple independent instructions
• Each of them is aligned to a functional unit
• Latencies are fixed
– Architecturally visible
• Compiler packs instructions into a VLIW also schedules all
hardware resources
• Entire VLIW issues as a single unit
• Result: ILP with simple hardware
– compact, fast hardware control
– fast clock
– At least, this is the goal
Introduction
o Instruction of a VLIW processor consists of multiple independent
operations grouped together.
o There are multiple independent Functional Units in VLIW processor
architecture.
o Each operation in the instruction is aligned to a functional unit.
o All functional units share the use of a common large register file.
o This type of processor architecture is intended to allow higher
performance without the inherent complexity of some other
approaches.
VLIW History
The term coined by J.A. Fisher (Yale) in 1983
ELI S12 (prototype) Trace
(Commercial)
Origin lies in horizontal microcode optimization
Another pioneering work by B. Ramakrishna Rau in
1982 Poly
cyclic (Prototype) Cydra-5
(Commercial)
Recent developments Trimedia
– Philips TMS320C6X –
Texas Instruments

Slide 74
"Bob" Rau

• Bantwal Ramakrishna "Bob" Rau (1951


– December 10, 2002) was a computer
engineer and HP Fellow. Rau was a founder
and chief architect of Cydrome, where he
helped develop the Very long instruction
word technology that is now standard in
modern computer processors. Rau was the
recipient of the 2002 Eckert–Mauchly
Award.
1984: Co-founded Cydrome Inc. and was the chief
architect of the Cydra 5 mini-supercomputer.

1989: Joined Hewlett Packard and started HP Lab's research


program in VLIW and instruction-level parallel processing.
Director of the Compiler and Architecture Research (CAR)
program, which during the 1990s, developed advanced
compiler technology for Hewlett Packard and Intel computers.

At HP, also worked on PICO (Program In, Chip Out) project


to take an embedded application and to automatically design
highly customized computing hardware that is specific to that
application, as well as any compiler that might be needed.

2002: passed away after losing a long battle with cancer


The VLIW Architecture
• A typical VLIW (very long instruction word) machine
has instruction words hundreds of bits in length.
• Multiple functional units are used concurrently in a
VLIW processor.
• All functional units share the use of a
common large register file.
Parallel Operating Environment (POE)
• Compiler creates complete plan of run-time execution
– At what time and using what resource
– POE communicated to hardware via the ISA
– Processor obediently follows POE
– No dynamic scheduling, out of order execution
• These second guess the compiler’s plan
• Compiler allowed to play the statistics
– Many types of info only available at run-time
• branch directions, pointer values
– Traditionally compilers behave conservatively  handle worst case
possibility
– Allow the compiler to gamble when it believes the odds are in its favor
• Profiling
• Expose micro-architecture to the compiler
– memory system, branch execution
VLIW Processors

Compiler support to increase ILP


• compiler creates each VLIW word
• need for good code scheduling greater than with in-order issue superscalars
• instruction doesn’t issue if 1 operation can’t ( reverse to maala
bulb )
• techniques for increasing ILP
1.loop unrolling
2.software pipelining (schedules instructions from
different iterations together)
3.aggressive inlining (function becomes part of the
caller code)
4.trace scheduling (schedule beyond basic block
boundaries)
Different Approaches
Other approaches to improving performance in processor architectures :
o Pipelining
Breaking up instructions into sub-steps so that instructions can be
executed partially at the same time

o Superscalar architectures
Dispatching individual instructions to be executed completely
independently in different parts of the processor

o Out-of-order execution
Executing instructions in an order different from the program
Parallel processing
Processing instructions in parallel requires three
major tasks:

1. checking dependencies between instructions to


determine which instructions can be grouped
together for parallel execution;
2. assigning instructions to the functional units on
the hardware;
3. determining when instructions are initiated placed
together into a single word.
ILP
Consider the following program:
op 1 e = a + b
op2 f = c + d
op3 m = e * f

o Operation 3 depends on the results of operations 1 and 2, so it


cannot be calculated until both of them are completed
o However, operations 1 and 2 do not depend on any other
operation, so they can be calculated simultaneously
o If we assume that each operation can be completed in one unit of
time then these three instructions can be completed in a total of
two units of time giving an ILP of 3/2.
Two approaches to ILP

o Hardware approach:
Works upon dynamic parallelism where
scheduling of instructions is at run time

o Software approach:
Works on static parallelism where
scheduling of instructions is by compiler
VLIW COMPILER

o Compiler is responsible for static scheduling of instructions in VLIW


processor.

o Compiler finds out which operations can be


executed in parallel in the program.

o It groups together these operations in single instruction which is the


very large instruction word.

o Compiler ensures that an operation is not issued before its operands


are ready .
VLIW Example
FU

FU
I-fetch &
Issue
Memory
Port

Memory
Port

Multi-ported
Register
File
Block Diagram
Working

o Long instruction words are fetched from the memory


o A common multi-ported register file for fetching the operands and
storing the results.
o Parallel random access to the register file is possible through the
read/write cross bar.
o Execution in the functional units is carried out concurrently with the
load/store operation of data between RAM and the register file.
o One or multiple register files for FX and FP data.
o Rely on compiler to find parallelism and schedule dependency free
program code.
Major categories

VLIW – Very Long Instruction Word


EPIC – Explicitly Parallel Instruction Computing
IA-64 EPIC

Explicitly Parallel Instruction Computing , VLIW


2001 800 MHz Itanium IA-64 implementation

Bundle of instructions
• 128 bit bundles
• 3 41-bit instructions/bundle
• 2 bundles can be issued at once
• if issue one, get another
• less delay in bundle issue
Data path : A simple VLIW Architecture

FU FU FU

Register file

Scalability ?
Access time, area, power consumption sharply increase with
number of register ports

Slide 90
Data path : Clustered VLIW Architecture
(distributed register file)

FU FU FU FU FU FU

Register file Register file Register file

Interconnection Network

Slide 91
Coarse grain Fus with
VLIW core

Multiplexer network
IR

Micro

Reg1
Reg1

Reg2

Reg1

Reg2

Reg2
Code
Coarse grain
FU

Prg. Counter MULT RAM ALU


Logic

Embedded (co)-processors as Fus in a VLIW architecture

Slide 92
Application Specific FUs

number of inputs

functionality FU
Functional Units

number of outputs

latency initiation interval I/O time shape

Slide 93
Superscalar Processors
• Superscalar
– Operations are sequential
– Hardware figures out resource assignment, time of execution

• Superscalar processors are designed to exploit more


instruction-level parallelism in user programs.
• Only independent instructions can be executed in parallel
without causing a wait state.
• The amount of instruction-level parallelism varies widely
depending on the type of code being executed.
Pipelining in Superscalar Processors
• In order to fully utilise a superscalar processor of
degree m, m instructions must be executable in
parallel. This situation may not be true in all clock
cycles. In that case, some of the pipelines may be
stalling in a wait state.
• In a superscalar processor, the simple operation
latency should require only one cycle, as in the base
scalar processor.
Superscalar Execution
Superscalar Implementation
• Simultaneously fetch multiple instructions
• Logic to determine true dependencies involving
register values
• Mechanisms to communicate these values
• Mechanisms to initiate multiple instructions in
parallel
• Resources for parallel execution of multiple
instructions
• Mechanisms for committing process state in
correct order
Difference Between VLIW
&
Superscalar Architecture
Why Superscalar Processors are
commercially more popular as
compared to VLIW processor ?

Binary code compatibility among scalar &


superscalar processors of same family
Same compiler works for all processors (scalars
and superscalars) of same family
Assembly programming of VLIWs is tedious
Code density in VLIWs is very poor
- Instruction encoding schemes
Area Performance

100
Slide
Superscalars vs. VLIW

VLIW requires a more complex compiler

Superscalar's can more efficiently execute


pipeline-independent code
• consequence: don’t have to recompile if change
the implementation

101
Slide
Comparison: CISC, RISC, VLIW

VLSI DESIGN GROUP – METS SCHOOL OF ENGINEERING , MALA


VLSI DESIGN GROUP – METS SCHOOL OF ENGINEERING , MALA
Advantages of VLIW

Compiler prepares fixed packets of multiple


operations that give the full "plan of execution"
– dependencies are determined by compiler and used to
schedule according to function unit latencies
– function units are assigned by compiler and correspond to
the position within the instruction packet ("slotting")
– compiler produces fully-scheduled, hazard-free code =>
hardware doesn't have to "rediscover" dependencies or
schedule
Disadvantages of VLIW

Compatibility across implementations is a major


problem
– VLIW code won't run properly with different number
of function units or different latencies
– unscheduled events (e.g., cache miss) stall entire
processor
Code density is another problem
– low slot utilization (mostly nops)
– reduce nops by compression ("flexible VLIW",
"variable-length VLIW")
References
1. Advanced Computer Architectures, Parallelism, Scalability,
Programmability, K. Hwang, 1993.
2. M. Smotherman, "Understanding EPIC Architectures and
Implementations" (pdf)
http://www.cs.clemson.edu/~mark/464/acmse_epic.pdf
3. Lecture notes of Mark Smotherman,
http://www.cs.clemson.edu/~mark/464/hp3e4.html
4. An Introduction To Very-Long Instruction Word (VLIW) Computer
Architecture, Philips Semiconductors,
http://www.semiconductors.philips.com/acrobat_download/other
/vliw-wp.pdf
5. Texas Instruments, Tutorial on TMS320C6000 VelociTI Advanced
VLIW Architecture.
http://www.acm.org/sigs/sigmicro/existing/micro31/pdf/m31_sesha
n.pdf
Thanks

Slide
108

You might also like