0% found this document useful (0 votes)
29 views64 pages

Stud CSA Processors Mod2 Part1

The document discusses different types of computer processors including CISC, RISC, superscalar, and vector processors. It covers topics like the design space of processors, instruction set architectures, pipeline architectures, and compares CISC and RISC scalar processors. Specific examples of CISC processors are provided with details on their instruction sets, functional units, registers, caches, and pipeline stages.

Uploaded by

SHEENA Y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views64 pages

Stud CSA Processors Mod2 Part1

The document discusses different types of computer processors including CISC, RISC, superscalar, and vector processors. It covers topics like the design space of processors, instruction set architectures, pipeline architectures, and compares CISC and RISC scalar processors. Specific examples of CISC processors are provided with details on their instruction sets, functional units, registers, caches, and pipeline stages.

Uploaded by

SHEENA Y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

COMPUTER SYSTEM ARCHITECTURE - CS 405

Module - 2 Part - 1

Processors
Processors
• Advanced Processor Technology
– Design Space of Processors
– Instruction-Set Architectures
– CISC Scalar Processors
– RISC Scalar Processors
• Superscalar and Vector Processors
– Superscalar Processors
– VLIW Architecture
– Vector and Symbolic Processors
Advanced Processor Technology
• Major microprocessor families include:
– CISC Computers
– RISC Computers
– Superscalar Processors
– VLIW Processors
– Super Pipelined
– Vector Supercomputers
– symbolic processor
• Scalar and vector processors are used for numerical computations.
• Symbolic processors have been developed for Al applications.
Design space of processors
Design space of processors
• The two broad categories of processors are CISC and RISC.
❑ CISC (Complex Instruction Set Computing)
• Conventional processors like Intel i486, M68040, VAX/8600, IBM 390, etc
fall into this family.
• Typical clock rate ~ 33 – 50 MHz and with microprogrammed control,
• typical CPI ~ 1 – 20.
• CISC processors are at the upper part of the design space.
❑ RISC (Reduced Instruction set computing)
• Today’s RISC processors like Intel i860, SPARC, MIPS R3000, IBM
RS/6000, MIPS, Alpha, ARM etc.
• Have faster clock rate ~ 20 – 120 MHz
• with hardwired control
• typical CPI ~ 1 – 2.
❑ Super Scalar processors
• Special subclass of RISC processor
• Allow Multiple instructions to be executed simultaneously during each
cycle.
• Effective CPI lower than RISC
• Clock rate as scalar RISC.
Design space of processors….
❑ Very long instruction word (VLIW) architecture
•Uses even more functional units than a superscalar processor.
•CPI can be further lowered.
•Due to very long instructions (256 to 1024 bits per instruction), its clock rate is
slow
•VLIW processors have been mostly implemented with microprogrammed control.
•Eg: Intel i860 RISC processor

❑ Super pipelined processors


• Use multiphase clocks
• Increased clock rate ranging from 100 to 500 MHz.
• But CPI rate is rather high.

❑ Vector supercomputers
• Use multiple functional units for concurrent scalar and vector operations.
• The effective CPI of a processor used in a supercomputer should be very low
•The cost increases appreciably if a processor design is restricted to the lower
right corner
Instruction Pipeline
• The execution cycle of a typical instruction involves four
phases:
fetch, decode, execute & write-back
• Often executed by an instruction pipeline.
• An instruction processor can be modelled by a pipeline
structure.
• The pipeline, like an industrial assembly line, receives
successive instructions from its input end and executes them
in a streamlined, overlapped fashion as they flow through.
• A pipeline cycle is intuitively defined as the time required for
each phase to complete its operation, assuming equal delay
in all phases (pipeline stages).
Definitions (instruction pipeline)
❖ Basic definitions associated with instruction pipeline operations:
• Instruction pipeline cycle— the clock period of the instruction pipeline.
• Instruction issue latency— the time (in cycles) required between the issuing of
two adjacent instructions.
• Instruction issue rate— the number of instructions issued per cycle, also called
the degree of a superscalar processor.
• Simple operation latency — Simple operations make up the vast majority of
instructions executed by the machine, such as integer adds, loads, stores,
branches, moves, etc. On the contrary, complex operations are those requiring
an order-of magnitude longer latency, such as divides, cache misses, etc. These
latencies are measured in number of cycles.
• Resource conflicts— This refers to the situation where two or more instructions
demand use of the same functional unit at the same time.
Instruction Pipeline

❑ A base scalar processor is defined as a machine with one instruction issued per cycle, a
one-cycle latency for a simple operation, and a one-cycle latency between instruction
issues. The instruction pipeline can be fully utilized if successive instructions can enter it
continuously at the rate of one per cycle. The effective CPI rating is 1 for the ideal pipeline.
❑ If instruction latency is two cycles per instruction, pipeline can be under utilized.
The effective CPI rating is 2.

❑ Another under pipelined situation is in which the pipeline cycle time is doubled by combining
pipeline stages. ln this case, the fetch and decode phases are combined into one pipeline
stage, and execute and write back are combined into another stage. This will also result in
poor pipeline utilization. The effective CPI rating is one half.
Data path architecture and control unit of a scalar processor
Data path architecture and control unit of a scalar processor…

• Here data path architecture and control unit of a typical, simple scalar
processor without an instruction pipeline.
• Main memory, ID controllers, etc. are connected to the external bus.
• The control unit generates control signals required for the fetch, decode, ALU
operation , memory access and write result phases of instruction execution.
• The control unit itself may use micro coded logic (CISC) or hardwired logic
(RISC).
Processors & Coprocessors
• Central processor of computer is called CPU
– Scalar processor
– Multiple functional units
– Floating point accelerator
• Floating point unit can be coprocessor
– Attached with CPU
– Executes instructions dispatched by CPU
– Can’t be used alone, can’t handle I/O operations
Architectural Models of a basic Scalar processor…
Instruction Set Architectures
• The instruction set, also called instruction set architecture (ISA), is part of a
computer that pertains to programming, which is basically machine
language.
• Instruction set, defines the primitive commands or machine instructions to
the processor.
• Two approaches of ISA: CISC and RISC
⮚ Examples of instruction set
⮚ Characteristics of instruction set:-
▪ ADD - Add two numbers together.
• Instruction formats
▪ COMPARE – Compare numbers.
• Data formats/types
▪ IN - Input information from a device,
• Addressing modes
e.g.,keyboard.
• General purpose registers
▪ JUMP - Jump to designated RAM
• Opcode specifications address.
• Flow control mechanisms ▪ LOAD - Load information from RAM
• memory architecture to the CPU.
• Interrupt and exception handling ▪ OUT - Output information to device,
• external I/O e.g.,monitor.
▪ STORE-Store information to RAM.
Complex Instruction Set Computing Reduced Instruction Set Computing
(CISC) (RISC)
• HLL statements directly implemented in • Only 25% of large set of instructions used
hardware. Add more and more functions into frequently 95% of the time. all these rare
the hardware instructions to software
• instruction set very large & complex • reduced instruction set
• Characterized by micro programmed control • Characterized by hardwired control without
with Control ROM Control ROM
• Typically contains 120 - 350 instructions • Typically contains less than 100 instructions
• Variable instruction format ( 16 - 64 bit) • Fixed instruction format (32 bit)
• a few (8 - 24) general purpose registers • Lot of general purpose registers (32 -192 )
• Clock rate ( 33 - 50Mhz), CPI ( 2 -15) • Clock rate ( 50 - 150Mhz), CPI ( < 1.5)
• Lot of memory based instructions • Mostly register based instructions
• Unified cache design • Split data and instruction cache design
• More than a dozen addressing modes • Only 3 – 5 addressing modes
• Improve execution efficiency • Memory access only by load/store
instructions
CISC vs RISC Architectures
CISC Scalar Processor

❑ Scalar processor executes with scalar data


⮚Simple models work with integer instructions using fixed point operands
⮚Complex models work with integer and floating point operations
• Both integer unit and floating point unit may be present in same CPU
• Ideally, its performance should be that of instruction pipeline with one
instruction fed per clock cycle
• Practically, it works in under pipelined situation due to data dependencies,
resource conflicts, branch penalties, etc.

❑ Design Philosophy - CISC


1.Implement useful instructions in hardware, resulting in shorter program
length and lower software overhead.
2.However, this is achieved at the expense of lower clock rate and higher CPI.

Balance between the two required !


CISC - Example 1

• Typical CISC architecture with


Micro programmed control

• Instruction set contains 300


instructions with 20 different
addressing modes

• CPU consist of two functional


units for execution of floating
point and integer instructions

• Unified cache holds both


instructions and data

• 16 GPRs in instruction unit


and instruction pipelining has
six stages
CISC - Example 2

• Processor implements over 100


instructions using 16 GPRs

• Separate cache each of 4KB for


data and instruction with MMUs
present in separate memory units

• Instruction set supports 18


addressing modes

• Integer unit has six stage instruction


pipeline, decodes all instructions

• Floating point unit consist of three


stage pipeline
General characteristics

•Large number of
instructions

•More options in the


addressing modes

•Lower clock rate

•High CPI

•Widely used in personal


computer (PC) industry
RISC Scalar Processor

• Generic RISC processors are called scalar RISC because they are designed to issue
one instruction per cycle
• RISC processors push some of the less frequently used operations into software
• RISC processors depend heavily on a good compiler because complex HLL
instructions are to be converted into primitive low level instructions, which are few
in number
• RISC processors have a higher clock rate and lower CPI
RISC Scalar Processor

Advantages:
•Speed: Since a simplified instruction set allows for a pipelined, super scalar
design, RISC processors often achieve 2 to 4 times the performance of CISC
processor using comparable semiconductor technology and the same clock
rates.
•Simpler Hardware: Because the instruction set of a RISC processor is so simple,
it uses up much less chip space; extra functions such as memory management
units or floating point arithmetic units, can also be placed on the same chip.
Smaller chips allow a semiconductor manufacturer to place more parts on a
single silicon wafer, which can lower the per-chip cost dramatically.
•Shorter Design Cycle: Since RISC processors are simpler than corresponding
CISC processors, they can be designed more quickly, and can take advantage of
other technological developments sooner than corresponding CISC designs,
leading to greater leaps in performance between generations.
General characteristics

• All use 32-bit


instructions

•Instruction set
consist of less than
100 instructions

•High clock rate

•Low CPI
RISC - Example 1

• SPARC stands for scalable processor architecture


• SPARC specification allows implementations to scale from processors required in
embedded systems to processors used for servers.
• exceptionally high execution rates(MIPS) and short time-to-market development
schedules.
• Scalability is due to use of number of register windows
• Floating point unit (FPU) is implemented on a separate chip
RISC - Example 1 (Window Registers)

❖ SPARC runs each procedure with a set of thirty two 32-bit registers
• Eight of these registers are global registers shared by all procedures
• Remaining twenty four registers are window registers associated with only one
procedure
• Concept of using overlapped registers is the most important feature
introduced
• Each register window is divided into three sections – Ins, Locals and Outs
• Locals are addressable by each procedure and Ins & Outs are shared among
procedures
• Input registers : arguments are passed to a function
• Local registers : to store any local data.
• Output registers : When calling a function, the programmer puts his argument
in these registers.
RISC - Example 1 (Window Registers)
RISC- Register Window
• At any time, an instruction can access the following
8 global registers and a 24 bit register - window.
• A register window comprises a 16-register set- divided into 8 in and 8 local registers-
together with the 8 in registers of an adjacent register set, addressable from the
current window as its out registers.
• When a procedure is called, the register window shifts by sixteen registers, hiding
the old input registers and old local registers and making the old output registers
the new input registers.
• The current (active) window into the r registers is given by the current window
pointer (CWP) register. It is always decremented.
• Window Invalid Mask- set as 1 for oldest window…if accessed, then trap occurs,
its contents saved on to stack, WIM rotated 1 bit and next lowest window set as
oldest .
• Trap base register – pointer to trap handler
• Special Register to create a 64-bit product in multiple step instructions
• Overlapping windows save time in inter procedure communication, faster context
switching.
RISC-The Floating-point Unit (FPU)

• The FPU has 32 32-bit (single-precision) floating-point


registers, 32 64-bit (double-precision) floating-point
registers, and 16 128-bit (quad-precision) floating-
point registers.
• Floating-point load/store instructions are used to move
data between the FPU and memory.
• The memory address is calculated by the IU.
• Floating-Point operate (FPop) instructions perform the
floating-point arithmetic operations and comparisons.
RISC – Example 2

• 64 bit RISC processor on a


single chip

• It executes 82 instructions, all


of them in single clock cycle

• There are nine functional units


connected by multiple data
paths

• There are two floating point units


namely multiplier unit and adder
unit, both of which can execute
concurrently
•Dual operation add-and-multiply
and subtract- and-multiply
eg: C = A* S2 + S1
•Merge register used by vector
integer instructions
•Graphics unit –integer
operations(8, 16, 32 bit pixel data
types)- 3DDrawing
RISC – Example 2
“Scalar” vs “Superscalar” Processors
• Scalar processors:-
– Execute one instruction per cycle
– One instruction is issued per cycle
– Pipeline throughput: one result per cycle
• Superscalar processors:-
– Multiple instruction pipelines executed
– Multiple instruction issued per cycle and
– Multiple results generated per cycle
Superscalar Processors
• Designed to exploit instruction-level parallelism in user programs.
• Depends on optimizing compiler
• Amount of parallelism depends on the type of code being executed
• On average, at instruction level around 2 instructions can be executed in
parallel
• There is no benefit to have a processor which can be fed with 3 instructions
per cycle
• Thus, instruction-issue degree in superscalar has been limited to 2 – 5
Pipelining in Superscalar Processors
• A superscalar processor of degree m can issue m instructions per cycle
• To fully utilize, at every cycle, there must be m instructions for execution
• Dependence on optimizing compilers is very high
• Figure depicts three instruction pipeline degree m=3
Pipelining in Superscalar Processor

• Superscalar processors were originally developed as an alternative


to vector processors
• They exploit higher degree of instruction level parallelism.
• A superscalar processor of degree m can issue m instructions per
cycle. The base scalar processor, implemented either in RISC or ClSC,
has m = 1. Thus m instructions must be executable in parallel.
• In a super scalar processor; the simple operation latency should
require only one cycle, as in the base scalar processor. Due to the
desire for a higher degree of instruction-level parallelism in
programs, the superscalar processor depends more on an optimizing
compiler to exploit parallelism.
Superscalar Architecture with IU and FPU
Superscalar architecture for a RISC processor
• Multiple instruction pipelines are used, instruction cache
supplies multiple instructions per fetch.
• However, the actual number of instructions issued to various
functional units may vary in each cycle.
• The number is constrained by data dependences and resource
conflicts among instructions that are simultaneously decoded.
• Multiple functional units are built into the integer unit and
into the floating-point unit.
• Multiple data buses exist among the functional units. In
theory, all functional units can be simultaneously used if
conflicts and dependences do not exist among them during a
given cycle.
Superscalar – Example 1
• A superscalar architecture by IBM

• Three functional units namely


branch processor, fixed point
processor and floating point
processor, all of which can operate in
parallel

• Branch processor can facilitate


execution of up to five instructions
per cycle - 1 branch, 1 fixed point,
1 condition-register, 1 floating point
multiply -add instruction

• Number of buses of varying width


are provided to support high
instruction and data bandwidths.
Representative Super-scalar Processor

• A number of commercially available processors have been


implemented with the superscalar architecture.
• The maximum number of instructions issued per cycle ranges
from two to five in these superscalar processors.
• Typically. the register files in the lU and FPU each have 32
registers. Most superscalar processors implement both the IU
and the FPU on the same chip.
• Besides the register files, reservation stations and record buffers
can be used to establish instruction level parallelism. The
purpose is to support instruction lookahead and internal data
forwarding, which are needed to schedule multiple instructions
simultaneously
Representative Super-scalar Processors
Very Long Instruction Word (VLIW) Architectures
• Typical VLIW architectures have instruction word length of hundreds of bits
• Built upon two concepts, namely
1. Superscalar processing
• Multiple functional units work concurrently
• Common large register file is shared
2. Horizontal microcoding
• Different fields of the long instruction word carries opcodes to be
dispatched to multiple functional units
• Programs written in conventional short opcodes are to be converted into
VLIW format by compilers through code compaction
Very Long Instruction Word Format
Example: Very Long Instruction Word Instruction
Very Long Instruction Word (VLIW) Architectures
• Very long instruction word (VLIW) describes a computer processing
architecture in which a language compiler or pre-processor breaks program
instruction down into basic operations that can be performed by the
processor in parallel (thatis, at the same time).
• These operations are put into a very long instruction word which the
processor can then take apart without further analysis, handing each
operation to an appropriate functional unit.
• In this multiple functional units are used concurrently in a VLIW processor.
All functional units share the use of a common large register file. The
operations to be simultaneously executed by the functional units are
synchronized in a VLIW instruction, say, 256 or 1024 bits per instruction
word, an early example being the Multiflow computer models.
• Different fields of the long instruction word carry the opcodes to be
dispatched to different functional units. Programs written in conventional
short instruction words {say 32 bits} must be compacted together to form the
VLIW instructions. This code compaction must be done by a compiler which
can predict branch outcomes using elaborate heuristics or run-time
statistics.
Very Long Instruction Word (VLIW) Architectures
• VLIW is sometimes viewed as the next step beyond the reduced
instruction set computing(RISC) architecture, which also works with a
limited set of relatively basic instructions and can usually execute more
than one instruction at a time (a characteristic referred to as superscalar).
• The main advantage of VLIW processors is that complexity is moved
from the hardware to the software, which means that the hardware can
be smaller, cheaper, and require less power to operate.
• The challenge is to design a compiler or pre-processor that is intelligent
enough to decide how to build the very long instruction words. If dynamic
pre-processing is done as the program is run, performance may be a
concern.
Typical VLIW Architecture

• Multiple functional units


are concurrently used

• All functional units use


the same register file

• A typical instruction
format
Pipelining in VLIW Architecture

• Each instruction in VLIW


architecture specifies multiple
instructions

• Execute stage has multiple


operations

• Instruction parallelism and data


movement in VLIW
architecture are specified at
compile time

• CPI of VLIW architecture is


lower than superscalar
processor
Pipelining in VLIW Processors
• Each instruction specifies multiple operations.
• VLIW machines behave much like superscalar machines with three differences:
• First, the decoding of VLIW instructions is easier than that of superscalar
instructions.
• Second, the code density of the superscalar machine is better when the available
instruction level parallelism is less than that exploitable by the VLIW machine. This
is because the fixed VLIW format includes bits for non-executable operations,
while the superscalar processor issues only executable instructions.
• Third, a superscalar machine can be object-code-compatible with a large family of
non-parallel machines. On the contrary, a VLIW machine exploiting different
amounts of parallelism would require different instruction sets.
• lnstruction parallelism and data movement in a VLIW architecture are completely
specified at compile time. Run-time resource scheduling and synchronization are in
theory completely eliminated. One can view a VLIW processor as an extreme
example of a superscalar processor in which all independent or unrelated
operations are already synchronously compacted together in advance.
• The CPI of a VLIW processor can be even lower than that of a superscalar
processor.
VLIW Opportunities
• “Random” parallelism among scalar operations is exploited in VLIW,
instead of regular parallelism in a vector or SIMD machine.
• The efficiency of the machine is entirely dictated by the success, or
“goodness”, of the compiler in planning the operations to be placed in the
same instruction words.
• Different implementations of the same VLIW architecture may not be
binary-compatible with each other, resulting in different latencies.
Superscalar vs VLIW
Vector Processors
• Vector processors have high-level operations that work
on linear arrays of numbers: "vectors"
Properties of Vector Processors

• Each result independent of previous result


=> long pipeline, compiler ensures no dependencies
=> high clock rate
• Vector instructions access memory with known pattern
=> highly interleaved memory
=> amortize memory latency of over ­64 elements
=> no (data) caches required! (Do use instruction cache)
• Reduces branches and branch problems in pipelines
• Single vector instruction implies lots of work (­loop)
=> fewer instruction fetches
Vector Processor
• Vector operations are SIMD (single instruction multiple data)
operations
• Each element is computed by a virtual processor (VP)
• Number of VPs given by vector length
– vector control register
• Vector register file
– Each register is an array of elements
– Size of each register determines maximum vector length
– Vector length register determines vector length for a
particular operation
• Multiple parallel execution units = “lanes” , “pipelines” or “pipes”
Vector Processors
• Vector processor is a coprocessor designed to perform vector
computations
• Vector computations involve instructions with large array of operands
– Same operation is performed over an array of operands
• Vector processor may be designed with architectures :-
❑ Register to register VP
• Register based instructions
• Shorter Instructions
• Vector register files

❑ Memory to memory VP
• Memory based instructions
• Longer instructions
• Instructions include memory address
Vector Instructions

• A register based vector instructions appear in most register to


register vector processors like Cray supercomputers.
• Vector length should be equal in both the operands in a
binary vector instruction
• The reduction is an operation on one or two vector operands,
and the result is a scalar such as the dot product between
two vectors and maximum of all components in a vector
• These vector operation are performed by dedicated pipeline
units including functional pipelines and memory- access
pipelines.
• Long vectors exceeding register length n must be segmented
to fit in vector registers n elements at a time.
Vector Instructions
• Register-based instructions

Vi vector register of length n

Si scalar register of length n

M(1 : n) represent memory


array of length n
• Memory-based instructions
Vector Pipelines

• Scalar pipeline
• Each “Execute-Stage”
operates upon a scalar
operand

• Vector pipeline
• Each “Execute-Stage”
operates upon a vector
operand
Symbolic Processors
❑ Symbolic processors
- Prolog Processors, Lisp Processors or symbolic manipulators.
Deals with logic programs, symbolic lists, objects, scripts, productions systems,
semantic networks, frames and artificial neural networks.
❑ Application areas:
- pattern recognition, expert systems, artificial intelligence, cognitive science,
machine learning, text retrieval, theorem proving, knowledge engineering, etc.
❑ Symbolic processors differ from numeric processors in terms of:-
– Data and knowledge representations
– Primitive operations
– Algorithmic behavior
– Memory
– I/O communication
– Special architectural features
Characteristics of Symbolic Processors
Symbolic Processors
• Lisp program is a set of functions in which data are passed from
function to function.
• The concurrent execution of these functions brings parallelism.
• The applicative and recursive nature of Lisp requires an environment
that efficiently supports stack computations and function calling.
• Use of linked lists as basic data structure – implements an automatic
garbage collection mechanism.
• Primitive operations demand a special instruction set with compare,
matching, logic and symbolic manipulation operations.
• Floating point operations not used.
Symbolic Processor - Example
• Symbolic Lisp Processor

• Layered architecture -
Simplified instruction set,
Stack oriented machine

•Multiple processing units


are provided which can
work in parallel

• Operands are fetched


from stack, stack buffer
or scratch pad memory.

• Top of stack is duplicated in


scratch pad memory

• Processor executes most


lisp instructions in single
machine cycle
RISC - Example 1…

You might also like