0% found this document useful (0 votes)
30 views144 pages

COA Unit-5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views144 pages

COA Unit-5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 144

Computer Organization and Architecture

(CS304PC)

D.Koteshwar Rao
Assistant Professor,
Department of ECE
Unit-5

Reduced Instruction Set


Computer
CISC Characteristics
 The essential goal of a CISC architecture is to attempt to
provide a single machine instruction for each statement that
is written in a high-level language.
 Another characteristic of CISC architecture is the
incorporation of variable-length instruction formats.
 The instructions in a typical CISC processor provide direct
manipulation of operands residing in memory.
CISC Characteristics
 The major characteristics of CISC architecture are:
1. A large number of instructions-typically from 100 to 250
instructions
2. Some instructions that perform specialized tasks and are
used infrequently
3. A large variety of addressing modes-typically from 5 to
20 different modes
4. Variable-length instruction formats
5. Instructions that manipulate operands in memory
RISC Characteristics
 The concept of RISC architecture involves an attempt to
reduce execution time by simplifying the instruction set of
the computer.
 The major characteristics of a RISC processor are:
1. Relatively few instructions
2. Relatively few addressing modes
3. Memory access limited to load and store instructions
4. All operations done within the registers of the CPU
5. Fixed-length, easily decoded instruction format
6. Single-cycle instruction execution
7. Hardwired rather than microprogrammed control
RISC Characteristics
 The small set of instructions of a typical RISC processor
consists mostly of register-to-register operations, with only
simple load and store operations for memory access. Thus
each operand is brought into a processor register with a load
instruction.
 All computations are done among the data stored in
processor registers. Results are transferred to memory by
means of store instructions.
 By using a relatively simple instruction format, the
instruction length can be fixed and aligned on word
boundaries.
RISC Characteristics
 An important aspect of RISC instruction format is that it is
easy to decode. Thus the operation code and register fields of
the instruction code can be accessed simultaneously by the
control.
 A characteristic of RISC processors is their ability to execute
one instruction per clock cycle.
 This is done by overlapping the fetch, decode, and execute
phases of two or three instructions by using a procedure
referred to as pipelining.
 A load or store instruction may require two clock cycles
because access to memory takes more time than register
operations. Efficient pipelining, as well as a few other
characteristics, are sometimes attributed to RISC.
RISC Characteristics
 Although they may exist in non-RISC architectures as well.
Other characteristics attributed to RISC architecture are:
1. A relatively large number of registers in the processor
unit
2. Use of overlapped register windows to speed-up
procedure call and return
3. Efficient instruction pipeline
4. Compiler support for efficient translation of high-level
language programs into machine language programs
Unit-5

Pipeline and Vector


Processing
Parallel Processing
 Parallel processing is a technique used to provide
simultaneous data-processing tasks to increase the
computational speed of a computer system.
 Instead of processing each instruction sequentially as in a
conventional computer, a parallel processing system is able
to perform concurrent data processing to achieve faster
execution time.
 The purpose of parallel processing is to speed up the
computer processing capability and increase its throughput,
that is, the amount of processing that can be accomplished
during a given interval of time.
Parallel Processing
 The amount of hardware increases with parallel processing
and with it, the cost of the system increases.
 Parallel processing can be viewed from various levels of
complexity.
 At the lowest level, parallel and serial operations can be
distinguished by the type of registers used. Shift registers
operate in serial fashion one bit at a time, while registers
with parallel load operate with all the bits of the word
simultaneously.
 Parallel processing at a higher level of complexity can be
achieved by having a multiplicity of functional units that
perform identical or different operations simultaneously.
Parallel Processing
 Parallel processing is established by distributing the data
among the multiple functional units.
 Figure shows one possible way of separating the execution
unit into eight functional units operating in parallel. The
operands in the registers are applied to one of the units
depending on the operation specified by the instruction
associated with the operands.
 There are a variety of ways that parallel processing can be
classified. The normal operation of a computer is to fetch
instructions from memory and execute them in the
processor.
Parallel Processing
Parallel Processing
 The sequence of instructions read from memory constitutes
an instruction stream. The operations performed on the data
in the processor constitutes a data stream.
 Parallel processing may occur in the instruction stream, in
the data stream or in both. Flynn's classification divides
computers into four major groups as follows:
Single instruction stream, single data stream (SISD)
Single instruction stream, multiple data stream (SIMD)
Multiple instruction stream, single data stream (MISD)
Multiple instruction stream, multiple data stream (MIMD)
Parallel Processing
 SISD represents the organization of a single computer
containing a control unit, a processor unit, and a memory
unit. Instructions are executed sequentially and the system
may or may not have internal parallel processing
capabilities. Parallel processing in this case may be
achieved by means of multiple functional units or by
pipeline processing.
 SIMD represents an organization that includes many
processing units under the supervision of a common control
unit. All processors receive the same instruction from the
control unit but operate on different items of data.
Parallel Processing
 MISD structure is only of theoretical interest since no
practical system has been constructed using this
organization.
 MIMD organization refers to a computer system capable of
processing several programs at the same time.
 In this chapter we consider parallel processing under the
following main topics:
 1. Pipeline processing
 2.Vector processing
 3. Array processors
Pipelining
 Pipelining is a technique of decomposing a sequential
process into sub operations, with each sub process being
executed in a special dedicated segment that operates
concurrently with all other segments.
 The result obtained from the computation in each segment
is transferred to the next segment in the pipeline. The final
result is obtained after the data have passed through all
segments.
 The name "pipeline" implies a flow of information
analogous to an industrial assembly line.
 The registers provide isolation between each segment so
that each can operate on distinct data simultaneously.
Pipelining
 Each segment consists of an input register followed by a
combinational circuit. The register holds the data and the
combinational circuit performs the sub operation in the
particular segment.
 The output of the combinational circuit in a given segment
is applied to the input register of the next segment.
 A clock is applied to all registers after enough time has
elapsed to perform all segment activity. In this way the
information flows through the pipeline one step at a time.
Pipelining
 Example: Ai*Bi+Ci for i = 1, 2, 3, . . . , 7
 Each sub operation is to be implemented in a segment
within a pipeline. Each segment has one or two registers
and a combinational circuit as shown in Figure.
 R1 through R5 are registers that receive new data with every
clock pulse. The multiplier and adder are combinational
circuits. The sub operations performed in each segment of
the pipeline are as follows:
Pipelining
Pipelining
 The five registers are loaded with new data for every clock
pulse. The effect of each clock is shown in Table
Pipelining
 The first clock pulse transfers A1 and B1 into R1 and R2.
The second clock pulse transfers the product of R1 and R2
into R3 and C1 into R4. The same clock pulse transfers A2
and B2 into R1 and R2.
 The third clock pulse operates on all three segments
simultaneously. It places A3 and B3 into R1 and R2,
transfers the product of R1 and R2 into R3, transfers C2 into
R4, and places the sum of R3 and R4 into R5.
 It takes three clock pulses to fill up the pipe and retrieve the
first output from R5.
Pipelining
 From there on, each clock produces a new output and
moves the data one step down the pipeline. This happens as
long as new input data flow into the system.
 When no more input data are available, the clock must
continue until the last output emerges out of the pipeline
Pipelining
General Considerations
 The general structure of a four-segment pipeline is shown in Fig.

 The operands pass through all four segments in a fixed sequence.


Each segment consists of a combinational circuit S i that performs
a sub operation over the data stream flowing through the pipe.
Pipelining
General Considerations
 The segments are separated by registers Ri that hold the
intermediate results between the stages.
 Information flows between adjacent stages under the control
of a common clock applied to all the registers
simultaneously.
 Task is the total operation performed going through all the
segments in the pipeline.
 The behavior of a pipeline can be illustrated with a space-
time diagram. This is a diagram that shows the segment
utilization as a function of time.
Pipelining
General Considerations
 The space-time diagram of a four-segment pipeline is
demonstrated in Figure.

 The horizontal axis displays the time in clock cycles and the
vertical axis gives the segment number.
Pipelining
General Considerations
 The diagram shows six tasks T1 through T6 executed in
four segments. Initially, task T1 is handled by segment 1.
 After the first clock, segment 2 is busy with T 1, while
segment 1 is busy with task T2.
 Continuing in this manner, the first task T 1 is completed
after the fourth clock cycle. From then on, the pipe
completes a task every clock cycle. Once the pipeline is full,
it takes only one clock period to obtain an output.
Pipelining
General Considerations
 Consider the case where a k-segment pipeline with a clock
cycle time tp is used to execute n tasks. The first task T1
requires a time equal to ktp to complete its operation since
there are k segments in the pipe.
 The remaining n - 1 tasks emerge from the pipe at the rate
of one task per clock cycle and they will be completed after
a time equal to (n - 1)tp.
 Therefore, to complete n tasks using a k-segment pipeline
requires k + (n - 1) clock cycles.
Pipelining
General Considerations
 Consider a non pipeline unit that performs the same
operation and takes a time equal to tn to complete each task.
The total time required for n tasks is ntn.
 The speedup of a pipeline processing over an equivalent
non pipeline processing is defined by the ratio
Pipelining
General Considerations
 As the number of tasks increases, n becomes much larger
than k-1, and k + n - 1 approaches the value of n. Under this
condition, the speedup becomes

 If we assume that the time it takes to process a task is the


same in the pipeline and non pipeline circuits, we will have
tn = ktp. Including this assumption, the speedup reduces to

where k is the number of segments in the pipeline


Pipelining
General Considerations
 There are two areas of computer design where the pipeline
organization is applicable.
 An arithmetic pipeline divides an arithmetic operation into
sub operations for execution in the pipeline segments.
 An instruction pipeline operates on a stream of instructions
by overlapping the fetch, decode, and execute phases of the
instruction cycle.
Arithmetic Pipeline
 Pipeline arithmetic units are usually found in very high
speed computers.
 They are used to implement floating-point operations,
multiplication of fixed-point numbers, and similar
computations encountered in scientific problems.
 Floating-point operations are easily decomposed into sub
operations.
 An example of a pipeline unit for floating-point addition and
subtraction is shown. The inputs to the floating-point adder
pipeline are, two normalized floating-point binary numbers.
X = A X 2a
Y = B X 2b
Arithmetic Pipeline
 A and B are two fractions that represent the mantissas and a
and b are the exponents.
 The floating-point addition and subtraction can be
performed in four segments, as shown in Figure.
 The registers labeled R are placed between the segments to
store intermediate results. The sub operations that are
performed in the four segments are:
1. Compare the exponents.
2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalize the result.
Arithmetic Pipeline
Arithmetic Pipeline
 The exponents are compared by subtracting them to
determine their difference. The larger exponent is chosen as
the exponent of the result. The exponent difference
determines how many times the mantissa associated with
the smaller exponent must be shifted to the right.
 This produces an alignment of the two mantissas. It should
be noted that the shift must be designed as a combinational
circuit to reduce the shift time.
 The two mantissas are added or subtracted in segment 3.
The result is normalized in segment 4.
Arithmetic Pipeline
 When an overflow occurs, the mantissa of the sum or
difference is shifted right and the exponent incremented by
one.
 If an underflow occurs, the number of leading zeros in the
mantissa determines the number of left shifts in the
mantissa and the number that must be subtracted from the
exponent.
Arithmetic Pipeline
Example:
 Consider the two normalized floating-point numbers:
X = 0.9504 X 103
Y = 0.8200 X 102
 The two exponents are subtracted in the first segment to
obtain
3-2=1.
 The larger exponent 3 is chosen as the exponent of the
result. The next segment shifts the mantissa of Y to the
right to obtain
X = 0.9504 X 103
Y = 0.0820 X 103
Arithmetic Pipeline
Example:
 This aligns the two mantissas under the same exponent. The
addition of the two mantissas in segment 3 produces the sum
Z = 1.0324 X 103
 The sum is adjusted by normalizing the result so that it has a
fraction with a nonzero first digit. This is done by shifting
the mantissa once to the right and incrementing the exponent
by one to obtain the normalized sum.
Z = 0.10324 X 104
 The comparator, shifter, adder-subtractor, incrementer, and
decrementer in the floating-point pipeline are implemented
with combinational circuits.
Instruction Pipeline
 Pipeline processing can occur not only in the data stream
but in the instruction stream as well.
 An instruction pipeline reads consecutive instructions from
memory while previous instructions are being executed in
other segments.
 This causes the instruction fetch and execute phases to
overlap and perform simultaneous operations.
Instruction Pipeline
 Consider a computer with an instruction fetch unit and an
instruction execution unit designed to provide a two-
segment pipeline.
 The instruction fetch segment can be implemented by
means of a first-in, first-out (FIFO) buffer. This is a type of
unit that forms a queue rather than a stack.
 Whenever the execution unit is not using memory, the
control increments the program counter and uses its address
value to read consecutive instructions from memory.
 The instructions are inserted into the FIFO buffer so that
they can be executed on a first-in, first-out basis.
Instruction Pipeline
 Thus an instruction stream can be placed in a queue, waiting
for decoding and processing by the execution segment.
 The instruction stream queuing mechanism provides an
efficient way for reducing the average access time to
memory for reading instructions.
 Whenever there is space in the FIFO buffer, the control unit
initiates the next instruction fetch phase. The buffer acts as
a queue from which control then extracts the instructions for
the execution unit.
Instruction Pipeline
 In the most general case, the computer needs to process
each instruction with the following sequence of steps.
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5 . Execute the instruction.
6. Store the result in the proper place.
Instruction Pipeline
Example: Four-Segment Instruction Pipeline
 Figure shows the instruction cycle with a four-segment
pipeline.
 While an instruction is being executed in segment 4, the next
instruction in sequence is busy fetching an operand from
memory in segment 3.
 The effective address may be calculated in a separate
arithmetic circuit for the third instruction and whenever the
memory is available, the fourth and all subsequent
instructions can be fetched and placed in an instruction FIFO.
 Thus up to four sub operations in the instruction cycle can
overlap and up to four different instructions can be in
progress of being processed at the same time.
Instruction Pipeline
Instruction Pipeline
 Figure shows the operation of the instruction pipeline. The time
in the horizontal axis is divided into steps of equal duration.
 The four segments are represented in the diagram with an
abbreviated symbol.
1. FI is the segment that fetches an instruction.
2. DA is the segment that decodes the instruction and calculates
the effective address.
3. FO is the segment that fetches the operand.
4. EX is the segment that executes the instruction.
 It is assumed that the processor has separate instruction and data
memories so that the operation in FI and FO can proceed at the
same time.
Instruction Pipeline

Fig: Timing of Instruction pipeline


Instruction Pipeline
 In the absence of a branch instruction, each segment
operates on different instructions.
 Thus, in step 4, instruction 1 is being executed in segment
EX; the operand for instruction 2 is being fetched in
segment FO; instruction 3 is being decoded in segment DA;
and instruction 4 is being fetched from memory in segment
FI.
 Assume now that instruction 3 is a branch instruction. As
soon as this instruction is decoded in segment DA in step 4,
the transfer from FI to DA of the other instructions is halted
until the branch instruction is executed in step 6.
Instruction Pipeline
 If the branch is taken, a new instruction is fetched in step 7.
 If the branch is not taken, the instruction fetched previously
in step 4 can be used. The pipeline then continues until a
new branch instruction is encountered.
 Another delay may occur in the pipeline if the EX segment
needs to store the result of the operation in the data memory
while the FO segment needs to fetch an operand.
 In that case, segment FO must wait until segment EX has
finished its operation.
RISC Pipeline
 The data transfer instructions in RISC are limited to load
and store instructions. These instructions use register
indirect addressing. They usually need three or four stages
in the pipeline.
 To prevent conflicts between a memory access to fetch an
instruction and to load or store an operand, most RISC
machines use two separate buses with two memories: one
for storing the instructions and the other for storing the data.
 One of the major advantages of RISC is its ability to
execute instructions at the rate of one per clock cycle.
 It is not possible to expect that every instruction be fetched
from memory and executed in one clock cycle.
RISC Pipeline
 The advantage of RISC over CISC is that RISC can
achieve pipeline segments, requiring just one clock cycle,
while CISC uses many segments in its pipeline, with the
longest segment requiring two or more clock cycles.
 Another characteristic of RISC is the support given by the
compiler that translates the high-level language program
into machine language program.
RISC Pipeline
Example: Three-Segment Instruction Pipeline
 The control section fetches the instruction from program
memory into an instruction register. The instruction is
decoded at the same time that the registers needed for the
execution of the instruction are selected.
 The processor unit consists of a number of registers and an
arithmetic logic unit (ALU) that performs the necessary
arithmetic, logic, and shift operations.
 A data memory is used to load or store the data from a
selected register in the register file.
RISC Pipeline
Example: Three-Segment Instruction Pipeline
 The instruction cycle can be divided into three sub
operations and implemented in three segments:
I: Instruction fetch
A: ALU operation
E: Execute instruction
 The I segment fetches the instruction from program
memory. The instruction is decoded and an ALU operation
is performed in the A segment. The ALU is used for three
different functions, depending on the decoded instruction.
RISC Pipeline
Example: Three-Segment Instruction Pipeline
 It performs an operation for a data manipulation instruction,
it evaluates the effective address for a load or store
instruction, or it calculates the branch address for a program
control instruction.
 The E segment directs the output of the ALU to one of three
destinations, depending on the decoded instruction.
 It transfers the result of the ALU operation into a
destination register in the register file, it transfers the
effective address to a data memory for loading or storing, or
it transfers the branch address to the program counter.
RISC Pipeline
Delayed Load
 Consider now the operation of the following four
instructions:

 If the three-segment pipeline proceeds without


interruptions, there will be a data conflict in instruction 3
because the operand in R2 is not yet available in the A
segment. This can be seen from the timing of the pipeline
shown in Fig. (a).
RISC Pipeline
RISC Pipeline
Delayed Load
 The E segment in clock cycle 4 is in a process of placing the
memory data into R2. The A segment in clock cycle 4 is using
the data from R2, but the value in R2 will not be the correct
value since it has not yet been transferred from memory.
 It is up to the compiler to make sure that the instruction
following the load instruction uses the data fetched from
memory.
 If the compiler cannot find a useful instruction to put after the
load, it inserts a no-op (no-operation) instruction. This is a type
of instruction that is fetched from memory but has no operation,
thus wasting a clock cycle. This concept of delaying the use of
the data loaded from memory is referred to as delayed load.
RISC Pipeline
Delayed Load
 Fig. (b) shows the same program with a no-op instruction
inserted after the load to R2 instruction.
 The data is loaded into R2 in clock cycle 4. The add
instruction uses the value of R2 in step 5.
 Thus the no-op instruction is used to advance one clock
cycle in order to compensate for the data conflict in the
pipeline.
 The advantage of the delayed load approach is that the data
dependency is taken care of by the compiler rather than the
hardware.
RISC Pipeline
Delayed Branch
 A branch instruction delays the pipeline operation until the
instruction at the branch address is fetched.
 The method used in most RISC processors is to rely on the
compiler to redefine the branches so that they take effect at
the proper time in the pipeline. This method is referred to as
delayed branch.
 The compiler for a processor that uses delayed branches is
designed to analyze the instructions before and after the
branch and rearrange the program sequence by inserting
useful instructions in the delay steps.
RISC Pipeline
Delayed Branch
 An example of delayed branch is shown in Fig. The
program for this example consists of five instructions:
Load from memory to R1
Increment R2
Add R3 t o R4
Subtract R5 from R6
Branch to address X
RISC Pipeline
Delayed Branch
 In Fig. (a) the compiler inserts two no-op instructions after
the branch. The branch address X is transferred to PC in
clock cycle 7.
 The fetching of the instruction at X is delayed by two clock
cycles by the no-op instructions.
 The instruction at X starts the fetch phase at clock cycle 8
after the program counter PC has been updated.
RISC Pipeline
RISC Pipeline
Delayed Branch
 The program in Fig. (b) is rearranged by placing the add
and subtract instructions after the branch instruction instead
of before as in the original program.
 Inspection of the pipeline timing shows that PC is updated
to the value of X in clock cycle 5, but the add and subtract
instructions are fetched from memory and executed in the
proper sequence.
 In other words, if the load instruction is at address 101 and
X is equal to 350, the branch instruction is fetched from
address 103. The add instruction is fetched from address
104 and executed in clock cycle 6.
RISC Pipeline
Delayed Branch
 The subtract instruction is fetched from address 105 and
executed in clock cycle 7.
 Since the value of X is transferred to PC with clock cycle 5
in the E segment, the instruction fetched from memory at
clock cycle 6 is from address 350, which is the instruction at
the branch address.
Vector Processing
 Computers with vector processing capabilities are in
demand in specialized applications. The following are
representative application areas where vector processing is
of the utmost importance.
Long-range weather forecasting
Petroleum explorations
Seismic data analysis
Medical diagnosis
Aerodynamics and space flight simulations
Artificial intelligence and expert systems
Mapping the human genome
Image processing
Vector Processing
 Without sophisticated computers, many of the required
computations cannot be completed within a reasonable
amount of time.
 To achieve the required level of high performance it is
necessary to utilize the fastest and most reliable hardware
and apply innovative procedures from vector and parallel
processing techniques.
Vector Processing
Vector Operations
 Many scientific problems require arithmetic operations on
large arrays of numbers. These numbers are usually
formulated as vectors and matrices of floating-point
numbers.
 A vector is an ordered set of a one-dimensional array of
data items. A vector V of length n is represented as a row
vector by
V = [V1 V2 V 3 · · · Vn]
 It maybe represented as a column vector if the data items
are listed in a column.
Vector Processing
Vector Operations
 A conventional sequential computer is capable of processing
operands one at a time.
 Consequently, operations on vectors must be broken down into
single computations with subscripted variables.
 The element Vi of vector V is written as V(I) and the index I
refers to a memory address or register where the number is
stored.
 To examine the difference between a conventional scalar
processor and a vector processor, consider the following
Fortran DO loop:
DO 20 I = 1, 100
20 C(I) = B(I) + A(I)
Vector Processing
Vector Operations
 This is a program for adding two vectors A and B of length
100 to produce a vector C. This is implemented in machine
language by the following sequence of operations.
Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I + 1
If I ≤ 100 go to 20
Continue
Vector Processing
Vector Operations
 This constitutes a program loop that reads a pair of
operands from arrays A and B and performs a floating-point
addition. The loop control variable is then updated and the
steps repeat 100 times.
 A computer capable of vector processing eliminates the
overhead associated with the time it takes to fetch and
execute the instructions in the program loop.
 It allows operations to be specified with a single vector
instruction of the form
C(1 : 100) = A(1 : 100) + B(1 : 100)
Vector Processing
Vector Operations
 The vector instruction includes the initial address of the
operands, the length of the vectors and the operation to be
performed, all in one composite instruction.
 A possible instruction format for a vector instruction is
shown in Fig.
Vector Processing
Vector Operations
 This is essentially a three-address instruction with three
fields specifying the base address of the operands and an
additional field that gives the length of the data items in the
vectors.
 This assumes that the vector operands reside in memory. It
is also possible to design the processor with a large number
of registers and store all operands in registers prior to the
addition operation.
 In that case the base address and length in the vector
instruction specify a group of CPU registers.
Vector Processing
Matrix Multiplication
 Matrix multiplication is one of the most computational
intensive operations performed in computers with vector
processors.
 The multiplication of two n x n matrices consists of n 2 inner
products or n3 multiply-add operations.
 An n x m matrix of numbers has n rows and m columns and
may be considered as constituting a set of n row vectors or a
set of m column vectors.
Vector Processing
Matrix Multiplication
 For example, the multiplication of two 3 x 3 matrices A and
B.

 The product matrix C is a 3 x 3 matrix whose elements are


related to the elements of A and B by the inner product:

 For example, the number in the first row and first column of
matrix C is calculated by letting i = 1, j = 1, to obtain
C11 = a11b11 + a12b21 + a13b31
Vector Processing
Matrix Multiplication
 This requires three multiplications and (after initializing C11
to 0) three additions.
 The total number of multiplications or additions required to
compute the matrix product is 9 x 3 = 27.
 If we consider the linked multiply-add operation c + a x b as
a cumulative operation, the product of two n x n matrices
requires n3 multiply-add operations.
 The computation consists of n2 inner products, with each
inner product requiring n multiply-add operations, assuming
that c is initialized to zero before computing each element in
the product matrix.
Vector Processing
Matrix Multiplication
 In general, the inner product consists of the sum of k
product terms of the form

 In a typical application k may be equal to 100 or even 1000.


The inner product calculation on a pipeline vector processor
is shown in Figure.
The output of the adder is 0 for the first eight cycles until both
pipes are full. A; and B; pairs are brought in and multiplied at a
rate of one pair per cycle.
 After the first four cycles, the products begin to be added to the
output of the adder. During the next four cycles 0 is added to
the products entering the adder pipeline.
At the end of the eighth cycle, the first four products A1 B1
through A4 B4 are in the four adder segments, and the next four
products, A5 B5 through A, B,, are in the multiplier segments.
 At the beginning of the ninth cycle, the output of the adder is
A1 B1 and the output of the multiplier is A5 B5• Thus the ninth
cycle starts the addition A1 B1 + A5 B5 in the adder pipeline.
The tenth cycle starts the addition A2 B2, + A6 B6, and so on.
This pattern breaks down the summation into four sections as
follows:
C = A1 B1 + A5, B5, + A9, B9, + A13 B13 + · · ·
+ A2, B2, + A6, B6, + A10 B10 + A14,B14 + ...
+A3,B3, + A7, B7, + A11 B11 + A15 B15 + · · ·
+A4,B4, + A8, B8, + A12,B12 + A16, B16+ ...

Figure :Pipeline for calculating an inner product.


Memory Interleaving
 Arithmetic pipeline usually requires two or more operands to
enter the pipeline at the same time. Instead of using two
memory buses for simultaneous access, the memory can be
partitioned into a number of modules connected to a common
memory address and data buses.
A memory module is a memory array together with its own
address and data registers. Figure shows a memory unit with
four modules.
 Each memory array has its own address register AR and data
register DR. The address registers receive information from a
common address bus and the data registers communicate with a
bidirectional data bus. independent of the state of the other
modules.
The advantage of a modular memory is that it allows the
use of a tech­nique called interleaving.
In an interleaved memory, different sets of addresses
are assigned to different memory modules.

Address
Bus

Data Fig: Multiple module memory organization.


Array Processors
 An array processor is a processor that performs computations
on large arrays of data. There are two types of processors.
 An attached array processor is an auxiliary processor attached
to a general-purpose computer. It is intended to improve the
performance of the host computer in specific numerical
computation tasks.
 An SIMD array processor is a processor that has a single-
instruction multiple-data organization. It manipulates vector
instructions by means of multiple functional units responding
to a common instruction.
 Although both types of array processors manipulate vectors,
their internal organization is different.
Array Processors
Attached Array Processor
 An attached array processor is designed as a peripheral for a
conventional host computer and its purpose is to enhance
the performance of the computer by providing vector
processing for complex scientific applications.
 It achieves high performance by means of parallel
processing with multiple functional units. It includes an
arithmetic unit containing one or more pipelined floating
point adders and multipliers.
 The array processor can be programmed by the user to
accommodate a variety of complex arithmetic problems.
Array Processors
Attached Array Processor
 Figure shows the interconnection of an attached array
processor to a host computer.
 The host computer is a general-purpose commercial
computer and the attached processor is a back-end machine
driven by the host computer.
Array Processors
Attached Array Processor
 The array processor is connected through an input-output
controller to the computer and the computer treats it like an
external interface.
 The data for the attached processor are transferred from
main memory to a local memory through a high-speed bus.
 The general-purpose computer without the attached
processor serves the users that need conventional data
processing.
 The system with the attached processor satisfies the needs
for complex arithmetic applications.
Array Processors
SIMD Array Processor
 SIMD array processor is a computer with multiple
processing units operating in parallel.
 The processing units are synchronized to perform the same
operation under the control of a common control unit, thus
providing a single instruction stream, multiple data stream
(SIMD) organization.
 A general block diagram of an array processor is shown in
Figure. It contains a set of identical processing elements
(PEs), each having a local memory M . Each processor
element includes an ALU, a floating-point arithmetic unit
and working registers.
Array Processors
SIMD Array Processor
 The master control unit controls the operations in the
processor elements. The main memory is used for storage of
the program.
Array Processors
SIMD Array Processor
 The function of the master control unit is to decode the
instructions and determine how the instruction is to be
executed.
 Scalar and program control instructions are directly
executed within the master control unit.
 Vector instructions are broadcast to all PEs simultaneously.
Each PE uses operands stored in its local memory.
 Vector operands are distributed to the local memories prior
to the parallel execution of the instruction.
Array Processors
SIMD Array Processor
 For example, the vector addition C = A + B. The master
control unit first stores the ith components ai and bi of A and
B in local memory M, for i = 1, 2, 3, . . . , n.
 It then broadcasts the floating-point add instruction ci = ai +
bi to all PEs, causing the addition to take place
simultaneously.
 The components of ci are stored in fixed locations in each
local memory. This produces the desired vector sum in one
add cycle.
Array Processors
SIMD Array Processor
 Masking schemes are used to control the status of each PE
during the execution of vector instructions.
 Each PE has a flag that is set when the PE is active and reset
when the PE is inactive.
 This ensures that only those PEs that need to participate are
active during the execution of the instruction.
Unit-5

Multi Processors
Characteristics of Multiprocessors
 A multiprocessor system is an interconnection of two or
more CPUs with memory and input-output equipment.
 The term "processor" in multiprocessor can mean either a
central processing unit (CPU) or an input-output processor
(lOP).
 However, a system with a single CPU and one or more lOPs
is usually not included in the definition of a multiprocessor
system unless the lOP has computational facilities
comparable to a CPU.
 As it is most commonly defined, a multiprocessor system
implies the existence of multiple CPUs, although usually
there will be one or more lOPs as well.
Characteristics of Multiprocessors
 A multiprocessor system is controlled by one operating
system that provides interaction between processors and all
the components of the system cooperate in the solution of a
problem.
 Multiprocessing improves the reliability of the system so that
a failure or error in one part has a limited effect on the rest of
the system.
 If a fault causes one processor to fail, a second processor can
be assigned to perform the functions of the disabled
processor. The system as a whole can continue to function
correctly with perhaps some loss in efficiency.
 The benefit derived from a multiprocessor organization is an
improved system performance.
Characteristics of Multiprocessors
 The system derives its high performance from the fact that
computations can proceed in parallel in one of two ways.
1. Multiple independent jobs can be made to operate in parallel.
2. A single job can be partitioned into multiple parallel tasks.
 An example is a computer system where one processor performs
the computations for an industrial process control while others
monitor and control the various parameters, such as temperature
and flow rate.
 Another example is a computer where one processor performs
high speed floating-point mathematical computations and
another takes care of routine data-processing tasks.
Characteristics of Multiprocessors
 An overall function can be partitioned into a number of
tasks that each processor can handle individually.
 Multiprocessing can improve performance by decomposing
a program into parallel executable tasks.
 Multiprocessors are classified by the way their memory is
organized. A multiprocessor system with common shared
memory is classified as a shared memory or tightly coupled
multiprocessor.
 An alternative model of microprocessor is the distributed-
memory or loosely coupled system. Each processor element
in a loosely coupled system has its own private local
memory.
Interconnection Structures
 The components that form a multiprocessor system are
CPUs, lOPs connected to input-output devices, and a
memory unit that may be partitioned into a number of
separate modules.
 The interconnection between the components can have
different physical configurations, depending on the number
of transfer paths that are available between the processors
and memory in a shared memory system or among the
processing elements in a loosely coupled system.
Interconnection Structures
 Some physical forms available for establishing an
interconnection network are:
1. Time-shared common bus
2. Multiport memory
3. Crossbar switch
4. Multistage switching network
5. Hypercube system
Interconnection Structures
Time-Shared Common Bus
 A common-bus multiprocessor system consists of a number
of processors connected through a common path to a
memory unit. A time-shared common bus for five
processors is shown in Figure.
 Only one processor can communicate with the memory or
another processor at any given time.
Interconnection Structures
Time-Shared Common Bus
 Transfer operations are conducted by the processor that is in
control of the bus at the time.
 Any other processor wishing to initiate a transfer must first
determine the availability status of the bus, and only after
the bus becomes available can the processor address the
destination unit to initiate the transfer.
 A command is issued to inform the destination unit what
operation is to be performed . The receiving unit recognizes
its address in the bus and responds to the control signals
from the sender, after which the transfer is initiated.
Interconnection Structures
Time-Shared Common Bus
 The system may exhibit transfer conflicts since one
common bus is shared by all processors.
 These conflicts must be resolved by incorporating a bus
controller that establishes priorities among the requesting
units.
 A single common-bus system is restricted to one transfer at
a time.
 A more economical implementation of a dual bus structure
is depicted in Figure.
Interconnection Structures
Time-Shared Common Bus
 Each local bus may be connected to a CPU, an IOP or any
combination of processors.

Fig: System bus structure for multiprocessors


Interconnection Structures
Time-Shared Common Bus
 A system bus controller links each local bus to a common
system bus.
 The I/O devices connected to the local lOP, as well as the local
memory, are available to the local processor. The memory
connected to the common system bus is shared by all
processors.
 If an lOP is connected directly to the system bus, the I/O
devices attached to it may be made available to all processors.
 Only one processor can communicate with the shared memory
and other common resources through the system bus at any
given time. The other processors are kept busy communicating
with their local memory and I/O devices.
Interconnection Structures
Multiport Memory
 A multiport memory system employs separate buses
between each memory module and each CPU. This is
shown in figure for four CPUs and four memory modules
(MMs).
 Each processor bus is connected to each memory module. A
processor bus consists of the address, data, and control lines
required to communicate with memory.
 The memory module is said to have four ports and each port
accommodates one of the buses. The module must have
internal control logic to determine which port will have
access to memory at any given time.
Interconnection Structures
Multiport Memory
Interconnection Structures
Multiport Memory
 Memory access conflicts are resolved by assigning fixed
priorities to each memory port.
 The advantage of the multi port memory organization is the
high transfer rate that can be achieved because of the
multiple paths between processors and memory.
 The disadvantage is that it requires expensive memory
control logic and a large number of cables and connectors.
Interconnection Structures
Crossbar Switch
 Crossbar switch organization consists of a number of cross
points that are placed at intersections between processor
buses and memory module paths.
 Figure shows a crossbar switch interconnection between
four CPUs and four memory modules. The small square in
each cross point is a switch that determines the path from a
processor to a memory module.
 Each switch point has control logic to set up the transfer
path between a processor and memory.
Interconnection Structures
Crossbar Switch
Interconnection Structures
Crossbar Switch
 Figure shows the functional design of a crossbar switch
connected to one memory module.
 The circuit consists of multiplexers that select the data,
address and control from one CPU for communication with
the memory module.
Interconnection Structures
Crossbar Switch
Interconnection Structures
Multistage Switching Network
 The basic component of a multistage network is a two-input, two
output interchange switch. As shown in Fig., the 2 X 2 switch has
two inputs labeled A and B, and two outputs, labeled 0 and 1.
 There are control signals associated with the switch that establish
the interconnection between the input and output terminals.
 The switch has the capability of connecting input A to either of
the outputs. Terminal B of the switch behaves in a similar fashion.
 The switch also has the capability to arbitrate between conflicting
requests. If inputs A and B both request the same output terminal
only one of them will be connected; the other will be blocked.
Interconnection Structures
Multistage Switching Network
The 2×2 crossbar switch is used in the multistage network.
It has 2 inputs (A & B) and 2 outputs (0 & 1). To establish
the connection between the input & output terminals, the
control inputs CA & CB are associated.
The input is connected to 0 output if the control input is 0 &
the input is connected to 1 output if the control input is 1.
This switch can arbitrate between conflicting requests. Only
1 will be connected if both A & B require the same output
terminal, the other will be blocked/ rejected.
We can construct a multistage network using 2×2 switches,
in order to control the communication between a number of
sources & destinations. Creating a binary tree of cross-bar
switches accomplishes the connections to connect the input
to one of the 8 possible destinations.
Multistage Switching Network
 In the above diagram, PA & PB are 2 processors, and they are
connected to 8 memory modules in a binary way from 000(0) to
111(7) through switches. Three levels are there from a source to a
destination. To choose output in a level, one bit is assigned to
each of the 3 levels. There are 3 bits in the destination number:
1st bit determines the output of the switch in 1st level, 2nd bit in
2nd level & 3rd bit in the 3rd level.
 Example: If the source is: PB & the destination is memory module
011 (as in the figure): A path is formed from PB to 0 output in 1st
level, output 1 in 2nd level & output 1 in 3rd level.
 Usually, the processor acts as the source and the memory unit
acts as a destination in a tightly coupled system. The destination
is a memory module. But, processing units act as both, the source
and the destination in a loosely coupled system.
 Many patterns can be made using 2×2 switches such as Omega
networks, Butterfly Network, etc.
Advantage of Multi-Stage Switching network
Multi-Stage Switching network are used smaller switches,

i.e., 2×2 switches to reduce the complexity. To set the


switches, routing algorithms can be used. Its complexity
and cost are less than the cross-bar interconnection
network.
Interconnection Structures
Hypercube Interconnection
 The hypercube or binary n-cube multiprocessor structure is
a loosely coupled system composed of N = 2n processors
interconnected in an n –dimensional binary cube.
 Each processor forms a node of the cube. Each processor
has direct communication paths to n other neighbor
processors.
 These paths correspond to the edges of the cube. There are
2n distinct n-bit binary addresses that can be assigned to the
processors.
 Each processor address differs from that of each of its n
neighbors by exactly one bit position.
Interconnection Structures
Hypercube Interconnection
Interconnection Structures
Hypercube Interconnection
 Figure shows the hypercube structure for n = 1, 2, and 3.
 A one-cube structure has n = 1 and 2 n = 2. It contains two
processors interconnected by a single path.
 A two-cube structure has n = 2 and 2n = 4. It contains four
nodes interconnected as a square.
 A three-cube structure has eight nodes interconnected as a
cube. A n-cube structure has 2n nodes with a processor
residing in each node.
Example:
In a three-cube structure, node 000 may communicate with
011 (from 000 to 010 to 011 or from 000 to 001 to 011). It
should cross at least three links to communicate from node
000 to node 111. A routing procedure is designed by
determining the exclusive-OR of the source node address
with the destination node address. The resulting binary
value will have 1 bits corresponding to the axes on which
the two nodes differ. Then, message is transmitted along
any one of the exes.
For example, a message at node 010 going to node 001
produces an exclusive-OR of the two addresses equal to
011 in a three-cube structure. The message can be
transmitted along the second axis to node 000 and then
through the third axis to node 001.
Interprocessor Arbitration
Serial Arbitration Procedure
 Arbitration procedures service all processor requests on the
basis of established priorities.
 A hardware bus priority resolving technique can be
established by means of a serial or parallel connection of the
units requesting control of the system bus.
 The serial priority resolving technique is obtained from a
daisy-chain connection of bus arbitration circuits. The
processors connected to the system bus are assigned priority
according to their position along the priority control line.
Interprocessor Arbitration
Serial Arbitration Procedure
 The device closest to the priority line is assigned the highest
priority.
 When multiple devices concurrently request the use of the
bus, the device with the highest priority is granted access to
it.
 Figure shows the daisy-chain connection of four arbiters. It
is assumed that each processor has its own bus arbiter logic
with priority-in and priority-out lines.
 The priority out (PO) of each arbiter is connected to the
priority in (PI) of the next-lower-priority arbiter. The PI of
the highest-priority unit is maintained at a logic 1 value.
Interprocessor Arbitration
Serial Arbitration Procedure
 The highest-priority unit in the system will always receive
access to the system bus when it requests it.
 The PO output for a particular arbiter is equal to 1 if its PI
input is equal to 1 and the processor associated with the
arbiter logic is not requesting control of the bus.

Fig: Serial (Daisy –chain ) arbitration


Interprocessor Arbitration
Serial Arbitration Procedure
 This is the way that priority is passed to the next unit in the
chain.
 If the processor requests control of the bus and the
corresponding arbiter finds its PI input equal to 1, it sets its
PO output to 0. Lower-priority arbiters receive a 0 in PI and
generate a 0 in PO. Thus the processor whose arbiter has a
PI = 1 and PO = 0 is the one that is given control of the
system bus.
Interprocessor Arbitration
Parallel Arbitration Logic
 The parallel bus arbitration technique uses an external
priority encoder and a decoder as shown in Figure.
 Each bus arbiter in the parallel scheme has a bus request
output line and a bus acknowledge input line.
 Each arbiter enables the request line when its processor is
requesting access to the system bus.
 The processor takes control of the bus if its acknowledge
input line is enabled.
 The bus busy line provides an orderly transfer of control, as
in the daisy-chaining case.
Interprocessor Arbitration
Parallel Arbitration Logic
Interprocessor Arbitration
Parallel Arbitration Logic
 Figure shows the request lines from four arbiters going into
a 4 x 2 priority encoder.
 The output of the encoder generates a 2-bit code which
represents the highest-priority unit among those requesting
the bus.
 The 2-bit code from the encoder output drives a 2 x 4
decoder which enables the proper acknowledge line to grant
bus access to the highest-priority unit.
Interprocessor Communication and
Synchronization
Interprocessor Communication
 The various processors in a multiprocessor system must be
provided with a facility for communicating with each other.
 A communication path can be established through common
input-output channels.
 In a shared memory multiprocessor system, the most
common procedure is to set aside a portion of memory that
is accessible to all processors.
 The primary use of the common memory is to act as a
message center similar to a mailbox, where each processor
can leave messages for other processors and pick up
messages intended for it.
Interprocessor Communication and
Synchronization
Interprocessor Communication
 In addition to shared memory, a multiprocessor system may
have other shared resources.
 For example, a magnetic disk storage unit connected to an
lOP may be available to all CPUs. This provides a facility
for sharing of system programs stored in the disk.
 A communication path between two CPUs can be
established through a link between two lOPs associated
with two different CPUs.
 This type of link allows each CPU to treat the other as an
I/O device so that messages can be transferred through the
I/O path.
Interprocessor Communication and
Synchronization
Interprocessor Communication
 To prevent conflicting use of shared resources by several
processors there must be a provision for assigning resources
to processors. This task is given to the operating system.
 There are three organizations that have been used in the
design of operating system for multiprocessors: master-
slave configuration, separate operating system, and
distributed operating system.
 In a master-slave mode, one processor, designated the
master, always executes the operating system functions. The
remaining processors, denoted as slaves, do not perform
operating system functions.
Interprocessor Communication and
Synchronization
Interprocessor Communication
 If a slave processor needs an operating system service, it
must request it by interrupting the master and waiting until
the current program can be interrupted.
 In the separate operating system organization, each
processor can execute the operating system routines it
needs. This organization is more suitable for loosely
coupled systems where every processor may have its own
copy of the entire operating system.
Interprocessor Communication and
Synchronization
Interprocessor Communication
 In the distributed operating system organization, the
operating system routines are distributed among the
available processors.
 However, each particular operating system function is
assigned to only one processor at a time.
 This type of organization is also referred to as a floating
operating system since the routines float from one processor
to another and the execution of the routines may be
assigned to different processors at different times.
Interprocessor Communication and
Synchronization
Interprocessor Communication
 In a loosely coupled multiprocessor system the memory is
distributed among the processors and there is no shared
memory for passing information.
 The communication between processors is by means of
message passing through I/O channels. The communication
is initiated by one processor calling a procedure that resides
in the memory of the processor with which it wishes to
communicate.
 When the sending processor and receiving processor name
each other as a source and destination, a channel of
communication is established.
Interprocessor Communication and
Synchronization
Interprocessor Synchronization
 The instruction set of a multiprocessor contains basic
instructions that are used to implement communication and
synchronization between cooperating processes.
 Synchronization refers to the special case where the data
used to communicate between processors is control
information.
 Synchronization is needed to enforce the correct sequence
of processes and to ensure mutually exclusive access to
shared writable data.
 Multiprocessor systems usually include various mechanisms
to deal with the synchronization of resources.
Interprocessor Communication and
Synchronization
Interprocessor Synchronization
 Low-level primitives are implemented directly by the
hardware. These primitives are the basic mechanisms that
enforce mutual exclusion for more complex mechanisms
implemented in software.
 One of the most popular methods is through the use of a
binary semaphore.
 A binary variable called a semaphore is often used to
indicate whether or not a processor is executing a critical
section.
Interprocessor Communication and
Synchronization
Interprocessor Synchronization
 A semaphore is a software controlled flag that is stored in a
memory location that all processors can access.
 When the semaphore is equal to 1, it means that a processor
is executing a critical program, so that the shared memory is
not available to other processors.
 When the semaphore is equal to 0, the shared memory is
available to any requesting processor.
Cache Coherence
 The primary advantage of cache is its ability to reduce the
average access time in uniprocessors.
 When the processor finds a word in cache during a read
operation, the main memory is not involved in the transfer.
 If the operation is to write, there are two commonly used
procedures to update memory.
 In the write-through policy, both cache and main memory
are updated with every write operation.
 In the write-back policy, only the cache is updated and the
location is marked so that it can be copied later into main
memory.
Cache Coherence
 In a shared memory multiprocessor system, all the processors
share a common memory. In addition, each processor may
have a local memory, part or all of which may be a cache.
 The compelling reason for having separate caches for each
processor is to reduce the average access time in each
processor.
 The same information may reside in a number of copies in
some caches and main memory.
 To ensure the ability of the system to execute memory
operations correctly, the multiple copies must be kept
identical. This requirement imposes a cache coherence
problem.
Cache Coherence
Conditions for Incoherence
 Cache coherence problems exist in multiprocessors with
private caches because of the need to share writable data.
Read-only data can safely be replicated without cache
coherence enforcement mechanisms.
 To illustrate the problem, consider the three-processor
configuration with private caches shown in Fig.
 Sometime during the operation an element X from main
memory is loaded into the three processors, P1, P2, and P3.
As a consequence, it is also copied into the private caches of
the three processors.
Cache Coherence
Conditions for Incoherence
 Assume that X contains the value of 52. The load on X to
the three processors results in consistent copies in the
caches and main memory.
Cache Coherence
Conditions for Incoherence
 If one of the processors performs a store to X, the copies of
X in the caches become inconsistent.
 A load by the other processors will not return the latest
value. Depending on the memory update policy used in the
cache, the main memory may also be inconsistent with
respect to the cache. This is shown in Fig.
 A store to X (of the value of 120) into the cache of
processor P1 updates memory to the new value in a write-
through policy.
Cache Coherence
Conditions for Incoherence
 A write-through policy maintains consistency between
memory and the originating cache, but the other two caches
are inconsistent since they still hold the old value.
Cache Coherence
Conditions for Incoherence
 In a write-back policy, main memory is not updated at the
time of the store. The copies in the other two caches and
main memory are inconsistent. Memory is updated
eventually when the modified data in the cache are copied
back into memory.
Cache Coherence
Solutions to the Cache Coherence Problem
 A simple scheme is to disallow private caches for each
processor and have a shared cache memory associated with
main memory.
 Every data access is made to the shared cache. This method
violates the principle of closeness of CPU to cache and
increases the average memory access time. In effect, this
scheme solves the problem by avoiding it.
Cache Coherence
Solutions to the Cache Coherence Problem
 For performance considerations it is desirable to attach a
private cache to each processor. One scheme that has been
used allows only non shared and read-only data to be stored
in caches. Such items are called cachable.
 Shared writable data are non cachable. The compiler must
tag data as either cachable or non cachable, and the system
hardware makes sure that only cachable data are stored in
caches. The non cachable data remain in main memory.
 This method restricts the type of data stored in caches and
introduces an extra software overhead that may degradade
performance.
Cache Coherence
Solutions to the Cache Coherence Problem
 The cache coherence problem can be solved by means of a
combination of software and hardware or by means of
hardware-only schemes.
 The two methods mentioned previously use software based
procedures that require the ability to tag information in
order to disable caching of shared writable data.
 Hardware-only solutions are handled by the hardware
automatically and have the advantage of higher speed and
program transparency.
Cache Coherence
Solutions to the Cache Coherence Problem
 In the hardware solution, the cache controller is specially
designed to allow it to monitor all bus requests from CPUs
and lOPs. All caches attached to the bus constantly monitor
the network for possible write operations.
 Depending on the method used, they must then either
update or invalidate their own cache copies when a match is
detected.

You might also like