CH 2 Vector Processing
CH 2 Vector Processing
Vector processing concepts, pipelined vector processors, Cray-1 type vector processor,
architecture of Cray-1, Characteristics of Cray-1, Instruction formats of the Cray-1
Characteristics of Vector Processing. Array processors, Introduction to Associative memory
processors, Interleaved Memory organization.
The difference between parallel processing and vector processing is that parallel
processing involves multiple processors working on separate tasks simultaneously. In
contrast, vector processing involves a single processor performing the same operation
on multiple data elements simultaneously.
vii. Legacy: The CRAY-1 paved the way for future supercomputers and established
Seymour Cray as a pioneer in high-performance computing. Its success led to the
development of subsequent Cray supercomputers.
❖ architecture of Cray-1
- The mass storage subsystem provides secondary storage and consists of one to eight
Cray Research DCU-2 Disk Controllers, each with one to four DD-19 Disk Storage
Units. Each DD-19 has a capacity of 2.424 x 109 bits so that a maximum mass
storage configuration could hold 9.7 x 109 8-bit characters.
- I/0 channels can be connected to independent processors referred to as front-end
computers or 1/0 stations or can be connected to peripheral equipment according to
the requirements of the individual installation.
- At least one front-end system is considered standard to collect data and present it to
the CRAY-1 for processing and to receive output from the CRAY-1 for distribution
to slower devices.
• Computation section
- The computation section contains instruction buffers, registers and functional units
which operate together to execute a program of instructions stored in memory.
- Arithmetic operations are either integer or floating point. Integer arithmetic is
performed in two's complement mode. Floating point quantities have signed-
magnitude representation.
- The CRAY-1 executes 128 operation codes as either 16-bit (one parcel) or 32-bit
(two-parcel) instructions. Operation codes provide for both scalar and vector
processing.
- Floating point instructions provide for addition, subtraction, multiplication, and
reciprocal approximation. The reciprocal approximation instruction allows for the
computation of a floating divide operation using a multiple instruction sequence.
- Integer or fixed point operations are provided as follows: integer addition, integer
subtraction, and integer multiplication. An integer multiply operation produces a
24-bit result; additions and subtractions produce either 24-bit or 64-bit results. No
integer divide instruction is provided and the operation is accomplished through a
software algorithm using floating point hardware.
- The instruction set includes Boolean operations for OR, AND, and exclusive OR
and for a mask-controlled merge operation. Shift operations allow the manipulation
of either 64-bit or 128-bit operands to produce 64-bit results. With the exception of
24-bit integer arithmetic, all operations are implemented in vector as well as scalar
instructions. The integer product is a scalar instruction designed for index
calculation. Full indexing capability allows the programmer to index throughout
memory in either scalar or vector modes. The index may be positive or negative in
5|P r ep are d b y: Priy an ka Mo re
Chapter 2 Vector Processing
either mode. This allows matrix operations in vector mode to be performed on rows
or the diagonal as well as conventional column-oriented operations.
- Each functional unit implements an algorithm or a portion of the instruction set.
Units are independent and are fully segmented. This means that a new set of
operands for unrelated computation may enter a functional unit each clock period.
• Memory section
- The memory for the CRAY-1 normally consists of 16 banks of bi-polar 1024- bit
LSI memory. Three memory size options are available: 262,144 words, 524,288
words, or 1,048,576 words. Each word is 72 bits long and consists of 64 data bits
and 8 check bits. The banks are independent of each other.
- Sequentially addressed words reside in sequential banks. The memory cycle time is
four clock periods (50 nsec). The access time, that is, the time required to fetch an
operand from memory to a scalar register is 11 clock periods (132.5 nsec). There is
no inherent memory degradation for 16-bank memories of less than one million
words.
- The maximum transfer rate for B, T, and V registers is one word per clock period.
For A and S registers, it is one word per two clock periods. Transfers of instructions
to the instruction buffers occur at a rate of 16 parcels (four words) per clock period.
Thus, the high speed of memory supports the requirements of scientific applications
while its low cycle time is well suited to random access applications. The phased
memory banks allow high communication rates through the I/0 section and provide
low read/store times for vector registers.
• IO Section
- Input and output communication with the CRAY-1 is over 12 full duplex 16-bit
channels. Associated with each channel are control lines that indicate the presence
of data on the channel (ready), data received (resume), or transfer complete
(disconnect).
- The channels are divided into four channel groups. A channel group consists of
either six input paths or six output paths. The four channel groups are scanned
sequentially for I/0 requests at a rate of one channel group per clock period. The
channel group will be reinterrogated four clock periods later whether any I/0 request
is pending in the channel or not. If more than one channel of the channel group is
active, the requests are resolved on a priority basis. The request from the lowest
numbered channel is serviced first.
• Vector Processing
- All operands processed by the CRAY-1 are held in registers prior to their being
processed by the functional units and are received by registers after processing. In
general, the sequence of operations is to load one or more vector registers from
memory and pass them to functional units. Results from this operation are received
by another vector register and may be processed additionally in another operation
or returned to memory if the results are to be retained.
- The contents of a V register are transferred to or from memory by specifying a first
word address in memory, an increment for the memory address, and a length. The
transfer proceeds beginning with the first element of the V register and
incrementing by one in the V register at a rate of up to one word per clock period
depending on memory conflicts.
- A result may be received by a V register and re-entered as an operand to another
vector computation in the same clock period. This mechanism allows for "chaining"
two or more vector operations together. Chain operation allows the CRAY-1 to
produce more than one result per clock period. Chain operation is detected
automatically by the CRAY-1 and is not explicitly specified by the programmer,
although the programmer may reorder certain code segments in order to enable
chain operation.
❖ Characteristics of Cray-1
register. The operation continues until the number of elements processed is equal to
the count specified by the vector length register.
For example: C (1:50) = A (1:50) + B (1:50)
This vector instruction includes the initial addresses of the two source operands,
one destination operand, the length of the vectors and the operation to be performed.
iii. Vector instructions are classified into for basic types:
F1: V = V f2: V = S
F3: V * V = V f4: V*S = V
Where V indicates vector operand and S indicates scalar operand. The operations f1
and f2 are unary operations such as vector square root, vector sine, vector complement,
vector summation and so on. On the other hand, operations f3 and f4 are binary
operations such as vector add, vector multiply, vector scalar adds and so on.
iv. In vector processing, identical processes are repeatedly invoked many times, each
of which can be subdivided into subprocesses.
v. In vector processing, successive operands are fed through the pipeline segments and
require as few buffers and local controls as possible. This parallel vector processing
allows the generation of more than two results per clock period. The parallel vector
operations are automatically initiated either when successive vector instructions use
different functional units and different vector registers, or when successive vector
instructions use the result stream from one vector register as the operand of another
operation using different functional units. This process is known as chaining.
vi. Because of the startup delay in a pipeline, a vector processor performs better with
longer vectors.
vii. Vector processing is usually faster and more efficient than scalar processing
because it reduces the overhead associated with maintenance of the loop control
variables.
❖ Array processors
10 | P r e p a r e d b y : P r i y a n k a M o r e
Chapter 2 Vector Processing
- The auxiliary processor like the attached array processor is shown below.
11 | P r e p a r e d b y : P r i y a n k a M o r e
Chapter 2 Vector Processing
- This processor includes a master control unit and main memory. The master control
unit in the processor controls the operation of the processing elements. And also,
decodes the instruction & determines how the instruction is executed.
- So, if the instruction is program control or scalar then it is executed directly in the
master control unit. Main memory is mainly used to store the program while every
processing unit uses operands that are stored in its local memory.
• Applications of array processors
i. This processor is used in medical and astronomy applications.
ii. These are very helpful in speech improvement.
iii. These are used in sonar and radar systems.
iv. These are applicable in anti-jamming, seismic exploration & wireless
communication.
v. This processor is connected to a general-purpose computer to improve the
computer’s performance within arithmetic computational tasks. So it attains
high performance through parallel processing by several functional units.
12 | P r e p a r e d b y : P r i y a n k a M o r e
Chapter 2 Vector Processing
13 | P r e p a r e d b y : P r i y a n k a M o r e
Chapter 2 Vector Processing
14 | P r e p a r e d b y : P r i y a n k a M o r e
Chapter 2 Vector Processing
- The least significant bits select the memory bank (module) in low-order
interleaving. In this, consecutive memory addresses are in different memory
modules, allowing memory access faster than the cycle time.
15 | P r e p a r e d b y : P r i y a n k a M o r e
Chapter 2 Vector Processing
banks. This kind of memory access can reduce the memory access time by a factor
close to the number of memory banks. In this memory interleaving memory
location, i can be found in bank i mod n.
16 | P r e p a r e d b y : P r i y a n k a M o r e