0% found this document useful (0 votes)
23 views

Module V

The document discusses input/output (I/O) and parallel processing. It covers basic I/O concepts like accessing I/O devices, interrupts, DMA, bus structures and standards. It then discusses parallel processing topics like hardware multithreading, vector processing, shared-memory multiprocessors, cache coherence, message passing, parallel programming and performance modeling. The document also provides details on I/O topics like bus architectures, memory-mapped I/O, isolated I/O, asynchronous and programmed I/O transfer modes.

Uploaded by

Aryan Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Module V

The document discusses input/output (I/O) and parallel processing. It covers basic I/O concepts like accessing I/O devices, interrupts, DMA, bus structures and standards. It then discusses parallel processing topics like hardware multithreading, vector processing, shared-memory multiprocessors, cache coherence, message passing, parallel programming and performance modeling. The document also provides details on I/O topics like bus architectures, memory-mapped I/O, isolated I/O, asynchronous and programmed I/O transfer modes.

Uploaded by

Aryan Jain
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Module V

Input Output & Parallel Processing


Basic Input Output
Accessing I/O Devices, Interrupts, DMA
Input Output Organization.
Bus Structure, Bus Operation, Arbitration, Interface, Interconnection Standards.
Parallel Processing
Hardware Multithreading, Vector (SIMD) Processing, Shared-Memory Multiprocessors, Cache Coherence,
Message-Passing Multi computers, Parallel Programming for Multiprocessors, Performance Modeling.

Basic Input Output

Devices that are under the direct control of the computer are said to be connected on-line.
These devices are designed to read information into or out of the memory unit upon command
from the CPU and are considered to be part of the total computer system.

Input or output devices attached to the computer are also called peripherals. Among the most
common peripherals are keyboards, display units, and printers.

Peripherals that provide auxiliary storage for the system are magnetic disks and tapes.
Peripherals are electromechanical and electromagnetic devices of some complexity. The I/O
Subsystem of the computer, provides an efficient mode of communication between the central
system and the outside environment.
Accessing I/O Devices:

A simple arrangement to connect I/O devices to a computer is to use a single bus arrangement, as shown in
above figure. Each I/O device is assigned a unique set of address. When the processor places a particular
address on the address lines, the device that recognizes this address responds to the commands issued on the
control lines. The processor requests either a read or a write operation which is transferred over the data lines.
When I/O devices and the memory share the same address space, the arrangement is called memory-
mapped I/O. With memory mapped I/O any machine instruction that can access memory can be used to
transfer data to or from I/O device. Most of the computer system use memory mapped I/O.
IO versus Memory Bus
In addition to communicating with I/O, the processor must communicate with the memory unit. Like the
IO bus, the memory bus contains data, address, and read/write control lines. There are three ways that
computer buses can be used to communicate with memory and I/O:
1. Use two separate buses, one for memory and the other for I/O.
2. Use one common bus for both memory and I/O but have separate control lines for each.
3. Use one common bus for memory and I/O with common control lines.
In the first method, the computer has independent sets of data, address, and control buses, one for accessing
memory and the other for I/O. This is done IOP in computers that provide a separate I/O processor (IOP) in
addition to the central processing unit (CPU). The memory communicates with both the CPU and the IOP
through a memory bus. The IOP communicates also with the input and output devices through a separate I/O
bus with its own address, data and control lines. The purpose of the IOP is to provide an independent
pathway for the transfer of information between external devices and internal memory. The I/O
processor is sometimes called a data channel.
Isolated versus Memory-Mapped I/O

Then we have Isolated I/O in which we Have common bus(data and address) for I/O and memory but
separate read and write control lines for I/O.

So when CPU decode instruction then if data is for I/O then it places the address on the address line and set
I/O read or write control line on due to which data transfer occurs between CPU and I/O. As the address space
of memory and I/O is isolated and the name is so. The address for I/O here is called ports. Here we have
different read-write instruction for both I/O and memory.
In the isolated I/O configuration, the CPU has distinct input and output instructions, and each of
these instructions is associated with the address of an interface register. When the CPU fetches and
decodes the operation code of an input or output instruction, it places the address associated with the
instruction into the common address lines. At the same time, it enables the I/O read (for input) or I/O write
(for output) control line. This informs the external components that are attached to the common bus that
the address in the address lines is for an interface register and not for a memory word.
When the CPU is fetching an instruction or an operand from memory, it places the memory address on
the address lines and enables the memory read or memory write control line. This informs the external
components that the address is for a memory word and not for an I/O interface. The isolated I/O method
isolates memory and I/O addresses so that memory address values are not affected by interface address
assignment since each has its own address space.

In Memory-mapped I/O the same address space for both memory and I/O. This is the case in
computers that employ only one set of read and write signals and do not distinguish between memory
and I/O addresses. This configuration is referred to as memory-mapped I/O.
In this case every bus in common due to which the same set of instructions work for memory and I/O.
Hence we manipulate I/O same as memory and both have same address space, due to which
addressing capability of memory become less because some part is occupied by the I/O.
The computer treats an interface register as being part of the memory system. The assigned addresses for
interface registers cannot be used for memory words, which reduces the memory address range available. In
a memory-mapped I/O organization there are no specific input or output instructions. The CPU can manipulate
I/O data residing in interface registers with the same instructions that are used to manipulate memory words.
Each interface is organized as a set of registers that respond to read and write requests in the normal address
space. Computers with memory-mapped I/O can use memory-type instructions to access I/O data. It allows
the computer to use the same instructions for either input-output transfers or for memory transfers.
The advantage is that the load and store instructions used for reading and writing from memory can be
used to input and output data from I/O registers. In a typical computer, there are more memory-reference
instructions than I/O instructions. With memory mapped I/O all instructions that refer to memory are also
available for I/O.

Input output interface

Peripherals connected to a computer need special communication links for interfacing them with the central
processing unit. The purpose of the communication link is to resolve the differences that exist between the
central computer and each peripheral. Input-output interface provides a method for transferring information
between internal storage and external I/0 devices.

The major differences are:

1. Peripherals are electromechanical and electromagnetic devices and their manner of


operation is different from the operation of the CPU and memory, which are electronic
devices.
Therefore, a conversion of signal values may be required.
2. The data transfer rate of peripherals is usually slower than the transfer rate of the CPU,
and consequently, a synchronization mechanism may be needed.
3. Data codes and formats in peripherals differ from the word format in the CPU and
memory.
4. The operating modes of peripherals are different from each other and each must be
controlled so as not to disturb the operation of other peripherals connected to the
CPU.
To resolve these differences, computer systems include special hardware components
between the CPU and peripherals to supervise and synchronize all input and output
transfers. These components are called interface units because they interface between the
processor bus and the peripheral device.

Asynchronous Data Transfer:

Asynchronous data transfer between two independent units requires that control signals be
transmitted between the communicating units to indicate the time at which data is being
transmitted. One way of achieving this is by means of a strobe pulse supplied by one of the
units to indicate to the other unit when the transfer has to occur.

Handshaking method: Another method commonly used is to accompany each data item
being transferred with a control signal that indicates the presence of data in the bus. The unit
receiving the data item responds with another control signal to acknowledge receipt of the data.
This type of agreement between two independent units is referred to as handshaking.

Modes of Transfer

Data transfer between the central computer and I/O devices may be handled in a variety of
modes. Some modes use the CPU as an intermediate path; others transfer the data directly to
and from the memory unit. Data transfer to and from peripherals may be handled in one of
three possible modes:
1. Programmed I/O
2. Interrupt-initiated I/O
3. Direct memory access (DMA)

1. Programmed I/O Mode:

In this mode of data transfer the operations are the results in I/O instructions which is a part of
computer program. Each data transfer is initiated by an instruction in the program.
In the programmed I/O method, the CPU stays in a program loop until the I/O unit
indicates that it is ready for data transfer. This is a time-consuming process since it keeps
the processor busy needlessly. Usually, the transfer is to and from a CPU register and
peripheral. Other instructions are needed to transfer the data to and from CPU and memory.
Transferring data under program control requires constant monitoring of the peripheral
by the CPU. Once a data transfer is initiated, the CPU is required to monitor the
interface to see when a transfer can again be made. Normally the transfer is from a CPU
register to peripheral device or vice-versa. Once the data is initiated the CPU starts monitoring
the interface to see when next transfer can made. The instructions of the program keep close
tabs on everything that takes place in the interface unit and the I/O. In this technique CPU is
responsible for executing data from the memory for output and storing data in memory for
executing of Programmed I/O as shown in Flowchart-:
The transfer of each byte requires three instructions:
1. Read the status register.
2. Check the status of the flag bit and branch to step 1 if not set or to step 3 if set.
3. Read the data register.
Each byte is read into a CPU register and then transferred to memory with a store instruction.
A common IO programming task is to transfer a block of words from an IO device and store
them in a memory buffer.

Drawback of the Programmed I/O


The main drawback of the Program Initiated I/O was that the CPU has to monitor the units
all the times when the program is executing. Thus the CPU stays in a program loop until
the I/O unit indicates that it is ready for data transfer. This is a time consuming process
and the CPU time is wasted a lot in keeping an eye to the executing of program.
The programmed IO method is particularly useful in small low-speed computers or in systems
that are dedicated to monitor a device continuously. The difference in information transfer rate
between the CPU and the IO device makes this type of transfer inefficient.
To remove this problem an Interrupt facility and special commands are used.

2. Interrupt-Initiated I/O :

In this method an interrupt facility an interrupt command is used to inform the device about
the start and end of transfer. In the meantime the CPU executes other program. When the
interface determines that the device is ready for data transfer it generates an Interrupt Request
and sends it to the computer. When the CPU receives such signal, it temporarily stops the
execution of the program and branches to a service program to process the I/O transfer and
after completing it returns back to task, what it was originally performing.
In this type of IO, computer does not check the flag. It continue to perform its task. Whenever
any device wants the attention, it sends the interrupt signal to the CPU. CPU then deviates
from what it was doing, store the return address from PC and branch to the address of the
subroutine.
There are two ways of choosing the branch address:
a. Vectored Interrupt
b. Non-vectored Interrupt
In vectored interrupt the source that interrupt the CPU provides the branch information.
This information is called interrupt vectored.
In non-vectored interrupt, the branch address is assigned to the fixed address in the memory.
Priority Interrupt:
There are number of IO devices attached to the computer. They are all capable of generating
the interrupt.
When the interrupt is generated from more than one device, priority interrupt system is used
to determine which device is to be serviced first.
Devices with high speed transfer are given higher priority and slow devices are given
lower priority.
Establishing the priority can be done in two ways:
Using Software and Using Hardware
Software : A polling procedure is used to identify highest priority in software means.
Polling Procedure :
A polling procedure is used to identify the highest-priority source by software means. In this
method there is one common branch address for all interrupts. The program that takes care
of interrupts begins at the branch address and polls the interrupt sources in sequence. The order
in which they are tested determines the priority of each interrupt. The highest-priority source
is tested first, and if its interrupt signal is on, control branches to a service routine for this
source. Otherwise, the next-lower-priority source is tested.

The disadvantage is that time required to poll them can exceed the time to serve them
in large number of IO devices. That if there are many interrupts, the time required to poll
them can exceed the time available to service the IO device. In this situation a hardware
priority-interrupt unit can be used to speed up the operation.

Hardware: Hardware priority system function as an overall manager.


It accepts interrupt request and determine the priorities.
To speed up the operation each interrupting devices has its own interrupt vector.
No polling is required, all decision are established by hardware priority interrupt unit.
It can be established by serial or parallel connection of interrupt lines.

Serial or Daisy Chaining Priority:

Device with highest priority is placed first.


Device that wants the attention send the interrupt request to the CPU.
CPU then sends the INTACK signal which is applied to PI(priority in) of the first device.
If it had requested the attention, it place its VAD(vector address) on the bus. And it block
the signal by placing 0 in PO(priority out)
If not it pass the signal to next device through PO(priority out) by placing 1.
This process is continued until appropriate device is found.
The device whose PI is 1 and PO is 0 is the device that send the interrupt request.

Parallel Priority Interrupt:


The parallel priority interrupt method uses a register whose bits are set separately by
the interrupt signal from each device. Priority is established according to the position of
the bits in the register. In addition to the interrupt register, the circuit may include a
mask register whose purpose is to control the status of each interrupt request. The mask
register can be programmed to disable the lower-priority interrupts while a higher-priority
device is being serviced. It can also provide a facility that allows a high-priority device to
interrupt the CPU while a lower-priority device is being serviced. The magnetic disk, being a
high-speed device, is given the highest priority. The printer has the next priority, followed by
a character reader and a keyboard. The mask register has the same number of bits as the
interrupt register. By means of program instructions, it is possible to set or reset any bit in the
mask register. Each interrupt bit and its corresponding mask bit are applied to an AND
gate to produce the four inputs to a priority encoder. In this way an interrupt is recognized
only if its corresponding mask bit is set to 1 by the program. Priority encoder generates two
bits of vector address (four interrupt sources). Another output from the encoder sets an
interrupt status flip-flop IST when an interrupt that is not masked occurs. The interrupt
enable flip-flop IEN can be set or cleared by the program to provide an overall control over
the interrupt system. The outputs of IST ANDed with IEN provide a common interrupt signal
for the CPU. The interrupt acknowledge INTACK signal from the CPU enables the bus buffers
in the output register and a vector address VAD is placed into the data bus. and the CPU.
When IEN is cleared, the interrupt request coming from IST is neglected by the CPU. At
the end of each instruction cycle the CPU checks IEN and the interrupt signal from IST. If
either is equal to 0, control continues with the next instruction. If both IEN and IST are equal
to 1, the CPU goes to an interrupt cycle.
During the interrupt cycle the CPU performs the following sequence of microoperations:

The CPU pushes the return address from PC into the stack. It then acknowledges the interrupt
by enabling the INTACK line. The priority interrupt unit responds by placing a unique
interrupt vector into the CPU data bus. The CPU transfers the vector address into PC and clears
lEN prior to going to the next fetch phase. The instruction read from memory during the next
fetch phase will be the one located at the vector address.
Initial and Final Operations
Each interrupt service routine must have an initial and final set of operations for controlling
the registers in the hardware interrupt system.
The initial sequence of each interrupt service routine must have instructions to control the
interrupt hardware in the following manner:
1. Clear lower-level mask register bits.
2. Clear interrupt status bit IST.
3. Save contents of processor registers.
4. Set interrupt enable bit IEN.
5. Proceed with service routine.
The final sequence in each interrupt service routine must have instructions
to control the interrupt hardware in the following manner:
1. Clear interrupt enable bit IEN.
2. Restore contents of processor registers.
3. Clear the bit in the interrupt register belonging to the source that has been serviced.
4. Set lower-level priority bits in the mask register.
5. Restore return address into PC and set IEN.
The bit in the interrupt register belonging to the source of the interrupt must be cleared so that
it will be available again for the source to interrupt. The lower-priority bits in the mask register
(including the bit of the source being interrupted) are set so they can enable the interrupt. The
return to the interrupted program is accomplished by restoring the return address to PC

Direct Memory Access (DMA):


In the Direct Memory Access (DMA) the interface transfer the data into and out of the memory
unit through the memory bus. The transfer of data between a fast storage device such as
magnetic disk and memory is often limited by the speed of the CPU. Removing the CPU
from the path and letting the peripheral device manage the memory buses directly would
improve the speed of transfer. This transfer technique is called Direct Memory Access
(DMA).
During the DMA transfer, the CPU is idle and has no control of the memory buses. A
DMA Controller takes over the buses to manage the transfer directly between the I/O
device and memory.

The CPU may be placed in an idle state in a variety of ways. One common method extensively
used in microprocessor is to disable the buses through special control signals such as:
Bus Request (BR)
Bus Grant (BG)
These two control signals in the CPU that facilitates the DMA transfer. The Bus Request
(BR) input is used by the DMA controller to request the CPU. When this input is active, the
CPU terminates the execution of the current instruction and places the address bus, data bus
and read write lines into a high Impedance state. High Impedance state means that the output
is disconnected. The CPU activates the Bus Grant (BG) output to inform the external DMA
that the Bus Request (BR) can now take control of the buses to conduct memory transfer
without processor.
When the DMA terminates the transfer, it disables the Bus Request (BR) line. The CPU
disables the Bus Grant (BG), takes control of the buses and return to its normal operation. The
transfer can be made in several ways that are:
i. DMA Burst
ii. Cycle Stealing
i) DMA Burst: - In DMA Burst transfer, a block sequence consisting of a number of memory
words is transferred in continuous burst while the DMA controller is master of the memory
buses.
ii) Cycle Stealing: - Cycle stealing allows the DMA controller to transfer one data word at a
time, after which it must returns control of the buses to the CPU.
DMA Controller:
The DMA controller needs the usual circuits of an interface to communicate with the CPU and
I/O device. The DMA controller has three registers:
i. Address Register
ii. Word Count Register
iii. Control Register
i. Address Register :- Address Register contains an address to specify the desired location in
memory. Starting address.
ii. Word Count Register :- WC holds the number of words to be transferred. The
register is increment /decrement by one after each word transfer and internally tested for zero.
iii. Control Register :- Control Register specifies the mode of transfer .The unit
communicates with the CPU via the data bus and control lines.

DMA Transfer:
The registers in the DMA are selected by the CPU through the address bus by enabling the DS
(DMA select) and RS (Register select) inputs. The RD (read) and WR (write) inputs are
bidirectional. When the BG (Bus Grant) input is 0, the CPU can communicate with the DMA
registers through the data bus to read from or write to the DMA registers. When BG =1, the
DMA can communicate directly with the memory by specifying an address in the address bus
and activating the RD or WR control.

The CPU communicates with the DMA through the address and data buses as with any
interface unit. The DMA has its own address, which activates the DS and RS lines. The CPU
initializes the DMA through the data bus. Once the DMA receives the start control command,
it can transfer between the peripheral and the memory. When BG = 0 the RD and WR are
input lines allowing the CPU to communicate with the internal DMA registers. When BG=1,
the RD and WR are output lines from the DMA controller to the random access memory to
specify the read or write operation of data.

Bus
A bus is a communication system that transfers data between components inside a computer,
or between computers.
Bus is a group of conducting wires which carries information, all the peripherals are
connected to microprocessor through Bus.

Diagram to represent bus organization system of 8085 Microprocessor.

Bus Arbitration:
The device that is allowed to initiate data transfers on the bus at any given time is called the
bus master. Arbitration is the process by which the next device to become the bus master
is selected and bus mastership is transferred to it. The two approaches are centralized and
distributed arbitrations.
i) In Centralized approach; A hardware device called bus controller or bus arbiter allocates
bus. It uses one of the following type
(1) Daisy chaining
(2) Polling
ii) In Distributed Approach: each master has mediator as compared to only single in
centralized approach. Equal responsibility is given to all devices to carry out arbitration
process, without using a central arbiter

The Primary function of the bus is to provide a communication path for the transfer of
data.
Some bus protocols are set, these involve data, address and control lines. A variety of
schemes have been devised for the timing of data transfers over a bus.
They are: Synchronous and Asynchronous schemes
Synchronous bus
All devices derive timing information from a common clock line. Equally spaced pulses
on this line define equal time intervals. Each of these intervals constitutes a bus cycle
during which one data transfer can take place.

Asynchronous bus

This is a scheme based on the use of a handshake between the master and the slave for
controlling data transfers on the bus. The common clock is replaced by two timing
control lines, master-ready and slave-ready. The first is asserted by the master to indicate
that it is ready for a transaction and the second is a response from the slave. The master places
the address and command information on the bus. It indicates to all devices that it has done
so by activating the master-ready line. This causes all devices on the bus to decode the address.
The selected slave performs the required operation and informs the processor it has done so
by activating the slave ready line. A typical handshake control of data transfer during an input
and an output . The master waits for slave-ready to become asserted before it removes its
signals from the bus. The handshake signals are fully interlocked. A change of state in one
signal is followed by 1 a change in the other signal. Hence this scheme is known as a full
handshake.

Interface Circuits
An I/O interface consists of the circuitry required to connect an I/O device to a computer
bus. On one side of the interface, we have bus signals. On the other side, we have a data path
with its associated controls to transfer data between the interface and the I/O device – port.
We have two types:
Serial port and
Parallel port
A parallel port transfers data in the form of a number of bits (8 or 16) simultaneously to
or from the device. A serial port transmits and receives data one bit at a time.
Communication with the bus is the same for both formats. The conversion from the parallel to
the serial format, and vice versa, takes place inside the interface circuit. In parallel port, the
connection between the device and the computer uses a multiple-pin connector and a cable
with as many wires. This arrangement is suitable for devices that are physically close to the
computer. In serial port, it is much more convenient and cost-effective where longer cables are
needed.

Typically, the functions of an I/O interface are:


• Provides a storage buffer for at least one word of data
• Contains status flags that can be accessed by the processor to determine whether the buffer
is full or empty
• Contains address-decoding circuitry to determine when it is being addressed by the processor
• Generates the appropriate timing signals required by the bus control scheme
• Performs any format conversion that may be necessary to transfer data between the bus and
the I/O device, such as parallel-serial conversion in the case of a serial port

Standard I/O interfaces Consider a computer system using different interface standards.
The three major standard I/O interfaces discussed here are: They are interconnected by a circuit
called bridge.
– PCI (Peripheral Component Interconnect)
– SCSI (Small Computer System Interface)
– USB (Universal Serial Bus)
PCI (Peripheral Component Interconnect)
main memory and PCI bridge are connected to disk, printer and Ethernet interface through
PCI bus. At any given time, one device is the bus master. It has the right to initiate data
transfers by issuing read and write commands. A master is called an initiator in PCI
terminology. This is either processor or DMA controller. The addressed device that responds
to read and write commands is called a target. A complete transfer operation on the bus,
involving an address and a burst of data, is called a transaction. Device configuration is also
discussed.

SCSI (Small Computer System Interface) Bus

It is a standard bus defined by the American National Standards Institute (ANSI). A controller
connected to a SCSI bus is an initiator or a target. The processor sends a command to the SCSI
controller, which causes the following sequence of events to take place:
• The SCSI controller contends for control of the bus (initiator).
• When the initiator wins the arbitration process, it selects the target controller and hands over
control of the bus to it.
• The target starts an output operation. The initiator sends a command specifying the required
read operation.
• The target sends a message to the initiator indicating that it will temporarily suspends the
connection between them. Then it releases the bus.
• The target controller sends a command to the disk drive to move the read head to the first
sector involved in the requested read operation.
• The target transfers the contents of the data buffer to the initiator and then suspends the
connection again.
• The target controller sends a command to the disk drive to perform another seek operation.
• As the initiator controller receives the data, it stores them into the main memory using the
DMA approach.
• The SCSI controller sends an interrupt to the processor to inform it that the requested
operation has been completed.
The bus signals, arbitration, selection, information transfer and reselection are the topics
discussed in addition to the above.
For connecting different devices to a computer different buses are used. Each bus typically has
a different data transfer speed.

1) ISA (Industry Standard Architecture) bus: ISA bus was created by IBM in 1981. ISA bus can
transfer 8 or 16 bits at one time. ISA 8 bit bus can run at 4.77 MHz and 16 bit at 8.33 MHz.

2) IDE (Integrated Drive Electronics) bus: IDE bus is used for connecting disks and CDROMs to
the computer.

3) USB (Universal Serial Bus): It is used for connecting keyboard and mouse, and other USB
devices to the computer. A USB bus has a connector with four wires. Two wires are used for
supplying electrical power to the USB devices. USB 1.0 has a data rate of 1.5 MB/s and USB 2.0
which is a high speed one has a data rate of 35 MB/s.

4) IEEE 1394 or FireWire: IEEE 1394 is used for high speed data transfer. It can transfer data at a
rate of up to 400 MB/s. It is a bit serial bus which is used for connection cameras, and other
multimedia devices.
Hardware multithreading
Operating system (OS) software enables multitasking of different programs in the
same processor by performing context switches among programs. A program,
together with any information that describes its current state of execution, is
observed by the OS as an entity called a process. Each process has a
corresponding thread, which is an independent path of execution within a
program. More precisely, the term thread is used to refer to a thread of control
whose state consists of the contents of the program counter and other
processor registers.
It is possible for multiple threads to execute portions of one program and run in
parallel as if they correspond to separate programs. Two or more threads can be
running on different processors, executing either the same part of a program on
different data, or executing different parts of a program. Threads for different
programs can also execute on different processors. All threads that are part of a
single program run in the same address space and are associated with the
same process. To deal with multiple threads efficiently, a processor is
implemented with several identical sets of registers, including multiple
program counters. Each set of registers can be dedicated to a different thread.
Thus, no time is wasted during a context switch to save and restore register
contents. The processor is said to be using a technique called hardware
multithreading. With multiple sets of registers, context switching is simple and
fast. All that is necessary is to change a hardware pointer in the processor to use
a different set of registers to fetch and execute subsequent instructions. Switching
to a different thread can be completed within one clock cycle. The state of the
previously active thread is preserved in its own set of registers.
Switching to a different thread may be triggered at any time by the occurrence of
a specific event, rather than at the end of a fixed time interval. For example, a cache
miss may occur when a Load or Store instruction is being executed for the active
thread. Instead of stalling while the slower main memory is accessed to service the
cache miss, a processor can quickly switch to a different thread and continue
to fetch and execute other instructions. This is called coarse-grained
multithreading because many instructions may be executed for one thread before
an event such as a cache miss causes a switch to another thread.
An alternative to switching between threads on specific events is to switch after
every instruction is fetched. This is called fine-grained or interleaved
multithreading. The intent is to increase the processor throughput. Each new
instruction is independent of its predecessors from other threads. This should
reduce the occurrence of stalls due to data dependencies.
Thus, throughput may be increased by interleaving instructions from many threads,
but it takes longer for a given thread to complete all of its instructions.

VECTOR PROCESSING(SIMD)
Vector Processor (computer) Ability to process vectors, and related data
structures such as matrices and multi-dimensional arrays, much faster than
conventional computers Vector Processors may also be pipelined.
To achieve the required level of high performance it is necessary to utilize the
fastest and most reliable hardware and apply innovative procedures from
vector and parallel processing techniques.
Computers with vector processing capabilities are in demand in specialized
applications.
The following are representative application areas where vector processing is
importance.

Long-range weather forecasting, Petroleum explorations , Seismic data analysis,


Medical diagnosis, Aerodynamics and space flight simulations ,Artificial
intelligence and expert systems and Image processing.
A vector is an ordered set of a one-dimensional array of data items. A vector
V of length n is represented as a row vector by V = [V1 V2 V, · · · Vn] .
It may be represented as a column vector if the data items are listed in a
column. A conventional sequential computer is capable of processing
operands one at a time. Consequently, operations on vectors must be broken
down into single computations with subscripted variables.

The element V; of vector V is written as V(I) and the index I refers to a memory
address or register where the number is stored.
To examine the difference between a conventional scalar processor and a vector
processor, consider the following
Fortran DO loop:
DO 2 0 I=1,100
20 C(I)=B(I)+A(I)
This is a program for adding two vectors A and B of length 100 to produce a vector
C . This is implemented in machine language by the following sequence of
operations.
Initialize I = 0
20 Read A ( I )
Read B ( I )
Store C ( I ) = A ( I ) + B ( I )
Increment I = I + 1
If I<= 1 0 0 go to 2 0
Continue
This constitutes a program loop that reads a pair of operands from arrays A and B
and performs a floating-point addition. The loop control variable is then updated
and the steps repeat 100 times. A computer capable of vector processing
eliminates the overhead associated with the time it takes to fetch and execute the
instructions in the program loop. It allows operations to be specified with a
single vector instruction of the form-

C(1 : 100) = A(1 : 100) + B(1 : 100)


The vector instruction includes the initial address of the operands, the length
of the vectors, and the operation to be performed, all in one composite
instruction. The addition is done with a pipelined floating-point adder.
This is essentially a three-address instruction with three fields specifying the base
address of the operands and an additional field that gives the length of the data
items in the vectors. This assumes that the vector operands reside in memory. It is
also possible to design the processor with a large number of registers and store all
operands in registers prior to the addition operation. In that case the base address
and length in the vector instruction specify a group of CPU registers.

Matrix Multiplication
Matrix multiplication is one of the most computational intensive operations
performed in computers with vector processors. The multiplication of two n x n
matrices consists of n' inner products or n' multiply-add operations
Shared-Memory Multiprocessor
In a shared-memory multiprocessor, all processors have access to the same
memory. Tasks running in different processors can access shared variables in
the memory using the same addresses. The size of the shared memory is likely
to be large. Implementing a large memory in a single module would create a
bottleneck when many processors make requests to access the memory
simultaneously. This problem is alleviated by distributing the memory multiple
modules so that simultaneous requests from different processors are more likely to
access different memory modules, depending on the addresses of those requests.

A shared-memory multiprocessor is an architecture consisting of a modest


number of processors, all of which have direct (hardware) access to all the
main memory in the system. This permits any of the system processors to
access data that any of the other processors has created or will use. The key
to this form of multiprocessor architecture is the interconnection
network that directly connects all the processors to the memories.
This is complicated by the need to retain cache coherence across all caches of all
processors in the system.

A multiprocessor system is an interconnection of two or more CPUs with memory


and input-output equipment. The term "processor" in multiprocessor can mean
either a central processing unit (CPU) or an input-output processor (IOP).
However, a system with a single CPU and one or more IOPs is usually not included
in the definition of a multiprocessor system unless the IOP has computational
facilities comparable to a CPU. As it is most commonly defined, a multiprocessor
system implies the actuality of multiple CPUs, although usually there will be one
or more IOPs as well multiprocessors are classified as multiple instruction stream,
multiple data stream (MIMD) systems.
There are some similarities between multiprocessor and multicomputer systems
since both support concurrent operations. However, there exists an important
distinction between a system with multiple computers and a system with multiple
processors.
Multiprocessors are classified by the way their memory is organized. A
multiprocessor system with common shared memory is classified as a shared
memory or tightly coupled multiprocessor. This does not prevent each processor
from having its own local memory. In fact, most commercial tightly coupled
multiprocessors provide a cache memory with each CPU. In addition, there is
a global common memory that all CPUs can access. Information can therefore
be shared among the CPUs by placing it in the common global memory.
An alternative model of microprocessor is the distributed-memory or loosely
coupled system. Each processor element in a loosely coupled system has its own
private local memory. The processors are tied together by a switching scheme
designed to route information from one processor to another through a message-
passing scheme. The processors relay program and data to other processors in
packets. A packet consists of an address, the data content, and some error detection
code. The packets are addressed to a specific processor or taken by the first
available processor, depending on the communication system used.
Cache Coherence
For higher performance in a multiprocessor system, each processor will usually
have its own cache. Cache coherence refers to the problem of keeping the
data in these caches consistent. The main problem is dealing with writes by a
processor.
There are two general strategies for dealing with writes to a cache:

Write-through - all data written to the cache is also written to memory at the same
time.

Write-back - when data is written to a cache, a dirty bit is set for the affected block.
The modified block is written to memory only when the block is replaced.

In a shared memory multiprocessor system, all the processors share a


common memory. In addition, each processor may have a local memory, part
or all of which may be a cache. The fascinating reason for having separate caches
for each processor is to reduce the average access time in each processor. The
same information may reside in a number of copies in some caches and main
memory. To ensure the ability of the system to execute memory operations
correctly, the multiple copies must be kept identical. This requirement
imposes a cache coherence problem. Cache coherence problems exist in
multiprocessors with private caches because of the need to share writable
data. Read-only data can safely be replicated without cache coherence
enforcement mechanisms. To illustrate the problem, consider the three-processor
configuration
Sometime during the operation an element X from main memory is loaded into the
three processors, P1, P2, and P3. As a consequence, it is also copied into the private
caches of the three processors. For simplicity, we assume that X contains the value
of 52. The load on X to the three processors results in consistent copies in the
caches and main memory. If one of the processors performs a store to X, the copies
of X in the caches become inconsistent. A load by the other processors will not
return the latest value. Depending on the memory update policy used in the cache,
the main memory may also be inconsistent with respect to the cache. A store to X
(of the value of 120) into the cache of processor P1 updates memory to the new
value in a write-through policy. A write-through policy maintains consistency
between memory and the originating cache, but the other two caches are
inconsistent since they still hold the old value. In a write-back policy, main
memory is not updated at the time of the store. The copies in the other two
caches and main memory are inconsistent. Memory is updated eventually
when the modified data in the cache are copied back into memory. Another
configuration that may cause consistency problems is a direct memory access
(DMA) activity in conjunction with an IOP connected to the system bus. In
the case of input, the DMA may modify locations in main memory that also
reside in cache without updating the cache. During a DMA output, memory
locations may be read before they are updated from the cache when using a
write-back policy. IO-based memory incoherence can be overcome by making
the IOP a participant in the cache coherent solution that is adopted in the
system.

A simple scheme is to disallow private caches for each processor and have a
shared cache memory associated with main memory. Every data access is made
to the shared cache. This method violates the principle of closeness of CPU to
cache and increases the average memory access time. In effect, this scheme
solves the problem by avoiding it. For performance considerations it is desirable
to attach a private cache to each processor. One scheme that has been used
allows only non shared and read-only data to be stored in caches. Such items
are called cachable. Shared writable data are noncachable. The compiler
must tag data as either cachable or non cachable, and the system hardware
makes sure that only cachable data are stored in caches. The noncachable
data remain in main memory. This method restricts the type of data stored in
caches and introduces an extra software overhead that may degrade performance.
A scheme that allows writable data to exist in at least one cache is a method
that employs a centralized global table in its compiler. The status of memory
blocks is stored in the central global table. Each block is identified as read only
(RO) or read and write (RW). All caches can have copies of blocks identified as
RO. Only one cache can have a copy of an RW block. Thus if the data are
updated in the cache with an RW block, the other caches are not affected because
they do not have a copy of this block. The cache coherence problem can be
solved by means of a combination of software and hardware or by means of
hardware-only schemes. The two methods mentioned previously use software
based procedures that require the ability to tag information in order to disable
caching of shared writable data. In the hardware solution, the cache controller
is specially designed to allow it to monitor all bus requests from CPUs and
IOPs. All caches attached to the bus constantly monitor the network for possible
write operations. Depending on the method used, they must then either update
or invalidate their own cache copies when a match is detected. The bus
controller that monitors this action is referred to as a snoopy cache controller.
This is basically a hard ware unit designed to maintain a bus-watching mechanism
over all the caches attached to the bus. Various schemes have been proposed to
solve the cache coherence problem by means of snoopy cache protocol . The
simplest method is to adopt a write through policy and use the following procedure.
All the snoopy controllers watch the bus for memory store operations. When
a word in a cache is updated by writing into it, the corresponding location in
main memory is also updated. The local snoopy controllers in all other caches
check their memory to determine if they have a copy of the word that has been
overwritten. a copy exists in a remote cache, that location is marked invalid.
Because all caches snoop on all bus writes, whenever a word is written, the net
effect is to update it in the original cache and main memory and remove it from all
other caches. In this way, inconsistent versions are prevented.

Message-Passing Multicomputers:
A different way of using multiple processors involves implementing each node
in the system as a complete computer with its own memory. Other computers in
the system do not have direct access to this memory. Data that need to be shared
are exchanged by sending messages from one computer to another. Such systems
are called message-passing multicomputers. Parallel programs are written
differently for message-passing multicomputers than for shared-memory
multiprocessors. To share data between nodes, the program running in the
computer that is the source of the data must send a message containing the data to
the destination computer. The program running in the destination computer
receives the message and copies the data into the memory of that node. To
facilitate message passing, a special communications unit at each node is often
responsible for the low-level details of formatting and interpreting messages
that are sent and received, and for copying message data to and from the
memory of the node. The computer in each node issues commands to the
communications unit. The computer then continues performing other
computations while the communications unit handles the details of sending and
receiving messages.
Parallel Programming for Multiprocessors
Programming for a shared-memory multiprocessor is a natural extension of
conventional programming for a single processor. A high-level source program is
written using tasks that are executed by one processor. But it is also possible to
indicate that certain tasks are to be executed simultaneously in different processors.
Sharing of data is achieved by defining global variables that are read and written
by different processors as they perform their assigned tasks. To illustrate parallel
programming, we consider the example of computing the dot product of two
vectors, each containing N numbers. The details of initializing the contents of the
two vectors are omitted to focus on the aspects relevant to parallel programming.
The loop accumulates the sum of N products. Each pass depends on the partial sum
computed in the preceding pass, and the result computed in the final pass is the dot
product. Despite the dependency, it is possible to partition the program into
independent tasks for simultaneous execution by exploiting the associative
property of addition. Each task computes a partial sum, and the final result is
obtained by adding the partial sums.
#include < stdio.h> /* Routines for input/output. */
#define N 100 /* Number of elements in each vector. */
double a[N], b[N]; /* Vectors for computing the dot product. */
void main (void)
{
int i;
double dot_product; < Initialize vectors a[], b[] – details omitted.>
dot_product = 0.0;
for (i = 0; i < N; i++)
dot_product = dot_product + a[i] * b[i];
printf ("The dot product is %g\ n", dot_product);
}
Figure 12.7 C program for computing a dot product.

To implement a parallel program for computing the dot product, two Steps :
1. Thread Creation
We make multiple processors participate in parallel execution to compute the
partial sums we define the tasks that are assigned to different processors, and then
we describe how execution of these tasks is initiated in multiple processors. We
can write a parallel version of the dot product program using parameters for the
number of processors, P, and the number of elements in each vector, N. We assume
for simplicity that N is evenly divisible by P. The overall computation involves a
sum of N products. For P processors, we define P independent tasks, where each
task is the computation of a partial sum of N/P products. When a program is
executed in a single processor, there is one active thread of execution control. This
thread is created implicitly by the operating system (OS) when execution of the
program begins. For a parallel program, we require the independent tasks to
be handled separately by multiple threads of execution control, one for each
processor. These threads must be created explicitly. A typical approach is to use
a routine named create_thread in a library that supports parallel
programming. The library routine accepts an input parameter, which is a
pointer to a subroutine to be executed by the newly created thread. An
operating system service is invoked by the library routine to create a new
thread with a distinct stack, so that it may call other subroutines and have its
own local variables. All global variables are shared among all threads.
It is necessary to distinguish the threads from each other. One approach is to
provide another library routine called get_my_thread_id that returns a
unique integer between 0 and P − 1 for each thread. With that information, a
thread can determine the appropriate subset of the overall computation for which
it is responsible.
2. Thread Synchronization
We ensure that each processor has computed its partial sum before the final result
for the dot product is computed. Synchronization of multiple threads is therefore
required. There are several methods of synchronization, and they are often
implemented in additional library routines for parallel programming. Here, we
consider one method called a barrier.
The purpose of a barrier is to force threads to wait until they have all reached
a specific point in the program where a call is made to the library routine for
the barrier. Each thread that calls the barrier routine enters a busy-wait loop
until the last thread calls the routine and enables all of the threads to continue
their execution. This ensures that the threads have completed their respective
computations preceding the barrier call.
Example Parallel Program
Having described the issues related to thread creation and synchronization, and
typical library routines that are provided for thread management, we can now
present a parallel dot product program as an example. shows a main routine, and
another routine

#include < stdio.h> /* Routines for input/output. */


#include "threads.h" /* Routines for thread creation/synchronization. */
#define N 100 /* Number of elements in each vector. */
#define P 4 /* Number of processors for parallel execution. */
double a[N], b[N]; /* Vectors for computing the dot product. */
double partial_sums[P]; /* Array of results computed by threads. */
Barrier bar; /* Shared variable to support barrier synchronization. */
void ParallelFunction (void)
{
int my_id, i, start, end;
double s;
my_id = get_my_thread_id (); /* Get unique identifier for this thread. */
start = (N/P) * my_id; /* Determine start/end using thread identifier. */
end = (N/P) * (my_id + 1) – 1; /* N is assumed to be evenly divisible by P .* /
s = 0.0;
for (i = start; i <= end; i++)
s = s + a[i] * b[i];
partial_sums[my_id] = s; /* Save result in array. */
barrier (&bar, P); /* Synchronize with other threads. */
}
void main (void)
{
int i;
double dot_product; < Initialize vectors a[], b[] – details omitted.>
init_barrier (&bar);
for (i = 1; i < P; i++) /* Create P – 1 additional threads. */
create_thread (ParallelFunction);
ParallelFunction(); /* Main thread also joins parallel execution. */
dot_product = 0.0; /* After barrier synchronization, compute final result. */
for (i = 0; i < P; i++)
dot_product = dot_product + partial_sums[i];
printf ("The dot product is %g\ n", dot_product);
}
Figure Parallel program in C for computing a dot product called ParallelFunction
that defines the independent tasks for parallel execution. When the program begins
executing, there is only one thread executing the main routine. This thread
initializes the vectors, then it initializes a shared variable needed for barrier
synchronization. To initiate parallel execution, the create_thread routine is called
P − 1 times from the main routine to create additional threads that each execute
ParallelFunction. Then, the thread executing the main routine calls
ParallelFunction directly so that a total of P threads are involved in the overall
computation. The operating system software is responsible for distributing the
threads to different processors for parallel execution. Each thread calls
get_my_thread_id from ParallelFunction to obtain a unique integer identifier in the
range 0 to P − 1. Using this information, the thread calculates the start and end
indices for the loop that generates the partial sum of that thread. After executing
the loop, it writes the result to a separate element of the shared partial_sums array
using its unique identifier as the array index. Then, the thread calls the library
routine for barrier synchronization to wait for other threads to complete their
computation. After the last thread to complete its computation calls the barrier
routine, all threads return to ParallelFunction. There is no further computation to
perform in ParallelFunction, so the P − 1 threads created by the library call in the
main routine terminate. The thread that called ParallelFunction directly from the
main routine returns to compute the final result using the values in the partial_sums
array. A large collection of routines for parallel programming in the C
language is defined in the IEEE 1003.1 standard . This collection is also known
as the POSIX threads or P threads library. It provides a variety of thread
management and synchronization mechanisms. Implementations of this library are
available for widely used operating systems to facilitate programming for
multiprocessors.

Performance Modeling
The most important measure of the performance of a computer is how quickly it
can execute programs. When considering one processor, the speed with which
instructions are fetched and executed is affected by the instruction set architecture
and the hardware design. The total number of instructions that are executed is
affected by the compiler as well as the instruction set architecture. The terms in
that model include the number of instructions executed, the average number
of cycles per instruction, and the clock frequency. This model enables the
prediction of execution time, given sufficiently detailed information. A higher-
level model that relies on less detailed information can be used to assess potential
improvements in performance. Consider a program whose execution time on some
computer is Torig . Our objective is to assess the extent to which the execution time
can be reduced when a performance enhancement, such as parallel processing, is
introduced

The overall Speedup is the ratio of the execution time:-

The above expression for speedup is known as Amdahl’s Law.


Once a breakdown of the original execution time has been determined, it is often
useful to determine an upper bound on the possible speedup. To do so, we let
p→∞ to reflect the ideal, but unrealistic, reduction of the fraction fenh of execution
time to zero. The resulting speedup is 1/funenh, which means that the portion of
execution time that is not enhanced is the limiting factor on performance. A
smaller value of funenh gives a larger bound on the speedup.
For example,
funenh = 0.1 gives an upper bound of 10, but funenh = 0.05 gives a larger bound of 20.
However, the expected speedup using a realistic value of p is normally well below
the upper bound. For example, using p = 16 with funenh = 0.05 gives a speedup of
only 1/(0.05 + 0.95/16) = 9.1, well below the upper bound of 20.
The important conclusion from this discussion is that the unenhanced portion of
the original execution time can significantly limit the achievable speedup, even if
the enhanced portion is improved by an arbitrarily large factor.

Reference:
Carl Hamacher, Zvonko Vranesic, Safwat Zaky, Computer Organization, fifth
edition, Mc-graw Hill higher education.

You might also like