Module V
Module V
Devices that are under the direct control of the computer are said to be connected on-line.
These devices are designed to read information into or out of the memory unit upon command
from the CPU and are considered to be part of the total computer system.
Input or output devices attached to the computer are also called peripherals. Among the most
common peripherals are keyboards, display units, and printers.
Peripherals that provide auxiliary storage for the system are magnetic disks and tapes.
Peripherals are electromechanical and electromagnetic devices of some complexity. The I/O
Subsystem of the computer, provides an efficient mode of communication between the central
system and the outside environment.
Accessing I/O Devices:
A simple arrangement to connect I/O devices to a computer is to use a single bus arrangement, as shown in
above figure. Each I/O device is assigned a unique set of address. When the processor places a particular
address on the address lines, the device that recognizes this address responds to the commands issued on the
control lines. The processor requests either a read or a write operation which is transferred over the data lines.
When I/O devices and the memory share the same address space, the arrangement is called memory-
mapped I/O. With memory mapped I/O any machine instruction that can access memory can be used to
transfer data to or from I/O device. Most of the computer system use memory mapped I/O.
IO versus Memory Bus
In addition to communicating with I/O, the processor must communicate with the memory unit. Like the
IO bus, the memory bus contains data, address, and read/write control lines. There are three ways that
computer buses can be used to communicate with memory and I/O:
1. Use two separate buses, one for memory and the other for I/O.
2. Use one common bus for both memory and I/O but have separate control lines for each.
3. Use one common bus for memory and I/O with common control lines.
In the first method, the computer has independent sets of data, address, and control buses, one for accessing
memory and the other for I/O. This is done IOP in computers that provide a separate I/O processor (IOP) in
addition to the central processing unit (CPU). The memory communicates with both the CPU and the IOP
through a memory bus. The IOP communicates also with the input and output devices through a separate I/O
bus with its own address, data and control lines. The purpose of the IOP is to provide an independent
pathway for the transfer of information between external devices and internal memory. The I/O
processor is sometimes called a data channel.
Isolated versus Memory-Mapped I/O
Then we have Isolated I/O in which we Have common bus(data and address) for I/O and memory but
separate read and write control lines for I/O.
So when CPU decode instruction then if data is for I/O then it places the address on the address line and set
I/O read or write control line on due to which data transfer occurs between CPU and I/O. As the address space
of memory and I/O is isolated and the name is so. The address for I/O here is called ports. Here we have
different read-write instruction for both I/O and memory.
In the isolated I/O configuration, the CPU has distinct input and output instructions, and each of
these instructions is associated with the address of an interface register. When the CPU fetches and
decodes the operation code of an input or output instruction, it places the address associated with the
instruction into the common address lines. At the same time, it enables the I/O read (for input) or I/O write
(for output) control line. This informs the external components that are attached to the common bus that
the address in the address lines is for an interface register and not for a memory word.
When the CPU is fetching an instruction or an operand from memory, it places the memory address on
the address lines and enables the memory read or memory write control line. This informs the external
components that the address is for a memory word and not for an I/O interface. The isolated I/O method
isolates memory and I/O addresses so that memory address values are not affected by interface address
assignment since each has its own address space.
In Memory-mapped I/O the same address space for both memory and I/O. This is the case in
computers that employ only one set of read and write signals and do not distinguish between memory
and I/O addresses. This configuration is referred to as memory-mapped I/O.
In this case every bus in common due to which the same set of instructions work for memory and I/O.
Hence we manipulate I/O same as memory and both have same address space, due to which
addressing capability of memory become less because some part is occupied by the I/O.
The computer treats an interface register as being part of the memory system. The assigned addresses for
interface registers cannot be used for memory words, which reduces the memory address range available. In
a memory-mapped I/O organization there are no specific input or output instructions. The CPU can manipulate
I/O data residing in interface registers with the same instructions that are used to manipulate memory words.
Each interface is organized as a set of registers that respond to read and write requests in the normal address
space. Computers with memory-mapped I/O can use memory-type instructions to access I/O data. It allows
the computer to use the same instructions for either input-output transfers or for memory transfers.
The advantage is that the load and store instructions used for reading and writing from memory can be
used to input and output data from I/O registers. In a typical computer, there are more memory-reference
instructions than I/O instructions. With memory mapped I/O all instructions that refer to memory are also
available for I/O.
Peripherals connected to a computer need special communication links for interfacing them with the central
processing unit. The purpose of the communication link is to resolve the differences that exist between the
central computer and each peripheral. Input-output interface provides a method for transferring information
between internal storage and external I/0 devices.
Asynchronous data transfer between two independent units requires that control signals be
transmitted between the communicating units to indicate the time at which data is being
transmitted. One way of achieving this is by means of a strobe pulse supplied by one of the
units to indicate to the other unit when the transfer has to occur.
Handshaking method: Another method commonly used is to accompany each data item
being transferred with a control signal that indicates the presence of data in the bus. The unit
receiving the data item responds with another control signal to acknowledge receipt of the data.
This type of agreement between two independent units is referred to as handshaking.
Modes of Transfer
Data transfer between the central computer and I/O devices may be handled in a variety of
modes. Some modes use the CPU as an intermediate path; others transfer the data directly to
and from the memory unit. Data transfer to and from peripherals may be handled in one of
three possible modes:
1. Programmed I/O
2. Interrupt-initiated I/O
3. Direct memory access (DMA)
In this mode of data transfer the operations are the results in I/O instructions which is a part of
computer program. Each data transfer is initiated by an instruction in the program.
In the programmed I/O method, the CPU stays in a program loop until the I/O unit
indicates that it is ready for data transfer. This is a time-consuming process since it keeps
the processor busy needlessly. Usually, the transfer is to and from a CPU register and
peripheral. Other instructions are needed to transfer the data to and from CPU and memory.
Transferring data under program control requires constant monitoring of the peripheral
by the CPU. Once a data transfer is initiated, the CPU is required to monitor the
interface to see when a transfer can again be made. Normally the transfer is from a CPU
register to peripheral device or vice-versa. Once the data is initiated the CPU starts monitoring
the interface to see when next transfer can made. The instructions of the program keep close
tabs on everything that takes place in the interface unit and the I/O. In this technique CPU is
responsible for executing data from the memory for output and storing data in memory for
executing of Programmed I/O as shown in Flowchart-:
The transfer of each byte requires three instructions:
1. Read the status register.
2. Check the status of the flag bit and branch to step 1 if not set or to step 3 if set.
3. Read the data register.
Each byte is read into a CPU register and then transferred to memory with a store instruction.
A common IO programming task is to transfer a block of words from an IO device and store
them in a memory buffer.
2. Interrupt-Initiated I/O :
In this method an interrupt facility an interrupt command is used to inform the device about
the start and end of transfer. In the meantime the CPU executes other program. When the
interface determines that the device is ready for data transfer it generates an Interrupt Request
and sends it to the computer. When the CPU receives such signal, it temporarily stops the
execution of the program and branches to a service program to process the I/O transfer and
after completing it returns back to task, what it was originally performing.
In this type of IO, computer does not check the flag. It continue to perform its task. Whenever
any device wants the attention, it sends the interrupt signal to the CPU. CPU then deviates
from what it was doing, store the return address from PC and branch to the address of the
subroutine.
There are two ways of choosing the branch address:
a. Vectored Interrupt
b. Non-vectored Interrupt
In vectored interrupt the source that interrupt the CPU provides the branch information.
This information is called interrupt vectored.
In non-vectored interrupt, the branch address is assigned to the fixed address in the memory.
Priority Interrupt:
There are number of IO devices attached to the computer. They are all capable of generating
the interrupt.
When the interrupt is generated from more than one device, priority interrupt system is used
to determine which device is to be serviced first.
Devices with high speed transfer are given higher priority and slow devices are given
lower priority.
Establishing the priority can be done in two ways:
Using Software and Using Hardware
Software : A polling procedure is used to identify highest priority in software means.
Polling Procedure :
A polling procedure is used to identify the highest-priority source by software means. In this
method there is one common branch address for all interrupts. The program that takes care
of interrupts begins at the branch address and polls the interrupt sources in sequence. The order
in which they are tested determines the priority of each interrupt. The highest-priority source
is tested first, and if its interrupt signal is on, control branches to a service routine for this
source. Otherwise, the next-lower-priority source is tested.
The disadvantage is that time required to poll them can exceed the time to serve them
in large number of IO devices. That if there are many interrupts, the time required to poll
them can exceed the time available to service the IO device. In this situation a hardware
priority-interrupt unit can be used to speed up the operation.
The CPU pushes the return address from PC into the stack. It then acknowledges the interrupt
by enabling the INTACK line. The priority interrupt unit responds by placing a unique
interrupt vector into the CPU data bus. The CPU transfers the vector address into PC and clears
lEN prior to going to the next fetch phase. The instruction read from memory during the next
fetch phase will be the one located at the vector address.
Initial and Final Operations
Each interrupt service routine must have an initial and final set of operations for controlling
the registers in the hardware interrupt system.
The initial sequence of each interrupt service routine must have instructions to control the
interrupt hardware in the following manner:
1. Clear lower-level mask register bits.
2. Clear interrupt status bit IST.
3. Save contents of processor registers.
4. Set interrupt enable bit IEN.
5. Proceed with service routine.
The final sequence in each interrupt service routine must have instructions
to control the interrupt hardware in the following manner:
1. Clear interrupt enable bit IEN.
2. Restore contents of processor registers.
3. Clear the bit in the interrupt register belonging to the source that has been serviced.
4. Set lower-level priority bits in the mask register.
5. Restore return address into PC and set IEN.
The bit in the interrupt register belonging to the source of the interrupt must be cleared so that
it will be available again for the source to interrupt. The lower-priority bits in the mask register
(including the bit of the source being interrupted) are set so they can enable the interrupt. The
return to the interrupted program is accomplished by restoring the return address to PC
The CPU may be placed in an idle state in a variety of ways. One common method extensively
used in microprocessor is to disable the buses through special control signals such as:
Bus Request (BR)
Bus Grant (BG)
These two control signals in the CPU that facilitates the DMA transfer. The Bus Request
(BR) input is used by the DMA controller to request the CPU. When this input is active, the
CPU terminates the execution of the current instruction and places the address bus, data bus
and read write lines into a high Impedance state. High Impedance state means that the output
is disconnected. The CPU activates the Bus Grant (BG) output to inform the external DMA
that the Bus Request (BR) can now take control of the buses to conduct memory transfer
without processor.
When the DMA terminates the transfer, it disables the Bus Request (BR) line. The CPU
disables the Bus Grant (BG), takes control of the buses and return to its normal operation. The
transfer can be made in several ways that are:
i. DMA Burst
ii. Cycle Stealing
i) DMA Burst: - In DMA Burst transfer, a block sequence consisting of a number of memory
words is transferred in continuous burst while the DMA controller is master of the memory
buses.
ii) Cycle Stealing: - Cycle stealing allows the DMA controller to transfer one data word at a
time, after which it must returns control of the buses to the CPU.
DMA Controller:
The DMA controller needs the usual circuits of an interface to communicate with the CPU and
I/O device. The DMA controller has three registers:
i. Address Register
ii. Word Count Register
iii. Control Register
i. Address Register :- Address Register contains an address to specify the desired location in
memory. Starting address.
ii. Word Count Register :- WC holds the number of words to be transferred. The
register is increment /decrement by one after each word transfer and internally tested for zero.
iii. Control Register :- Control Register specifies the mode of transfer .The unit
communicates with the CPU via the data bus and control lines.
DMA Transfer:
The registers in the DMA are selected by the CPU through the address bus by enabling the DS
(DMA select) and RS (Register select) inputs. The RD (read) and WR (write) inputs are
bidirectional. When the BG (Bus Grant) input is 0, the CPU can communicate with the DMA
registers through the data bus to read from or write to the DMA registers. When BG =1, the
DMA can communicate directly with the memory by specifying an address in the address bus
and activating the RD or WR control.
The CPU communicates with the DMA through the address and data buses as with any
interface unit. The DMA has its own address, which activates the DS and RS lines. The CPU
initializes the DMA through the data bus. Once the DMA receives the start control command,
it can transfer between the peripheral and the memory. When BG = 0 the RD and WR are
input lines allowing the CPU to communicate with the internal DMA registers. When BG=1,
the RD and WR are output lines from the DMA controller to the random access memory to
specify the read or write operation of data.
Bus
A bus is a communication system that transfers data between components inside a computer,
or between computers.
Bus is a group of conducting wires which carries information, all the peripherals are
connected to microprocessor through Bus.
Bus Arbitration:
The device that is allowed to initiate data transfers on the bus at any given time is called the
bus master. Arbitration is the process by which the next device to become the bus master
is selected and bus mastership is transferred to it. The two approaches are centralized and
distributed arbitrations.
i) In Centralized approach; A hardware device called bus controller or bus arbiter allocates
bus. It uses one of the following type
(1) Daisy chaining
(2) Polling
ii) In Distributed Approach: each master has mediator as compared to only single in
centralized approach. Equal responsibility is given to all devices to carry out arbitration
process, without using a central arbiter
The Primary function of the bus is to provide a communication path for the transfer of
data.
Some bus protocols are set, these involve data, address and control lines. A variety of
schemes have been devised for the timing of data transfers over a bus.
They are: Synchronous and Asynchronous schemes
Synchronous bus
All devices derive timing information from a common clock line. Equally spaced pulses
on this line define equal time intervals. Each of these intervals constitutes a bus cycle
during which one data transfer can take place.
Asynchronous bus
This is a scheme based on the use of a handshake between the master and the slave for
controlling data transfers on the bus. The common clock is replaced by two timing
control lines, master-ready and slave-ready. The first is asserted by the master to indicate
that it is ready for a transaction and the second is a response from the slave. The master places
the address and command information on the bus. It indicates to all devices that it has done
so by activating the master-ready line. This causes all devices on the bus to decode the address.
The selected slave performs the required operation and informs the processor it has done so
by activating the slave ready line. A typical handshake control of data transfer during an input
and an output . The master waits for slave-ready to become asserted before it removes its
signals from the bus. The handshake signals are fully interlocked. A change of state in one
signal is followed by 1 a change in the other signal. Hence this scheme is known as a full
handshake.
Interface Circuits
An I/O interface consists of the circuitry required to connect an I/O device to a computer
bus. On one side of the interface, we have bus signals. On the other side, we have a data path
with its associated controls to transfer data between the interface and the I/O device – port.
We have two types:
Serial port and
Parallel port
A parallel port transfers data in the form of a number of bits (8 or 16) simultaneously to
or from the device. A serial port transmits and receives data one bit at a time.
Communication with the bus is the same for both formats. The conversion from the parallel to
the serial format, and vice versa, takes place inside the interface circuit. In parallel port, the
connection between the device and the computer uses a multiple-pin connector and a cable
with as many wires. This arrangement is suitable for devices that are physically close to the
computer. In serial port, it is much more convenient and cost-effective where longer cables are
needed.
Standard I/O interfaces Consider a computer system using different interface standards.
The three major standard I/O interfaces discussed here are: They are interconnected by a circuit
called bridge.
– PCI (Peripheral Component Interconnect)
– SCSI (Small Computer System Interface)
– USB (Universal Serial Bus)
PCI (Peripheral Component Interconnect)
main memory and PCI bridge are connected to disk, printer and Ethernet interface through
PCI bus. At any given time, one device is the bus master. It has the right to initiate data
transfers by issuing read and write commands. A master is called an initiator in PCI
terminology. This is either processor or DMA controller. The addressed device that responds
to read and write commands is called a target. A complete transfer operation on the bus,
involving an address and a burst of data, is called a transaction. Device configuration is also
discussed.
It is a standard bus defined by the American National Standards Institute (ANSI). A controller
connected to a SCSI bus is an initiator or a target. The processor sends a command to the SCSI
controller, which causes the following sequence of events to take place:
• The SCSI controller contends for control of the bus (initiator).
• When the initiator wins the arbitration process, it selects the target controller and hands over
control of the bus to it.
• The target starts an output operation. The initiator sends a command specifying the required
read operation.
• The target sends a message to the initiator indicating that it will temporarily suspends the
connection between them. Then it releases the bus.
• The target controller sends a command to the disk drive to move the read head to the first
sector involved in the requested read operation.
• The target transfers the contents of the data buffer to the initiator and then suspends the
connection again.
• The target controller sends a command to the disk drive to perform another seek operation.
• As the initiator controller receives the data, it stores them into the main memory using the
DMA approach.
• The SCSI controller sends an interrupt to the processor to inform it that the requested
operation has been completed.
The bus signals, arbitration, selection, information transfer and reselection are the topics
discussed in addition to the above.
For connecting different devices to a computer different buses are used. Each bus typically has
a different data transfer speed.
1) ISA (Industry Standard Architecture) bus: ISA bus was created by IBM in 1981. ISA bus can
transfer 8 or 16 bits at one time. ISA 8 bit bus can run at 4.77 MHz and 16 bit at 8.33 MHz.
2) IDE (Integrated Drive Electronics) bus: IDE bus is used for connecting disks and CDROMs to
the computer.
3) USB (Universal Serial Bus): It is used for connecting keyboard and mouse, and other USB
devices to the computer. A USB bus has a connector with four wires. Two wires are used for
supplying electrical power to the USB devices. USB 1.0 has a data rate of 1.5 MB/s and USB 2.0
which is a high speed one has a data rate of 35 MB/s.
4) IEEE 1394 or FireWire: IEEE 1394 is used for high speed data transfer. It can transfer data at a
rate of up to 400 MB/s. It is a bit serial bus which is used for connection cameras, and other
multimedia devices.
Hardware multithreading
Operating system (OS) software enables multitasking of different programs in the
same processor by performing context switches among programs. A program,
together with any information that describes its current state of execution, is
observed by the OS as an entity called a process. Each process has a
corresponding thread, which is an independent path of execution within a
program. More precisely, the term thread is used to refer to a thread of control
whose state consists of the contents of the program counter and other
processor registers.
It is possible for multiple threads to execute portions of one program and run in
parallel as if they correspond to separate programs. Two or more threads can be
running on different processors, executing either the same part of a program on
different data, or executing different parts of a program. Threads for different
programs can also execute on different processors. All threads that are part of a
single program run in the same address space and are associated with the
same process. To deal with multiple threads efficiently, a processor is
implemented with several identical sets of registers, including multiple
program counters. Each set of registers can be dedicated to a different thread.
Thus, no time is wasted during a context switch to save and restore register
contents. The processor is said to be using a technique called hardware
multithreading. With multiple sets of registers, context switching is simple and
fast. All that is necessary is to change a hardware pointer in the processor to use
a different set of registers to fetch and execute subsequent instructions. Switching
to a different thread can be completed within one clock cycle. The state of the
previously active thread is preserved in its own set of registers.
Switching to a different thread may be triggered at any time by the occurrence of
a specific event, rather than at the end of a fixed time interval. For example, a cache
miss may occur when a Load or Store instruction is being executed for the active
thread. Instead of stalling while the slower main memory is accessed to service the
cache miss, a processor can quickly switch to a different thread and continue
to fetch and execute other instructions. This is called coarse-grained
multithreading because many instructions may be executed for one thread before
an event such as a cache miss causes a switch to another thread.
An alternative to switching between threads on specific events is to switch after
every instruction is fetched. This is called fine-grained or interleaved
multithreading. The intent is to increase the processor throughput. Each new
instruction is independent of its predecessors from other threads. This should
reduce the occurrence of stalls due to data dependencies.
Thus, throughput may be increased by interleaving instructions from many threads,
but it takes longer for a given thread to complete all of its instructions.
VECTOR PROCESSING(SIMD)
Vector Processor (computer) Ability to process vectors, and related data
structures such as matrices and multi-dimensional arrays, much faster than
conventional computers Vector Processors may also be pipelined.
To achieve the required level of high performance it is necessary to utilize the
fastest and most reliable hardware and apply innovative procedures from
vector and parallel processing techniques.
Computers with vector processing capabilities are in demand in specialized
applications.
The following are representative application areas where vector processing is
importance.
The element V; of vector V is written as V(I) and the index I refers to a memory
address or register where the number is stored.
To examine the difference between a conventional scalar processor and a vector
processor, consider the following
Fortran DO loop:
DO 2 0 I=1,100
20 C(I)=B(I)+A(I)
This is a program for adding two vectors A and B of length 100 to produce a vector
C . This is implemented in machine language by the following sequence of
operations.
Initialize I = 0
20 Read A ( I )
Read B ( I )
Store C ( I ) = A ( I ) + B ( I )
Increment I = I + 1
If I<= 1 0 0 go to 2 0
Continue
This constitutes a program loop that reads a pair of operands from arrays A and B
and performs a floating-point addition. The loop control variable is then updated
and the steps repeat 100 times. A computer capable of vector processing
eliminates the overhead associated with the time it takes to fetch and execute the
instructions in the program loop. It allows operations to be specified with a
single vector instruction of the form-
Matrix Multiplication
Matrix multiplication is one of the most computational intensive operations
performed in computers with vector processors. The multiplication of two n x n
matrices consists of n' inner products or n' multiply-add operations
Shared-Memory Multiprocessor
In a shared-memory multiprocessor, all processors have access to the same
memory. Tasks running in different processors can access shared variables in
the memory using the same addresses. The size of the shared memory is likely
to be large. Implementing a large memory in a single module would create a
bottleneck when many processors make requests to access the memory
simultaneously. This problem is alleviated by distributing the memory multiple
modules so that simultaneous requests from different processors are more likely to
access different memory modules, depending on the addresses of those requests.
Write-through - all data written to the cache is also written to memory at the same
time.
Write-back - when data is written to a cache, a dirty bit is set for the affected block.
The modified block is written to memory only when the block is replaced.
A simple scheme is to disallow private caches for each processor and have a
shared cache memory associated with main memory. Every data access is made
to the shared cache. This method violates the principle of closeness of CPU to
cache and increases the average memory access time. In effect, this scheme
solves the problem by avoiding it. For performance considerations it is desirable
to attach a private cache to each processor. One scheme that has been used
allows only non shared and read-only data to be stored in caches. Such items
are called cachable. Shared writable data are noncachable. The compiler
must tag data as either cachable or non cachable, and the system hardware
makes sure that only cachable data are stored in caches. The noncachable
data remain in main memory. This method restricts the type of data stored in
caches and introduces an extra software overhead that may degrade performance.
A scheme that allows writable data to exist in at least one cache is a method
that employs a centralized global table in its compiler. The status of memory
blocks is stored in the central global table. Each block is identified as read only
(RO) or read and write (RW). All caches can have copies of blocks identified as
RO. Only one cache can have a copy of an RW block. Thus if the data are
updated in the cache with an RW block, the other caches are not affected because
they do not have a copy of this block. The cache coherence problem can be
solved by means of a combination of software and hardware or by means of
hardware-only schemes. The two methods mentioned previously use software
based procedures that require the ability to tag information in order to disable
caching of shared writable data. In the hardware solution, the cache controller
is specially designed to allow it to monitor all bus requests from CPUs and
IOPs. All caches attached to the bus constantly monitor the network for possible
write operations. Depending on the method used, they must then either update
or invalidate their own cache copies when a match is detected. The bus
controller that monitors this action is referred to as a snoopy cache controller.
This is basically a hard ware unit designed to maintain a bus-watching mechanism
over all the caches attached to the bus. Various schemes have been proposed to
solve the cache coherence problem by means of snoopy cache protocol . The
simplest method is to adopt a write through policy and use the following procedure.
All the snoopy controllers watch the bus for memory store operations. When
a word in a cache is updated by writing into it, the corresponding location in
main memory is also updated. The local snoopy controllers in all other caches
check their memory to determine if they have a copy of the word that has been
overwritten. a copy exists in a remote cache, that location is marked invalid.
Because all caches snoop on all bus writes, whenever a word is written, the net
effect is to update it in the original cache and main memory and remove it from all
other caches. In this way, inconsistent versions are prevented.
Message-Passing Multicomputers:
A different way of using multiple processors involves implementing each node
in the system as a complete computer with its own memory. Other computers in
the system do not have direct access to this memory. Data that need to be shared
are exchanged by sending messages from one computer to another. Such systems
are called message-passing multicomputers. Parallel programs are written
differently for message-passing multicomputers than for shared-memory
multiprocessors. To share data between nodes, the program running in the
computer that is the source of the data must send a message containing the data to
the destination computer. The program running in the destination computer
receives the message and copies the data into the memory of that node. To
facilitate message passing, a special communications unit at each node is often
responsible for the low-level details of formatting and interpreting messages
that are sent and received, and for copying message data to and from the
memory of the node. The computer in each node issues commands to the
communications unit. The computer then continues performing other
computations while the communications unit handles the details of sending and
receiving messages.
Parallel Programming for Multiprocessors
Programming for a shared-memory multiprocessor is a natural extension of
conventional programming for a single processor. A high-level source program is
written using tasks that are executed by one processor. But it is also possible to
indicate that certain tasks are to be executed simultaneously in different processors.
Sharing of data is achieved by defining global variables that are read and written
by different processors as they perform their assigned tasks. To illustrate parallel
programming, we consider the example of computing the dot product of two
vectors, each containing N numbers. The details of initializing the contents of the
two vectors are omitted to focus on the aspects relevant to parallel programming.
The loop accumulates the sum of N products. Each pass depends on the partial sum
computed in the preceding pass, and the result computed in the final pass is the dot
product. Despite the dependency, it is possible to partition the program into
independent tasks for simultaneous execution by exploiting the associative
property of addition. Each task computes a partial sum, and the final result is
obtained by adding the partial sums.
#include < stdio.h> /* Routines for input/output. */
#define N 100 /* Number of elements in each vector. */
double a[N], b[N]; /* Vectors for computing the dot product. */
void main (void)
{
int i;
double dot_product; < Initialize vectors a[], b[] – details omitted.>
dot_product = 0.0;
for (i = 0; i < N; i++)
dot_product = dot_product + a[i] * b[i];
printf ("The dot product is %g\ n", dot_product);
}
Figure 12.7 C program for computing a dot product.
To implement a parallel program for computing the dot product, two Steps :
1. Thread Creation
We make multiple processors participate in parallel execution to compute the
partial sums we define the tasks that are assigned to different processors, and then
we describe how execution of these tasks is initiated in multiple processors. We
can write a parallel version of the dot product program using parameters for the
number of processors, P, and the number of elements in each vector, N. We assume
for simplicity that N is evenly divisible by P. The overall computation involves a
sum of N products. For P processors, we define P independent tasks, where each
task is the computation of a partial sum of N/P products. When a program is
executed in a single processor, there is one active thread of execution control. This
thread is created implicitly by the operating system (OS) when execution of the
program begins. For a parallel program, we require the independent tasks to
be handled separately by multiple threads of execution control, one for each
processor. These threads must be created explicitly. A typical approach is to use
a routine named create_thread in a library that supports parallel
programming. The library routine accepts an input parameter, which is a
pointer to a subroutine to be executed by the newly created thread. An
operating system service is invoked by the library routine to create a new
thread with a distinct stack, so that it may call other subroutines and have its
own local variables. All global variables are shared among all threads.
It is necessary to distinguish the threads from each other. One approach is to
provide another library routine called get_my_thread_id that returns a
unique integer between 0 and P − 1 for each thread. With that information, a
thread can determine the appropriate subset of the overall computation for which
it is responsible.
2. Thread Synchronization
We ensure that each processor has computed its partial sum before the final result
for the dot product is computed. Synchronization of multiple threads is therefore
required. There are several methods of synchronization, and they are often
implemented in additional library routines for parallel programming. Here, we
consider one method called a barrier.
The purpose of a barrier is to force threads to wait until they have all reached
a specific point in the program where a call is made to the library routine for
the barrier. Each thread that calls the barrier routine enters a busy-wait loop
until the last thread calls the routine and enables all of the threads to continue
their execution. This ensures that the threads have completed their respective
computations preceding the barrier call.
Example Parallel Program
Having described the issues related to thread creation and synchronization, and
typical library routines that are provided for thread management, we can now
present a parallel dot product program as an example. shows a main routine, and
another routine
Performance Modeling
The most important measure of the performance of a computer is how quickly it
can execute programs. When considering one processor, the speed with which
instructions are fetched and executed is affected by the instruction set architecture
and the hardware design. The total number of instructions that are executed is
affected by the compiler as well as the instruction set architecture. The terms in
that model include the number of instructions executed, the average number
of cycles per instruction, and the clock frequency. This model enables the
prediction of execution time, given sufficiently detailed information. A higher-
level model that relies on less detailed information can be used to assess potential
improvements in performance. Consider a program whose execution time on some
computer is Torig . Our objective is to assess the extent to which the execution time
can be reduced when a performance enhancement, such as parallel processing, is
introduced
Reference:
Carl Hamacher, Zvonko Vranesic, Safwat Zaky, Computer Organization, fifth
edition, Mc-graw Hill higher education.