Large Computer Systems and Pipelining: Homework
Large Computer Systems and Pipelining: Homework
Large Computer Systems and Pipelining: Homework
HOMEWORK:
Submitted By:
Reyes, Ryan B.
Submitted To:
Computer Architecture
TF 06:00PM – 07:30PM
Date:
Large-scale computer systems are systems consisting of more than one computer, usually
under the supervision of a master computer, in which smaller computers handle input/output and
routine jobs while the large computer carries out the more complex computations. Such example
is a mainframe computer which is used to separate processes (program) for task management,
program management, job management, serialization, catalogs, inter address space and
communication.
In general, parallel processing means that at least two microprocessors handle parts of an
overall task. The concept of speeding-up the execution of a program is pretty simple: A computer
scientist divides a complex problem into component parts using special software specifically
designed for the task. He or she then assigns each component part to a dedicated processor. Each
processor solves its part of the overall computational problem. A program being executed across
N processors might execute N times faster than it would use a single processor. The software
reassembles the data to reach the end conclusion of the original complex problem.
Single Instruction, Single Data (SISD) computers have one processor that handles one
algorithm using one source of data at a time. The computer tackles and processes each task in
order, and so sometimes people use the word "sequential" to describe SISD computers. They
aren't capable of performing parallel processing on their own. Conventional single-processor von
Neumann computers are classified as SISD systems
Multiple Instruction, Single Data (MISD) computers have multiple processors. Each
processor uses a different algorithm but uses the same shared input data. MISD computers can
analyze the same set of data using several different operations at the same time. The number of
operations depends upon the number of processors. There aren't many actual examples of MISD
computers, partly because the problems an MISD computer can calculate are uncommon and
specialized.
Single Instruction, Multiple Data (SIMD) computers have several processors that
follow the same set of instructions, but each processor inputs different data into those
instructions. SIMD computers run different data through the same algorithm. This can be useful
for analyzing large chunks of data based on the same criteria. Many complex computational
problems don't fit this model. When there is only one control unit and all processors execute the
same instruction in a synchronized fashion
Single Program, Multiple Data (SPMD) systems are a subset of MIMDs. An SPMD
computer is structured like an MIMD, but it runs the same set of instructions across all
processors.
B. INTERCONNECTING NETWORKS
The processing units can communicate and interact with each other using either shared
memory or message passing methods. The interconnection network for shared memory systems
can be classified as bus-based versus switch-based.
In message passing systems, the interconnection network is divided into static and
dynamic. Static connections have a fixed topology that does not change while programs are
running. It provides a direct inter-processor communication path and usually for distributed-
memory multiprocessor. Dynamic connections create links on the fly as the program executes
that provide a physically separate switching network for inter-processor communication and
usually for shared-memory multiprocessor
A computer network can be two computers connected and can also consist of, and are
usually made for, more than two computers:
Based on their layout (not the physical but the imagined layout, also referred to as
topology), there are two types of networks. A network is referred to as peer-to-peer if most
computers are similar and run workstation operating systems.
In a peer-to-peer network, each computer holds its files and resources. Other computers
can access these resources but a computer that has a particular resource must be turned on for
other computers to access the resource it has. For example, if a printer is connected to computer
A and computer B wants to printer to that printer, computer A must be turned On.
In a client/server environment, each computer still holds (or can still hold) its (or some)
resources and files. Other computers can also access the resources stored in a computer, as in a
peer-to-peer scenario. One of the particularities of a client/server network is that the files and
resources are centralized. This means that a computer, the server, can hold them and other
computers can access them. Since the server is always On, the client machines can access the
files and resources without caring whether a certain computer is On.
The client/server type of network also provides many other advantages such as
centralized backup, Intranet capability, Internet monitoring, etc.
Large computer systems usually use purpose specific storage based on the requirements
of the system. A large super computer may not need that much storage, but it would need to be
faster, so there would be large RAM requirements with little permanent storage to speak of.
Like shared memory systems, distributed memory systems vary widely but share a
common characteristic. Distributed memory systems require a communication network to
connect inter-processor memory.
Processors have their own local memory. Memory addresses in one processor do not map
to another processor, so there is no concept of global address space across all processors.
Because each processor has its own local memory, it operates independently. Changes it
makes to its local memory have no effect on the memory of other processors. Hence, the concept
of cache coherency does not apply.
When a processor needs access to data in another processor, it is usually the task of the
programmer to explicitly define how and when data is communicated. Synchronization between
tasks is likewise the programmer's responsibility.
The network "fabric" used for data transfer varies widely, though it can can be as simple
as Ethernet.
The largest and fastest computers in the world today employ both shared and distributed
memory architectures.
The shared memory component is usually a cache coherent SMP machine. Processors on
a given SMP can address that machine's memory as global.
The distributed memory component is the networking of multiple SMPs. SMPs know
only about their own memory - not the memory on another SMP. Therefore, network
communications are required to move data from one SMP to another.
Program parallelism is done so that instructions can be run on a single computer having a
single Central Processing Unit (CPU), a problem is broken into a discrete series of instructions
where they are executed one after another at any moment in time.
Program parallelism is the simultaneous use of multiple compute resources to solve a
computational problem using multiple CPUs which will separate the problem into discrete parts
that can be solved concurrently in less time with multiple computer resources than with a single
computer resource. Each part is further broken down to a series of instructions which execute
simultaneously on different CPUs.
The computer resources can include a single computer with multiple processors; an
arbitrary number of computers connected by a network; or a combination of both.
Shared memory parallel computers vary widely, but generally have in common the ability
for all processors to access all memory as global address space.
Multiple processors can operate independently but share the same memory resources.
Changes in a memory location effected by one processor are visible to all other processors.
Global address space provides a user-friendly programming perspective to memory. Data sharing
between tasks is both fast and uniform due to the proximity of memory to CPUs
E. MULTICOMPUTERS
Multicomputers are computers made up of several computers. The term generally refers
to an architecture in which each processor has its own memory rather than multiple processors
with a shared memory. Multicomputer means more than one fully functional computer which is
used to process data. The amount of CPU use by each computer doesn't matter as long they
process at the same time and the outcome will be combine. Multicomputers have no shared
memory, and each “computer” consists of a single processor, cache, private memory, and I/O
devices.
Multicomputers use distributed computing that deals with hardware and software systems
containing more than one processing element or storage element, concurrent processes, or
multiple programs which run under a loosely or tightly controlled regime. Multicomputers are
commonly used when strong computer power is required in an environment with restricted
physical space or electrical power.
In distributed computing a program is split up into parts that run simultaneously on
multiple computers communicating over a network. Distributed computing is a form of parallel
computing, but parallel computing is most commonly used to describe program parts running
simultaneously on multiple processors in the same computer. Both types of processing require
dividing a program into parts that can run simultaneously, but distributed programs often must
deal with heterogeneous environments, network links of varying latencies, and unpredictable
failures in the network or the computers.
Common suppliers include Mercury Computer Systems, CSPI, and SKY Computers.
Common uses include 3D medical imaging devices and mobile radar.
2. PIPELINING
The fundamental idea is to split the processing of a computer instruction into a series of
independent steps, with storage at the end of each step. This allows the computer's control
circuitry to issue instructions at the processing rate of the slowest step, which is much faster than
the time needed to perform all steps at once.
The term pipeline refers to the fact that each step is carrying data at once (like water), and
each step is connected to the next. Breaking a task into steps performed by different units, and
multiple inputs stream through the units, with next input starting in a unit when
previous input done with the unit but not necessarily done with the task.
A. BASIC CONCEPTS
Pipelining is a key implementation technique used to build fast processors. It allows the
execution of multiple instructions to overlap in time. A pipeline within a processor is similar to a
car assembly line. Each assembly station is called a pipe stage or a pipe segment. The computer
pipeline is divided in stages. Each stage completes a part of an instruction in parallel. The stages
are connected one to the next to form a pipe - instructions enter at one end, progress through the
stages, and exit at the other end.
Pipelining is a process in which the data is accessed in a stage by stage process. The data
is accessed in a sequence that is each stage performs an operation. If there are n numbers of
stages then n number of operations is done. To increase the throughput of the processing network
the pipe lining process is done. This method is adopted because the operation or the data is
accessed in a sequence with a fast mode.
Pipelining is a particularly effective way of organizing parallel activity in a computer
system. The basic idea is very simple. It is frequently encountered in manufacturing plants,
where pipelining is commonly known as an assembly line operation. By laying the production
process out in an assembly line, product at various stages can be worked on simultaneously. This
process is also referred to as pipelining; because, as in a pipeline, new inputs are accepted at one
end before previously accepted inputs appear as outputs at the other end.
Pipelining does not decrease the time for individual instruction execution. Instead, it
increases instruction throughput. The throughput of the instruction pipeline is determined by how
often an instruction exits the pipeline.
Because the pipe stages are hooked together, all the stages must be ready to proceed at
the same time. We call the time required to move an instruction one step further in the pipeline a
machine cycle. The length of the machine cycle is determined by the time required for the
slowest pipe stage.
Pipeline latency. The fact that the execution time of each instruction does not decrease puts
limitations on pipeline depth; The interrupt must take effect between instructions, that is, when
one instruction has completed and the next has not yet begun. With pipelining, the next
instruction has usually begun before the current one has completed.
Imbalance among pipeline stages. Imbalance among the pipe stages reduces performance since
the clock can run no faster than the time needed for the slowest pipeline stage;
Pipeline overhead. Pipeline overhead arises from the combination of pipeline register delay
(setup time plus propagation delay) and clock skew. Once the clock cycle is as small as the sum
of the clock skew and latch overhead, no further pipelining is useful, since there is no time left in
the cycle for useful work. Not all stages take the same amount of time. This means that the speed
gain of a pipeline will be determined by its slowest stage. This problem is particularly acute in
instruction processing, since different instructions have different operand requirements and
sometimes vastly different processing time. Moreover, synchronization mechanisms are required
to ensure that data is passed from stage to stage only when both stages are ready.
B. INSTRUCTION QUEUE
The instruction queue is a first-in, first-out (FIFO) storage area for decoded instructions
and fetched operands.
The first step in applying pipelining techniques to instruction processing is to divide the
task into steps that may be performed with independent hardware. The most obvious division is
between the FETCH cycle (fetch and interpret instructions) and the EXECUTE cycle (access
operands and perform operation). If these two activities are to run simultaneously, they must use
independent registers and processing circuits, including independent access to memory (separate
MAR and MBR).
It is possible to further divide FETCH into fetching and interpreting, but since
interpreting is very fast this is not generally done. To gain the benefits of pipelining it is
desirable that each stage take a comparable amount of time.
A more practical division would split the EXECUTE cycle into three parts: Fetch
operands, perform operation, and store results. A typical pipeline might then have four stages
through which instructions pass, and each stage could be processing a different instruction at the
same time. The result of each stage is passed on to the next stage.
C. BRANCHING
In order to fetch the "next" instruction, we must know which one is required. If the
present instruction is a conditional branch, the next instruction may not be known until the
current one is processed.
The problem in branching is that the pipeline may be slowed down by a branch
instruction because we do not know which branch to follow. In the absence of any special help in
this area, it would be necessary to delay processing of further instructions until the branch
destination is resolved. Since branches are extremely frequent, this delay would be unacceptable.
Use of this technique requires a coding method which is confusing for programmers but
not too difficult for compiler code generators.
Most other techniques involve some type of speculative execution, in which instructions
are processed which are not known with certainty to be correct. It must be possible to discard or
"back out" from the results of this execution if necessary.
The usual solution is to follow the "obvious" branch, that is, the next sequential
instruction, taking care to perform no irreversible action. Operands may be fetched and
processed, but no results may be stored until the branch is decoded. If the choice was wrong, it
can be abandoned and the alternate branch can be processed.
This method works reasonably well if the obvious branch is usually right. When coding
for such pipelined CPU's, care should be taken to code branches (especially error transfers) so
that the "straight through" path is the one usually taken. Of course, unnecessary branching should
be avoided.
Another possibility is to restructure programs so that fewer branches are present, such as
by "unrolling" certain types of loops. This can be done by optimizing compilers or, in some
cases, by the hardware itself.
A more costly solution occasionally used is to split the pipeline and begin processing
both branches. This idea is receiving new attention in some of the newest processors.
D. DATA DEPENDENCY
When several instructions are in partial execution, a problem arises if they reference the
same data. We must ensure that a later instruction does not attempt to access data sooner than a
preceding instruction, if this will lead to incorrect results. For example, instruction N+1 must not
be permitted to fetch an operand that is yet to be stored into by instruction N.
To guard against data hazards it is necessary for each stage to be aware of the operands in
use by stages further down the pipeline. The type of use must also be known, since two
successive reads do not conflict and should not be cause to slow the pipeline. Only when writing
is involved is there a possible conflict.
The pipeline is typically equipped with a small associative check memory which can
store the address and operation type (read or write) for each instruction currently in the pipe. The
concept of "address" must be extended to identify registers as well. Each instruction can affect
only a small number of operands, but indirect effects of addressing must not be neglected.
As each instruction prepares to enter the pipe, its operand addresses are compared with
those already stored. If there is a conflict, the instruction (and usually those behind it) must wait.
When there is no conflict, the instruction enters the pipe and its operands addresses are stored in
the check memory. When the instruction completes, these addresses are removed. The memory
must be associative to handle the high-speed lookups required.
A processing unit includes multiple execution units and sequencer logic that is disposed
downstream of instruction buffer logic, and that is responsive to a sequencer instruction present
in an instruction stream. The execution unit may contain multiple functional pipelines for
arithmetic logic functions.
While the instruction unit is fetching instruction I + K+1, the instruction queue holds
instruction I+1,I + 2,…..I + K, and the execution unit executes instruction I. In this sense, the
CPU is a good example of a linear pipeline. In response to such an instruction, the sequencer
logic issues a plurality of instructions associated with a long latency operation to one execution
unit, while blocking instructions from the instruction buffer logic from being issued to that
execution unit.
In addition, the blocking of instructions from being issued to the execution unit does not
affect the issuance of instructions to any other execution unit, and as such, other instructions
from the instruction buffer logic are still capable of being issued to and executed by other
execution units even while the sequencer logic is issuing the plurality of instructions associated
with the long latency operation.