CH17 COA9e Parallel Processing
CH17 COA9e Parallel Processing
William Stallings
Computer Organization
and Architecture
9th Edition
+
Chapter 17
Parallel Processing
+
17.1 Multiple Processor Organization
Types of Parallel Processor Systems
Simplicity
Simplest approach to multiprocessor organization
Flexibility
Generally easy to expand the system by attaching more processors to the
bus
Reliability
The bus is essentially a passive medium and the failure of any attached
device should not cause failure of the whole system
+
Disadvantages of the bus organization:
Scheduling
Any processor may perform scheduling so conflicts must be avoided
Scheduler must assign ready processes to available processors
Synchronization
With multiple active processes having potential access to shared address spaces or I/O resources, care must be
taken to provide effective synchronization
Synchronization is a facility that enforces mutual exclusion and event ordering
Memory management
In addition to dealing with all of the issues found on uniprocessor machines, the OS needs to exploit the
available hardware parallelism to achieve the best performance
Paging mechanisms on different processors must be coordinated to enforce consistency when several processors
share a page or segment and to decide on page replacement
Attempt to avoid the need for additional hardware circuitry and logic
by relying on the compiler and operating system to deal with the
problem
Because the problem is only dealt with when it actually arises there is more
effective use of caches, leading to improved performance over a software
approach
Approaches are transparent to the programmer and the compiler, reducing the
software development burden
When a write is required, all other caches of the line are invalidated
Modified
The line in the cache has been modified and is available only in this cache
Exclusive
The line in the cache is the same as that in main memory and is not present
in any other cache
Shared
The line in the cache is the same as that in main memory and may be
present in another cache
Invalid
The line in the cache does not contain valid data
Table 17.1
MESI Cache Line States
MESI State Transition Diagram
+
17.4 Multithreading and Chip
Multiprocessors
Processor performance can be measured by the rate at which it executes
instructions
Multithreading
Allows for a high degree of instruction-level parallelism without increasing
circuit complexity or power consumption
Instruction stream is divided into several smaller streams, known as threads,
that can be executed in parallel
Definitions of Threads
and Processes Thread in multithreaded
processors may or may not be
the same as the concept of
software threads in a
multiprogrammed operating
system
Thread switch Thread is concerned with
scheduling and execution,
• The act of switching whereas a process is concerned
processor control between with both
threads within the same scheduling/execution and
process resource and resource
• Typically less costly than
ownership
process switch
Thread:
• Dispatchable unit of work within a Process:
process • An instance of program running
• Includes processor context (which on computer
includes the program counter and stack • Two key characteristics:
pointer) and data area for stack
• Executes sequentially and is interruptible • Resource ownership
so that the processor can turn to another • Scheduling/execution
thread
Process switch
• Operation that switches the processor from one
process to another by saving all the process control
data, registers, and other information for the first
and replacing them with the process information for
the second
17.4 MULTITHREADING AND CHIP MULTIPROCESSORS
Interleaved Blocked
Fine-grained Coarse-grained
Processor deals with two or Thread executed until event causes
more thread contexts at a time delay
Switching thread at each clock
Effective on in-order processor
cycle Avoids pipeline stall
If thread is blocked it is skipped Chip multiprocessing
Simultaneous (SMT) Processor is replicated on a single
chip
Instructions are simultaneously Each processor handles separate
issued from multiple threads to threads
execution units of superscalar
processor Advantage is that the available
logic area on a chip is used
effectively
+
Approaches to
Executing Multiple
Threads
+
Example Systems
Cluster Configurations
Table 17.2
Clustering Methods: Benefits and Limitations
+
Operating System Design Issues
Failover
The function of switching applications and data resources over from a failed system to an
alternative system in the cluster
Failback
Restoration of applications and data resources to the original system once it has been
fixed
Load balancing
Incremental scalability
Automatically include new computers in scheduling
Middleware needs to recognize that processes may switch between machines
Parallelizing Computation
SMP Clustering
Easier to manage and configure Far superior in terms of
incremental and absolute
Much closer to the original single scalability
processor model for which nearly
all applications are written Superior in terms of availability
Less physical space and lower All components of the system can
power consumption readily be made highly redundant
CC-NUMA
Organization
+
NUMA Pros and Cons
Need for high precision and a program that repetitively performs floating
point arithmetic calculations on large arrays of numbers
Most of these problems fall into the category known as continuous-field
simulation
Array processor
Designed to address the need for vector computation
Configured as peripheral devices by both mainframe and minicomputer users to
run the vectorized portions of programs
Vector Addition Example
+
Matrix Multiplication
(C = A * B)
+
Approaches to
Vector
Computation
+
Pipelined Processing of
Floating-Point
Operations
A Taxonomy of
Computer Organizations
+
Vector computation
Approaches to vector computation
IBM 3090 vector facility
+ Key terms Chapter 17