0% found this document useful (0 votes)
27 views5 pages

CS405 Csa M1

Uploaded by

BINESH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
27 views5 pages

CS405 Csa M1

Uploaded by

BINESH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 5
"ARALLEL COMPUTER MODELS * Parallel processing has emerged as a key enabling technology in modern computers, driven by the ever- Increasing demand for higher performance, lower costs, and sustained productivity in real-life applications. * Concurrent events are taking place in today’s high- performance computers due to the common practice of rmultiprogramming, muliprocessing, or multi computing. * Parallelism appears in various forms, such as pipelining, vectorization, concurrency, —simultaneity, data parallelism, parttioning, interleaving, overlapping, multiplicity, replication, time sharing, space sharing, multitasking, multiprogramming, multithreading, and distributed computing at different processing levels. EVOLUTION OF COMPUTER ARCHITECTURE The study of computer architecture invoves both programming/software requirements and hardware organization. Therefore the study of architecture covers both instruction set architectures and machine implementation organizations. ‘As shown in figure below, Evolution Started with the von Neumann architecture built as 2 sequential machine executing scalar data . The sequential computer was improved from bit-serial to word—parallel operations, and from fixed—point to floating point operations. The von Neumann architecture is slow due to sequential execution of instructions in programs. lookahead, parallelism, and pipelining: Lookshead techniques were introduced to prefetch instructions in order to overlap W/E (instruction fetch/ decode and execution} operations and to enable functional parallelism. Functional parallelism was supported by two approaches: One |s to use multiple functional units simultaneously, and the other Is to practice pipelining at. various processing levels. The latter includes pipelined instruction execution, Pipelined arithmetic computations, and memory-access operations. Pipelining has proven especially attractive in performing identical operations repeatedly over vector data strings. ‘A vector is one dimensional array of numbers. A vector processor is CPU that implements an instruction set containing instructions that operate on one dimensional arrays of data called vectors. \ector operations were originally carried out implicitly by software-controlled looping using scalar pipeline processors. ‘There are two Families of pipelined vector processors: ‘+ Memory -to-memory- architecture supports the pipelined flow of vector operands directly from the memory to pipelines and then back to the memory. + Register-to register architecture uses vector registers to Interface between the memory and functional pipelines ‘SYSTEM ATTRIBUTES TO PERFORMANCE System Attributes versus Performance Factors The eal performance of a computer system requires a perfect match between machine capability and program behaviour. Machine capability can be enhanced with better hardware technology, however program behavioutis difficult to precict due to its dependence on application and run-time conditions. Below are the five fundamental factors for projecting the performance of a computer. CPU is driven by a clock of constant clock with a cycle time (tr). The inverse of cycle time is the clock rate (f=1/ 1) Size of the program is determined by the Instruction Count(Ic). Different instructions in a particular program may require different number of clock cycles to ‘execute. So, Cycles per instruction (CPI):-s an important parameter for measuring the time needed to execute an instruction Execution Time/CPU Time (T): Let le be Instruction Count or total number of instructions in the program. The Execution Time or CPU time (T) will be: T=IcxCPixt The execution of an instruction invalves the instruction fetch, decode, operand fetch, execution and store results PI = Instruction Cycle = Processor Cycles +Memory cycles. ie Pl =Instruction cycle = p+m +k where im = number of memory references P = number of processor cycles k= latency factor (how much the memory is slow w.r.t to CPU) Therefore, T= ex (pret) xt From the above equation the five factors affecting performance are p,m, MIPS Rate. The processor speed is often measured in terms of millon instructions per second (MIPS). We simply cal it the MIPS rate ofa given processor MIPS rate = c/(T*106 ) sf/(cPI*106) =(FPe)/(C*106) CPU Time ,T = (IC*10-6)/MIPS FLYNN’S CLASSIFICATION Michael Flynn (1972) introduced a classification of various computer architectures based on rotions of Instruction an¢ data streams. Stream denote a sequence of Items (instructions or data) as executed or operated uponby asingle processor. Two typesof information flow Into a processor: Instructions ard data, The Instruction stream is defined as the sequence of instructions performed by the processing unit. The data stream is defined @3 the data traffic exchanged between the memory and the processing unt. 4: $15D(Single instruction Single Data Stream) = Conventions! Gantionss sequential machines GU = Controt Unit PUT prococera une are called SSD single MU = Memory Unit Instruction stream 1S = Instruction Siream over single data DS = Data Stream stream] computers. PE = Processing Element instructions are. LM = Local Mom: - iMemory executed sequentially but may be overlapped in their execution stages (pipeiining) So (@) SISD uniprocessor architecture cu vo=—>| Mu. 2: SIMD{Single Instruction Multiple Data Stream) Represents vector computers/array processors equipped with scalar and vector hardware. There are multiple processing elements supervised by the same control unit. All PEs recelve the same instruction broadcast from the control unit but operate on different data sets from distinct data streams. Perper as” ) pata . eae ss regameas LUIS = a ree fon oe. J Ht (0) SIMD echtecture (vith distributed memon) 3: MIMD (multiple instructions over multiple data stream) Most popular model. Parallel computers terms are reserved for MIMD i vo. cu, ie] PU; os" 4: MisD(muttiple instruction over single data stream) The same data stream flows through a linear array of processors executing different instruction streams. This architecture is also known as systolic arrays For pipelined execution of specific algorithms Of the four machine models, most parallel computers builtin the past assumed the MIMD model for general purpose computations. The SIMD and MISD models are more suitable for speciel-purpose computations. For this reason, MIMDIs the most popular model, SIMD next, and MISD the least popular model being applied in ‘commercial machines. MFLOPS: Most compute intensive applications in science and “engineering make heavy use of floating point operations. For such applications a more relevant measure of performance is floating point operations per second abbreviated as_—=—smflops. With prefix mega(10"6),giga(10%9) tera(1012) or peta(10*15). Floating-point performance Is expressed as millions of floating-point operations per second (MFLOPS), defined as follows ,only with floating-point instructions Number of executed floating-point operations in a program Throughput Rate: Number of programs executed per unit time Is called system throughput wslin programs per second).In a multiprogrammed system, the system throughput is, ‘often lower than CPU throughput Wp defined by Ww /T Wem ft OR teece, W=( MIPS*10°)/Te IMPUCIT PARALLELISM 1: parallelism Is a characteristic of a programming language that allows a compiler or interpreter to automatically expioit the parallelism inherent to the computations expressed by some of the language's constructs. 2: Uses conventional languages such as C, C++, FORTRAN, ‘or Pascal to write source program 3: The sequentially coded source program Is translated into parallel object code by a parallelizing compiler. 4: Compiler detects parallelism and assigns target machine resources. 5: Success relies on intelligence of parallelizing compiler. Requires less effort from programmers. 6: Applied in shared memory multiprocessors. EXPLICIT PARALLEUSM 1: In computer programming. explicit parallelism is the representation of concurrent computations by means of primitives in the form of special-purpose directives or function calls. Most parallel primitives are related to process synchronization, communication or task partitioning. Requires more effort by programmers to develop 2 source program using parallel dialectslike C, C++, Fortran and Pascal. 3: Parallelism is explicitly specified in the user programs. 4: Burden on compiler is reduced as parallelism specified explicitly. 5: Programmer's effort is more- special s/w tools needed ‘to make environment friendlier to user groups. 6: Applied in Loosely coupled Multiprocessors to tightly coupled VW AMDAHL'S LAW it named after computer scientist Gene Amdahl a computer architect from IBM and Amdahl corporation), tt isalso known as Amdah’s argument. + It isa formula which gives the theoretical speedup in latency of the execution of ataskat a fixes workload that can be expected of @ system whose resources are improved. ‘In other words, it's a formula used to find the maximum Improvement possible by just Improving apartcular part of a system, ‘It Is often used in parallel computing to predict the ‘theoretical speedup when using multiple processors + Speedup is defined as the ratio of performance for the entire task using the enhancement and performance for the entire task without using the enhancement + Speedup can be defined as the ratio of execution time for the entire task without using the enhancement and execution time for the entire task using the enhancement. ‘+ If Pe is the performance for entire task using the enhancement when possible, ‘+ Pwis the performance for entire task without using the enhancement, + Ewis the execution time for entire task without using the enhancement and * Ee is the execution time for entire task using the enhancement when possible then, + Speedup = Pe/Pw, Speedup = Ew/Ee + Amdahi’s law uses two factors to find speedup MULTIPROCESSORS AND MULTICOMPUTERS Multiprocessors 1. Single computer with multiple processors 2. Each PE's(CPU/processing elements) do not have their ‘own individual memories ~ memory and 1/0 resources are shared ~ Thus called Shared Memory Multiprocessors, 3. Communication between PE’s a must 4, Tightly coupled ~ due to high degree of resource sharing 5. Use Dynamic Network ~ thus communication links can be reconfigured 6. Ex Sequent Symmetry S-81 7. 3 Types - UMA model, NUMA model, COMA model Multicomputer 1, Multiple autonomous computers. 2, Each PE’s has is own memory and resources ~ no sharing - Thus called Distributed Memory Multicomputers 3. Communication between PE’snot mandatory 4. Loosely coupled as there is no resource sharing 5. Use Static Network ~ connection of switching units is fixed 6. Ex: Message Passing Multicomputer 7. NORMA model/ Distributed-Memory Multicomputer THE UMA MODEL - Physical memory is uniformly shared by all processors ~ All processors (PE1...PEn) take equal access time to memory ~ Thus its termed as Uniform Memory Access Computers Each PE can have its own private Cache High degree of resource zharing(memory and /O ) ~ Tightly Coupled = Interconnection Network can be ~ Common bus, cross bar switch or Multistage n/w ( discussed later) - When all PE’s have equal access to all peripheral devices = Symmetric Multiprocessor = In Asymmetric multiprocessor only one subset of processors have peripheral access. Master Processors contol Slave (attached} processors. Py ‘System Interconnect (Bus, Crossbar, Multistage network) Sg Ie oa ee TE vo ‘SM, sm, oe ‘Shared Memory ‘The UMA multiprocessor model Applications of UMA Model + Suitable for general purpose and time sharing application by multiple users «Can be used to speed up execution of a single program in time critical application isadvantages = Interacting process cause simultaneous access to same locations - cause problem when an update is followed by read operation (old valve will be reac) = Poor Scalability ~a5 no: of processors increase shared memory area increase-thus n/w becomes bottleneck. - No:of processors usually n range(10-100) ‘THE NUMA MODEL (Ex: BBN Butterfly) mr; (a) Shared local memories (0.9. he NButterty) (©) Anierarchical cluster model (e.g. the Cedar systom at the Uni- ‘versity of linols) ‘Two NUMA models for multiprocessor systems Access time varies with location of memory - Shared memory is distributed to all processors — Local memories Collection of all local memories forms global memory space accessible by all processors, «Its faster to access content within local memory of a processor than to access remote memory attached to another processor (delay through interconnection) ~ Thus named as NON-Uniform Memory access ~ because access time depends on whether data available in local memory of the processor itself or not. ‘Advantage Reduces r/w bottleneck issue that occurs in UMA as processors have a direct access path to attached local memory 3 types of Memory access pattem 1: Fastest to access— local memory of PE itself 2: Next fastest ~ global memory shared by PE/Cluster 3: Slowest to access ~ remote memory(local memory of PE from another Cluster) ‘THE COMA MODEL “The COMA modelo nuldproceser (PF Pcessx C Cache, 0: Dinetory - Multiprocessor + Cache Memory = COMA Model Multiprocessor using cache only memory x: Data diffusion Machine , KSR-1 machine = Special case of NUMA machine in which distributed main memories are converted to caches ~ all caches together form a global address space = Remote cache access is assisted by Distributed Cache directories (D in figabv) ‘Application of COMA. General purpose multiuser applications DISTRIBUTED MEMORY / NORMA / MULTICOMPUTERS =Consists of multiple computers(called nodes) = A node Is an autonomous computer consisting of processor, local memory attached disks or 1/0 peripherals, Nodes interconnected by a message passing network ~ ‘an be Mesh, Ring, Torus, Hypercube etc (discussed later) (== Generic model of message-passing multicomputer = All interconnection provide point to point static connection among nodes = Local memories are private and accessible only by processor ~ Thus Multicomputer are also called No- remote-Memory ~Access(NORMA)(difference with UMA and NUMA) = Communication between nodes if required Is carried out by passing messages through static connection network. Advantages over Shared Memory ~ Scalable and Flexible : we can add CPU's - Reliable and accessible : since with Shared memory a failure can bring the whole system down Disadvantage + Considered harder to program because we are used to programming on common memory systems.

You might also like