Ch-8 Shared Memory Multiprocessors

Advanced computer architecture

Uploaded by

Basant

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

117 views

Ch-8 Shared Memory Multiprocessors

Advanced computer architecture

Uploaded by

Basant

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 45

Bt multi , basic issues related with This chapter deals with multiprocessors and multicomputers, Utipeocesse, suaicunddynarnc connect networks etc. After studying this chapter, students willbe sbetoundeasy how various processors communicate with each other, how they are connected using wan, interconnection schemes etc. 8.4” INTRODUCTIOI ee Multiprocessors are classifies as multiple instruction stream and multiple data stream (MIMD) ens The most obvious form of multiprocessor architecture is the shared memory multiprocese Multiprocessors may share memory by either 1. Sharing a common data cache. 2, Sharing a common bus to memory 3. Sharing a network which interconnects the processors to all of memory. _ CHARACTERISTICS OF MULTIPROCESSORS : Males ithe use of two or more central processing units within a single computes me refers to the interconnection of two or more CPUs with memory and input Sonne as op essing sometimes refers tothe execution of multiple concurrent software provess®s'" posed toa single process at any one instant. pe both mukiprocessing and multiprogramming, only one ofthe two; 0" Asystem can ae (Wo of them, Multi ji i ; Multiprocessors are classified as multiple instruction stream and multiple iG We all b multicorcoacomuter Hee sav Sahin cu, Fig.8.1 Multiprocessor. Here, CY, 1 CU, represents contol unit PU, PU, represents procesing any 1, ~ are the instruction streams Dy =D, are the ata seas Fig.8.2 MIMD architecture. SO IS gat are wo terms Multiprocessor and Multicomputer. The similarity between maligrocessorand ee, licompuler is that both support processors, > understood ng various )) systems. processor. a system er of the ) stream concurrent operations but still there are some differences. Table 8.1 Difference between Multiprocessor and Multicomputers. ay > {Multiprocessor -° 2 Moteompeter ‘Amutiprocessor system is controlled by one operating syst that provides the interaction between processors tad all the components of the system cooperate in the ‘cuion ofa problem. ‘Multicomputers many computes are interconnected wih cach other by means of communication lies 0 form 2 ‘computer network, The network consists of several ‘autonomous computers that may or may not communicate ‘with each other. These systems are also known as closet ‘computers and COWS (Clusters of Worksaions). The secre of igh performance isthe interooneconnetwock Multiprocessors share the common memory Advantages and disadvantages of Multicom 1. Malticomputers are highly scalable and thi Molticomputers are oftheir own and donot share them. puter til coupled CPUs wit memerissnares Nonorytairocessors 284 2, Message passing solves memory access synchronization problem But there \S If load balancing problem If deadlock in message passing occurs th between processes en there may be a need to physically copying data + interconnection Network em: Fig. 8.4 Multiprocessor Architecture. ‘or (shared memory system) Advantages and Disadvantages of Multiprocess' Communication between processor is efficient. ‘Synchronized access to share data in memory needed. Synchronising constructs (semaphores, conditional critical regions, monitors) result in rondeterministc behaviour which can lead smming errors that are difficult to discover «+ Lack of scalability due to (memory) contention problem 9 ulticompuetss improves the reliability rth system so that a failure in one part has limited fest of rest of the system. And the benefit ocsived from multiprocessor organization is an improved sY Se performance. High performance fe achieved because computations can proceed in parallel in one of the two ways: {L Multiple independent jobs can be made to operate in parallel. 2. A single job can be portioned into multiple parallel tasks. “There exists a 3rd type of system called distributed systems. Distributed systems are multicompite®” ‘sing virtual shared memory. In these systems the Tocal memories of molti-computer are components of global address space. i.e. ‘any processor can access ‘the local memory of any other processor. Table 8.2 Comparison of three kinds of multiple CPU systems.en sed in processor (memory ly follow use in using ‘Shared Meret Mic, ey Ma IMA: Cache aes emory access machines are a ¢ ater emery 5588 “s 3, CC-NUMA: Cache CHT NT COMA and CC-NUMA are used in newer teneratan Ng nd COMA as clear from ne | ‘computers Trea | pet PEO [fF] i Fig.8.7 CC-NUMA Architecture. “CLASSIFICATION OF MULTIPROCE: Multiprocessor are clasified in two-categori Table 8.3. Classifcat SSOR! aie ightly coupled systems 2. Loosely-coupled sytes, ion of Multiprocessor. ly coupled systems Sy 1. Tightly-coupled multiprocessor sysiems contain smukiple CPU datare connected atte bs evel. These (CPUs may have access toa central shared memory (SMP or UMA), or may participate in a memory hicrarchy with both local and shared memory (numa). 1, Loosety-coupled multiprocessor system (often feed toas clusters) are based on multiple standalone single or dual processor commodity computers ine connected via a higk speed communication sytem. 2 Tighily-coupled systems perform better and are Physically smaller than losely-coupled system. 2 Loosely coupled system is just opposite of tightly coupled system, a Tiehly coupled systems are more expensive, 3. Tightly coupled systems are less expensive. 4 Ina ighay coupled sytem, the de . lay Experienced, reese iseatfrom on computer nator {hot and data ate is high; thats the number ap Dis persecondthat canbe wanted sarge = ‘4 Ina loosely-coupled system, the opposite is tue he inler-machine message delay is large andthe da is low. | Millie oo a 5. Forexample, two computers connected by 22400! sicit boar , lla reise te sec modern over the telephone system cea © loosely coupled. 6 Teta. 'Y- Ps and” Fig. 8.29 System bus (data, address and contro! buses). tion, can 1 Py to P, Sse Busis two types-synchronous and asynchronous. Sndronous Bus: As already said that synchronization means that a global ‘clock is available, Thus in : 'nmonoas bus, each data item s transferred during atime interval: Tis ime interval must be known sible. BY la both source and destination, But if a global clock isnot available fo the ama Suohave separate clocks of approximately the same frequency in each unit, atthe synchronization ‘sab ust be transmitted periodically in order to keep all the clocks synchronized with each ober ‘Reroons Bust In this the data item is being transferred by handshaking control signals to indicate sted ac tansfered from te source ad received by the destination. os _ Te contol lines provide signals for controlling the information transfer between units. Ti iindic A co jons to be ‘Was indicate the validity of data and address information, Command signals ieee aa is aay Control lines include transfer signals such as memory read & write, in! "Pus grant signals etc. : runt oes ines ccur then they may be detected by hardware or sono ae the highest priority. Software: ‘ dentify the inter A polling procedure is used to identify the in! s Oy coe bre acnoas interrupts. “sso The source wl one rach adress is sed for a IE ein wh pal Te Priority of each ‘conto the highest priority interrupt signal i Om stesied, and 007+ ‘teeta a ext Beetrm | 4, Hardware: A hardware prionty jnterrupt unit functions a8 2 ‘overall manager in an interrupt cepts interrupt requests from many sources, determines which system envionment. The Unt We i jest to the computer based on this interrupt vector request has the highest priory, sand issues an interrupt re determination. To speed UP ‘the operation, each interrupt source has its own I address to access its own service routine directly. Serial Arbitration jority. Serial and Shai parallel are the c 1 requests based on Pr -onnect “arbitration procedures service all the pt arbitration procedures. Serial jority resolving technique 18 obtained from dais CPU Date Bus | \ \ ! | \ \ Taterrupt acknowledge Fig. 8.30. Daisy-chain method (serial Arbitration). Working of daisy-chain method a All devices that can request a1 interrupt are connected serially ‘and in priority order with the riorit i ny ehh ae rm the CPU: ae Jn device placed last and closest 10 the CPU. 2. First, any (oF all) of the devices signal an interrupt on the Interrupt Request line. 3, Next, the CPU a jledges the interrupt on the Interrupt ‘Acknowledge line. A device on ‘the Tine passes the Interrupt ‘Acknowledge signal (0 the next lower prionty device the vent lower prioity device and Steps Moe wth a0 on its input generates & 0 09S252 Advanced Computer Architecture ‘Thus, the device with PI = 1 and PO = 0 is the one with the highest priority that is requesting an interrupt, and this device places its VAD on the data bus. 8.6.2 Parallel Arbitration * Bus arbitar 1 REQ ave. 24 Bus busy line Fig. 8.31 Parallel Arbitration. As itis clear from the figure that this technique uses an external priority encoder and a decoder. Here the size is taken as 4*2, because of 4 bus arbiters, Otherwise accordingly can be changed. Each arbiter has a bus request output line and a bus acknowledge input line. If arbiter wants to access the system then it enables the request line. Here 4*2 encoder is used that will generate 2 bit code which represents the highest priority unit among those requesting the bus. A priority encoder is an algorithm that compresses multiple binary inputs into a smaller number of ‘outputs. The output of a priority encoder is the binary representation of the ordinal number starting from zero of | significant input bit. They are often used to control interrupt requests by acting onPriority Encoder Table 8.6 0 ° 0 1 0 0 1 0 1 0 0 0 1 1 1 Tribe acknowledge line is enabled then the processor takes control of the bus. 2*4 decoder e the proper acknowledgment line to grant the access to highest priority unit. A decoder is a device: dpe the reverse operation of an encoder, undoing the encoding so thatthe original information can retrieved. The same method used to encode is usually just reversed in order to decode. Itis a combinational circuit that converts binary information from n input lines to a maximum of 2" unique output lines, ‘Bus busy line indicates that bus is being used thus, this line provides the orderly transfer of conto, 187. BASIC MOLTIPROCESSORS ‘We have studied multiprocessors its advantages and disadvantages also\ The ‘multiprocessor configuration inedces speedup potential as well as additional sources of delay andTslow down of performance) “The efficiency of multiprocessors also depends on the way how the program is decomposed into smaller subprograms to allow concurrent execution on multiple processors. Three basic issues in multiprocessors are considered. These are | 1. Partitioning 2. Scheduling of tasks 3. Synchronization. 1 Partitioning As the name indicated, partitioning isthe process of dividing a program into tasks, each of which can b= assigned to an individual iscsi eso rn iPS aa aa a) Py Page aPa 5 Fig. 8.32 Partitioning of Process P Partitioning is ited by a control i ‘and arcs speciieats' hich hy So aaa cassie eee ae J peu: tte a, computer Architecture Reapanicn ularity is defined sstetaionsny | define 3 a — ed artim of data SS UON Which t operates ship between : | stor 20. of PEP | a Granultarity Poe —— S eee Brampe Fine grained om 1: Load oe Ntepetesy palm tae few data A.system with He Sto Fenromriedee eloments per Pe ie : yranularity 8 regarded as havi 13 Re rained. Five ‘fenuaty and wegen 7 a 5:10 8 coarse-grained. Coarse a Soni . grained includes. 1000 aa yantons instructions, + Alou data elements, er enables 4 ‘map to each PE evice which ation can be mbinational a put lines, = — -of control, cl ha ve ha J (One data element Many data elements ae maps to each PE ‘maps to each PE os uration: i id Fig. 8.33 Classification of Granularity 7 sosed into tis obvious that when partitioning. is done, there is program overheads, Overheads affects the eof parallelism. Overheads for parallel versions always greater than overheads for sei eon N Worked time is configuration dependent, i., depending on cache size, organisation the way multiprocessors have different task overheads associated tes are shared, different shared memory ribhem, As parallelism goes on increasing, granularity will also perform and thus larger the overt Tew are thre types of depencies in parallel architecture. : | oy 1 Control dependenc: vs h can be d y allelism. © Resource dependency ‘ oe Dependency i dependence are of 5 types defined below: eat | dependence: a y's flow on if an exec Lael ' Smaart noe ouput off feeds a 2 pat Flow depen ia : - n order i a (f, +1): Instruction fis atidependent on if flows eae a 7 output off, overaps the Inpt to gen bth 08 tenes | nd arcs 2 Output dependence (1, -> 12) i and J, are output eRe yt | . vaiabl write secnined dependence: Input and Output ele en spaterens cE ae mt deperiderice: Ifthe dependete? | ‘then thisis called unknown dependeneoited elism ‘Shared Memory Mullrocessors ack Tal a7 implick ‘and Explicit parallelism. © Explicit parallelism © a developed by Charle it k. This | This approach was Y This approach was developed by Davie Ra . | this approach, rather than using parallelizing ‘approach uses a conventional lang compiier, parallelism js explicit fied in th |. Then by a aaa Lisp et. to write the source program so ee eee parallelizing compiler, pean object code. Then | parallclism on Compile get reduced. But thi pe aed is approach requires more effort by the programmer | i le s the parallelism and assign. | PP! b ” toe ere resources: Ths in implicit approach, | (0 develop a source program using parallel version eS SS ea aie a pocung ‘applied in programming | of C, fortran, pascal, etc. mull Ors. E Fig. 8.35 Implicit parallelism. 8.7.2. Scheduling ‘Scheduling is defined as the method by which threads, processor or data flows are given access system resources (¢.g., processor time, communication bandwidth). Thus is usually done to load b a system effectively or achieve a target quality of service. f Need for a scheduling algorithm: There is a need for scheduling because most modern systems. Perform multitasking (execute more than one process at a time) and multiplexing (transmit multiple flows simultaneously), ‘The scheduler is concemed mainly with: * Throughput: The total number of processes that complete their execution per time unit. * Latency, specifically: (a) peaapend Aime: Total time between submission of a process and its completion. (6) Response time: Amount of time it takes from when a request was submitted until the first ‘+ Fairness/Waiting Time: Equal CPU time to each ‘generally appropriate times _ : process (or more Stealing o each proces roy). Is the time for which the process remains in he NCO aay aii Serocco JR meee ales Seitz Jn B Parall & parallelizing eed. But this he programme Parallel version nat en wo Pes Static and Dynamic Fine (done al compile time) wei with pipelining tls when there is a stall (hazard) and erie of instructions and the stall has to (eee by the hardware - is some kind of list coutaining the es and the durations in which scheduled. For example, a simple evuicating device could have the following ‘csesedule in pseudo-code: | peat forever: -exceute task 1 for 10 ms eat task 2 for 20 ms —axsute task 1 for 5 ms | exzute tsk 3 for 15 ms a “Scheduling << Fig. 8.37 Static and dynamic scheduling Table 8.8 Dynamic scheduling... Dynamic scheduling: Instruction i scheduled by the eee ee In dynamic scheduling hardware decides in what coder instructions can be executed and instructions behind a stalled instruction can pass it (On the other hand, when performing dynamic scheduling, whenever the scheduler decides which task to execute next (and for how long), it ooks at a list of tasks requesting the processor at that point in time and then decides which to use next. Examples are the “earliest-deadline first” scheduler. Here, the schedule changes if some task has nothing t0 do techniques is regructions can proceed: 10 Lae i. it will stall instructions in the pipelines handles dependencies tl creat. 1 simples tC iled wi jing is code es 4 ic schedulli Scheduling But it may not be fa fewer transistors, fast ult tolerant. In ter design “cr ifon vee a0 order instruction issue: 2% “Thus if there is a dependen=? that they use it ith one (a) bette ) high These adva Andis ina number Scheduling and 87.24 D Simple pi called Inst © bss © Re Asin into a qu———— eee —s — shoredMemoryMutipcassos *258 {q) better at hiding latencies, less processor stalling (b) higher utilization of functional units ‘ned at a cost ofa significant increase in harWare complexity. These advantages are Bal "and disadvantage of run time scheduling is rune verted, Run time scheduling can be performed i ‘eso or May FUN ON any PrOCESsOr. ima numberof ways The scheduler itself may AYE particular proc Scheduling may be done by one proces or or by several processors. The first one 1s called centralized and later is called distributed scheduling, In cun time scheduling, there are run time overheads and these i scheduling, dynamic data management and dynamic execution control 8.7.2.4 Dynamic Scheduling Technique MoU: Dec 2007, 2008, 2017) Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue. Split the ID pipe stage of simple S-stage pipeline into 2 stages « Tesue: Decode instructions, check for structural hazards 2 Read operands: Wait until no data hazards then read operands ‘An instruction fetch proceeds with the issue stage and may fei ether into a single-enty [ate of jato a queue: instructions are then issued from 1 ich or queve. TheEX sagesOnta operands stage. As in floating-point pipeline, execution may take multiple cycles, depending on the operation. Thus there isa need to, distinguish when an instruction begins ‘execution and when it completes the instruction isin execution. This allows multiple instructions to be execution; between the two times, jn execution at the same time. Jneduling, but which technique will be used depends on include information gathering and Many techniques may be used for dynamic 5 amount of available program information. Load balancing means to distribute workloads 3cf0% multiple processors or in & luster, network links, central processing units, disk drives, or other rs0ures> Successful load balancing ‘optimizes resource use, maximizes throughput, minimizes response time, and voids overload. Scheduling orajoad balancing relies on esumating He a of computation needed within cach COncURE Subtasks. Thus workload is more evenly draibuted among available processors. As the igure below Shows the load balancing in internet. 1. Load balancing: How Load Balancing Works: Your website visitors2, Systeni Toad Balancing: There is a difference between load balancing and system load balancing. More overheads are included in load balancing rather that system load balancing. This is because system Toad balancing uses only the nurnber of available concurrent processors for task scheduling, thus system load will be balanced. Depending on interprocess communication and information on the computational requirements of processes there are three scheduling techniques: 3. Compiler Assisted Scheduling: Block level dynamic program information is gathered at run time and used in conjunction with interprocess communication and information on the computational requirements of each task to determine the schedule of concurrent tasks. 4, Static Scheduling: The exact interprocess communication and information on the computational requirements can be determined at compile time and an optimum schedule can be represented in directed acyclic graph if the program is static, 7 5. Clustering: Ifinterprocess communication between the processes and information on the computational requirements is known then some tasks may be optimally clustered together before assignment to available 8.7.3 SYNCHRONIZATION AND COHERENCY Interprocess communication is the mechanism for processes to communicate and to synchronize their actions. The processes communicate with each other without resorting to shared variables in shared multiprocessor system, In shared multiprocessor system , the most common procedure is to set a portion of memory that is accessible by all the processors. Thus all the processors leave their messages in that for other processors and pick the one which is intended for them. IPC facility provides two operations: Send message- message size fixed or variable, Receive message. In addition to shared memory other resources like magnetic disk etc., is also shared in multiprocessor ‘system. If P and Q wish to communicate, they need to: 1. Establish a communication link between them 2. Exchange messages via send/receive. The Implementation of communication link may be physical (e.g., shared memory, hardware bus) or logical (¢.g.. logical properties). Communication may be direct or indirect, Indirect communication Processes must name each other explicitly: * Send (P, message): Send a message to process P . pee (Q, message): Receive a message from process Q and the properties of communication are————— atis age. ation Shared Memory Multiprocessors ee ‘ir ynication oe ieee received from mailboxes (als0 referred to as ports), a es has a unique id and is created by the kernel on request i Processes can ‘communicate only if they share a mailbox. , : by several processors there musta provs the conflicts of using same resource 7 : a. fe Uses eral processors. And this is done by “Operating Sian a (or : “Teimaia orion sem | Fig, 8.38 Orgainsation of operating system. Table 6.9 Master slave configuration ‘Separate operating system Distributed operating system Sait ne pocsoris he master | This organizations auitalefor | In this organization, the tharahways crerues the operating | Ioosely coupled systems. In this | operating system routines are system functions, every processor can execute the | distributed among various Ira slave processor wants some | operating system routines it | processors. Each particular service than it must request it by | needs. The communication | operating system function is interupting the master and wait | berween the processors is by | assigned to only one process until eurrent program can be | means of message passing arene in distibuted operating cee through UO channels since there | system. is no shared memory.ee of Master slave organization a vantage is is simpli : ae sors eo higher than for a single processor system because , == Big a : ‘ Se if master processor fi, ene te or use of resources because ia slave processor becomes ie wh Mutu onl prt unt the MASE CAN ASSIgN more Work to it While masters busy, Succ we of interrupts because all slaves must interry ; ‘mutu ; ps eg UO requests). 1p master every time they need OS provision —“geeion coupling organization ==] ™ Fig. 8.40 Loosely coupling organization. sh scheme: r th . a shen a single processor | fails, others can sto catastrophic system failures because even W! endent! Tr i and itis dificult to detect when a processor as fled of Scanned with ComSeannar——— set instruction. It does not have to be 264} Advanced Computer Arhitacsre the lock signal must be active during the execution of test-and Ie active once the semaphore is set. Thus the lock mechanistn prevents other processors from accessing “hile semaphore is being set. The semaphore itself when set prevents other processes from A a critical section. accessing the shared memory while one processor is execut Problems with Test-And-Set When many processes are waiting to enter a critical region, starvation could occur because processes gain access in an arbitrary fashion. ~ Unless a first-come first-served policy were set up, some processes could be favored over = Waiting processes remain in unproductive, resource-consuming wait loops (busy waiting). = Consumes valuable processor time. Relies on the competing processes to test key. others. WAIT and SIGNAL Locking This is a Modification of test-and-set. 2 new operations are added which are mutually exclusive and become part of the process scheduler’s set of operations. - WAIT - SIGNAL Operations WAIT and SIGNAL frees processes from "busy waiting" dilemma and returns control to OS which can then run other jobs while waiting processes are idle 8.7.4 Effect of Partitioning and Scheduling Overheads We know there are 2 types of parallelism: Hardware and software. A brief discussion is presented here. Table 8.10 Hardware and Software parallelism. Hardware parallelism - As the name implies "Hardware parallelism", thus it is defined by hardware (machine architecture). Hardware parallelism Software parallelism This type of parallelism is defined by 1. Is. function of cost and performance tradeoffs. 2. Indicates the peak performance of processor resource. 3. Hardware parallelism in a processor is characterized by number of instructions issues per cycle. The degree of parallelism is characterized by program flow graph. i, Control and data dependence of programs. 2. It is a function of algorithms, compiler optimization etc. 3. Parallelism in a program varies during the execution period. Software parallelism limits the performance of processor. Degree of parallelism reflects the extent to parallelism. which software parallelism matches the hardware Degree of parallelism (DOP): when a program is partitioned into tasks , maximum no. of concurrent tasks can be determined. This is simply the maximum number of tasks that can be executed at one time, Itis called DOP. Since for one task, one processor is needed, thus for each time period , the number of processor used to execute a program is also called DOP. Parallelism profile: When DOP is plotted with respect to time then this is called parallelism profile. i dbooaie—————— ~ ' —_—— " oe yee of paalelism , coesponding degree of speedup may not stop refers to how much «parallel algorithm is faster thay g Speed up: Even there isa high de ‘chieved. In parallel computing, sPe' corresponding sequential algorithm. Speedup is defined by the following formula: where « pis the number of processors _ ; + 7; ste execution time of the sequential algorithm, «Ts the execution time of the parallel algorithm with p processors Linear speedup ordeal speedup is obtained when . When running an algorithm with Tinea speedup, doubling the number of processors doubles the speed. As this is ideal, itis considered very good scalabilit Efficiency is a performance metric defined as Itis a value, typically between zero and one, estimating how well-utilized the processors are in solving the problem, compared to how much effort is wasted in communication and synchronization. Algorithms with linear speedup and algorithms running on a single processor have an efficiency of 1, while many difficut-to-parallelize algorithms have efficiency such as x that approaches zero as inp the number of processors increases. In theory E, <= 1. Supertinear speedup: If we are using p processors, then we say that speedup should be p. But sometimes 1 speedup of more than p when using p processors is observed in parallel computing, which is called super linear speedup. Thus Super linear speedup means a speedup of more than p when using p processors. There are certain reasons of super linear speedup. “One possible penn for a super linear speedup is the cache effect resulting from the different memory hierarchies of a modern computer. In parallel computing, not only do the numbers of processors change, bt so does the size of accumulated caches from different processors. With the larger accumulated cache size, more or even all of the working set can fit into caches and the memory access time reduces ay . which causes the extra speedup in addition to that from the actual computation. Super linear speedups can also occur when performing backtracking i : on in ing in parallel: An exception ‘one thread can cause several other threads to backtrack early, before they reach the exception themselves _ If we allow the memory system to scale, that is p times larger for parallel processor system than for Scanned with ComSeannara ‘ance Laws mre perform Fermance laws are defined here: “re at a Ce rpased on fixed problem size = Big tals aie: is for scalable problems. > may ¥ o é il te a Mz mode: tis model is for scaled problems bounded by memory — 4 ye Law eo -_ sarallel computing to predict the theoretical maxi , ee maximum expected improvement to an ova it using multe processors ee, c . feral syste ‘This law assumes that computational workload is fixed ant ere Part ofthe problem size, is + * # umber of processors increase in parallel computing, the fi ae i ge esuls are obtained comparatively shorter Ge ete ‘ase of parallelization, Amdaht's law states that if P is the proportion of a program that can ase ft lizatic . jel (ie. benefit from parallelization), and (1-P) s the proportion that cannotbe paralized ciel B real, then the maximum speedup that can be achieved by using W processors ao Su) = — me (l-P)+ 7 snhelimit,as N tends to infinity, the maximum speedup tends to 1/1 ~P). In practice, performance 1s ae in spe falls apily a8 N is increased once there is even a small component of (1 ~ ). ‘sation, _anexample, if P is 90%, then (I~ P) is 10%, and the problem can be sped up by a maximum of ee, steal 10, n0 matter how large the value of N used. For this reason, parallel computing is only ‘eilfor either small numbers of processors, or problems with very high values of P: so-called zero as strasingly parallel problems. A great part of the craft of parallel programming consists of atempting ‘sce the component (1 — P) to the smallest possible value. Pcan be estimated by using the measured speedup (SU) on a specific number ‘of processors (NP) predict speedup fora different number NP. way can then be used in Amdabl's law to , synchronizatio 1 overheads also B02 0° program, where ina improved sequent pein he pat a sm waeaosuan en aeSo and f= tallty + ty) = 0-75. then # If part B is made five times faster . 5 maximum speed 2a speedip 70756-)) «tf part Ais made to run wie as fast (P=?) D ‘maxim lip < 2 = 1.00 axiraum speesiP = 759.25(2—1) “Therefore, making A twice as fast is better than raking B five times faster. The perenne improvement in speed can be calculated as = 0 speedup factor «Improving part A by a factor of two will increase overall program Spee dare anakes it 37-59% faster tan the original ‘computation. «However, improving par B by a factor of five, which presumably requires more effort, will only achieve ans overall speedup factor ‘of 1.25, which makes it 20% faster percentage improvement .d by a factor of 1.6, 2. Guftasons Law ‘This law is for scaled problems Gustafson’s law sings of Amdaht’s law, which addresses the shortco! available as the number of MACHINES increases. eee ot fully explit the computing power a become Guatafeon's Law instead proposes that Proeraneke Trad to set the size of problems 10 use ME available equipment to solve problems within Ff practical fixed time. Therefore, if aster (avore parallel) equipment is availabe, larger problems can be solved in the same time. s(P)=P-a(P-) “where P is the number of processors is the speedup, and o& the pon-parallelizable fraction of any parallel process. ‘The execution time of the program on 2 parallel computer is decomposed into (a+b) and b is the parallel time, on any of the P processors (Overhead is “where tis the sequential time ea us / f Gust is i amount of work tobe done in parallel implication is of the single processor ingle processing assignment 10 be executed in parallel with (yPi empties tat b, te per-process paral] time, should be held fixed as P time for sequential processing i$ E a+P:b. (at P- bya + D)- Scanned with ComSeannar‘Advanced Computer Architecture parents ie mately P, as desired It may e increases, if that holds true, wen be the case that then § approaches P Thus, if ais small, the speedup is approxin diminishes as P (together with the problem size) monotonous with the growth of P. 3. Sun and Ni Model ‘Sun and Ni model is a generalization of Amdahll's la largest possible problem limited by memory space. T higher speedup, higher accuracy and thus better resource utilization. .w and Gustafson’s law, the idea was to solve the his also demands a scaled workload , providing a 8.8 > CONCLUSIONS 4, Multiprogramming “The operating system can run MORE than one program. It divides the use of CPU among on time sharing basis. For example, let us say there afe two programs waiting inthe pool to be executed “by the CPU. So, the OS picks the first program and executes. I the program has some I/O (which means input/output) operations involved. then it puts this program in the queue and picks the second p: and executes, meanwhile the first program receiving its input. Since the working of OS and CPU will be fast, it looks as if both the prograins are executed simultaneously. the programs rogram 2. Multitasking Itis considered as the extension of multiprogramming. Here, the computer can perform more than one task. For example, let us say we are printing a document of 100 pages. While are computer is performing that, we still can do other jobs like typing a new document. So, more than one task is performed. One of the main differences between multiprogramming and multitasking is, “In multiprogramming, user cannot interact (everything is decided by OS, like picking the next program and sharing on time basis, etc...) where as in multitasking, a user can interact with the system (you can type a letter, while the other task of printing is going on)” 3, Multiprocessing There will be more than one processor. So, a single program can be divided into pieces (like modules) and can be processed by the multi processors or sometimes, each processor can handle individual programs. (ie If you have multiprocessing system, then it can either do multiprogramming as well as multitasking as well as multithreading). In Multiprocessing, each process will has its own system resources, 4, Multithreading ‘This type of programming helps when more than one client uses it, For example, let us take our DB. Hite Tim typing this post, there would be someone else, doing the same type of job. If DB is not ving a multithread option, then not more than one person will be able to do the same job. Iwill look like multiprocessing, that is a single program processed by more than one processor. oe tanakiteraige. there is not a must that there should be several Processors with individual resources, lich actually increases the overhead of. operating system to start/stop the resources everytime. 5. Types of Multiprocessors (@) Loosely coupled multi omputer systemsasa abn gs ne le s) © IPC by message passing « Typically PC or workstation clusters . ically distributed components = . ee ty longer message delays and limited bandwidth jE] «EC Fig. 8.41 Loosely coupled systems. (&) Closely coupled multi-computer systems «© Shared Memory Multiprocessor «Processors connected via commmon bus or fast network «© Characterized by short message delays, high © bandwidth «IPC via Shared Memory Fig. 8.42 Closely coupled system. 6. Issues for Multiprocessors Networks ‘© Total Bandwidth ‘s Amount of data which can be moved from somewhere to somewhere per unit time + Link Bandwidth ‘* Amount of data which can be moved along one link per unit time + Message Latency Time from start of sending a message until it is received © Bisection Bandwidth © Amount of data which can move from one half it ti Be crac of network to the other per unit ti 7. Switching Schemes © Store-and-forward packet switching: from one ee messages are broken into packets. Packets move This scheme has increasing latency (« -forwarding in : hea (delay) problem because of store-and-f ng - Circuit withing: a at is established from the source tthe destination. Once this pals bits are pumped from source to destination u in the intermediate switches Circuit ime for worst Scanned with ComSeannar272° Advanced Computer Architecture | aes + Average distance? * Bisection width: k (d) Torus: © 20 links * Node degree: 4 * Diameter: k * Average distance? * Bisection width: 2* (©) d dimensional mesh or torus: ID torus 2D mesh 3D mesh © N= ky K ky « X ky X ky nodes * Described by d vector of coordinates (i. ip) * Where 0.< ij,” k; for 0d” jd" d, d-dimensional k ary mesh: N = ky ok=ad'N * Described by d vector of radix k coordinate. © Diameter = d’"(k,) 12. Scheduling is important in shared memory multiprocessors and it can be done either statically (at compile time) or dynamically (at run time) usually scheduling is done at both the times. 13. Static scheduling information can be deriver on the basis of probable critical paths. But this alone is insufficient to ensure overall speedup or even fault tolerance. 14. Dynamic scheduling In Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior * It handles cases when dependences unknown at compile time * itallows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve * It allows code that compiled for one pipeline to run efficiently on a different pipeline © It simplifies the compiler * Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling. 15. The major run-time overheads in run time scheduling include information gathering, scheduling, dynamic execution control and dynamic data management. 16. There are 5 techniques of run-time scheduling: load balancing, system load balancing, clustering, scheduling with compiler assistance, static scheduling. 17, Parallelism is defined as computation in which many calculations are carried out simultaneously, ‘operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (“in parallel”). Two types of parallelism is defined here: explicit and(at hile mic ing, ing, sly, and ‘Shared Memory Multiprocessors must explicitly state which instructions can be execute jn iy _— programmer « Expl ue i ler of instructions tha « Inyplcit parallelism ~~ automanie detection by compiler of instructions that can be perf ee ste solving the cache coherency problem. semaphore are needed while solving the cache © y F smaphore is hardware ora software 8 variable whose value indicates the status of common its purpose isto lock the resource being used. A process which needs the resource wi determining the status of the resource fotlosved by the decision for are synchronized hy using the 18. resout check the semaphore for proceeding. In multitasking o semaphore techniques. 19. Multiprocessors may be sy! perating systems, the activit mmetric or asymmetric. Symmetric multiprocessor is an operating system concept that any processor may run the OS. Each process has a copy of the OS and any processor may arrange the communication among the processors a5 required. Whilein asymmetric tnulliprocessor a designated master processor control the system and directs the tasks and communications among the remaining processors. | What is difference between Multiprocessing, Multitasking? Q.2. Discuss the difference between tightly coupled multiprocessors and loosely coupled multiprocessors. Also state the advantages and disadvantages of one over another. What do you mean by interconnection structure? What are the types of static and dynamic connection structure? Q.4. How many switch points are there in a crossbar switch network that connects the p processors and m memory modules? Solution: p*m switch Q.5. What do you mean by bus arbiter? Discuss daisy-chaining mechanism. Q.6. What are the disadvantages of using time-shared bus organization? Q.7. Discuss the cache coherence problem. How it can be overcome? Q.8. The 8*8 omega network has 3 stages and each stage has 4 switches in the stage fora switches. How many stages and switches are needed in nn omega switching network? Hint: 8 = 23 thus 3 stages in the network. Q.9. Construct a diagram for 4*4 omega switching network. Show the switch setting is required © connect input 3 to output 1. .10. Distinguish between multiprocessor and multicomputers based on their structure, resource sharing and interprocessor communications. Also explain the difference between UMA, NUMA, CCNUMA, COMA. Qu11. Explain with diagram the Flynn's classification of computers, : Qu12, Distinguish between fine-grained, medium grain and coarse grained multicomputers Q.13. Distinguish between hardware and software parallelism. Q.14. Compare linear arrays, ring, completely connected, star, hypercube connection schemes with total of 12

Advanced Computer Architecture
88% (17)
Advanced Computer Architecture
170 pages
Symmetric Multiprocessors: Unit 5 Memory Organization
No ratings yet
Symmetric Multiprocessors: Unit 5 Memory Organization
6 pages
Ch-9 MIMD Architecture and SPMD
No ratings yet
Ch-9 MIMD Architecture and SPMD
8 pages
Contents:: Multiprocessors: Characteristics of Multiprocessor, Structure of Multiprocessor
No ratings yet
Contents:: Multiprocessors: Characteristics of Multiprocessor, Structure of Multiprocessor
52 pages
CSO Notes Unit 5 Multiprocessor
No ratings yet
CSO Notes Unit 5 Multiprocessor
52 pages
COA Assignment
No ratings yet
COA Assignment
21 pages
computer architecture[1]
No ratings yet
computer architecture[1]
11 pages
Unit6 - Microprocessor - Final 1
No ratings yet
Unit6 - Microprocessor - Final 1
30 pages
Multiprocessor
No ratings yet
Multiprocessor
22 pages
DC Unit 1
No ratings yet
DC Unit 1
32 pages
Multiple Processor Systems: 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed Systems
No ratings yet
Multiple Processor Systems: 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed Systems
55 pages
Pipeline
No ratings yet
Pipeline
43 pages
Final Unit5 CO Notes
No ratings yet
Final Unit5 CO Notes
7 pages
Lectures On DS
No ratings yet
Lectures On DS
8 pages
Swarnadeep Shit 11000223050
No ratings yet
Swarnadeep Shit 11000223050
11 pages
Multiprocessor Architecture and Programming
No ratings yet
Multiprocessor Architecture and Programming
20 pages
Multiprocessors
No ratings yet
Multiprocessors
8 pages
Lecture 3 Multiprocessor Vs Multicomputer Vs DS
No ratings yet
Lecture 3 Multiprocessor Vs Multicomputer Vs DS
55 pages
Multiprocessor Systems:: Advanced Operating System
No ratings yet
Multiprocessor Systems:: Advanced Operating System
60 pages
MIDTERM-REVIEWER
No ratings yet
MIDTERM-REVIEWER
17 pages
microprocessor
No ratings yet
microprocessor
7 pages
Unit 6 - Computer Organization and Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 6 - Computer Organization and Architecture - WWW - Rgpvnotes.in
14 pages
2ad6a430 1637912349895
No ratings yet
2ad6a430 1637912349895
51 pages
Unit 5
No ratings yet
Unit 5
23 pages
Lecture 19
No ratings yet
Lecture 19
20 pages
Distributed Operating Syst EM: 15SE327E Unit 1
No ratings yet
Distributed Operating Syst EM: 15SE327E Unit 1
49 pages
Distributed Systems: University of Pennsylvania
No ratings yet
Distributed Systems: University of Pennsylvania
26 pages
Multi Processors
No ratings yet
Multi Processors
15 pages
Cs3551 Distributed Computing Unit-1
No ratings yet
Cs3551 Distributed Computing Unit-1
52 pages
Distributed Systems: - CSE 380 - Lecture Note 13 - Insup Lee
No ratings yet
Distributed Systems: - CSE 380 - Lecture Note 13 - Insup Lee
24 pages
Relation To Computer System Components: M.D.Boomija, Ap/Cse
100% (1)
Relation To Computer System Components: M.D.Boomija, Ap/Cse
39 pages
Unit VI
No ratings yet
Unit VI
50 pages
1 Module 1 Introduction To Multiprocessors September 29 2024
No ratings yet
1 Module 1 Introduction To Multiprocessors September 29 2024
29 pages
8.1.1 Multiprocessors Hardware 8.1.2 Multiprocessors Operation System Types 8.1.3 Multiprocessors Synchronization 8.1.4 Multiprocessors Scheduling
No ratings yet
8.1.1 Multiprocessors Hardware 8.1.2 Multiprocessors Operation System Types 8.1.3 Multiprocessors Synchronization 8.1.4 Multiprocessors Scheduling
49 pages
CAPE COMPUTER SCIENCE Lesson 9 Advanced Networking
No ratings yet
CAPE COMPUTER SCIENCE Lesson 9 Advanced Networking
62 pages
Unit-3 2 Multiprocessor Systems
No ratings yet
Unit-3 2 Multiprocessor Systems
12 pages
Distributed System
100% (1)
Distributed System
26 pages
Multiprocessors
No ratings yet
Multiprocessors
12 pages
Unit 1
No ratings yet
Unit 1
88 pages
Mca - Unit Iii
No ratings yet
Mca - Unit Iii
59 pages
unit6
No ratings yet
unit6
36 pages
DoS - Unit 1
No ratings yet
DoS - Unit 1
57 pages
Module 07 - Multiprocessing
No ratings yet
Module 07 - Multiprocessing
60 pages
Parallelism and Multicores
No ratings yet
Parallelism and Multicores
54 pages
@vtucode - in 21CS643 Module 4 2021 Scheme
No ratings yet
@vtucode - in 21CS643 Module 4 2021 Scheme
189 pages
Lecture 5 Network Topologies for Parallel Architectures - Updated
No ratings yet
Lecture 5 Network Topologies for Parallel Architectures - Updated
46 pages
Multiprocessors Interconnection Networks
No ratings yet
Multiprocessors Interconnection Networks
32 pages
Unit I Introduction
No ratings yet
Unit I Introduction
54 pages
DS Lec03
No ratings yet
DS Lec03
43 pages
COA group Assigment
No ratings yet
COA group Assigment
11 pages
Characteristics Multi Processors
No ratings yet
Characteristics Multi Processors
7 pages
Unit One
No ratings yet
Unit One
93 pages
cs668 Lec1 ParallelArch
No ratings yet
cs668 Lec1 ParallelArch
18 pages
Multiprocessor: Vikram Singh Slathia
No ratings yet
Multiprocessor: Vikram Singh Slathia
19 pages
COE4590_8_Multiprocessor
No ratings yet
COE4590_8_Multiprocessor
17 pages
CO-unit6 (1)
No ratings yet
CO-unit6 (1)
8 pages
Ch-7 Cache Coherence and Synchronization
No ratings yet
Ch-7 Cache Coherence and Synchronization
20 pages
Unit-6 Cache Memory Organization
No ratings yet
Unit-6 Cache Memory Organization
36 pages
Ch-4 Processor Memory Modeling Using Queuing Theory
100% (2)
Ch-4 Processor Memory Modeling Using Queuing Theory
19 pages
CH - 3 Memory Hierarchy Design PDF
No ratings yet
CH - 3 Memory Hierarchy Design PDF
27 pages

Ch-8 Shared Memory Multiprocessors

Uploaded by

Ch-8 Shared Memory Multiprocessors

Uploaded by

You might also like