0% found this document useful (0 votes)

121 views44 pages

Parallel N Distributed Systems

The document provides an overview of parallel computing topics including: - Motivations for parallelism such as sustained performance gains and lower costs. - Limitations of sequential processor and memory speeds that parallelism addresses. - Standardized programming models reducing the time cost of parallel solutions. - Scope of parallel computing applications across engineering, science, commerce. - Organization of the course covering fundamentals, programming, and algorithms.

Uploaded by

Sidharth Panda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

121 views44 pages

Parallel N Distributed Systems

Uploaded by

Sidharth Panda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 44

Reference: introduction to parallel comuting by Vipin Kumar etl. (pearson education) MODUL !

" Introduction to Parallel Computing

Topic Overview
Motivating Parallelism Scope of Parallel Computing Applications Organization and Contents of the Course

Motivating Parallelism
The role of parallelism in accelerating computing speeds has been recognized for several decades. Its role in providing multiplicit of datapaths and increased access to storage elements has been significant in commercial applications. The scalable performance and lo!er cost of parallel platforms is reflected in the !ide variet of applications. "eveloping parallel hard!are and soft!are has traditionall been time and effort intensive. If one is to vie! this in the conte#t of rapidl improving uniprocessor speeds$ one is tempted to %uestion the need for parallel computing.

There are some unmista&able trends in hard!are design$ !hich indicate that uniprocessor 'or implicitl parallel( architectures ma not be able to sustain the rate of realizable performance increments in the future.

This is the result of a number of fundamental ph sical and computational limitations.

The emergence of standardized parallel programming environments$ libraries$ and hard!are has significantl reduced time to 'parallel( solution.

The Computational Power Argument

Moore)s la! states *+,-./0 ``The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 1 years. That

means by 1!"#, the number of components per integrated circuit for minimum cost will be $#, .%%

Moore attributed this doubling rate to e#ponential behavior of die sizes$ finer minimum dimensions$ and 11circuit and device cleverness)). In +,2.$ he revised this la! as follo!s0 ``There is no room left to s&ueeze anything out by being clever. 'oing forward from here we have to depend on the two size factors ( bigger dies and finer dimensions.%% 3e revised his rate of circuit complexity doubling to +4 months and pro5ected from +,2. on!ards at this reduced rate. If one is to bu into Moore)s la!$ the %uestion still remains 6 ho! does one translate transistors into useful OPS 'operations per second(7 The logical recourse is to rel on parallelism$ both implicit and e#plicit. Most serial 'or seemingl serial( processors rel e#tensivel on implicit parallelism. 8e focus in this class$ for the most part$ on e#plicit parallelism.

The Memory/Disk Speed Argument

8hile cloc& rates of high6end processors have increased at roughl 9:; per ear over the past decade$ "<AM access times have onl improved at the rate of roughl +:; per ear over this interval. This mismatch in speeds causes significant performance bottlenec&s. Parallel platforms provide increased band!idth to the memor s stem. Parallel platforms also provide higher aggregate caches.

Principles of localit of data reference and bul& access$ !hich guide parallel algorithm design also appl to memor optimization. Some of the fastest gro!ing applications of parallel computing utilize not their ra! computational speed$ rather their abilit to pump data to memor and dis& faster.

The Data Communication Argument

As the net!or& evolves$ the vision of the Internet as one large computing platform has emerged. This vie! is e#ploited b applications such as S=TI>home and ?olding>home.

In man other applications 't picall databases and data mining( the volume of data is such that the cannot be moved. An anal ses on this data must be performed over the net!or& using parallel techni%ues.

#cope of $arallel %omputing &pplications

Parallelism finds applications in ver diverse application domains for different motivating reasons. These range from improved application performance to cost considerations.

Applications in ngineering and Design

"esign of airfoils 'optimizing lift$ drag$ stabilit ($ internal combustion engines 'optimizing charge distribution$ burn($ high6speed circuits 'la outs for dela s and capacitive and inductive effects($ and structures 'optimizing structural integrit $ design parameters$ cost$ etc.(. "esign and simulation of micro6 and nano6scale s stems. Process optimization$ operations research.

Scienti!ic Applications
?unctional and structural characterization of genes and proteins. Advances in computational ph sics and chemistr have e#plored ne! materials$ understanding of chemical path!a s$ and more efficient processes. Applications in astroph sics have e#plored the evolution of gala#ies$ thermonuclear processes$ and the anal sis of e#tremel large datasets from telescopes. 8eather modeling$ mineral prospecting$ flood prediction$ etc.$ are other important applications. @ioinformatics and astroph sics also present some of the most challenging problems !ith respect to anal zing e#tremel large datasets.

Commercial Applications
Some of the largest parallel computers po!er the !all streetA "ata mining and anal sis for optimizing business and mar&eting decisions.

Barge scale servers 'mail and !eb servers( are often implemented using parallel platforms. Applications such as information retrieval and search are t picall po!ered b large clusters.

Applications in Computer Systems

Cet!or& intrusion detection$ cr ptograph $ multipart computations are some of the core users of parallel computing techni%ues. =mbedded s stems increasingl rel on distributed control algorithms. A modern automobile consists of tens of processors communicating to perform comple# tas&s for optimizing handling and performance. Conventional structured peer6to6peer net!or&s impose overla algorithms directl from parallel computing. net!or&s and utilize

Organi"ation and Contents o! this Course

?undamentals0 This part of the class covers basic parallel platforms$ principles of algorithm design$ group communication primitives$ and anal tical modeling techni%ues. Parallel Programming0 This part of the class deals !ith programming using message passing libraries and threads. Parallel Algorithms0 This part of the class covers basic algorithms for matri# computations$ graphs$ sorting$ discrete optimization$ and d namic programming. ===============xxxxxxxxxxx===============

$arallel %omputing $latforms 'opic O(er(ie)

Implicit Parallelism0 Trends in Microprocessor Architectures Bimitations of Memor S stem Performance "ichotom of Parallel Computing Platforms Communication Model of Parallel Platforms Ph sical Organization of Parallel Platforms Communication Costs in Parallel Machines Messaging Cost Models and <outing Mechanisms Mapping Techni%ues Case Studies

#cope of $arallelism
Conventional architectures coarsel comprise of a processor$ memor s stem$ and the datapath. =ach of these components present significant performance bottlenec&s. Parallelism addresses each of these components in significant !a s. "ifferent applications utilize different aspects of parallelism 6 e.g.$ data itensive applications utilize high aggregate throughput$ server applications utilize high aggregate net!or& band!idth$ and scientific applications t picall utilize high processing and memor s stem performance. It is important to understand each of these performance bottlenec&s.

"mplicit $arallelism: 'rends in Microprocessor &rc*itectures

Microprocessor cloc& speeds have posted impressive gains over the past t!o decades 't!o to three orders of magnitude(. 3igher levels of device integration have made available a large number of transistors. The %uestion of ho! best to utilize these resources is an important one. Current processors use these resources in multiple functional units and e#ecute multiple instructions in the same c cle. The precise manner in !hich these instructions are selected and e#ecuted provides impressive diversit in architectures.

$ipelining and #uperscalar +ecution

Pipelining overlaps various stages of instruction e#ecution to achieve performance. At a high level of abstraction$ an instruction can be e#ecuted !hile the ne#t one is being decoded and the ne#t one is being fetched. This is a&in to an assembl line for manufacture of cars. Pipelining$ ho!ever$ has several limitations. The speed of a pipeline is eventuall limited b the slo!est stage. ?or this reason$ conventional processors rel on ver deep pipelines 'D: stage pipelines in state6of6the6art Pentium processors(. 3o!ever$ in t pical program traces$ ever .6-th instruction is a conditional 5umpA This re%uires ver accurate branch prediction. The penalt of a misprediction gro!s !ith the depth of the pipeline$ since a larger number of instructions !ill have to be flushed. One simple !a of alleviating these bottlenec&s is to use multiple pipelines. The %uestion then becomes one of selecting these instructions.

#uperscalar +ecution: &n +ample

=#ample of a t!o6!a superscalar e#ecution of instructions.

In the above e#ample$ there is some !astage of resources due to data dependencies. The e#ample also illustrates that different instruction mi#es !ith identical semantics can ta&e significantl different e#ecution time.

#uperscalar +ecution
Scheduling of instructions is determined b a number of factors0 E True "ata "ependenc 0 The result of one operation is an input to the ne#t. E <esource "ependenc 0 T!o operations re%uire the same resource. E @ranch "ependenc 0 Scheduling instructions across conditional branch statements cannot be done deterministicall a6priori. E The scheduler$ a piece of hard!are loo&s at a large number of instructions in an instruction %ueue and selects appropriate number of instructions to e#ecute concurrentl based on these factors. E The comple#it of this hard!are is an important constraint on superscalar processors.

#uperscalar +ecution: "ssue Mec*anisms

In the simpler model$ instructions can be issued onl in the order in !hich the are encountered. That is$ if the second instruction cannot be issued because it has a data dependenc !ith the first$ onl one instruction is issued in the c cle. This is called in(order issue. In a more aggressive model$ instructions can be issued out of order. In this case$ if the second instruction has data dependencies !ith the first$ but the third instruction does not$ the first and third instructions can be co6scheduled. This is also called d namic issue. Performance of in6order issue is generall limited.

#uperscalar +ecution: fficiency %onsiderations

Cot all functional units can be &ept bus at all times. If during a c cle$ no functional units are utilized$ this is referred to as vertical !aste. If during a c cle$ onl some of the functional units are utilized$ this is referred to as horizontal !aste.

"ue to limited parallelism in t pical instruction traces$ dependencies$ or the inabilit of the scheduler to e#tract parallelism$ the performance of superscalar processors is eventuall limited. Conventional microprocessors t picall support four6!a superscalar e#ecution.

Very Long "nstruction ,ord (VL",) $rocessors

The hard!are cost and comple#it of the superscalar scheduler is a ma5or consideration in processor design. To address this issues$ FBI8 processors rel on compile time anal sis to identif and bundle together instructions that can be e#ecuted concurrentl . These instructions are pac&ed and dispatched together$ and thus the name ver long instruction !ord. This concept !as used !ith some commercial success in the Multiflo! Trace machine 'circa +,49(. Fariants of this concept are emplo ed in the Intel IA-9 processors.

Very Long "nstruction ,ord (VL",) $rocessors: %onsiderations

Issue hard!are is simpler. Compiler has a bigger conte#t from !hich to select co6scheduled instructions. Compilers$ ho!ever$ do not have runtime information such as cache misses. Scheduling is$ therefore$ inherentl conservative. @ranch and memor prediction is more difficult. FBI8 performance is highl dependent on the compiler. A number of techni%ues such as loop unrolling$ speculative e#ecution$ branch prediction are critical. T pical FBI8 processors are limited to 96!a to 46!a parallelism.

Limitations of Memory #ystem $erformance

Memor s stem$ and not processor speed$ is often the bottlenec& for man applications. Memor s stem performance is largel captured b t!o parameters$ latenc and band!idth. Batenc is the time from the issue of a memor re%uest to the time the data is available at the processor. @and!idth is the rate at !hich data can be pumped to the processor b the memor s stem.

Memory #ystem $erformance: -and)idt* and Latency

It is ver important to understand the difference bet!een latenc and band!idth. Consider the e#ample of a fire6hose. If the !ater comes out of the hose t!o seconds after the h drant is turned on$ the latenc of the s stem is t!o seconds. Once the !ater starts flo!ing$ if the h drant delivers !ater at the rate of . gallonsGsecond$ the band!idth of the s stem is . gallonsGsecond. If ou !ant immediate response from the h drant$ it is important to reduce latenc . If ou !ant to fight big fires$ ou !ant high band!idth.

Memory Latency: &n +ample

Consider a processor operating at + H3z '+ ns cloc&( connected to a "<AM !ith a latenc of +:: ns 'no caches(. Assume that the processor has t!o multipl 6add units and is capable of e#ecuting four instructions in each c cle of + ns. The follo!ing observations follo!0
E The pea& processor rating is 9 H?BOPS. E Since the memor latenc is e%ual to +:: c cles and bloc& size is one !ord$ ever time a memor re%uest is made$ the processor must !ait +:: c cles before it can process the data.

Memory Latency: &n +ample

On the above architecture$ consider the problem of computing a dot6product of t!o vectors.
E A dot6product computation performs one multipl 6add on a single pair of vector elements$ i.e.$ each floating point operation re%uires one data fetch. E It follo!s that the pea& speed of this computation is limited to one floating point operation ever +:: ns$ or a speed of +: M?BOPS$ a ver small fraction of the pea& processor ratingA

"mpro(ing ffecti(e Memory Latency Using %ac*es

Caches are small and fast memor elements bet!een the processor and "<AM. This memor acts as a lo!6latenc high6band!idth storage. If a piece of data is repeatedl used$ the effective latenc of this memor s stem can be reduced b the cache. The fraction of data references satisfied b the cache is called the cache hit ratio of the computation on the s stem. Cache hit ratio achieved b a code on a memor s stem often determines its performance.

Impact o! Caches# $ample

Consider the architecture from the previous e#ample. In this case$ !e introduce a cache of size ID J@ !ith a latenc of + ns or one c cle. 8e use this setup to multipl t!o matrices A and @ of dimensions ID K ID. 8e have carefull chosen these numbers so that the cache is large enough to store matrices A and @$ as !ell as the result matri# C.

Impact o! Caches# $ample %continued&

The follo!ing observations can be made about the problem0 ?etching the t!o matrices into the cache corresponds to fetching DJ !ords$ !hich ta&es appro#imatel D:: Ls. E Multipl ing t!o n ) n matrices ta&es *n+ operations. ?or our problem$ this corresponds to -9J operations$ !hich can be performed in +-J c cles 'or +- Ls( at four instructions per c cle. E The total time for the computation is therefore appro#imatel the sum of time for loadGstore operations and the time for the computation itself$ i.e.$ D:: M +- Ls. E This corresponds to a pea& computation rate of -9JGD+- or I:I M?BOPS.
E

Impact o! Caches
<epeated references to the same data item correspond to temporal localit . In our e#ample$ !e had O,n*- data accesses and O,n+- computation. This as mptotic difference ma&es the above e#ample particularl desirable for caches.

"ata reuse is critical for cache performance.

Impact o! Memory 'andwidth

Memor band!idth is determined b the band!idth of the memor bus as !ell as the memor units. Memor band!idth can be improved b increasing the size of memor bloc&s. The underl ing s stem ta&es l time units '!here l is the latenc of the s stem( to deliver b units of data '!here b is the bloc& size(.

Impact o! Memory 'andwidth# $ample

Consider the same setup as before$ e#cept in this case$ the bloc& size is 9 !ords instead of + !ord. 8e repeat the dot6product computation in this scenario0 E Assuming that the vectors are laid out linearl in memor $ eight ?BOPs 'four multipl 6 adds( can be performed in D:: c cles. E This is because a single memor access fetches four consecutive !ords in the vector. E Therefore$ t!o accesses can fetch four elements of each of the vectors. This corresponds to a ?BOP ever D. ns$ for a pea& speed of 9: M?BOPS.

Impact o! Memory 'andwidth

It is important to note that increasing bloc& size does not change latenc of the s stem. Ph sicall $ the scenario illustrated here can be vie!ed as a !ide data bus '9 !ords or +D4 bits( connected to multiple memor ban&s. In practice$ such !ide buses are e#pensive to construct. In a more practical s stem$ consecutive !ords are sent on the memor bus on subse%uent bus c cles after the first !ord is retrieved.

Impact o! Memory 'andwidth

The above e#amples clearl illustrate ho! increased band!idth results in higher pea& computation rates. The data la outs !ere assumed to be such that consecutive data !ords in memor !ere used b successive instructions 'spatial localit of reference(. If !e ta&e a data6la out centric vie!$ computations must be reordered to enhance spatial localit of reference.

Impact o! Memory 'andwidth# $ample

Consider the follo!ing code fragment0
for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) column_sum[i] += b[j][i];

The code fragment sums columns of the matri# b into a vector column_sum.

Impact o! Memory 'andwidth# $ample

The vector column_sum is small and easil fits into the cache The matri# b is accessed in a column order.

The strided access results in ver poor performance.

Multipl ing a matri# !ith a vector0 'a( multipl ing column6b 6column$ &eeping a running sumN 'b( computing each element of the result as a dot product of a ro! of the matri# !ith the vector.

Impact o! Memory 'andwidth# $ample

8e can fi# the above code as follo!s0
for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) for (i = 0; i < 1000; i++) column_sum[i] += b[j][i];

In this case$ the matri# is traversed in a ro!6order and performance can be e#pected to be significantl better.

Memory System Per!ormance# Summary

The series of e#amples presented in this section illustrate the follo!ing concepts0 E =#ploiting spatial and temporal localit in applications is critical for amortizing memor latenc and increasing effective memor band!idth. E The ratio of the number of operations to number of memor accesses is a good indicator of anticipated tolerance to memor band!idth. E Memor la outs and organizing computation appropriatel can ma&e a significant impact on the spatial and temporal localit .

Alternate Approaches !or (iding Memory )atency

Consider the problem of bro!sing the !eb on a ver slo! net!or& connection. 8e deal !ith the problem in one of three possible !a s0

!e anticipate !hich pages !e are going to bro!se ahead of time and issue re%uests for them in advanceN E !e open multiple bro!sers and access different pages in each bro!ser$ thus !hile !e are !aiting for one page to load$ !e could be reading othersN or E !e access a !hole bunch of pages in one go 6 amortizing the latenc across various accesses. The first approach is called prefetching$ the second multithreading$ and the third one corresponds to spatial localit in accessing memor !ords.
E

Multithreading !or )atency (iding

A thread is a single stream of control in the flo! of a program. 8e illustrate threads !ith a simple e#ample0
for (i = 0; i < n; i++) c[i] = dot_product(get_row( ! i)! b);

=ach dot6product is independent of the other$ and therefore represents a concurrent unit of e#ecution. 8e can safel re!rite the above code segment as0
for (i = 0; i < n; i++) c[i] = cre te_t"re d(dot_product!get_row( ! i)! b);

Multithreading !or )atency (iding# $ample

In the code$ the first instance of this function accesses a pair of vector elements and !aits for them. In the meantime$ the second instance of this function can access t!o other vector elements in the ne#t c cle$ and so on. After l units of time$ !here l is the latenc of the memor s stem$ the first function instance gets the re%uested data from memor and can perform the re%uired computation. In the ne#t c cle$ the data items for the ne#t function instance arrive$ and so on. In this !a $ in ever cloc& c cle$ !e can perform a computation.

Multithreading !or )atency (iding

The e#ecution schedule in the previous e#ample is predicated upon t!o assumptions0 the memor s stem is capable of servicing multiple outstanding re%uests$ and the processor is capable of s!itching threads at ever c cle. It also re%uires the program to have an e#plicit specification of concurrenc in the form of threads. Machines such as the 3=P and Tera rel on multithreaded processors that can s!itch the conte#t of e#ecution in ever c cle. Conse%uentl $ the are able to hide latenc effectivel .

Pre!etching !or )atency (iding

Misses on loads cause programs to stall. 11

8h not advance the loads so that b the time the data is actuall needed$ it is alread thereA The onl dra!bac& is that ou might need more space to store advanced loads. 3o!ever$ if the advanced loads are over!ritten$ !e are no !orse than beforeA

Tradeo!!s o! Multithreading and Pre!etching

Multithreading and prefetching are criticall impacted b the memor band!idth. Consider the follo!ing e#ample0 E Consider a computation running on a machine !ith a + H3z cloc&$ 96!ord cache line$ single c cle access to the cache$ and +:: ns latenc to "<AM. The computation has a cache hit ratio at + J@ of D.; and at ID J@ of ,:;. Consider t!o cases0 first$ a single threaded e#ecution in !hich the entire cache is available to the serial conte#t$ and second$ a multithreaded e#ecution !ith ID threads !here each thread has a cache residenc of + J@. E If the computation ma&es one data re%uest in ever c cle of + ns$ ou ma notice that the first scenario re%uires 9::M@Gs of memor band!idth and the second$ IH@Gs.

Tradeo!!s o! Multithreading and Pre!etching

@and!idth re%uirements of a multithreaded s stem ma increase ver significantl because of the smaller cache residenc of each thread. Multithreaded s stems become band!idth bound instead of latenc bound. Multithreading and prefetching onl address the latenc problem and ma often e#acerbate the band!idth problem. Multithreading and prefetching also re%uire significantl more hard!are resources in the form of storage.

$plicitly Parallel Plat!orms Dichotomy o! Parallel Computing Plat!orms

An e#plicitl parallel program must specif concurrenc and interaction bet!een concurrent subtas&s. The former is sometimes also referred to as the control structure and the latter as the communication model.

Control Structure o! Parallel Programs

Parallelism can be e#pressed at various levels of granularit 6 from instruction level to processes. @et!een these e#tremes e#ist a range of models$ along !ith corresponding architectural support.

Control Structure o! Parallel Programs

Processing units in parallel computers either operate under the centralized control of a single control unit or !or& independentl . If there is a single control unit that dispatches the same instruction to various processors 'that !or& on different data($ the model is referred to as single instruction stream$ multiple data stream 'SIM"(. If each processor has its o!n control control unit$ each processor can e#ecute different instructions on different data items. This model is called multiple instruction stream$ multiple data stream 'MIM"(.

SIMD and MIMD Processors

A t pical SIM" architecture 'a( and a t pical MIM" architecture 'b(.

SIMD Processors
Some of the earliest parallel computers such as the Illiac IF$ MPP$ "AP$ CM6D$ and MasPar MP6+ belonged to this class of machines. Fariants of this concept have found use in co6processing units such as the MMO units in Intel processors and "SP chips such as the Sharc. SIM" relies on the regular structure of computations 'such as those in image processing(. It is often necessar to selectivel turn off operations on certain data items. ?or this reason$ most SIM" programming paradigms allo! for an 11activit mas&))$ !hich determines if a processor should participate in a computation or not.

Conditional $ecution in SIMD Processors

=#ecuting a conditional statement on an SIM" computer !ith four processors0 'a( the conditional statementN 'b( the e#ecution of the statement in t!o steps

MIMD Processors
In contrast to SIM" processors$ MIM" processors can e#ecute different programs on different processors. A variant of this$ called single program multiple data streams 'SPM"( e#ecutes the same program on different processors. It is eas to see that SPM" and MIM" are closel related in terms of programming fle#ibilit and underl ing architectural support. =#amples of such platforms include current generation Sun Pltra Servers$ SHI Origin Servers$ multiprocessor PCs$ !or&station clusters$ and the I@M SP.

SIMD*MIMD Comparison
SIM" computers re%uire less hard!are than MIM" computers 'single control unit(. 3o!ever$ since SIM" processors ae speciall designed$ the tend to be e#pensive and have long design c cles. Cot all applications are naturall suited to SIM" processors. In contrast$ platforms supporting the SPM" paradigm can be built from ine#pensive off6 the6shelf components !ith relativel little effort in a short amount of time.

Communication Model o! Parallel Plat!orms

There are t!o primar forms of data e#change bet!een parallel tas&s 6 accessing a shared data space and e#changing messages. Platforms that provide a shared data space are called shared6address6space machines or multiprocessors. Platforms that support messaging are also called message passing platforms or multicomputers.

Shared*Address*Space Plat!orms
Part 'or all( of the memor is accessible to all processors. Processors interact b modif ing data ob5ects stored in this shared6address6space.

If the time ta&en b a processor to access an memor !ord in the s stem global or local is identical$ the platform is classified as a uniform memor access 'PMA($ else$ a non6 uniform memor access 'CPMA( machine.

+,MA and ,MA SharedAddressSpace Plat!orms

T pical shared6address6space architectures0 'a( Pniform6memor access shared6 address6space computerN 'b( Pniform6memor 6access shared6address6space computer !ith caches and memoriesN 'c( Con6uniform6memor 6access shared6 address6space computer !ith local memor onl .

+,MA and ,MA SharedAddressSpace Plat!orms

The distinction bet!een CPMA and PMA platforms is important from the point of vie! of algorithm design. CPMA machines re%uire localit from underl ing algorithms for performance. Programming these platforms is easier since reads and !rites are implicitl visible to other processors. 3o!ever$ read6!rite data to shared data must be coordinated 'this !ill be discussed in greater detail !hen !e tal& about threads programming(.

Caches in such machines re%uire coordinated access to multiple copies. This leads to the cache coherence problem. A !ea&er model of these machines provides an address map$ but not coordinated access. These models are called non cache coherent shared address space machines.

SharedAddressSpace vs- Shared Memory Machines

It is important to note the difference bet!een the terms shared address space and shared memor . 8e refer to the former as a programming abstraction and to the latter as a ph sical machine attribute. It is possible to provide a shared address space using a ph sicall distributed memor .

Message*Passing Plat!orms
These platforms comprise of a set of processors and their o!n 'e#clusive( memor . Instances of such a vie! come naturall from clustered !or&stations and non6shared6 address6space multicomputers. These platforms are programmed using 'variants of( send and receive primitives. Bibraries such as MPI and PFM provide such primitives.

Message Passing vs- Shared Address Space Plat!orms

Message passing re%uires little hard!are support$ other than a net!or&. Shared address space platforms can easil emulate message passing. The reverse is more difficult to do 'in an efficient manner(.

Physical Organi"ation o! Parallel Plat!orms

8e begin this discussion !ith an ideal parallel machine called Parallel <andom Access Machine$ or P<AM.

Architecture o! an Ideal Parallel Computer

A natural e#tension of the <andom Access Machine '<AM( serial architecture is the Parallel <andom Access Machine$ or P<AM. P<AMs consist of p processors and a global memor of unbounded size that is uniforml accessible to all processors. Processors share a common cloc& but ma e#ecute different instructions in each c cle.

Architecture o! an Ideal Parallel Computer

"epending on ho! simultaneous memor accesses are handled$ P<AMs can be divided into four subclasses. E =#clusive6read$ e#clusive6!rite '=<=8( P<AM. E Concurrent6read$ e#clusive6!rite 'C<=8( P<AM. E =#clusive6read$ concurrent6!rite '=<C8( P<AM. E Concurrent6read$ concurrent6!rite 'C<C8( P<AM.

Architecture o! an Ideal Parallel Computer

E E E E

8hat does concurrent !rite mean$ an !a 7 Common0 !rite onl if all values are identical. Arbitrar 0 !rite the data from a randoml selected processor. Priorit 0 follo! a predetermined priorit order. Sum0 8rite the sum of all data items.

Physical Comple$ity o! an Ideal Parallel Computer

Processors and memories are connected via s!itches. Since these s!itches must operate in O,1- time at the level of !ords$ for a s stem of p processors and m !ords$ the s!itch comple#it is O,mp-. Clearl $ for meaningful values of p and m$ a true P<AM is not realizable.

Interconnection +etworks !or Parallel Computers

Interconnection net!or&s carr data bet!een processors and to memor . Interconnects are made of s!itches and lin&s '!ires$ fiber(. Interconnects are classified as static or d namic. Static net!or&s consist of point6to6point communication lin&s among processing nodes and are also referred to as direct net!or&s. " namic net!or&s are built using s!itches and communication lin&s. " namic net!or&s are also referred to as indirect net!or&s.

Static and Dynamic Interconnection +etworks

Classification of interconnection net!or&s0 'a( a static net!or&N and 'b( a d namic net!or&.

Interconnection +etworks
S!itches map a fi#ed number of inputs to outputs. The total number of ports on a s!itch is the degree of the s!itch. The cost of a s!itch gro!s as the s%uare of the degree of the s!itch$ the peripheral hard!are linearl as the degree$ and the pac&aging costs linearl as the number of pins.

Interconnection +etworks# +etwork Inter!aces

Processors tal& to the net!or& via a net!or& interface. The net!or& interface ma hang off the IGO bus or the memor bus. In a ph sical sense$ this distinguishes a cluster from a tightl coupled multicomputer. The relative speeds of the IGO and memor buses impact the performance of the net!or&.

+etwork Topologies
A variet of net!or& topologies have been proposed and implemented. These topologies tradeoff performance for cost. Commercial machines often implement h brids of multiple topologies for reasons of pac&aging$ cost$ and available components.

+etwork Topologies# 'uses

Some of the simplest and earliest parallel machines used buses. All processors access a common bus for e#changing data. The distance bet!een an t!o nodes is O,1- in a bus. The bus also provides a convenient broadcast media. 3o!ever$ the band!idth of the shared bus is a ma5or bottlenec&. T pical bus based machines are limited to dozens of nodes. Sun =nterprise servers and Intel Pentium based shared6bus multiprocessors are e#amples of such architectures.

+etwork Topologies# 'uses

@us6based interconnects 'a( !ith no local cachesN 'b( !ith local memor Gcaches. Since much of the data accessed b processors is local to the processor$ a local memor can improve the performance of bus6based machines.

+etwork Topologies# Cross.ars

A crossbar net!or& uses an p)m grid of s!itches to connect p inputs to m outputs in a non6bloc&ing manner.

A completel non6bloc&ing crossbar net!or& connecting p processors to b memor ban&s.

+etwork Topologies# Cross.ars

The cost of a crossbar of p processors gro!s as O,p*-. This is generall difficult to scale for large values of p. =#amples of machines that emplo crossbars include the Sun Pltra 3PC +:::: and the ?u5itsu FPP.::.

+etwork Topologies# Multistage +etworks

Crossbars have e#cellent performance scalabilit but poor cost scalabilit . @uses have e#cellent cost scalabilit $ but poor performance scalabilit . Multistage interconnects stri&e a compromise bet!een these e#tremes.

.et)or/ 'opologies: Multistage .et)or/s

The schematic of a t pical multistage interconnection net!or&

+etwork Topologies# Multistage Omega +etwork

One of the most commonl used multistage interconnects is the Omega net!or&. This net!or& consists of log p stages$ !here p is the number of inputsGoutputs. At each stage$ input i is connected to output . if0

+etwork Topologies# Multistage Omega +etwork

=ach stage of the Omega net!or& implements a perfect shuffle as follo!s0

A perfect shuffle interconnection for eight inputs and outputs.

+etwork Topologies# Multistage Omega +etwork

The perfect shuffle patterns are connected using DKD s!itches. The s!itches operate in t!o modes E crossover or passthrough.

T!o s!itching configurations of the D K D s!itch0 'a( Pass6throughN 'b( Cross6over.

+etwork Topologies# Multistage Omega +etwork

A complete Omega net!or& !ith the perfect shuffle interconnects and s!itches can no! be illustrated0

A complete omega net!or& connecting eight inputs and eight outputs. An omega net!or& has p/* ) log p s!itching nodes$ and the cost of such a net!or& gro!s as ,p log p-.

+etwork Topologies# Multistage Omega +etwork / 0outing

Bet s be the binar processor. representation of the source and d be that of the destination

The data traverses the lin& to the first s!itching node. If the most significant bits of s and d are the same$ then the data is routed in pass6through mode b the s!itch else$ it s!itches to crossover. This process is repeated for each of the log p s!itching stages. Cote that this is not a non6bloc&ing s!itch.

+etwork Topologies# Multistage Omega +etwork / 0outing

An e#ample of bloc&ing in omega net!or&0 one of the messages ':+: to +++ or ++: to +::( is bloc&ed at lin& A@.

+etwork Topologies# Completely Connected +etwork

=ach processor is connected to ever other processor. The number of lin&s in the net!or& scales as O,p*-. 8hile the performance scales ver !ell$ the hard!are comple#it is not realizable for large values of p. In this sense$ these net!or&s are static counterparts of crossbars.

+etwork Topologies# Completely Connected and Star Connected +etworks

=#ample of an 46node completel connected net!or&.

'a( A completel 6connected net!or& of eight nodesN 'b( a star connected net!or& of nine nodes.

+etwork Topologies# Star Connected +etwork

=ver node is connected onl to a common node at the center. "istance bet!een an pair of nodes is O,1-. 3o!ever$ the central node becomes a bottlenec&. In this sense$ star connected net!or&s are static counterparts of buses.

+etwork Topologies# )inear Arrays1 Meshes1 and k-d Meshes

In a linear arra $ each node has t!o neighbors$ one to its left and one to its right. If the nodes at either end are connected$ !e refer to it as a +6" torus or a ring. A generalization to D dimensions has nodes !ith 9 neighbors$ to the north$ south$ east$ and !est. A further generalization to d dimensions has nodes !ith *d neighbors. A special case of a d6dimensional mesh is a h percube. 3ere$ d 0 log p$ !here p is the total number of nodes.

+etwork Topologies# )inear Arrays

Binear arra s0 'a( !ith no !raparound lin&sN 'b( !ith !raparound lin&.

+etwork Topologies# Two* and Three Dimensional Meshes

T!o and three dimensional meshes0 'a( D6" mesh !ith no !raparoundN 'b( D6" mesh !ith !raparound lin& 'D6" torus(N and 'c( a I6" mesh !ith no !raparound.

+etwork Topologies# (ypercu.es and their Construction

Construction of h percubes from h percubes of lo!er dimension.

+etwork Topologies# Properties o! (ypercu.es

The distance bet!een an t!o nodes is at most log p. =ach node has log p neighbors. The distance bet!een t!o nodes is given b the number of bit positions at !hich the t!o nodes differ.

+etwork Topologies# Tree*'ased +etworks

Complete binar tree net!or&s0 'a( a static tree net!or&N and 'b( a d namic tree net!or&.

+etwork Topologies# Tree Properties

The distance bet!een an t!o nodes is no more than *logp. 26

Bin&s higher up the tree potentiall carr more traffic than those at the lo!er levels. ?or this reason$ a variant called a fat6tree$ fattens the lin&s as !e go up the tree. Trees can be laid out in D" !ith no !ire crossings. This is an attractive propert of trees.

+etwork Topologies# 2at Trees

A fat tree net!or& of +- processing nodes

valuating Static Interconnection +etworks

1iameter2 The distance bet!een the farthest t!o nodes in the net!or&. The diameter of a linear arra is p 1$ that of a mesh is *, 1-, that of a tree and h percube is log p$ and that of a completel connected net!or& is O,1-. 3isection 4idth2 The minimum number of !ires ou must cut to divide the net!or& into t!o e%ual parts. The bisection !idth of a linear arra and tree is 1$ that of a mesh is $ that of a h percube is p/* and that of a completel connected net!or& is pDG9. Cost2 The number of lin&s or s!itches '!hichever is as mptoticall higher( is a meaningful measure of the cost. 3o!ever$ a number of other factors$ such as the abilit to la out the net!or&$ the length of !ires$ etc.$ also factor in to the cost.

valuating Static Interconnection +etworks

Cet!or& Completel 6connected Star Complete binar tree Binear arra D6" mesh$ no !raparound D6" !raparound mesh 3 percube 8raparound 56ar d6cube

"iameter

@isection 8idth

Arc Connectivit

Cost 'Co. of lin&s(

valuating Dynamic Interconnection +etworks

Cet!or& Crossbar Omega Cet!or& " namic Tree

"iameter

@isection 8idth

Arc Connectivit

Cost 'Co. of lin&s(

Cache Coherence in Multiprocessor Systems

Interconnects provide basic mechanisms for data transfer. In the case of shared address space machines$ additional hard!are is re%uired to coordinate access to data that might have multiple copies in the net!or&. The underl ing techni%ue must provide some guarantees on the semantics. This guarantee is generall one of serializabilit $ i.e.$ there e#ists some serial order of instruction e#ecution that corresponds to the parallel schedule.

8hen the value of a variable is changes$ all its copies must either be invalidated or updated.

Cache coherence in multiprocessor s stems0 'a( Invalidate protocolN 'b( Ppdate protocol for shared variables

Cache Coherence# ,pdate and Invalidate Protocols

If a processor 5ust reads a value once and does not need it again$ an update protocol ma generate significant overhead. If t!o processors ma&e interleaved test and updates to a variable$ an update protocol is better. @oth protocols suffer from false sharing overheads 't!o !ords that are not shared$ ho!ever$ the lie on the same cache line(. Most current machines use invalidate protocols.

Maintaining Coherence ,sing Invalidate Protocols

=ach cop of a data item is associated !ith a state. One e#ample of such a set of states is$ shared$ invalid$ or dirt . In shared state$ there are multiple valid copies of the data item 'and therefore$ an invalidate !ould have to be generated on an update(. In dirt state$ onl one cop e#ists and therefore$ no invalidates need to be generated. In invalid state$ the data cop is invalid$ therefore$ a read generates a data re%uest 'and associated state changes(.

Maintaining Coherence ,sing Invalidate Protocols

State diagram of a simple three6state coherence protocol.

Maintaining Coherence ,sing Invalidate Protocols

=#ample of parallel program e#ecution !ith the simple three6state coherence protocol.

Snoopy Cache Systems

3o! are invalidates sent to the right processors7 In snoop caches$ there is a broadcast media that listens to all invalidates and read re%uests and performs appropriate coherence operations locall .

A simple snoop bus based cache coherence s stem

Per!ormance o! Snoopy Caches

Once copies of data are tagged dirt $ all subse%uent operations can be performed locall on the cache !ithout generating e#ternal traffic. If a data item is read b a number of processors$ it transitions to the shared state in the cache and all subse%uent read operations become local. If processors read and update data at the same time$ the generate coherence re%uests on the bus 6 !hich is ultimatel band!idth limited.

Directory 'ased Systems

In snoop caches$ each coherence operation is sent to all processors. This is an inherent limitation. 8h not send coherence re%uests to onl those processors that need to be notified7 This is done using a director $ !hich maintains a presence vector for each data item 'cache line( along !ith its global state.

Directory 'ased Systems

Architecture of t pical director based s stems0 'a( a centralized director N and 'b( a distributed director .

Per!ormance o! Directory 'ased Schemes

used. The need for a broadcast media is replaced b the director . The additional bits to store the director ma add significant overhead. The underl ing net!or& must be able to carr all the coherence re%uests. The director is a point of contention$ therefore$ distributed director schemes must be

Communication Costs in Parallel Machines

Along !ith idling and contention$ communication is a ma5or overhead in parallel programs. The cost of communication is dependent on a variet of features including the programming model semantics$ the net!or& topolog $ data handling and routing$ and associated soft!are protocols.

Message Passing Costs in Parallel Computers

The total time to transfer a message over a net!or& comprises of the follo!ing0 6tartup time 'ts(0 Time spent at sending and receiving nodes 'e#ecuting the routing algorithm$ programming routers$ etc.(. E 7er(hop time 'th(0 This time is a function of number of hops and includes factors such as s!itch latencies$ net!or& dela s$ etc. E 7er(word transfer time 'tw(0 This time includes all overheads that are determined b the length of the message. This includes band!idth of lin&s$ error chec&ing and correction$ etc.
E

Store*and*2orward 0outing
A message traversing multiple hops is completel received at an intermediate hop before being for!arded to the ne#t hop. The total communication cost for a message of size m !ords to traverse l communication lin&s is In most platforms$ th is small and the above e#pression can be appro#imated b

0outing Techni3ues

Passing a message from node 7 to 7+ 'a( through a store6and6for!ard communication net!or&N 'b( and 'c( e#tending the concept to cut6through routing. The shaded regions represent the time that the message is in transit. The startup time associated !ith this message transfer is assumed to be zero.

Packet 0outing
Store6and6for!ard ma&es poor use of communication resources. Pac&et routing brea&s messages into pac&ets and pipelines them through the net!or&. Since pac&ets ma ta&e different paths$ each pac&et must carr routing information$ error chec&ing$ se%uencing$ and other related header information. The total communication time for pac&et routing is appro#imated b 0

The factor tw accounts for overheads in pac&et headers.

Cut*Through 0outing
Ta&es the concept of pac&et routing to an e#treme b further dividing messages into basic units called flits. Since flits are t picall small$ the header information must be minimized. This is done b forcing all flits to ta&e the same path$ in se%uence. A tracer message first programs all intermediate routers. All flits then ta&e the same route. =rror chec&s are performed on the entire message$ as opposed to flits. Co se%uence numbers are needed.

Cut*Through 0outing
The total communication time for cut6through routing is appro#imated b 0

This is identical to pac&et routing$ ho!ever$ tw is t picall much smaller.

Simpli!ied Cost Model !or Communicating Messages

The cost of communicating a message bet!een t!o nodes l hops a!a through routing is given b

using cut6

In this e#pression$ th is t picall smaller than ts and tw. ?or this reason$ the second term in the <3S does not sho!$ particularl $ !hen m is large. ?urthermore$ it is often not possible to control routing and placement of tas&s. ?or these reasons$ !e can appro#imate the cost of message transfer b

Simpli!ied Cost Model !or Communicating Messages

It is important to note that the original e#pression for communication time is valid for onl uncongested net!or&s. If a lin& ta&es multiple messages$ the corresponding tw term must be scaled up b the number of messages. "ifferent communication patterns congest different net!or&s to var ing e#tents. It is important to understand and account for this in the communication time accordingl .

Cost Models !or Shared Address Space Machines

8hile the basic messaging cost applies to these machines as !ell$ a number of other factors ma&e accurate cost modeling more difficult. Memor la out is t picall determined b the s stem. ?inite cache sizes can result in cache thrashing. Overheads associated !ith invalidate and update operations are difficult to %uantif . Spatial localit is difficult to model. Prefetching can pla a role in reducing the overhead associated !ith data access. ?alse sharing and contention are difficult to model.

0outing Mechanisms !or Interconnection +etworks

3o! does one compute the route that a message ta&es from source to destination7 E <outing must prevent deadloc&s 6 for this reason$ !e use dimension6ordered or e6cube routing. E <outing must avoid hot6spots 6 for this reason$ t!o6step routing is often used. In this case$ a message from source s to destination d is first sent to a randoml chosen intermediate processor i and then for!arded to destination d.

0outing Mechanisms !or Interconnection +etworks

<outing a message from node 7s ':+:( to node 7d '+++( in a three6dimensional h percube using =6cube routing.

Mapping Techni3ues !or 4raphs

Often$ !e need to embed a &no!n communication pattern into a given interconnection topolog . 8e ma have an algorithm designed for one net!or&$ !hich !e are porting to another topolog . ?or these reasons$ it is useful to understand mapping bet!een graphs.

Mapping Techni3ues !or 4raphs# Metrics

8hen mapping a graph ',8,9- into ':,8:,9:-, the follo!ing metrics are important0 The ma#imum number of edges mapped onto an edge in 9: is called the congestion of the mapping. The ma#imum number of lin&s in 9: that an edge in 9 is mapped onto is called the dilation of the mapping. The ratio of the number of nodes in the set 8: to that in set 8 is called the expansion of the mapping.

m.edding a )inear Array into a (ypercu.e

A linear arra 'or a ring( composed of D d nodes 'labeled : through Dd +( can be embedded into a d6dimensional h percube b mapping node i of the linear arra onto node ',i, d- of the h percube. The function ',i, x- is defined as follo!s0

0
37

m.edding a )inear Array into a (ypercu.e

The function ' is called the binary reflected 'ray code '<HC(. Since ad5oining entries '',i, d- and ',i ; 1, d-( differ from each other at onl one bit position$ corresponding processors are mapped to neighbors in a h percube. Therefore$ the congestion$ dilation$ and e#pansion of the mapping are all +.

m.edding a )inear Array into a (ypercu.e# $ample

'a( A three6bit reflected Hra h percube

code ringN and 'b( its embedding into a three6dimensional

m.edding a Mesh into a (ypercu.e

A Dr K Ds !raparound mesh can be mapped to a D rMs6node h percube b mapping node ,i, .- of the mesh onto node ',i, r +- QQ ',., s 1- of the h percube '!here QQ denotes concatenation of the t!o Hra codes(.

m.edding a Mesh into a (ypercu.e

'a( A 9 K 9 mesh illustrating the mapping of mesh nodes to the nodes in a four6dimensional h percubeN and 'b( a D K 9 mesh embedded into a three6dimensional h percube. Once again$ the congestion$ dilation$ and e#pansion of the mapping is +.

m.edding a Mesh into a )inear Array

Since a mesh has more edges than a linear arra $ !e !ill not have an optimal congestionGdilation mapping. 8e first e#amine the mapping of a linear arra into a mesh and then invert this mapping. This gives us an optimal mapping 'in terms of congestion(.

m.edding a Mesh into a )inear Array# $ample

'a( =mbedding a +- node linear arra into a D6" meshN and 'b( the inverse of the mapping. Solid lines correspond to lin&s in the linear arra and normal lines to lin&s in the mesh.

m.edding a (ypercu.e into a 5*D Mesh

=ach node subcube of the h percube is mapped to a node ro! of the mesh. This is done b inverting the linear6arra to h percube mapping. This can be sho!n to be an optimal mapping.

m.edding a (ypercu.e into a 5*D Mesh# $ample

Case Studies# The I'M 'lue*4ene Architecture

The hierarchical architecture of @lue Hene.

Case Studies# The Cray T6 Architecture

Interconnection net!or& of the Cra TI=0 'a( node architectureN 'b( net!or& topolog .

Case Studies# The S4I Origin 6777 Architecture

Architecture of the SHI Origin I::: famil of servers

Case Studies# The Sun (PC Server Architecture

Architecture of the Sun =nterprise famil of servers

Graboplast ENG PDF
No ratings yet
Graboplast ENG PDF
11 pages
L1 Introduction
No ratings yet
L1 Introduction
12 pages
Lecture 1
No ratings yet
Lecture 1
18 pages
Unit 1
No ratings yet
Unit 1
54 pages
Introduction To Parallel Computing: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Introduction To Parallel Computing: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
15 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
14 pages
11 Introduction To Parallel Computing
No ratings yet
11 Introduction To Parallel Computing
14 pages
1-Introduction
No ratings yet
1-Introduction
48 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
PC 1
No ratings yet
PC 1
53 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
February 22, 2010
No ratings yet
February 22, 2010
53 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Lecture 2 Introduction to Parallel and Distributed Computing
No ratings yet
Lecture 2 Introduction to Parallel and Distributed Computing
29 pages
Parallel Computing Varun Patial
No ratings yet
Parallel Computing Varun Patial
41 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
PDC-3
No ratings yet
PDC-3
26 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
30 pages
Parallel Programming- Unit 1
No ratings yet
Parallel Programming- Unit 1
81 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
No ratings yet
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
20 pages
High Performance Computing
100% (2)
High Performance Computing
164 pages
Lec1 Introduction to Parallel Computing (2)
No ratings yet
Lec1 Introduction to Parallel Computing (2)
40 pages
Pdc Digital Notes 6 17
No ratings yet
Pdc Digital Notes 6 17
12 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
Chapter 1 - Parallel Architectures
No ratings yet
Chapter 1 - Parallel Architectures
60 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
14013204-3 - Parallel Computing - Lecture1_ (4)
No ratings yet
14013204-3 - Parallel Computing - Lecture1_ (4)
52 pages
Required Babes
No ratings yet
Required Babes
21 pages
Lecture-2-06.01.2025
No ratings yet
Lecture-2-06.01.2025
21 pages
Module 1: Parallelism Fundamentals Week 1 Learning Outcomes
No ratings yet
Module 1: Parallelism Fundamentals Week 1 Learning Outcomes
8 pages
Parallel Computing Terminology
No ratings yet
Parallel Computing Terminology
11 pages
Lecture Parallel Computing
No ratings yet
Lecture Parallel Computing
6 pages
Types of Parallel Computing
No ratings yet
Types of Parallel Computing
11 pages
Course Code 341-1
No ratings yet
Course Code 341-1
120 pages
Introduction To Computing
No ratings yet
Introduction To Computing
6 pages
P 1
No ratings yet
P 1
44 pages
p1
No ratings yet
p1
30 pages
Parallel Computation Lecture Notes
No ratings yet
Parallel Computation Lecture Notes
44 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
01 Intro Parallel Computing
No ratings yet
01 Intro Parallel Computing
40 pages
Lect 1 Overview
No ratings yet
Lect 1 Overview
17 pages
1.1 Parallelism
No ratings yet
1.1 Parallelism
29 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
Slides
No ratings yet
Slides
36 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Hpclab
No ratings yet
Hpclab
58 pages
U1&U2 PADCOM-25 (2)
No ratings yet
U1&U2 PADCOM-25 (2)
95 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
BCSE412L - Parallel Computing 01
No ratings yet
BCSE412L - Parallel Computing 01
27 pages
Parallel Processing: Types of Parallelism
No ratings yet
Parallel Processing: Types of Parallelism
7 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Unit 5
No ratings yet
Unit 5
66 pages
A Presentation On Parallel Computing: - Ameya Waghmare (Rno 41, BE CSE) Guided by-Dr.R.P.Adgaonkar (HOD), CSE Dept
No ratings yet
A Presentation On Parallel Computing: - Ameya Waghmare (Rno 41, BE CSE) Guided by-Dr.R.P.Adgaonkar (HOD), CSE Dept
32 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
34 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
orlitzky-benjamin-2001-corporate-social-performance-and-firm-risk-a-meta-analytic-review
No ratings yet
orlitzky-benjamin-2001-corporate-social-performance-and-firm-risk-a-meta-analytic-review
28 pages
Grade 9 Auto
No ratings yet
Grade 9 Auto
10 pages
SG407-9.42 2018 Backstage
No ratings yet
SG407-9.42 2018 Backstage
83 pages
Catena Prod Pret
No ratings yet
Catena Prod Pret
160 pages
Packaging Machinery
No ratings yet
Packaging Machinery
16 pages
02
No ratings yet
02
23 pages
Animals rights
No ratings yet
Animals rights
11 pages
Harley Davidson-Total
0% (1)
Harley Davidson-Total
36 pages
Chapter 2. TIME
No ratings yet
Chapter 2. TIME
24 pages
A Step Towards Rural Transformation
No ratings yet
A Step Towards Rural Transformation
102 pages
!Changelog r70
No ratings yet
!Changelog r70
35 pages
Lesson 04-Chapter 4 Classification PDF
100% (1)
Lesson 04-Chapter 4 Classification PDF
86 pages
JEANWATSONTHEORY
No ratings yet
JEANWATSONTHEORY
5 pages
1) 64 Bit Ripple Carry Adder Code With Output
No ratings yet
1) 64 Bit Ripple Carry Adder Code With Output
4 pages
Audi A4 B5 1995 Immobiliser AEB PDF
No ratings yet
Audi A4 B5 1995 Immobiliser AEB PDF
2 pages
(Hart) - S.E.a. Lab. Science Experiments and Activities (1990)
No ratings yet
(Hart) - S.E.a. Lab. Science Experiments and Activities (1990)
199 pages
Caso Uber SA - Disruption of The Local Taxi Industry
No ratings yet
Caso Uber SA - Disruption of The Local Taxi Industry
4 pages
PF ST EcoMi KTS EN
No ratings yet
PF ST EcoMi KTS EN
2 pages
NiSM Questionzss
No ratings yet
NiSM Questionzss
709 pages
Two Worlds One Family
No ratings yet
Two Worlds One Family
81 pages
(eBook PDF) Responsive Web Design with HTML 5 & CSS 9th Edition instant download
100% (1)
(eBook PDF) Responsive Web Design with HTML 5 & CSS 9th Edition instant download
54 pages
Learning Action Cell
No ratings yet
Learning Action Cell
3 pages
Events and Signals in Uml PDF
No ratings yet
Events and Signals in Uml PDF
2 pages
Marquess Of Winter A Historical Regency Romance Novel The Wild Brides Book 3 Hazel Linwood download
100% (2)
Marquess Of Winter A Historical Regency Romance Novel The Wild Brides Book 3 Hazel Linwood download
36 pages
PA 7 Accounting Report
No ratings yet
PA 7 Accounting Report
8 pages
Huawei ONT EG8145V5 Datasheet
No ratings yet
Huawei ONT EG8145V5 Datasheet
4 pages
Automatic Traffic Control System Using Solar Photovoltaic Panel
No ratings yet
Automatic Traffic Control System Using Solar Photovoltaic Panel
6 pages
Nana Jedy Darpawanto (30000120410015) Dan Putri Alifa Kholil (30000120410016)
No ratings yet
Nana Jedy Darpawanto (30000120410015) Dan Putri Alifa Kholil (30000120410016)
7 pages
Your Electronic Document(s)
No ratings yet
Your Electronic Document(s)
2 pages

Parallel N Distributed Systems

Uploaded by

Parallel N Distributed Systems

Uploaded by

Reference: introduction to parallel comuting by Vipin Kumar etl. (pearson education) MODUL !

" Introduction to Parallel Computing

This is the result of a number of fundamental ph sical and computational limitations.

The Computational Power Argument

The Memory/Disk Speed Argument

The Data Communication Argument

#cope of $arallel %omputing &pplications

Applications in ngineering and Design

Applications in Computer Systems

Organi"ation and Contents o! this Course

$arallel %omputing $latforms 'opic O(er(ie)

"mplicit $arallelism: 'rends in Microprocessor &rc*itectures

$ipelining and #uperscalar +ecution

#uperscalar +ecution: &n +ample

=#ample of a t!o6!a superscalar e#ecution of instructions.

#uperscalar +ecution: "ssue Mec*anisms

#uperscalar +ecution: fficiency %onsiderations

Very Long "nstruction ,ord (VL",) $rocessors

Very Long "nstruction ,ord (VL",) $rocessors: %onsiderations

Limitations of Memory #ystem $erformance

Memory #ystem $erformance: -and)idt* and Latency

Memory Latency: &n +ample

Memory Latency: &n +ample

"mpro(ing ffecti(e Memory Latency Using %ac*es

Impact o! Caches# $ample

Impact o! Caches# $ample %continued&

"ata reuse is critical for cache performance.

Impact o! Memory 'andwidth

Impact o! Memory 'andwidth# $ample

Impact o! Memory 'andwidth

Impact o! Memory 'andwidth

Impact o! Memory 'andwidth# $ample

Impact o! Memory 'andwidth# $ample

The strided access results in ver poor performance.

Impact o! Memory 'andwidth# $ample

Memory System Per!ormance# Summary

Alternate Approaches !or (iding Memory )atency

Multithreading !or )atency (iding

Multithreading !or )atency (iding# $ample

Multithreading !or )atency (iding

Pre!etching !or )atency (iding

Tradeo!!s o! Multithreading and Pre!etching

Tradeo!!s o! Multithreading and Pre!etching

$plicitly Parallel Plat!orms Dichotomy o! Parallel Computing Plat!orms

Control Structure o! Parallel Programs

Control Structure o! Parallel Programs

SIMD and MIMD Processors

A t pical SIM" architecture 'a( and a t pical MIM" architecture 'b(.

Conditional $ecution in SIMD Processors

Communication Model o! Parallel Plat!orms

+,MA and ,MA Shared*Address*Space Plat!orms

+,MA and ,MA Shared*Address*Space Plat!orms

Shared*Address*Space vs- Shared Memory Machines

Message Passing vs- Shared Address Space Plat!orms

Physical Organi"ation o! Parallel Plat!orms

Architecture o! an Ideal Parallel Computer

Architecture o! an Ideal Parallel Computer

Architecture o! an Ideal Parallel Computer

Physical Comple$ity o! an Ideal Parallel Computer

Interconnection +etworks !or Parallel Computers

Static and Dynamic Interconnection +etworks

Interconnection +etworks# +etwork Inter!aces

+etwork Topologies# 'uses

+etwork Topologies# 'uses

+etwork Topologies# Cross.ars

A completel non6bloc&ing crossbar net!or& connecting p processors to b memor ban&s.

+etwork Topologies# Cross.ars

+etwork Topologies# Multistage +etworks

.et)or/ 'opologies: Multistage .et)or/s

The schematic of a t pical multistage interconnection net!or&

+etwork Topologies# Multistage Omega +etwork

+etwork Topologies# Multistage Omega +etwork

A perfect shuffle interconnection for eight inputs and outputs.

+etwork Topologies# Multistage Omega +etwork

T!o s!itching configurations of the D K D s!itch0 'a( Pass6throughN 'b( Cross6over.

+etwork Topologies# Multistage Omega +etwork

+,MA and ,MA SharedAddressSpace Plat!orms

+,MA and ,MA SharedAddressSpace Plat!orms

SharedAddressSpace vs- Shared Memory Machines