Parallel N Distributed Systems
Parallel N Distributed Systems
Topic Overview
Motivating Parallelism Scope of Parallel Computing Applications Organization and Contents of the Course
Motivating Parallelism
The role of parallelism in accelerating computing speeds has been recognized for several decades. Its role in providing multiplicit of datapaths and increased access to storage elements has been significant in commercial applications. The scalable performance and lo!er cost of parallel platforms is reflected in the !ide variet of applications. "eveloping parallel hard!are and soft!are has traditionall been time and effort intensive. If one is to vie! this in the conte#t of rapidl improving uniprocessor speeds$ one is tempted to %uestion the need for parallel computing.
There are some unmista&able trends in hard!are design$ !hich indicate that uniprocessor 'or implicitl parallel( architectures ma not be able to sustain the rate of realizable performance increments in the future.
The emergence of standardized parallel programming environments$ libraries$ and hard!are has significantl reduced time to 'parallel( solution.
means by 1!"#, the number of components per integrated circuit for minimum cost will be $#, .%%
Moore attributed this doubling rate to e#ponential behavior of die sizes$ finer minimum dimensions$ and 11circuit and device cleverness)). In +,2.$ he revised this la! as follo!s0 ``There is no room left to s&ueeze anything out by being clever. 'oing forward from here we have to depend on the two size factors ( bigger dies and finer dimensions.%% 3e revised his rate of circuit complexity doubling to +4 months and pro5ected from +,2. on!ards at this reduced rate. If one is to bu into Moore)s la!$ the %uestion still remains 6 ho! does one translate transistors into useful OPS 'operations per second(7 The logical recourse is to rel on parallelism$ both implicit and e#plicit. Most serial 'or seemingl serial( processors rel e#tensivel on implicit parallelism. 8e focus in this class$ for the most part$ on e#plicit parallelism.
Principles of localit of data reference and bul& access$ !hich guide parallel algorithm design also appl to memor optimization. Some of the fastest gro!ing applications of parallel computing utilize not their ra! computational speed$ rather their abilit to pump data to memor and dis& faster.
In man other applications 't picall databases and data mining( the volume of data is such that the cannot be moved. An anal ses on this data must be performed over the net!or& using parallel techni%ues.
Scienti!ic Applications
?unctional and structural characterization of genes and proteins. Advances in computational ph sics and chemistr have e#plored ne! materials$ understanding of chemical path!a s$ and more efficient processes. Applications in astroph sics have e#plored the evolution of gala#ies$ thermonuclear processes$ and the anal sis of e#tremel large datasets from telescopes. 8eather modeling$ mineral prospecting$ flood prediction$ etc.$ are other important applications. @ioinformatics and astroph sics also present some of the most challenging problems !ith respect to anal zing e#tremel large datasets.
Commercial Applications
Some of the largest parallel computers po!er the !all streetA "ata mining and anal sis for optimizing business and mar&eting decisions.
Barge scale servers 'mail and !eb servers( are often implemented using parallel platforms. Applications such as information retrieval and search are t picall po!ered b large clusters.
Cet!or& intrusion detection$ cr ptograph $ multipart computations are some of the core users of parallel computing techni%ues. =mbedded s stems increasingl rel on distributed control algorithms. A modern automobile consists of tens of processors communicating to perform comple# tas&s for optimizing handling and performance. Conventional structured peer6to6peer net!or&s impose overla algorithms directl from parallel computing. net!or&s and utilize
#cope of $arallelism
Conventional architectures coarsel comprise of a processor$ memor s stem$ and the datapath. =ach of these components present significant performance bottlenec&s. Parallelism addresses each of these components in significant !a s. "ifferent applications utilize different aspects of parallelism 6 e.g.$ data itensive applications utilize high aggregate throughput$ server applications utilize high aggregate net!or& band!idth$ and scientific applications t picall utilize high processing and memor s stem performance. It is important to understand each of these performance bottlenec&s.
Microprocessor cloc& speeds have posted impressive gains over the past t!o decades 't!o to three orders of magnitude(. 3igher levels of device integration have made available a large number of transistors. The %uestion of ho! best to utilize these resources is an important one. Current processors use these resources in multiple functional units and e#ecute multiple instructions in the same c cle. The precise manner in !hich these instructions are selected and e#ecuted provides impressive diversit in architectures.
In the above e#ample$ there is some !astage of resources due to data dependencies. The e#ample also illustrates that different instruction mi#es !ith identical semantics can ta&e significantl different e#ecution time.
#uperscalar +ecution
Scheduling of instructions is determined b a number of factors0 E True "ata "ependenc 0 The result of one operation is an input to the ne#t. E <esource "ependenc 0 T!o operations re%uire the same resource. E @ranch "ependenc 0 Scheduling instructions across conditional branch statements cannot be done deterministicall a6priori. E The scheduler$ a piece of hard!are loo&s at a large number of instructions in an instruction %ueue and selects appropriate number of instructions to e#ecute concurrentl based on these factors. E The comple#it of this hard!are is an important constraint on superscalar processors.
"ue to limited parallelism in t pical instruction traces$ dependencies$ or the inabilit of the scheduler to e#tract parallelism$ the performance of superscalar processors is eventuall limited. Conventional microprocessors t picall support four6!a superscalar e#ecution.
Consider a processor operating at + H3z '+ ns cloc&( connected to a "<AM !ith a latenc of +:: ns 'no caches(. Assume that the processor has t!o multipl 6add units and is capable of e#ecuting four instructions in each c cle of + ns. The follo!ing observations follo!0
E The pea& processor rating is 9 H?BOPS. E Since the memor latenc is e%ual to +:: c cles and bloc& size is one !ord$ ever time a memor re%uest is made$ the processor must !ait +:: c cles before it can process the data.
Impact o! Caches
<epeated references to the same data item correspond to temporal localit . In our e#ample$ !e had O,n*- data accesses and O,n+- computation. This as mptotic difference ma&es the above e#ample particularl desirable for caches.
The code fragment sums columns of the matri# b into a vector column_sum.
The vector column_sum is small and easil fits into the cache The matri# b is accessed in a column order.
Multipl ing a matri# !ith a vector0 'a( multipl ing column6b 6column$ &eeping a running sumN 'b( computing each element of the result as a dot product of a ro! of the matri# !ith the vector.
In this case$ the matri# is traversed in a ro!6order and performance can be e#pected to be significantl better.
10
!e anticipate !hich pages !e are going to bro!se ahead of time and issue re%uests for them in advanceN E !e open multiple bro!sers and access different pages in each bro!ser$ thus !hile !e are !aiting for one page to load$ !e could be reading othersN or E !e access a !hole bunch of pages in one go 6 amortizing the latenc across various accesses. The first approach is called prefetching$ the second multithreading$ and the third one corresponds to spatial localit in accessing memor !ords.
E
=ach dot6product is independent of the other$ and therefore represents a concurrent unit of e#ecution. 8e can safel re!rite the above code segment as0
for (i = 0; i < n; i++) c[i] = cre te_t"re d(dot_product!get_row( ! i)! b);
8h not advance the loads so that b the time the data is actuall needed$ it is alread thereA The onl dra!bac& is that ou might need more space to store advanced loads. 3o!ever$ if the advanced loads are over!ritten$ !e are no !orse than beforeA
12
SIMD Processors
Some of the earliest parallel computers such as the Illiac IF$ MPP$ "AP$ CM6D$ and MasPar MP6+ belonged to this class of machines. Fariants of this concept have found use in co6processing units such as the MMO units in Intel processors and "SP chips such as the Sharc. SIM" relies on the regular structure of computations 'such as those in image processing(. It is often necessar to selectivel turn off operations on certain data items. ?or this reason$ most SIM" programming paradigms allo! for an 11activit mas&))$ !hich determines if a processor should participate in a computation or not.
13
=#ecuting a conditional statement on an SIM" computer !ith four processors0 'a( the conditional statementN 'b( the e#ecution of the statement in t!o steps
MIMD Processors
In contrast to SIM" processors$ MIM" processors can e#ecute different programs on different processors. A variant of this$ called single program multiple data streams 'SPM"( e#ecutes the same program on different processors. It is eas to see that SPM" and MIM" are closel related in terms of programming fle#ibilit and underl ing architectural support. =#amples of such platforms include current generation Sun Pltra Servers$ SHI Origin Servers$ multiprocessor PCs$ !or&station clusters$ and the I@M SP.
SIMD*MIMD Comparison
SIM" computers re%uire less hard!are than MIM" computers 'single control unit(. 3o!ever$ since SIM" processors ae speciall designed$ the tend to be e#pensive and have long design c cles. Cot all applications are naturall suited to SIM" processors. In contrast$ platforms supporting the SPM" paradigm can be built from ine#pensive off6 the6shelf components !ith relativel little effort in a short amount of time.
Shared*Address*Space Plat!orms
Part 'or all( of the memor is accessible to all processors. Processors interact b modif ing data ob5ects stored in this shared6address6space.
14
If the time ta&en b a processor to access an memor !ord in the s stem global or local is identical$ the platform is classified as a uniform memor access 'PMA($ else$ a non6 uniform memor access 'CPMA( machine.
T pical shared6address6space architectures0 'a( Pniform6memor access shared6 address6space computerN 'b( Pniform6memor 6access shared6address6space computer !ith caches and memoriesN 'c( Con6uniform6memor 6access shared6 address6space computer !ith local memor onl .
15
Caches in such machines re%uire coordinated access to multiple copies. This leads to the cache coherence problem. A !ea&er model of these machines provides an address map$ but not coordinated access. These models are called non cache coherent shared address space machines.
Message*Passing Plat!orms
These platforms comprise of a set of processors and their o!n 'e#clusive( memor . Instances of such a vie! come naturall from clustered !or&stations and non6shared6 address6space multicomputers. These platforms are programmed using 'variants of( send and receive primitives. Bibraries such as MPI and PFM provide such primitives.
16
E E E E
8hat does concurrent !rite mean$ an !a 7 Common0 !rite onl if all values are identical. Arbitrar 0 !rite the data from a randoml selected processor. Priorit 0 follo! a predetermined priorit order. Sum0 8rite the sum of all data items.
Classification of interconnection net!or&s0 'a( a static net!or&N and 'b( a d namic net!or&.
17
Interconnection +etworks
S!itches map a fi#ed number of inputs to outputs. The total number of ports on a s!itch is the degree of the s!itch. The cost of a s!itch gro!s as the s%uare of the degree of the s!itch$ the peripheral hard!are linearl as the degree$ and the pac&aging costs linearl as the number of pins.
+etwork Topologies
A variet of net!or& topologies have been proposed and implemented. These topologies tradeoff performance for cost. Commercial machines often implement h brids of multiple topologies for reasons of pac&aging$ cost$ and available components.
18
@us6based interconnects 'a( !ith no local cachesN 'b( !ith local memor Gcaches. Since much of the data accessed b processors is local to the processor$ a local memor can improve the performance of bus6based machines.
19
20
21
A complete omega net!or& connecting eight inputs and eight outputs. An omega net!or& has p/* ) log p s!itching nodes$ and the cost of such a net!or& gro!s as ,p log p-.
22
The data traverses the lin& to the first s!itching node. If the most significant bits of s and d are the same$ then the data is routed in pass6through mode b the s!itch else$ it s!itches to crossover. This process is repeated for each of the log p s!itching stages. Cote that this is not a non6bloc&ing s!itch.
An e#ample of bloc&ing in omega net!or&0 one of the messages ':+: to +++ or ++: to +::( is bloc&ed at lin& A@.
23
'a( A completel 6connected net!or& of eight nodesN 'b( a star connected net!or& of nine nodes.
24
Binear arra s0 'a( !ith no !raparound lin&sN 'b( !ith !raparound lin&.
T!o and three dimensional meshes0 'a( D6" mesh !ith no !raparoundN 'b( D6" mesh !ith !raparound lin& 'D6" torus(N and 'c( a I6" mesh !ith no !raparound.
25
Complete binar tree net!or&s0 'a( a static tree net!or&N and 'b( a d namic tree net!or&.
Bin&s higher up the tree potentiall carr more traffic than those at the lo!er levels. ?or this reason$ a variant called a fat6tree$ fattens the lin&s as !e go up the tree. Trees can be laid out in D" !ith no !ire crossings. This is an attractive propert of trees.
27
Cet!or& Completel 6connected Star Complete binar tree Binear arra D6" mesh$ no !raparound D6" !raparound mesh 3 percube 8raparound 56ar d6cube
"iameter
@isection 8idth
Arc Connectivit
28
"iameter
@isection 8idth
Arc Connectivit
8hen the value of a variable is changes$ all its copies must either be invalidated or updated.
Cache coherence in multiprocessor s stems0 'a( Invalidate protocolN 'b( Ppdate protocol for shared variables
29
30
=#ample of parallel program e#ecution !ith the simple three6state coherence protocol.
31
32
Architecture of t pical director based s stems0 'a( a centralized director N and 'b( a distributed director .
The total time to transfer a message over a net!or& comprises of the follo!ing0 6tartup time 'ts(0 Time spent at sending and receiving nodes 'e#ecuting the routing algorithm$ programming routers$ etc.(. E 7er(hop time 'th(0 This time is a function of number of hops and includes factors such as s!itch latencies$ net!or& dela s$ etc. E 7er(word transfer time 'tw(0 This time includes all overheads that are determined b the length of the message. This includes band!idth of lin&s$ error chec&ing and correction$ etc.
E
Store*and*2orward 0outing
A message traversing multiple hops is completel received at an intermediate hop before being for!arded to the ne#t hop. The total communication cost for a message of size m !ords to traverse l communication lin&s is In most platforms$ th is small and the above e#pression can be appro#imated b
33
0outing Techni3ues
34
Passing a message from node 7 to 7+ 'a( through a store6and6for!ard communication net!or&N 'b( and 'c( e#tending the concept to cut6through routing. The shaded regions represent the time that the message is in transit. The startup time associated !ith this message transfer is assumed to be zero.
Packet 0outing
Store6and6for!ard ma&es poor use of communication resources. Pac&et routing brea&s messages into pac&ets and pipelines them through the net!or&. Since pac&ets ma ta&e different paths$ each pac&et must carr routing information$ error chec&ing$ se%uencing$ and other related header information. The total communication time for pac&et routing is appro#imated b 0
Cut*Through 0outing
Ta&es the concept of pac&et routing to an e#treme b further dividing messages into basic units called flits. Since flits are t picall small$ the header information must be minimized. This is done b forcing all flits to ta&e the same path$ in se%uence. A tracer message first programs all intermediate routers. All flits then ta&e the same route. =rror chec&s are performed on the entire message$ as opposed to flits. Co se%uence numbers are needed.
Cut*Through 0outing
The total communication time for cut6through routing is appro#imated b 0
35
The cost of communicating a message bet!een t!o nodes l hops a!a through routing is given b
using cut6
In this e#pression$ th is t picall smaller than ts and tw. ?or this reason$ the second term in the <3S does not sho!$ particularl $ !hen m is large. ?urthermore$ it is often not possible to control routing and placement of tas&s. ?or these reasons$ !e can appro#imate the cost of message transfer b
36
<outing a message from node 7s ':+:( to node 7d '+++( in a three6dimensional h percube using =6cube routing.
0
37
38
39
'a( A 9 K 9 mesh illustrating the mapping of mesh nodes to the nodes in a four6dimensional h percubeN and 'b( a D K 9 mesh embedded into a three6dimensional h percube. Once again$ the congestion$ dilation$ and e#pansion of the mapping is +.
40
'a( =mbedding a +- node linear arra into a D6" meshN and 'b( the inverse of the mapping. Solid lines correspond to lin&s in the linear arra and normal lines to lin&s in the mesh.
Interconnection net!or& of the Cra TI=0 'a( node architectureN 'b( net!or& topolog .
42
43
44