0% found this document useful (0 votes)
3 views

DLL Protocol For Distributed Shared Memory Multiprocessor Systems

Uploaded by

23r11a66l3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DLL Protocol For Distributed Shared Memory Multiprocessor Systems

Uploaded by

23r11a66l3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

On the Doubly-Linked List Protocol for

Distributed Shared Memory Multiprocessor Systems

Albert C.K. Lau, Kelvin H.W. Leung, Nelson H.C. Yung and Y.S. Cheung

Department of Electrical and Electronics Engineering


University of Hong Kong
Haking Wong Building,Pokfulam Road, Hong Kong
email: [email protected],[email protected]

Abstract

Thispaper introduces the Doubly-Linked List


(DLL) Protocol for Distributed Shared Memory
(DS%?IMultiprocessor Systems. The protocol makes
uses of two linked list to keep track of valid copies of I I

pages in the system, thus eliminating the use of copy-


sets. Simulation studies show that the DLL protocol
interconnection network
achieved considerable speed-up for common Figure la: The hierarchical cluster model
mathematical problems including a linear equations
solver and a matrix multiplier. Performance
improvement of up to 51.9% over the Dynamic 1 2 n
Distributed Manager algorithm is obtained. Further
improvement and possible modijcation of the
protocol will also discussed.
M-bus
1. Introduction I memory]
Distributed Shared-Memory @SM) [l] is Figure lb: A cluster
becoming an important aspect of Massively Parallel
Processing W P ) because it allows programmers to
use the shared-memory programming model, which is because it is difficult for programmers to handle both
much more manageable than the message-passing the shared-memory and the distributed-memory
model used by traditional Massively Parallel model at the same time. It is very desbable to hide
Processors. The Doubly-Linked List (DLL) protocol this complication from the programmers so that they
discussed in this paper is a software DSM algorithm only see a uniform shared-memory model. On the
that is suitable for implementation in distributed other hand, since memory within a cluster is
.operating systems of modem MPPs. A typical MPP physically shared by the PES, the overhead of shared
configuration consists of a large number of processing memory accesses within a cluster is minimal. In order
nodes connected together by an interconnection to create a shared-memory environment out of the
network. The protocol presented in this paper, physically distributed memory, a protocol is needed
however, works for a more generalized hierarchical to handle remote memory accesses as well as to
cluster model [2], which consists of multiple clusters maintain memory coherence.
connected by an interconnection network (Figure la). Ivy [3], one of the fust transparent DSM
Each of these clusters may then have a small number systems, implemented DSM as virtual memory. In
of Processing Elements (PES), its own local memory, this system, when a page fault occurs in a cluster’s
and perhaps a dedicated Communication Processor local memory, instead of loading from disk, the
(CP) (Figure Ib). The typical MPP configuration may faulting page is fetched from a remote cluster that has
be considered as a special case of this model in which a valid copy of the page. It experimented with various
the number of PES in a cluster equals to 1. DSM is DSM algorithms and concluded that the Dynamic
particularly important in this kind of architecture Distributed Manager (DDM) algorithm generally had

293
o 1995 JEEE
0-7803-2018-2/95/$4.00
the best performance. The Fiwed Distributed Manager purpose of this approach is to fully distribute the
algorithm, also proposed in the same paper, was later copy-set using the P-links.
used in the Intel iPPsCI2 hypercube multicomputers There are several advantages using a linked
[4]. In the DDM algorithm, pages are migrated freely list to maintain the copy-set. Fm,it is simple and it
throughout the system and replicated as needed for reduces the number of messages needed for
shared read accesses by different clusters. The page invalidation nearly by half. Second, only a small
management is performed by individual owner cluster constant storage space is needed to store a link, as
of a page that keeps the copy-set (the set of clusters compared to the varying space needed to store the
that has valid copy of the page). Whenever there is a copy-set or the tree nodes in [3]. Third, since the
write access to a page, the owner of the page copy-set is completely distributed, invalidation is not
invalidates all other copies of the page in the system likely to cause congestion in the interconnection
by making use of the copy-set, then transfers the network.
ownership to the cluster that writes to the page. The DLL protocol is explained in detail in
The DDM algorithm has certain advantages. the next section. In addition, a feature will be
First, it is simple and easy to implement, thus can be proposed to further enhance the performance of the
added to existing systems with minimum effort. basic DLL protocol. Furthermore, the performance of
Second, since it is an extension of the basic virtual the DLL protocol will be compared to various other
memory system, which is a standard feature algorithms by extensive simulation studies. Finally,
supported by virtually all contemporary possible further research on the DLL protocol will be
microprocessors, the overhead caused by the described.
algorithm is small. Third, it is fully transparent to the
programmer, so systems with different underlying
architecture can use the same simple programming 2. The Basic Doubly-linked List
model Protocol
However, there are areas in the DDM
algorithm that can be improved so that it fits better in In the DLL protocol, each cluster has its own
modem MPPs. In fact, the original DDM algorithm page table, which contains information about all
was implemented on a network of Apollo memory pages in the system. Each memory page in
workstations. Therefore, the idea of storing the copy- the page table can have one of the three states -- E
set in the owner and having it invalidate other copies (exclusive), S (shared) or I (invalid). E state means
in the system worked satisfactorily. However, given the cluster has the only copy of the page in the whole
the high speed interconnection network used by most system. S state means more than one cluster in the
modem MPPs today, the burst of invalidation system have copies of the page. I state means the
messages generated by the owner can cause network cluster does not have a valid copy of the page.
congestion around the cluster and degrade the Every page has an owner, although page
network performance significantly. Also, the size of ownership is frequently transferred between clusters.
the copy-set varies and can be as large as the number The owner of a page is the cluster that most recently
of nodes in the system in the worst case. This greatly acquired the page. It is the responsibility of the owner
limits the scalability of the algorithm because as the to supply the page to requesting clusters.
system grows larger, the amount of memory allocated Also contained in the page table are two
for the copy-sets becomes impractically large. In Li’s -
links for each page the P-link and the N-link. The
paper [3], a method to partially distribute the copy-set P-link points to the cluster that is the previous owner
using trees of clusters were proposed. In this paper, of the page, while the N-link points to the cluster to
the idea is further developed into the Doubly-Linked which the page ownership is given, i.e., the new
List (DLL) Protocol. owner of the page. A null N-link means the cluster is
The concept of the DLL protocol is built on the owner of the page.
the existence of two types of l i for each page, When the system is initialized, memory
namely the N-links and the P - W . The N-linksare pages are distributed to each cluster’s local memory
used to locate the current owner of a page and is in an interleaved fashion, i.e., pages PO, p4, p8, ... go
similar to the probable owner field in the DDM to cluster CO, pages pl, p5, p9, ... go to cluster c l and
algorithm [3]. The P-links are used to maintain a so on. The cluster that contains a page in its l o d
linked list of clusters that contain valid copies of the memory at system startup is the initial owner of the
page. In other words, following the P-links,we can page. The page table is initialized as follows: the state
locate all valid copies of the page in the system. The of a page is E in its owner’s page table, and I in other
clusters’ page table. The P - l i of all pages in every

294
cluster are set to null. The N-link of a page is null in
its owner’s page table, and points to the owner in
other clusters’ page table. For example, the initial
state of part of the page table in CO is as shown in
2.RD
I I State I p-link I ~-1ink 1 __+ Message
. . . .,. P-links
null Figure 2: Read request by cl
Table 1: Initial state of cO’spage table

table 1.
Read and write accesses performed to pages
with different states will initiate different course of
events. They will be explain below.

2.1. Read accesses to E pages


If a page is in E state, the cluster has a valid
copy of the page and read accesses to the page can be
handled locally. No messages will be sent to other
clusters and the page table will not be changed.

2.2. Read accesses to S pages

Read accesses to S pages are handled exactly


the Same way as read accesses to E pages.
. . . . .bP-links
2.3. Read accesses to I pages
Figure 3: Read request by c2
In this case, the local memory of the cluster
does not have a valid copy of the page being accessed
so a copy of the page must be obtained from the it sets the P and N-links of the page to point to the
owner. A read-request (RR) message will be replying cluster (i.e., the previous owner of the page)
sent to the cluster pointed to by the N-link in the page and to null, respectively. It becomes the new owner
table, requesting for the missing page. If the cluster’s of the page and changes the page state from I to S.
local memory has no room for the new page, the The following are examples of read requests.
replacement algorithm, which will be described later In the examples, we shall assume a small system with
in this paper, will be used to make room for the new 4 clusters, CO-c3. Initially, CO is the owner of memory
page. page PO. Both the P and N-links of PO in CO are set to
If the cluster receiving the request is not the null and its state set to E. In all other clusters, the P-
owner of the page, it will forward the RR message to link of p0 are null, N-link is CO and the state is I.
the cluster pointed to by the N-link of the page in its Now, assume c l tries to perform a read
own page table. This process repeats until the owner access to PO. An RR message is sent from c l to CO.
of the page is reached. On receiving the message, CO sends a R D message,
On receiving the RR message, the owner will containing PO, back to cl, then sets its own N-link to
send a read-data (RD) message containing the re- cl, and changes the page state to S.When c l receives
quested page back to the requesting cluster. Then, it the RD message, it copies the page to its local
will set the N-link of the page in its own page table to memory, sets its P-link to CO and N-link to null, and
point to the requesting cluster, thus transferring the change the page state to S.The process is depicted in
ownership of the page to the requesting cluster. Figure 2. In all the figures, the sotid arrows represent
Finally, it sets the state of the page to S. the messages passed between clusters, while the
The requesting cluster, on receiving the RD dotted and dashed arrows show the state of the P and
message, copies the page to its local memory. Then, N-links after the whole process has been completed,

295
In the event of another cluster c2 performs a
read access to PO, as the state of PO in c2 is I, c2 n 5 . w I P
sends an RR message to the cluster pointed to by its
N-link, i.e., cluster CO. When CO receives the RR
message, since it is no longer the owner of PO ( N - l i
not null), it forwards the message to the cluster
pointed to by its own N-link, i.e., cluster cl. When 4.WIF
cl, which is the current owner of p0, receives the RR
message, it sends an RD message back to the
requesting cluster c2, and then set its N-link to c2,
thus transferring the ownership of PO to c2. Cluster
c2, on receiving the RD message from CO, copies PO
into its local memory, changes the state of PO to S,
and sets the P and N-links to cl and null,
respectively. The cluster c2 has become the new
c2
owner of p0. The process is depicted in Figure 3.
The current states of the PO entries of the - +N-links
page tables in cluster CO, cl and c2 are summarized in 1 P-links all null
Table 2. At this point, CO, cl and c2 each has a copy
of PO in state S. Therefore, read accesses performed Figure 4: Write request by CO
by these clusters can be handled locally.
Following the series of N-links, the message will
I I State I p-link I N-M~ I eventually reach the owner of the page. The owner,

p++--l -- - -- , --I
Table 2: State ofp0 in CO, c l & c2
on receiving the WI message, will send a write-
invalidate-forward 0message to the
cluster pointed to by its P-link, change the page to I
state, and reset its P and N-links to null and to the
requesting cluster, respectively. AU clusters receiving
the WIF message will forward the message to the
cluster pointed to by its own P-link, change the page
2.4. Write accesses to E pages to I state, and reset its P-link to null and N-link to the
requesting cluster. The requesting cluster will also
Since the cluster contains the only copy of receive the WIF message. It will just ignore the
the page in the whole system, it can write to the page message and forward it to the cluster pointed to by its
without generating any messages nor changes of the P-link. Following the P-links, all copies of the page in
page table. the system, except the one in the requesting cluster,
will be invalidated. When the W E message reaches
2.5. Write accesses to S pages the cluster whose P-link is null, the cluster will send a
write- invalidate-performed (%") message
When more than one cluster contain copies to the requesting cluster.
of the page, write access in one copy causes the other The requesting cluster, upon receiving the
copies to become obsolete. The DLL protocol uses a WlP message, will set its P and N-linksof the page to
writeinvalidate algorithm to solve this problem null and change the state of the page to E. It becomes
because it generates fewer traffics [l, 51. The write the new exclusive owner of the page. At this point,
update algorithm used by some other systems [6] is the write access can be performed.
not suitable for this implementation of the DLL An example of a write request to an S page
protocol which used the sequential consistency model is as follows. Assume the state of p0 in each cluster is
121, owing to the high cost of the write-update as shown in table 2 and now CO performs a write
messages. The possibility of using write-update in access to PO. Since PO is in state S in CO, other
future implementation of the DLL protocol using clusters that have copies of CO must have their copies
relaxed consistency models will be discussed in invalidated. Therefore, a WI message is sent to the
-
Section 5 Further Research on the DLL protocol. cluster pointed to by the N-link,i.e., cluster cl.
When a cluster performs a write access to an Following the N-links,c l forwards the WI message to
s page, a write-invalidate (WI)message will c2, which is the current owner of PO. Cluster c2 then
be sent to the cluster pointed to by the N - I i i . sends a WIF message to the cluster pointed to by its

296
P-link, i.e., cluster cl, changes the state of its p0 to I,
and resets its P-link to null and N-link to CO. The
cluster cl, upon receiving the WIF message, changes
the state of PO to I, forwards the message to the
cluster pointed to by its own P-link, i.e., cluster CO,
and reset its P and N-links to null and CO,
respectively. When CO receives the WIF message, as
it is the requesting cluster, it ignores the message.
Since CO’s P - l i i is null, all copies of PO in the
system, except the one in CO, are invalidated. At this
point, CO should send a WIP message back to the
requesting cluster. In this case, however, the
requesting cluster is CO itself, so this message is
skipped. Finally, CO changes the state of PO to E, sets
both of its P and N-links to null, and completes the
write access. The cluster CO becomes the new
exclusive owner of PO. The process is depicted in
Figure 4. . - -w N-links
2.6. Write accesses to I pages
Figure 5; Write request by c3
Write access to an I page is handled in a way
similar to handling write access to an E page, except
in t h i s case, the requesting cluster does not have a Let us consider an example of a write
valid copy of the page. Therefore, the page must be request to an I page. Again assume the state of the
copied from its current owner and if the requesting system is as shown in Table 2. Now, another cluster,
cluster has no space for the page, the replacement c3, tries to perform a write access to PO. Since the
algorithm must be used to make room for it. state of PO in c3 is I, c3 sends a WR message via the
When a write access to an I page occurs, the N-links to the owner of p0, i.e., c2 (Figure 5).
cluster sends a write-request on\)message to When c2 receives the W R message, it f m
the cluster pointed to by its N-link. Following the N- sends a WD message, which contains a copy of PO, to
links, the message will eventually reach the owner of c3. Second, it sends a WIF message to the cluster
the page. The owner, on receiving the WR message, pointed to by its P-link and changes the state of its
will perform three actions. First, a write-data own PO to I. Third, it resets its P and N - l i s to null
(WD) message, containing a copy of the page, will be and to c3, respectively.
sent to the requesting cluster. Second, a write- The WIF message, following the P-links,
invalidate-forward 0message will be goes through every cluster that contains a copy of PO,
sent to the cluster pointed to by its P-link. Third, it i.e., cl and CO, which also changes the state of p0 to I
invalidates its own copy of the page and resets its P and reset the P and N-links to null and to c3.
and N-links to null and to the requesting cluster, respectively. When the WIF message reaches CO,
respectively. whose P-link is originally null, CO will send a WIP
Following the P-links,the WIF message will message to c3.
go through every cluster that has a copy of the page, When c3 receives both the WD and the WIp
which will also invalidate its own copy of the page messages, it copies PO into its local memory, and set
and reset their P - l i i to null and N-link to the both the P and N-links to null. The cluster c3
requesting cluster. Finally, when the WIF message becomes the new exclusive owner of PO. The process
reaches the cluster with a null P-link, that cluster will is depicted in Figure 5.
send a write - inval idate -performed (WIF’)
message back to the requesting cluster. 2.7. Replacement Algorithm
When the requesting cluster receives both
the WD and the WIP message, it sets its P and N- When there is no room in a cluster’s local
links to null and change the state of the page to E. It memory for a newly requested page, one of the pages
becomes the new exclusive owner of the page and the currently in the local memory must be replaced and
write access can be performed. swap out. To replace a page, two problem must be

291
considered [I]: Which page should be replaced? A simple method to reduce this overhead is
Where should the replaced page go? to periodically broadcast the page ownership to all
The fm problem is similar to the clusters in the system, so that all read-request to the
replacement problem in multi-processor caches. In page can reach the owner directly without going
this case, a prioritized LRU (least recently used) through unrelated clusters. This method, however,
algorithm is used. Highest priority is given to those S introduces new problems of its own. Firsf it is
pages whose owners are not the replacing cluster, as ditlticult to determine how oftea the ownership should
nothing needs to be done to invalidate these pages. be broadcast. If a fmed time interval is used, a lot of
Second priority is given to those S pages whose unuecesary broadcast might be generated. One way
owner is the replacing cluster. Replacement of one of is to count the number of unrelated clusters that a
these pages involves the transfer of ownership of the read-request message goes through before reach-
page to the cluster pointed to by the P-link of the ing the owner, and broadcast the ownership if the
replacing cluster. The lowest priority is given to E number is greater than a certain threshold. The
pages because another cluster must be found to store second problem is that broadcaSring itself generates a
the replaced page. lot of traffic. As a page may not be used by all
This leads to the second problem: Where clusters in the system, many of these broadcasting
should the replaced page go? One way is to keep may be unnecessary.
track of free memory in system and swap out the page An alternative method is to reduce the N-
to a cluster with enough space. In the DLL protocol, l i i for every read request to a page. According to
however, since there is no centralized memory the DLL protocol, the cluster that generates the read
manager, it is very expensive to keep track of free request will become the new owner of the page after
memory. Moreover, the case that an E page must be the request has been serviced. Therefore, all the
replaced should be very m e , so we can afford to use C~UStersthat are involved in forwarding the read-
a more costly method to find the new owner. request message may change their N-links to the
Therefore, when an E page must be replaced, the requesting cluster, even though the request has not yet
replacing cluster will just send it to the next cluster, been completed. The requesting cluster should lock
i.e., the nearest cluster. The cluster receiving the page the page and queue all accesses to it until the read-
becomes the new owner of the page, even if it might data message is received. Although this method
have to replace one of its own page to make mom for only partially reduces the N-links (only the N-links of
the replaced page. This method is guaranteed to work clusters that are previously involved in forwarding a
given the virtual memory size is smaller than or equal message are reduced), it is virtually free because it
to the physical memory size. Of come, if secondary only uses the n o d read-request message with-
memories such as disks are available for the storage out adding new information to it. The performance of
of swap-out pages, we do not have this limitation. the protocol with N-links Reduction will be compared
to the original protocol using simulation studies.

3. Performance Enhancement Feature


4. Simulated Performance of the DLL
In this section, an enhancement feature Protocol
called N-links reduction will be described. The
feature, when incorporated into the basic DLL For the purpose of simulation studies, we
protocol, will reduce the overhead caused by inter- have implemented three Merent DSM algorithm.
cluster communication and thus enhance the overall Apart from the DLL protocol, a versionof the Central
performance. Saver algorithm and the DDM algorithm are also
implemented. In the Central Server algorithm, all the
3.1. N-linksReduction page i n f o d o n and remote memory accesses are
handled by a central server, which is one of the
In the basic protocol, when a cluster issues a cluster in the system. In the DDM algorithm (31 that
read request, it sends the message to the cluster we have implemented, the ownership of a page does
pointed to by its N-link. The message may then need not change with read-requests. The owner of a page
to go through a number of unrelated clusters. which keeps a copy-set of all clusters that have valid copies
are the previous owners of the requested page, before of the page, and invalidate them when a write-request
it reaches the true owner. This may introduce a is received. All cluster being invalidated will then
signifcant amount of overhead as the N-links grow send an acknowledgment message back to the
longer. requesting cluster, which must wait for all the

298
together using a parallel matrix multiplier. Systems of
DLL (no N-link reduction)
up to 16 processors have been simulated and the
- --Central Server speed-up obtained by various algorithms are
n
summarized in Figure 6 and 7. The speed-up is
calculated by:
6 /
speed -up =
time requiredby a single processor
time reauired bv n Drocessor I
As can be seen from the graphs, all the DSM
algorithms, with the exception of the Central Server
algorithm, achieve considerable speed-up even with
only a moderately large problem size. For the Central
0 2 4 6 8 10 12 14 16
Server algorithm, the speed-up obtained by using 16
no. of processors
processors is just around 1.3 for both problems. This
Figure 6: Speed-up of linear equations solver is due to the system bottleneck at the cluster that acts
as the c e n d server. As all remote memory accesses
must go through the central server, serious network
congestion occurs and the server becomes a
serializing point for the whole system, thus defeating
the purpose of parallel processing.
For the other three algorithms, the DLL
5 protocol with N-link reduction generally has the best
4.5 performance. For the linear equation solver, the
4 speed-up improvement over the DDM algorithm is
3.5 32.38% and 14.71% for 8 and 16 processors,
respectively. For the ma& multiplier, the
x 2; improvement is 51.94%and 21.84%.
B 2 The DLL protocol without N-lid reduction
1.5
1
performs better than the DDM algorithm when the
number of processors used is less than or equal to 8.
0.5
0 h
0 2 4 6 8 10 12 14 16
However, as the number of processors grows to 16, its
performance dropped and becomes closer to that of
the DDM algorithm. This is because when the
no. of processors number of processors in the system grows larger, the
chains of N-links can become very long and it can
Figure 7: Speed-up of matrix multiplier
take a long time for memory access requests
messages to reach the corresponding owner of the
acknowledgment messages to arrive before the write- page.
request can be completed. On the other hand, the DDM algorithm also
All the algorithms are implemented as user achieves considerable speed-up, though not as good
ievel programs in a network of workstations running as the speed-up achieves by the two DLL protocols.
PVM 3 [9]. The network transfer rate of 0.8 This is due to the large number of messages generated
bytdcycle (equivalent to 40MB/s on a 50MHz by the DDM algorithm in memory write accesses
system) and the message passing latency of 500 when the owner sent a burst of messages to invalidate
cycles are assumed [8, 10, 111. The page size is set to copies of the page in other clusters.
be lk Byte for a l l three algorithm. In various studies Note that from the figures, the speed-up
of interconnection network p e r f o m c e , the latency improvement (shown by the gradient of the curve) of
is shown to rise sharply when the network becomes the DLL protocol is greater when the number of
saturated [lo. 11, 121. proctssors incnases from 1 to 8, and is smaller when
Two common problems are to be solved in the number of processors increases beyond 8. The
the simulation. First, a set 256 linear equations is main reason for this is that inter-processor communi-
solved using the Gauss-Seidel method [7]. Second, cation overhead increase with the number of
two square matrix of size 48x48 are being multiplied processors and for the moderate size problems that we

299
~ ..x- . . DLL (no Klnk reductan)

30 I

d,
E
B
-gi
---
40000 _-
35wo .-

2Qmo..

-
15ooo..
$ loooo..
5ooo -.
I
~

o < ; : : : : : ;
- 0 2 4 6 8 10 12 14 16
no. of processois
Figure 8: k i m u m instantaneous number of messages in
systemfor linear equations solver

would be unfair to compare the number of messages


. .- x - . . DLL (no Klink reduction) it generates with the other algorithms in concern.
As seen from Figure 8 and 9, the DDM
a algorithm has both the highest maximum
instantaneous number and the highest average
instantaneous number of messages in system. This is
due to the frequent burst of invalidation messages
generated by the DDM algorithm from memory write
access. The constant high number of messages in the
system means that the DDM algorithm is more prone
to the network congestion problem. This problem will
become more significant when the problem size is
large, because then more pages will be required to
.-2 opjl store the data and results, thus more clusters will need
0 2 4 6 8 10 12 14 16 to perform invalidation requests simultaneously.
no. of processors
On the other hand, since the DLL protocol
Figure 9: Average instantaneous number of messages in uses the P-links to one-by-one invalidate other
Jystemfor linear equations solver clusters, the maximum instantaneous number of
messages in the system is kept below the number of
processors. Therefore, network congestion seldom
occurs. This is true even when the problem size
are solving, the overhead become significant as the increases because according to the DLL protocol,
number of processors used is more than 8. If larger messages are generated one at a time by each cluster,
problems are to be solved, better speed-up as oppose to the burst of messages gmerated by the
improvement will be achieved by adding more DDM algorithm during invalidation. Hence, it is
processon. expected that the DLL protocol scales better with
In order to have a clearer picture of the increasing problem size. Also, note that the average
number of messages used by the DLL and the DDM number of messages in system is smaller when N-link
algorithms, the maximum instantaneous number of reduction is used, owing to the smaller number of
messages in system (Figure 8) and the average messages needed to locate the owner of a page.
instantaneousnumber of messages in system (Figure Figure IO is the plot of the total number of
9) is plotted against the number of processors for the messages used to solve the set of 256 linear equations
linear equations solver. S i a r d t s have been against the number of processors for the DLL and the
obtained from the matrix multiplier. Note that the DDM algorithms. From the graph, the DLL protocol
Central Server algorithm is not included in the without N-link reduction requires the highest number
comparison because its performance is too poor and it of messages to solve the equations. By comparing this

300
with the number of messages required by the DLL relaxed consistency model [2] with only minor
protocol with N-link reduction, we can see that a modifications. Moreover, it is expected that the DLL
large number of messages were actually wasted going protocol will be suitable for relax coherence models
through the N-link locating the owner of pages. This because then a cluster does not need to wait for the
shows the importance of the N-link reduction feature invalidation message to go through a chain of P-links
to the DLL protocol. before it can complete a normal write operation. In
Also fhm the graph, the number of addition, the use of the write-update algorithm instead
messages required by DDM algorithm to solve the of write-invalidate algorithm for normal write
equations is significantly larger than the DLL operations will then be possible, although write-
protocol with N-link reduction. The extra messages invalidate should still be used for synchronized write
are actually the extra acknowledgment messages used accesses because of its shorter completion time.
by the DDM algorithm in the invalidation process. In
the DDM algorithm, to invalidate copies of a page in
n different clusters, 2n messages are needed (n 6. Conclusion
invalidation messages and n acknowledgment
messages). In the DLL protocol, however, only n+l This paper has introduced the Doubly-
messages are needed (n invalidation messages and 1 Linked List protocol for Distributed Shared Memory
aclmowledgment message). Therefore, the more systems. Detailed explanations of the protocol as well
processors in the system, the more messages the DLL as performance evaluation and comparison with other
protocol will save. existing protocols using simulation studies have also
been presented. By using the linked list of clusters,
the DLL protocol provides a fully distributed way to
5. Further Research on the DLL keep hack of replicated pages in the system, thus
Protocol eliminating the network congestion problem caused
by the generation of a large number of messages
From the simulation results, we can see that within a short time by a single cluster. It is shown that
the DLL protocol provides a promising and feasible the DLL protocol provides a high performance and
solution to the DSM memory coherence problem. The scalable solution to implementing DSM in
possibility of implementing the DLL protocol into a multiprocessor systems.
wide range of systems calls for further investigations
of the protocol.
First, as the DLL protocol was originally 7. Acknowledgment
designed for use in distributed operating systems
running on MPP, it is desirable that the protocol be This project is supported in part by the
actually implemented and thus performance University of Hong Kong CRGC Grant 337/062/0012.
evaluation using the real implementation, rather than
just simulation, is possible. In fact, our group is
current building a MPP system using the hierarchical
cluster model. The system will run a version of References
MACH [I31 with build-in DSM using the DLL
protocol. [I] B.Nitzberg and V.Lo, “Distxibuted Shared
Second, from the experience of Memory: A Survey of Issue and Algorithms,”
implementing the DLL protocol into a network of Computer, IEEE, pp. 52-60, Aug. 1991.
workstations running PVM for the simulation (21 K. Hwang, Advanced Computer Architecture,
purpose, we found that it might be feasible to include M c G ~ wHill, pp. 19-27, pp. 248-256, pp. 487-
the protocol as an add-on library for PVM, or as an 590, pp. 1993.
alternative library so that users can access the (31 K.Li and P.Hudak, “Memory Coherence in
distributed memory of the workstations transparently Shared V i Memory Systems,” ACM Trans.
as a global shared memory. As the costs of powerful Computer sysfems, Vol. 7 No. 4, pp. 321-359,
workstations drop, this could offer a new, cost Nov. 1989.
effective, and yet user-fiendly way of parallel [4] K.Li and R.Schaefer, “A Hypercube Shared
processing. Virtual Memory System,” Proceeding of I989
Third, although the current implementation International Conference on Parallel Processing,
of the DLL protocol uses the sequential consistency -
pp. 1-125 1-132, Aug. 1989.
model, it is possible to convert the protocol to use a

301
151 M.Stumm and S.Zhou, “Algorithms
Implementing Distributed Shared Memory,”
Computer, IEEE, pp. 5464, May. 1990.
[6] R.Bisiani and M.Ravishankar, “Plus: A
Distributed Shared-Memory System,”
Proceeding of I7th International Symposium on
Computer Architecture, pp, 115-124, 1990.
[7] S . A k l , The Design and Analysis of Parallel
Algorithms, Prentice Hall, 1989, pp. 203-205.
[SI TMS32OC4x User’s Guide, Texas Instruments,
-
pp.1-1 1-12, 1992.
[9] A.Geist, A.Beguelin, J.Dongam, W.Jiang,
R.Manchek, VSunderam, PVM 3 User’s Guide
and Reference Manual, Oak Ridge National
Laboratory, 1994.
[lO]X.Lin, P.K.McKinley, L.M.Ni, ”Deadlock-Free
Multicast Wormhole Routing in 2-D Mesh
Multicomputers,” IEEE Transactions on Parallel
and Distributed Systems, pp. 193-804, Aug.
1994.
1111J.Kim, A.Chien, “The Impact of Packetization in
Wormhole-Routed Networks,” Proceedings of
Parallel Architectures and Languages Europe,
June 1993.
[ 121W.Dally, “ V i - C h a n n e l Flow Control,” IEEE
Transactions on Parallel and Distributed
@stems, pp. 194-205, Mar. 1992.
[13]M.Accetta, R.Baron, W.Bolosky, D.Golub,
R.Rashid, A.Tevanian and M.Young, “Mach: A
New Kemel Foundation for UNM
Development,” Proceedings of Summer 1986
USENLYConference,pp. 93-113, June 1986.

302

You might also like