Ubicc Paper 4 4

A Fast and Efficient Strategy for Sub-mesh Allocation with Minimal Allocation
Overhead in 3D Mesh Connected Multicomputers

S. Bani-Mohammad1, M. Ould-Khaoua1, I. Ababneh2 and Lewis M. Mackenzie1
1
Department of Computing Science
University of Glasgow, Glasgow G12 8QQ, UK.
Email: {saad, mohamed, lewis}@dcs.gla.ac.uk
2
Department of Computing Science
Al al-Bayt University, Mafraq, Jordan.
Email: [email protected]
Abstract-This paper presents a fast and efficient contiguous ease of implementation [2, 3, 9, 15].
allocation strategy for 3D mesh multicomputers, referred to Efficient processor allocation and job scheduling are
as Turning Busy List (TBL for short), which can identify a critical to achieve and harness the full computing power
free sub-mesh of the requested size as long as it exists in the of a multicomputer [3, 7, 8, 23]. Processor allocation is
mesh system. Turning means that the orientation of the
responsible for selecting the set of processors on which
allocation request is changed when no sub-mesh is available
in the requested orientation. The TBL strategy relies on a parallel jobs are executed while job scheduling is
new approach that maintains a list of allocated sub-meshes responsible for determining the order in which the jobs
to determine all the regions consisting of nodes that cannot are executed [3]. An incoming job specifies the side
be used as base nodes for the requested sub-mesh. These lengths of the sub-mesh it requires before joining the
nodes are then subtracted from the right border plane of queue. The job scheduler selects the next job for
the allocated sub-meshes to find the nodes that can be used execution using the underlying scheduling policy and
as base nodes for the required sub-mesh size. Results from then the processor allocator finds an available sub-mesh
extensive simulations under a variety of system loads for the selected job.
confirm that the TBL strategy incurs much less allocation In distributed memory multicomputers, jobs are
overhead than all of the existing contiguous allocation
allocated distinct contiguous processor sub-mesh for the
strategies for 3D mesh multicomputers and delivers
competitive performance in terms of parameters such as the duration of their execution [3, 7, 8, 9, 10, 15, 23, 24].
average turnaround times and system utilization. Most existing research studies [3, 7, 9, 14, 15, 24] on
Moreover, the time complexity of the TBL strategy is much contiguous allocation have been carried out mostly in the
lower than that of the existing strategies. context of the 2D mesh network. There has been
relatively very little work on the 3D version of the mesh.
Keywords-Contiguous Allocation, Turnaround Time, Although the 2D mesh has been used in a number of
Utilization, Allocation Overhead, Switching Request parallel machines, such as iWARP [4], the Cray XT3 [5],
Orientation, Simulation. and Delta Touchstone [12], most practical
multicomputers, like the MIT J-Machine [25], Cray T3D
I. INTRODUCTION [18], the IBM BlueGene/L [1, 16], and Cray T3E [6],
Multicomputers, consisting of many processing have used the 3D mesh as the underlying network
elements (or nodes) connected through a high-speed topology due to its lower diameter and average
interconnection network, have been a prevalent communication distance [21].
computing platform for many real world scientific and The main shortcoming of existing contiguous allocation
engineering applications [3]. The mesh has been one of strategies for 3D mesh [8, 10, 23] is that they achieve
the most common networks for existing multicomputers complete sub-mesh recognition capability with high
due to its simplicity, scalability, structural regularity, and allocation overhead.
Ubiquitous Computing and Communication Journal

An efficient sub-mesh allocation strategy must be when jobs are scheduled using First-Come-First-Served
efficient in the time it takes for both allocation and (FCFS). Changes to request orientation are allowed; that
deallocation (i.e., allocation overhead). In addition to its is, if for example, a request for a × b × c processors could
own efficiency, the strategy must deliver competitive not be satisfied, it could be reoriented to become a × c × b
performance. The performance is measured in terms of [23]. Simulation results in [23] have shown that these
overall performance parameters such as the average three strategies have comparable performance; the
turnaround time, mean allocation time and mean system performance of FF is close to that of BF, and it is better
utilization. In this paper, we propose a new fast and than that of WF.
efficient contiguous allocation strategy that supports the An Efficient First Fit on the 3D torus: An efficient
rotation of job requests. The proposed strategy has lower processor allocation scheme for 3D torus which has
time complexity than the existing strategies, yet the complete recognition capability has been used to improve
simulation results show that its system performance is as the performance of contiguous First Fit allocation on the
good as that of the promising previously proposed 3D torus [8]. The results show that the average allocation
strategies. Moreover, the mean measured allocation time time of the new scheme is much smaller than that of the
of the proposed strategy is much lower than that of the earlier scheme [23] that is based on the Best Fit approach
previous strategies. for practical ranges of input load. Moreover, the time
The rest of the paper is organised as follows. Section 2 complexity for the new scheme is much lower. This is
contains a brief summary of allocation strategies achieved by an efficient search mechanism proposed for
previously proposed for the 3D mesh. Section 3 contains finding a free sub-mesh. In [8], different scheduling
a set of relevant preliminaries. Section 4 contains the strategies, such as FCFS and ScanAll, have been studied
proposed contiguous allocation strategy, and its to avoid potential performance loss due to blocking.
complexity analysis is given in Section 5. Simulation The above allocation strategies consider only
results are presented in Section 6. Section 7 concludes contiguous regions for the execution of a job. As a
this study. consequence, the length of the communication paths is
expected to be minimized in contiguous allocation. Only
II. RELATED WORK messages generated by the same job are expected within a
Contiguous allocation has been investigated for 2D and sub-mesh and therefore cause no inter-job contention in
3D mesh-connected multicomputers [3, 7, 8, 9, 10, 14, the network. However, the time complexities of the above
15, 23, 24]. The main shortcoming of the very few allocation strategies grow with the size of the mesh so
existing contiguous allocation strategies for the 3D mesh that they achieve complete sub-mesh recognition
is that they achieve complete sub-mesh recognition capability but with high allocation overhead.
capability with high allocation overhead. Below we In our proposed strategy, the mean measured allocation
describe some of these strategies. time is much lower than that of the previous strategies
Contiguous First Fit and Best Fit Allocation for the 3D and also our strategy delivers competitive performance.
Mesh: In [10], turning the allocation request is used to Moreover, the time complexity of allocation and
improve the performance of contiguous First Fit and Best deallocation operation is lower than that of the previous
Fit allocation in 3D mesh. Simulation results have shown strategies and do not grow with the size of the mesh as in
that First Fit with rotation can greatly improve previous strategies.
performance in terms of average turnaround time and
scheduling effectiveness. Moreover, the performance of III. PRELIMINARIES
First Fit is almost identical to that of Best Fit. The target system is a W × D × H 3D mesh, where W is
First Fit (FF), Best Fit (BF), and Worst Fit (WF) for the width of the cubic mesh, D its depth and H its
the 3D Torus Mesh: The contiguous allocation strategies height. Each processor is denoted by a coordinate
First Fit, Best Fit and Worst Fit have been investigated triple ( x, y , z ) , where 0 ≤ x ≤ W − 1 , 0 ≤ y ≤ D − 1 and
and compared using simulation for the 3D torus (a torus 0 ≤ z ≤ H − 1 [19]. A processor is connected by
is a mesh with wraparound links) and the mesh networks bidirectional communication links to its neighbour

processors. The following definitions have been adopted maintaining a busy list of allocated sub-meshes. The list
from [7, 19]. is scanned to determine all prohibited regions.
Definition 1: A sub-mesh S ( w, d , h) of width w , depth A prohibited region of job J (α × β × γ ) with respect to
d , and height h , where 0 ≤ w ≤ W − 1 , 0 ≤ d ≤ D − 1 and an allocated sub-mesh S ( x1 , y1 , z1 , x2 , y 2 , z 2 ) in the busy
0 ≤ h ≤ H − 1 is specified by the coordinates ( x, y, z ) and list is defined as the sub-mesh represented by the address
( x′, y ′, z ′) , where ( x, y, z ) are the coordinates of the base (x′, y′, z′, x2, y2, z2), where x′ = max(x1-α +1, 0), y′ = max
of the sub-mesh and ( x′, y ′, z ′) are the coordinates of its (y1-β +1, 0) and z′ = max (z1-γ+1, 0). For example, if a job
J requests the allocation of a sub-mesh of size 2×2×2,
end, as shown in Fig. 1.
the prohibited region of J (2 × 2 × 2) with respect to the
Definition 2: The size of S ( w, d , h) is w × d × h .
allocated sub-mesh (1,1,0,2,2,1), is the sub-mesh
Definition 3: An allocated sub-mesh is one whose
(0,0,0,2,2,1).
processors are all allocated to a parallel job.
The sub-meshes (w-α+1, 0, 0, w-1, d-1, h-1), (0, d-β+1,
Definition 4: A free sub-mesh is one whose processors
0, w-1, d-1, h-1), and (0, 0, h-γ+1, w-1, d-1, h-1) are
are all unallocated.
automatically not available for accommodating the base
Definition 5: A suitable sub-mesh S ( x, y, z ) is a free sub-
node of a free α × β × γ sub-mesh for J (α × β × γ ) ,
mesh that satisfies the conditions: x ≥ a , y ≥ b and
whether the nodes in these sub-meshes are free or not;
z ≥ c assuming that the allocation of S (a, b, c) is otherwise, the sub-mesh would grow out of the width and
requested. /or depth and /or height bounds of M (W , D, H ) . These
Definition 6: A list of all sub-meshes that are currently three sub-meshes are called automatic prohibited regions
allocated to jobs and are not available for allocation to of J (α × β × γ ) and must always be excluded during the
other jobs is called busy list.
sub-mesh allocation process.
Definition 7: A prohibited region is a region consisting
A job J (α × β × γ ) is allocatable if there exists at least
of nodes that cannot be used as base nodes for the
requested sub-mesh. one node that does not belong to any of the prohibited
Definition 8: The Right Border Plane (RBP) of a sub- regions and the three automatic prohibited regions of
mesh S ( x1 , y1 , z1 , x2 , y2 , z 2 ) with respect to a job J (α × β × γ ) .
J (α × β × γ ) is defined as the collection of nodes with All prohibited regions that result from the allocated
sub-meshes are subtracted from each RBP of the allocated
address ( x2 + 1, y′, z ′) where max( y1 − β + 1,0) ≤ y ′ ≤ y 2 sub-meshes to determine the nodes that can be used as
and max( z1 − γ + 1,0) ≤ z ′ ≤ z 2 . A RBP of sub-mesh S is base nodes for the required sub-mesh size. Fig. 2 shows
a plane located just off the right boundary of S . all possible cases for subtracting prohibited regions from
a RBP. The algorithm that is used to detect the base nodes
for any new job request is formally presented in Fig. 3,
and the allocation algorithm is presented in Fig. 4.
Z To facilitate the presentation of the algorithm, we
end assume that there is a hypothetical allocated sub-mesh b0
with address (−1,0,0,−1, D − 1, H − 1) at the head of the
Y
busy list. The RBP of the hypothetical allocated sub-mesh
base is the left boundary of the mesh. The list RBP _ Nodes
X contains a plane if its nodes are available for allocation to
Fig. 1: A submesh inside the 3D mesh. a job J (α × β × γ ) selected for execution.
The proposed allocation algorithm supports the rotation
IV. PROPOSED ALLOCATION STRATEGY of the job request. Let (a, b, c) be the width, depth and
height of a sub-mesh allocation request. The six
The proposed allocation strategy is based on

permutations (a, b, c) , (a, c, b) , (b, a, c) , (b, c, a ) , 2.13 (u1≤x≤u2)&&(v1≤y1≤v2)&&(v1≤y2≤v2)&&(w1≤z1≤w2)&&(z2>w2)
RBP (x, y1, w2+1, x, y2, z2)
(c, a, b) and (c, b, a ) are, in turn, considered for
allocation using the proposed allocation strategy. If 2.14 (u1≤x≤u2)&&(v1≤y1≤v2)&&(y2>v2)&&(w1≤z1≤w2)&&(z2>w2)
RBP1 (x, v2+1, z1, x, y2, w2); RBP2 (x, y1, w2+1, x, y2, z2)
allocation succeeds for any of these permutations the
process stops. For example, assume a free mesh (3, 3, 2) 2.15 (u1≤x≤u2)&&(v1≤y2≤v2)&&(y1<v1)&&(w1≤z1≤w2)&&(z2>w2)
and the job requests (2, 3, 2) and (3, 2, 1) arrive in this RBP1 (x, y1, z1, x, v1-1, w2); RBP2 (x, y1, w2+1, x, y2, z2)
order. The second job request cannot be allocated until it 2.16 (u1≤x≤u2)&&(v1≤y1≤v2)&&(v1≤y2≤v2)&&(z1<w1)&&(z2>w2)
is rotated to (1, 3, 2). RBP1 (x, y1, z1, x, y2, w1-1); RBP2 (x, y1, w2+1, x, y2, z2)
(x,y2,z2) 2.17 (u1≤x≤u2)&&(v1≤y1≤v2)&&(y2>v2)&&(z1<w1)&&(z2>w2)
(u2,v2,w2) RBP1 (x, y1, z1, x, v2, w1-1); RBP2 (x, v2+1, z1, x, y2, z2)
RBP3 (x, y1, w2+1, x, v2, z2)
2.18 (u1≤x≤u2)&&(v1≤y2≤v2)&&(y1<v1)&&(z1<w1)&&(z2>w2)
Prohibited RBP1 (x, y1, z1, x, v1-1, z2); RBP2 (x, v1, z1, x, y2, w1-1)
RBP RBP3 (x, v1, w2+1, x, y2, z2)
Region
2.19 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z1<w1)&&(z2>w2)
RBP1 (x, y1, z1, x, v1-1, z2); RBP2 (x, v2+1, z1, x, y2, z2)
RBP3 (x, v1, z1, x, v2, w1-1); RBP4 (x, v1, w2+1, x, v2, z2)
(u1,v1,w1)
2.20 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z1<w1)&&(z2==w2)
(x,y1,z1)
2.1 ((x< u1)||(x> u2)||( z2< w1)||( z1> w2)||( y2< v1)||( y1> v2)) RBP3 (x, v1, z1, x, v2, w1-1)
In this case the result is RBP itself. 2.21 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z1==w1)&&(z2>w2)
2.2 (u1≤x≤u2)&&(y2==v1)&&(y1<v1)&&(w1≤z2≤w2)&&(z1<w1) RBP1 (x, y1, z1, x, v1-1, z2); RBP2 (x, v2+1, z1, x, y2, z2)
RBP1 (x, v1, z1, x, y2, w1-1); RBP2(x, y1, z1, x, v1-1, z2) RBP3 (x, v1, w2+1, x, v2, z2)
2.3 (u1≤x≤u2)&&(y2> v2)&&(y1==v2)&&(w1≤z2≤w2)&&(z1<w1) 2.22 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z1==w1)&&(z2==w2)

RBP1 (x, y1, z1, x, y2, w1-1); RBP2 (x, v2+1, w1, x, y2, z2) RBP1 (x, y1, z1, x, v1-1, z2); RBP2 (x, v2+1, z1, x, y2, z2)
2.4 (u1≤x≤u2)&&(v1≤y1≤v2)&&(v1≤y2≤v2)&&(z1==w2)&&(z2>w2) 2.23 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z1==w1)&&(z2<w2)

RBP (x, y1, w2+1, x, y2, z2) RBP1 (x, y1, z1, x, v1-1, z2); RBP2 (x, v2+1, z1, x, y2, z2)
2.5 (u1≤x≤u2)&&(v1≤ y1≤ v2)&&(y2> v2)&&(z1== w2)&&(z2> w2) 2.24 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z1>w1)&&(z2==w2)

RBP1 (x, y1, w2+1, x, y2, z2); RBP2 (x, v2+1, z1, x, y2, w2) RBP1 (x, y1, z1, x, v1-1, z2); RBP2 (x, v2+1, z1, x, y2, z2)
2.6 (u1≤x≤u2)&&(v1≤ y2≤ v2)&&(y1< v1)&&(z1==w2)&&(z2> w2) 2.25 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z1<w1)&&(w1≤ z2<w2)

RBP1 (x, y1, w2+1, x, y2, z2); RBP2 (x, y1, z1, x, v1-1, w2) RBP1 (x, y1, z1, x, v1-1, z2); RBP2 (x, v2+1, z1, x, y2, z2)
RBP3 (x, v1, z1, x, v2, w1-1)
2.7 (u1≤x≤u2)&&(v1≤y1≤v2)&&(v1≤y2≤v2)&&(z2==w1)&&(z1<w1)
RBP (x, y1, z1, x, y2, w1-1) 2.26 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z2>w2)&&(w1≤ z1<w2)
2.8 (u1≤x≤u2)&&(v1≤y1≤v2)&&(y2>v2)&&(z2==w1)&&(z1<w1) RBP3 (x, v1, w2+1, x, v2, z2)
RBP1 (x, y1, z1, x, y2, w1-1); RBP2 (x, v2+1, w1, x, y2, z2)
2.27 (u1≤x≤u2)&&(v1≤y1≤v2)&&(w1≤z1≤w2)&&(w1≤z2≤w2)
2.9 (u1≤x≤u2)&&(v1≤y2≤v2)&&(y1<v1)&&(z2==w1)&&(z1<w1) No RBP in this case.
RBP1 (x, y1, z1, x, y2, w1-1); RBP2 (x, y1, w1, x, v1-1, z2)
2.28 (u1≤x≤u2)&&(v1≤y1≤v2)&&(y2>v2)&&(w1≤z1≤w2)&&(w1≤z2≤w2)
2.10 (u1≤x≤u2)&&(v1≤y1≤v2)&&(v1≤y2≤v2)&&(w1<z2≤w2)&&(z1<w1) RBP (x, v2+1, z1, x, y2, z2)
RBP (x, y1, z1, x, y2, w1-1)
2.29 (u1≤x≤u2)&&(v1≤y2≤v2)&&(y1<v1)&&(w1≤z1≤w2)&&(w1≤z2≤w2)
2.11 (u1≤x≤u2)&&(v1≤y1≤v2) && (y2>v2)&&(w1<z2≤w2)&&(z1<w1) RBP (x, y1, z1, x, v1-1, z2)
RBP1 (x, y1, z1, x, y2, w1-1); RBP2 (x, v2+1, w1, x, y2, z2)
Fig. 2: All possible cases for subtracting a prohibited region
2.12 (u1≤x≤u2)&&(v1≤y2≤v2)&&(y1<v1)&&(w1<z2≤w2)&&(z1<w1) from a right border plane
RBP1 (x, y1, z1, x, y2, w1-1); RBP2 (x, y1, w1, x, v1-1, z2)

Procedure Detect (α, β, γ):
Begin
{
{
Mesh M(w, d, h); incoming job J requests for an α×β×γ free sub-mesh;
Busy List B = {b0, b1, b2, ….., bm} where b0 is a hypothetical allocated sub-mesh and bi,1≤i≤m, are the m already
allocated sub-meshes; Both sub-meshes (0, d-β+1, 0, w–1, d–1, h–1), (w–α+1, 0, 0, w–1, d–1, h-1), and (0,0,h-
γ+1,w-1,d-1,h-1) are automatic prohibited regions and automatically not available for accommodating the base
node of a free α×β×γ sub-mesh for J.
}
Step 1. RBP_Nodes←NULL.
Step 2. for each allocated sub-mesh bi from i = 0 to m
Step 2.1. Construct RBP of bi, denoted as RBPi= (xr, yr1, zr1, xr, yr2, zr2), with respect to J where xr=x2+1,
yr1=max(y1-β+1, 0), zr1=max(z1-γ+1,0), yr2=y2 and zr2=z2.
Step 2.2. if RBPi is within any automatic prohibited region then goto Step2.
Step 2.3. for each allocated sub-mesh bj (x1, y1, z1, x2, y2, z2) from j = 1 to m
Construct prohibited region of J with respect to bj, denoted as Fj = (xf1, yf1, zf1, xf2, yf2, zf2) where
xf1=max(x1-α+1, 0), yf1=max(y1-β+1, 0), zf1=max(z1-γ+1, 0), xf2=x2, yf2=y2 and zf2=z2.
Subtract Fj from RBPi as follows:

Determine the case to which the subtraction belongs by comparing the coordinates of RBPi and Fj as in Fig. 2.
Switch (subtraction case)

{
case (1): if (zr1> zf2) then
begin
add the RBP in Fig. 2.1 to RBP_nodes.
goto Step 2.
end
break.
case (2): adjust RBPi as in Fig. 2.2; break. case (3): adjust RBPi as in Fig. 2.3; break.
case (4): add the RBP in Fig. 2.4 to RBP_nodes.; goto Step 2.
case (13): add the RBP in Fig. 2.13 to RBP_nodes.; goto Step 2.
case (26): adjust RBPi as in Fig. 2.26; break. case (27): go to Step 2.
}
goto Step 2.3.
TBL_Allocate(RBP_Nodes, α, β, γ)
}
End.
Fig. 3: Outline of the Detect Procedure in TBL Contiguous Allocation Strategy

for the second allocation request and the sub-mesh
Procedure TBL_Allocate (RBP_Nodes, α, β, γ): (2,0,0,3,0,1) is allocated to the job and then it is added
Begin to the busy list, resulting in { b0 :(-1,0,0,-1,3,3),
{
int botx, boty, botz, w, d, h; b1 :(0,0,0,1,3,3), b2 :(2,0,0,3,0,1)}. The same
botx=RBP_Nodes.botx; boty=RBP_Nodes.boty; procedure is repeated to allocate a sub-mesh to the
botz=RBP_Nodes.botz; third job. The allocated sub-mesh for the allocation
for each wi from i = botx to botx + α request 2×4×4 is denoted by black circles, the
for each dj from j= boty to boty + β allocated sub-mesh for the allocation request 2×1×2 is
for each hk from k = botz to botz + γ denoted by shaded circles and the allocated sub-mesh
Allocate the node (wi, dj, hk)for the incoming job.
for the allocation request 1×2×1 is denoted by dotted
}
End. circles.
In the deallocation operation, the allocated sub-
Fig. 4: Outline of the TBL Contiguous Allocation Strategy mesh is deallocated by removing its corresponding
entry in the busy list. The deallocation algorithm is
Example: This example shows how the allocation formally presented in Fig. 6.
algorithm works. In this example, we assume the
mesh is free, and three requests for the allocation of a (0,3,3)
2×4×4, 2×1×2 and 1×2×1 sub-meshes arrive in this
order. Fig. 5 gives the states of the processors of a h=3
4×4×4 mesh. Assume the job request 2×4×4 is
(0,0,3) (3,0,3)
allocated the sub-mesh (0,0,0,1,3,3) as shown in Fig.
5, and then the allocation algorithm is called with a (0,3,2)
2×1×2 allocation request. In this case, the busy list
contains the allocated sub-meshes b0 :(-1,0,0,-1,3,3) h=2
and b1 :(0,0,0,1,3,3) respectively. As for the second (0,0,2) (3,0,2)

job request 2×1×2, the first RBP (RBP for the height
(0,3,1)
hypothetical allocated sub-mesh b0 in the busy list) is
calculated resulting in (0,0,0,0,3,3). The automatic h=1
prohibited regions (0,4,0,3,3,3), (3,0,0,3,3,3) and
(0,0,3,3,3,3) with respect to the second allocation (0,0,1) (3,0,1)
request are subtracted from the first RBP, resulting in depth (0,3,0)
the plane (0,0,0,0,3,2), and then the prohibited region
of the allocated sub-mesh b1 :(0,0,0,1,3,3) with respect h=0
to the second allocation request is calculated, resulting
in (0,0,0,1,3,3) prohibited region, which when (0,0,0) (3,0,0)
subtracted from the plane (0,0,0,0,3,2) results in the width
NILL value, meaning that no node is available for the Fig. 5: Allocation Example
job request up to this point. Then, the RBP of the
allocated sub-mesh b1 :(0,0,0,1,3,3) is calculated,
resulting in (2,0,0,2,3,3). Again the automatic Procedure TBL_Deallocate ():
prohibited regions with respect to the second Begin{
allocation request are subtracted from the new RBP, jid = id of the departing job;
resulting in (2,0,0,2,3,2), and after that the prohibited For all elements in the busy list
region of the allocated sub-mesh b1 is subtracted from if (element’s id = jid)
remove the element from the busy list
the RBP (2,0,0,2,3,3), resulting in (2,0,0,2,3,2). Now, } end procedure
any node on the plane (2,0,0,2,3,2) can be used as a End.
base node for the second allocation request, in this
example, the node (2,0,0,2,0,0) is used as a base node Fig. 6: Outline of the deallocation algorithm

V. ALLOCATION AND DEALLOCATION TIME inter-arrival times. They are served on first-come-
COMPLEXITIES first-served (FCFS) basis to preserve fairness [9, 14,
15, 22]. The execution times are assumed to be
Firstly, we analyse the allocation algorithm.
exponentially distributed with a mean of one time
Assume that there are m allocated sub-meshes in the
unit. Two distributions are used to generate the width,
busy list, the RBP construction operation in steps 2
depth and height of job requests. The first is the
and 2.1 requires O(m) time. Scanning the busy list to
uniform distribution over the range from 1 to the mesh
subtract a prohibited region from a RBP requires side length, where the width, depth and height of the
O(1) time for each subtraction operation. In the worst job requests are generated independently. The second
case there are at most four RBP’s and m prohibited distribution is the exponential distribution, where the
regions for subsequent subtraction, therefore the width, depth and height of the job requests are
operation of subtracting m prohibited regions from a exponentially distributed with a mean of half the side
RBP in step 2.3 takes O(m) time. There are a total of length of the entire mesh. These distributions have
4 × m RBP’s and m prohibited regions to be often been used in the literature [3, 7, 9, 10, 11, 14,
considered, therefore the allocation algorithm takes 15, 19, 20, 23, 24]. Each simulation run consists of
one thousand completed jobs. Simulation results are
O(m 2 ) time. Typically, the average values of m do averaged over enough independent runs so that the
not grow with n where n is the number of processors confidence level is 95% that relative errors are below
in the mesh, as we will see in the simulation results. 5% of the means. The main performance parameters
The deallocation algorithm requires m iterations to observed are the average turnaround time of jobs,
remove the allocated sub-mesh from the busy list. mean system utilization and mean allocation time. The
Therefore, the deallocation algorithm takes O (m) turnaround time is the time that a parallel job spends
time. The proposed algorithm maintains a busy list of in the mesh from arrival to departure. The utilization
m allocated sub-meshes. Therefore, the space is the percentage of processors that are utilized over
complexity of the allocation algorithm is O(m) . The time. The allocation time is the time that the allocation
space incurred by this strategy is small compared to algorithm takes to assign a set of jobs to the mesh
the improvement in performance in terms of allocation system. The allocation time that is incurred for
time, as we will see in the simulation results. detecting the availability of a free sub-mesh for an
incoming job request is the realistic time. We
VI. SIMULATION RESULTS recognize that these results are implementation
dependent, but the trends shown by the results indicate
Extensive simulation experiments have been carried the features of the strategies. The independent variable
out to compare the performance of the proposed TBL in the simulation is the system load. The system load
allocation strategy against well-known contiguous is defined as the inverse of the mean inter-arrival time
allocation strategy First Fit [10], with and without of jobs. Unless specified otherwise, the performance
change of request orientation. The First Fit strategy figures shown below are for 8×8×8 mesh.
allocates an incoming job to the first available sub- Figures 7 and 8 show simulation results for average
mesh that is found [8, 10, 23, 24]. In this study, First allocation time against job arrival rates in a 8 × 8 × 8
Fit has been used to represent the contiguous class of mesh when request side lengths follow the uniform
strategies as it has been found to perform well [8, 10, and the exponential distributions. We observe that
23, 24]. Switching request orientation has been used TBL is superior to TFF in the two figures. In figure 7,
in [8, 10, 23]. for example, the average allocation time of TBL is
We have implemented the proposed allocation and 0.33 of the average allocation time of TFF under the
deallocation algorithms, including the busy list arrival rate 4.6 jobs/time unit. It can also be seen in
routines, in the C language, and integrated the the figures that our proposed TBL strategy consistently
software into the ProcSimity simulation tool for takes much smaller allocation time than TFF strategy
processor allocation and scheduling in highly parallel regardless of the system load. Moreover, the
systems [13, 17]. difference in allocation time gets much more
The target mesh is cube with width W , depth D significant as the system load increases. Thus, our
and height H . Jobs are assumed to have exponential proposed strategy can be said to be more effective

than the other strategies represented by TFF in these
Average Turnaround Time

figures.
160
140
120
100 BL
1.5 FF
Average Allocation Time
80
TBL
60 TFF
1.3
40
1.1
20
(msec)
TBL
0.9 0
TFF 0.2 0.6 1 1.4 1.8 2.2 2.6 3 3.4 3.8 4.2 4.6
0.7 Load
0.5
Fig. 9: Average turnaround time vs. system load for the

0.3
0.2 0.6 1 1.4 1.8 2.2 2.6 3 3.4 3.8 4.2 4.6
Load contiguous allocation strategies (BL, FF, TBL, TFF) and

the uniform side lengths distribution in a 8 × 8 × 8
Fig. 7: Average allocation times for the contiguous mesh.
allocation strategies (TBL, TFF) and uniform side
lengths distribution in a 8 × 8 × 8 mesh.
Average Turnaround Time

72
66
60
54
1.5 48
BL
Average Allocation Time
42
FF
36
1.3
TBL
30
24 TFF
1.1
18
(msec)
TFF 12
0.9
TBL 6
0
0.7 1 1.8 2.6 3.4 4.2 5 5.8 6.6 7.4 8.2 9 9.8
0.5
Load
0.3
1 1.8 2.6 3.4 4.2 5 5.8 6.6 7.4 8.2 9 9.8
Fig. 10: Average turnaround time vs. system load for
Load the contiguous allocation strategies (BL, FF, TBL, TFF)
and the exponential side lengths distribution in a 8 × 8 ×
Fig. 8: Average allocation times for the contiguous 8 mesh.
allocation strategies (TBL, TFF) and the exponential
side lengths distribution in a 8 × 8 × 8 mesh. In figures 11 and 12, the mean system utilization is
plotted against the system load for both job size
In figures 9 and 10, the average turnaround time of distributions. The simulation results show that all
jobs are plotted against the system load in a 8 × 8 × 8 strategies have the same utilization under the low
mesh for both job size distributions. It can be seen in loads. For higher loads, the utilization of the strategies
these figures that the average turnaround times of our that use the rotation of job requests is better than that
strategy TBL are very close to those of TFF. However, of the strategies that do not use the rotation of the job
the time complexity of TBL is in O(m 2 ) , whereas it is requests. For both job size distributions, the
in O(W × D × H ) for TFF [9]. Also the complexity of contiguous allocation strategies that use the rotation of
TBL strategy does not grow with the size of the mesh job requests achieve system utilization of 47% to
as in TFF strategy. It can also be seen in the figures 49%, but the contiguous allocation strategies that do
that TBL is substantially superior to the strategies BL not use the rotation of job requests can not exceed
and FF without rotation because it is highly likely that 36%.
a suitable contiguous sub-mesh is available for In figures 13 and 14, the average number of
allocation to a job when request rotation is allowed. In allocated sub-meshes ( m ) in TBL is plotted against
figure 9, for example, the average turnaround times of the system load for both job size distributions. As
TBL are 0.47, 0.53, and 0.56 of the average expected, the average number of allocated sub-meshes
turnaround times of FF and BL under the arrival rates is largest when the side lengths follow the exponential
3.8, 4.2, and 4.6 jobs/time unit, respectively. distribution. This is because the average sizes of jobs
are smallest in this case. Moreover, and as clarified in

the allocation and deallocation time complexity
section, the average number of allocated sub-meshes
8
Allocated Sub-meshes
7.2
Average Number of
( m ) is much lower than n for both job size 6.4
5.6
distributions. Moreover, experiments for larger mesh 4.8

4
TBL 8x8x8 Mesh
sizes show that m does not grow with n for the job 3.2 TBL 16x16x16
Mesh
size distributions considered in this paper, as

2.4
1.6
expected. 0.8
0
1 2.6 4.2 5.8 7.4 9 10.6 12.2 13.8 15.4 17 18.6 20.2 21.8
Load
0.92
0.82
0.72 Fig. 14: Average number of allocated sub-meshes ( m ) in

Utilization
0.62
0.52
BL
FF
TBL under the exponential side lengths distribution in a
0.42 TBL 16 × 16 × 16 mesh and 8 × 8 × 8 mesh.
0.32 TFF
VII. CONCLUSION AND FUTURE DIRECTIONS

0.22
0.12
0.02
0.2 0.6 1 1.4 1.8 2.2 2.6 3 3.4 3.8 4.2 4.6
While the existing contiguous allocation strategies
Load for 3D mesh achieve complete sub-mesh recognition
capability but with high allocation overhead, this
Fig. 11: Mean System utilization for the contiguous study has suggested a fast and efficient contiguous
allocation strategies (BL, FF, TBL, TFF) and the allocation strategy, which overcomes the limitations
uniform side lengths distribution in a 8 × 8 × 8 mesh. of the existing strategies. To this end, we have
proposed a new efficient contiguous allocation
strategy, notably Turning Busy List (TBL) strategy.
0.82
0.72
The TBL strategy can maintain good performance
0.62 with little allocation overhead. The performance of the
Utilization
0.52 BL
FF
TBL strategy has been compared against the existing
0.42
0.32
TBL contiguous allocation strategies. Simulation results
have shown that the performance of proposed
TFF
0.22
0.12 allocation strategy TBL is at least as good as that of

0.02
1 1.8 2.6 3.4 4.2 5 5.8 6.6 7.4 8.2 9 9.8 the previously proposed allocation strategies.
Load Moreover, the mean measured allocation time of the
TBL strategy is much lower than that of the previous
Fig. 12: Mean System utilization for the contiguous strategies. We also evaluated the switching request
allocation strategies (BL, FF, TBL, TFF) and the orientation. The results have revealed that the rotation
exponential side lengths distribution in an 8 × 8 × 8 of the job request improves the performance of the
mesh. contiguous allocation strategies. Moreover, TBL can
be efficient because it is implemented using a busy list
approach. This approach can be expected to be
3.2
efficient in practice because job sizes typically grow
Allocated Sub-meshes
Average Number of
with the size of the mesh. The length of the busy list
2.8
2.4
2 TBL 8x8x8 Mesh

can be expected to be small, even when the size of the
1.6
mesh grows.
As a continuation of this research in the future, it
1.2 TBL 16x16x16
Mesh
0.8
0.4 would be interesting to evaluate the performance of

0
0.2 1 1.8 2.6 3.4 4.2 5 5.8 6.6 7.4 8.2 9 9.8 10.6 11.4 the contiguous allocation strategies with different
Load scheduling approaches. It would be also interesting to
assess the proposed allocation strategy in other
Fig. 13: Average number of allocated sub-meshes ( m ) in common multicomputer networks, such as torus
TBL under the uniform side lengths distribution in a 16 × networks. Another possible line for future research is
16 × 16 mesh and 8 × 8 × 8 mesh. to implement our strategy based on real workload

traces from different parallel machines and compare it the Frontiers of Massively Parallel Computation
with our results obtained by means of simulations. (Frontiers'95), Washington, DC, USA, IEEE
Computer Society Press, pp. 414-421, 1995.
[14] K.-H. Seo, Fragmentation-Efficient Node
REFERENCES Allocation Algorithm in 2D Mesh-Connected
Systems, Proceedings of the 8th International
[1] “Blue Gene Project”, Symposium on Parallel Architecture, Algorithms
http://www.research.ibm.com/bluegene/index.html, and Networks (ISPAN’05), IEEE Computer Society
2005. Press, pp. 318-323, 7-9 December, 2005.
[2] A. Al-Dubai, M. Ould-Khaoua, L. M. Mackenzie, [15] K.-H. Seo, S.-C. Kim, Improving system
An efficient path-based multicast algorithm for performance in contiguous processor allocation for
mesh networks, Proc. 17th Int. Parallel and mesh-connected parallel systems, The Journal of
Distributed Processing Symposium (IPDPS), Nice, Systems and Software, vol. 67, no. 1, pp. 45-54,
France, IEEE Computer Society Press , pp. 283- 2003.
290, 22 -26 April, 2003. [16] M. Blumrich, D. Chen, P. Coteus, A. Gara, M.
[3] B.-S.Yoo, C.-R. Das, A Fast and Efficient Giampapa, P. Heidelberger, S. Singh, B.
Processor Allocation Scheme for Mesh-Connected Steinmacher-Burow, T. Takken and P. Vranas,
Multicomputers, IEEE Transactions on Parallel & Design and Analysis of the BlueGene/L Torus
Distributed Systems, vol. 51, no. 1, pp. 46-60, Interconnection Network, IBM Research Report
2002. RC23025, IBM Research Division, Thomas J.
[4] C. Peterson, J. Sutton, P. Wiley, iWARP: a 100- Watson Research Center, Dec. 3, 2003.
MPOS VLIW microprocessor for multicomputers, [17] ProcSimity V4.3 User’s Manual, University of
IEEE Micro, vol. 11, no. 13, 1991. Oregon, 1997.
[5] Cray, Cray XT3 Datasheet, 2004. [18] R.E. Kessler, J.L Swarszmeier, Cray T3D: a new
[6] E. Anderson, J. Brooks, C. Grassl, S. Scott, dimension for Cray research, Proc. CompCon, pp.
Performance of the Cray T3E multiprocessor, Proc. 176-182, 1993.
Supercomputing Conference, pp. 19, 1997. [19] S. Bani-Mohammad, M. Ould-Khaoua, I. Ababneh,
[7] G.-M. Chiu, S.-K. Chen, An efficient submesh and L. Machenzie, Non-contiguous Processor
allocation scheme for two-dimensional meshes with Allocation Strategy for 2D Mesh Connected
little overhead, IEEE Transactions on Parallel & Multicomputers Based on Sub-meshes Available
Distributed Systems, vol. 10, no. 5, pp. 471-486, for Allocation, Proceedings of the 12th
1999. International Conference on Parallel and
[8] H.Choo, S.Yoo, H.-Y. Youn, Processor scheduling Distributed Systems (ICPADS’06), Minneapolis,
and allocation for 3D torus multicomputer systems, Minnesota, USA, IEEE Computer Society Press,
IEEE Transactions on Parallel & Distributed Vol. 2 , pp. 41-48, 2006.
Systems, vol. 11, no. 5, pp. 475-484, 2000. [20] V. Lo, K. Windisch, W. Liu, and B. Nitzberg, Non-
[9] I. Ababneh, An Efficient Free-list Submesh contiguous processor allocation algorithms for
Allocation Scheme for two-dimensional mesh- mesh-connected multicomputers, IEEE
connected multicomputers, Journal of Systems and Transactions on Parallel and Distributed Systems,
Software, vol. 79, no. 8, pp. 1168-1179, August vol. 8, no. 7, pp. 712-726, 1997.
2006. [21] W. Athas, C. Seitz, Multicomputers: message-
[10] I. Ababneh, Job scheduling and contiguous passing concurrent computers, IEEE Computer,
processor allocation for three-dimensional mesh vol. 21, no. 8, pp. 9-24, 1988
multicomputers, AMSE Advances in Modelling & [22] W. Mao, J. Chen, W. Watson, Efficient Subtorus
Analysis, vol. 6, no. 4, pp. 43-58, 2001. Processor Allocation in a Multi-Dimensional
[11] I. Ababneh, S. Bani Mohammad, Noncontiguous Torus, Proceedings of the 8th International
Processor Allocation for Three-Dimensional Mesh Conference on High-Performance Computing in
Multicomputers, AMSE Advances in Modelling & Asia-Pacific Region (HPCASIA’05), IEEE
Analysis, vol. 8, no. 2, pp. 51-63, 2003. Computer Society Press, pp. 53-60, 30 November -
[12] Intel Corporation, A Touchstone DELTA system 3 December, 2005.
description, 1991. [23] W. Qiao, L. Ni, Efficient processor allocation for
[13] K. Windisch, J. V. Miller, and V. Lo, 3D tori, Technical Report, Michigan State
ProcSimity: an experimental tool for processor University, East Lansing, MI, 48824-1027, 1994.
allocation and scheduling in highly parallel [24] Y. Zhu, Efficient processor allocation strategies for
systems, Proceedings of the Fifth Symposium on mesh-connected parallel computers, Journal of

Parallel and Distributed Computing, vol. 16, no. 4,
pp. 328-337, 1992.
[25] Y.-J. Tsai, P. McKinley, An extended dominating
node approach to broadcast and global combine in
multiport wormhole-routed mesh networks, IEEE
Transactions on Parallel & Distributed Systems,
vol. 8, no. 1, pp. 41-58, 1997.

Ubicc Paper 4 4

Uploaded by

Copyright:

Available Formats

Ubicc Paper 4 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ubicc Paper 4 4

Uploaded by

Copyright:

Available Formats

A Fast and Efficient Strategy for Sub-mesh Allocation with Minimal Allocation

Overhead in 3D Mesh Connected Multicomputers

Ubiquitous Computing and Communication Journal

Ubiquitous Computing and Communication Journal

Ubiquitous Computing and Communication Journal

2.3 (u1≤x≤u2)&&(y2> v2)&&(y1==v2)&&(w1≤z2≤w2)&&(z1<w1) 2.22 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z1==w1)&&(z2==w2)

2.4 (u1≤x≤u2)&&(v1≤y1≤v2)&&(v1≤y2≤v2)&&(z1==w2)&&(z2>w2) 2.23 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z1==w1)&&(z2<w2)

2.5 (u1≤x≤u2)&&(v1≤ y1≤ v2)&&(y2> v2)&&(z1== w2)&&(z2> w2) 2.24 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z1>w1)&&(z2==w2)

2.6 (u1≤x≤u2)&&(v1≤ y2≤ v2)&&(y1< v1)&&(z1==w2)&&(z2> w2) 2.25 (u1≤x≤u2)&&(y2>v2)&&(y1<v1)&&(z1<w1)&&(w1≤ z2<w2)

Ubiquitous Computing and Communication Journal

Subtract Fj from RBPi as follows:

Switch (subtraction case)

Ubiquitous Computing and Communication Journal

and b1 :(0,0,0,1,3,3) respectively. As for the second (0,0,2) (3,0,2)

Ubiquitous Computing and Communication Journal

Ubiquitous Computing and Communication Journal

Average Turnaround Time

Fig. 9: Average turnaround time vs. system load for the

Load contiguous allocation strategies (BL, FF, TBL, TFF) and

Average Turnaround Time

Ubiquitous Computing and Communication Journal

distributions. Moreover, experiments for larger mesh 4.8

size distributions considered in this paper, as

0.72 Fig. 14: Average number of allocated sub-meshes ( m ) in

VII. CONCLUSION AND FUTURE DIRECTIONS

0.12 allocation strategy TBL is at least as good as that of

2 TBL 8x8x8 Mesh

0.4 would be interesting to evaluate the performance of

Ubiquitous Computing and Communication Journal

Ubiquitous Computing and Communication Journal

Ubiquitous Computing and Communication Journal

You might also like