Library For Matrix Multiplication-Based Data Manipulation On A Mesh-Of-Tori Architecture
Library For Matrix Multiplication-Based Data Manipulation On A Mesh-Of-Tori Architecture
Abstract—Recent developments in computational sciences, in- only for multi-processor computers, but also computers with
volving both hardware and software, allow reflection on the way processors consisting of multiple computational units (e.g.
that computers of the future will be assembled and software cores, processors, etc.). In this context, let us note that as the
for them written. In this contribution we combine recent results
concerning possible designs of future processors, ways they will number of computational units per processor is systematically
be combined to build scalable (super)computers, and generalized increasing, the inflation adjusted price of a processor remains
matrix multiplication. As a result we propose a novel library the same. As a result, the price per computational operation
of routines, based on generalized matrix multiplication that continues to decrease (see, also [5]).
facilitates (matrix / image) manipulations. While a number of approaches have been proposed to deal
with the memory wall problem (e.g. see discussion of 3D
I. I NTRODUCTION
memory stacking in [6]), they seem to only slow down the
INCE the early 1990’s one of the important factors process, rather than introduce a radical solution. Note that,
S limiting computer performance became the ability to feed
data to the, increasingly faster, processors. Already, in 1994
introduction of multicore processors resulted in (at least tem-
porary) sustaining the Moore’s Law and thus further pushing
authors of [1] discussed problems caused by the increasing gap the performance gap (see, [3], [7]). Here, it is also worth
between the speeds of memory and processors. Their work was mentioning recent approach to reduce memory contention via
followed, among others, by Burger and Goodman ([2]), who data encoding (see, [8]). The idea is to allow for hardware-
were concerned with the limitations imposed by the memory based encoding and decoding of data to reduce its size. Since
bandwidth on the development of computer systems. In 2002, this proposal is brand new, time will tell how successful it
P. Machanick presented an interesting survey ([3]) in which will be. Note, however, that also this proposal is in line with
he considered the combined effects of doubling of processor the general observation that “computational hardware” (i.e.
speed (predicted by Moore’s Law) and the 7% increase in encoders and decoders) is cheap, and should be used to reduce
memory speed, when compared in the same time scale. volume of data moved between processor(s) and memory.
The initial approach to address this problem was through Let us now consider one of the important areas of scientific
introduction of memory hierarchy for data reuse (see, for computing – computational linear algebra. Obviously, here
instance, [4]). In addition to the registers, CPUs have been the basic object is a matrix. While, one dimensional matrices
equipped with small fast cache memory. As a result systems (vectors) are indispensable, the fundamental object of majority
with 4 layers of latency were developed. Data could be of algorithms is a 2D, or a 3D, matrix. Upon reflection, it
replicated and reside in (1) register, (2) cache, (3) main is easy to realize that there exists a conflict between the
memory, (4) external memory. Later on, while the “speed gap” structure of a matrix and the way it is stored and processed
between processors and memory continued to widen, multi- in most computers. To make the point simple, 2D matrices
processor computers gained popularity. As a result, systems are rectangular (while 3D matrices are cuboidal). However,
with an increasing number of latencies have been built. On they are stored in one-dimensional memory (as a long vector).
the large scale, data element could be replicated and reside Furthermore, in most cases, they are processed in a vector-
in (and each subsequent layer means increasing / different oriented fashion (except for the SIMD-style array processors).
latency of access): (1) register, (2) level 1 cache, (3) level Finally, they are sent back to be stored in the one-dimensional
2 cache, (4) level 3 cache, (5) main memory of a (multi-core / memory. In other words, data arrangement natural for the
multi-processor) computer, (6) memory of another networked matrix is neither preserved, nor taken advantage of, which puts
computer (node in the system), (7) external device. Obviously, not only practical, but also theoretical limit on performance of
such complex structure of a computer system resulted in linear algebra codes (for more details, see, [9]).
need for writing complex codes to efficiently use it. Data Interestingly, similar disregard to the natural arrangement
blocking and reuse became the method of choice for solution of data concerns also many “sensor systems.” Here, the input
of large computational problems. This method was applied not image, which is square or rectangular, is read out serially,
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on December 13,2024 at 05:50:30 UTC from IEEE Xplore. Restrictions apply.
456 PROCEEDINGS OF THE FEDCSIS. KRAKÓW, 2013
pixel-by-pixel, and is send to the CPU for processing. This the sensor(s) and transfers it directly to its operational registers
means that the transfer of pixels destroys the 2D integrity of / local memory; (2) is capable of generalized FMA operations.
data (an image, or a frame, start to exist not in their natural The latter requirement means that such FMA should store
layout). Separately, such transfer introduces latency caused by (in its registers) constants needed to efficiently perform FMA
serial communication. Here, the need to transfer large data operations originating from various semirings. Let us name
streams to the processor may prohibit their use in applications, it the extended generalized FMA; EG FMA. Recall, that the
which require (near) real-time response [10]. Note that large cost of computational units (of all types) is systematically
data streams exist not only in scientific applications. For decreasing ([5]). Therefore, cost of the EG FMA unit should
instance modern digital cameras capture images consisting not be much higher than that of a standard FMAs found in
of 22.3 × 106 pixels (Cannon EOS 5D Mark III [11]) or today’s processors. Hence, it is easy to imagine m(b)illions
even 36.3 × 106 pixels (Nikon D800 [12]). What is even of them “purchased” for a reasonable price. As stated above,
more amazing, recently introduced Nokia phone (Nokia 808 such EG FMAs should be connected into a square array that
PureView [13]) has camera capturing 41 × 106 pixels. will match the shape of the input data. Let us now describe
For the scientific / commercial sensor arrays, the largest of how such system can be build.
them seems to be the 2-D pixel matrix detector installed in
II. M ESH - OF - TORI INTERCONNECTION TOPOLOGY
the Large Hadron Collider in CERN [14]. It has 109 sensor
cells. Similar number of sensors would be required in a CT Since early 1980’s a number of topologies for supercom-
scanner array of size approximately 1m2 , with about 50K puter systems have been proposed. Let us omit the unscalable
pixels per 1cm2 . In devices of this size, for the (near) real-time approaches, like a bus, a tree, or a star. The more interesting
image and video processing, as well as a 3-D reconstruction, it topologies (from the 1980’s and 1990’s) were:
would be natural to load data directly from the sensors to the • hypercube – scaled up to 64000+ processor in the Con-
processing elements (for immediate processing). Thus, a focal- nection Machine CM-1,
plane I/O, which can map the pixels of an image (or a video • mesh – scaled up to 4000 processors in the Intel Paragon,
frame) directly into the array of processors, allowing data • processor array – scaled up to 16000+ processor in the
processing to be carried out immediately, is highly desired. MassPar computer,
The computational elements could store the sensor information • rings of rings – scaled up to 1000+ processors in the
(e.g. a single pixel, or an array of pixels) directly in their Kendall Square KSR-1 machines
registers (or local memory of a processing unit). Such an • torus – scaled up to 2048 units in the Cray T3D
architecture has two potential advantages. First, cost can be However, all of these topologies suffered from the fact that
reduced because there is no need for memory buses or a at least some of the elements were reachable with a different
complicated layout. Second, speed can be improved as the latency than the others. This means, that algorithms imple-
integrity of input data is not destroyed by serial communica- mented on such machines would have to be asynchronous,
tion. As a result, processing can start as soon as the data is which works well, for instance, for ising-model algorithms
available (e.g. in the registers). Note that proposals for similar similar to [20], but is not acceptable for a large set of
hardware architectures have been outlined in [15], [16], [17]. computational problems. Otherwise, extra latency had to be
However, all previously proposed focal-plane array processors introduced by the need to wait for the information to be
were envisioned as a mesh-based interconnect, which is good propagated across the system.
for the local data reuse (convolution-like simple algorithms), To overcome this problem, recently, a new (mesh-of-tori;
but is not proper to support the global data reuse (matrix- MoTor) multiprocessor system topology has been proposed
multiplication-based complex algorithms). ([21], [22]). The fundamental (indivisible) unit of the MoTor
Separately, it has been established that computational linear system is a µ-Cell. The µ-Cell consists four computational
algebra can be extended through the theory of algebraic units connected into a 2 × 2 doubly-folded torus (see, Figure
semirings, to subsume large class of problems (e.g. including 1). Logically, an individual µ-Cell is surrounded by so-called
a number of well known graph algorithms). The theoretical membranes that allow it to be combined into larger elements
mechanism is named Algebraic Path Problem (APP). As through the process of cell-fusion. Obviously, collections of µ-
shown, for instance, in ([18]), there is an interesting link Cells can be split into smaller structures through cell division.
between the arithmetical fused multiply and add (FMA) oper- In Figure 1, we see total of 9 µ-Cells logically fused into a
ation, which is supported in modern hardware, and the FMAs single macro-µ-Cell consisting of 4 µ-Cells (combined into
originating from other semirings that are not. Specifically, a 2 × 2 doubly folded torus), and 5 separate (individual)
if it was possible to modify the standard FMA to include µ-Cells. Furthermore, in Figure 2 we observe all nine µ-
operations from other semirings (in a way similar to the Cells combined into a single system (a 3 × 3 doubly folded
proposals of KALRAY; [19]), and thus develop a generalized torus). Observe that, when the 2 × 2 (or 3 × 3) µ-Cells
FMA, it could be possible to speed-up large class of APP are logically fused (or divided), the newly formed structure
problems at least 2 times ([18]). remains a doubly folded torus. In this way, it can be postulated
Let us now assume the existence of a computational unit that the single µ-Cell represents the “image” of the whole
that satisfies the above requirements: (1) accepts input from system. While in earlier publications (e.g. [22], [23], [24], [25]
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on December 13,2024 at 05:50:30 UTC from IEEE Xplore. Restrictions apply.
MARIA GANZHA ET AL: LIBRARY FOR MATRIX MULTIPLICATION-BASED DATA MANIPULATION 457
00 03 01 02 00 01 00 05 01 04 02 03
30 33 31 32 10 11 50 55 51 54 52 53
10 13 11 12 00 01 10 15 11 14 12 13
20 23 21 22 10 11 40 45 41 44 42 43
00 01 00 01 00 01 20 25 21 24 22 23
10 11 10 11 10 11 30 35 31 34 32 33
Figure 1. 9 µ-Cells fused into a single 2×2 “system,” and 5 separate µ-Cells Figure 2. 9 µ-Cells fused into a single 6 × 6 EG FMA system
the computational units were mostly treated as “theoretical Furthermore, we have stated that the extended generalized
entities,” in the context of this paper we assume that each one FMA can contain a certain number of data registers to store (a)
of them is the EG FMA described above. However, analysis the needed scalar elements originating from various semirings,
of cell connectivity in Figure 1 shows that the model EG (b) elements of special matrices needed for matrix transfor-
FMA proposed in the previous section has to be complemented mations (see, below), as well as (c) data that the FMA is
by four interconnects that allow construction of the MoTor to operate on. However, we also consider the possibility that
system. Therefore, from here on, we will understand the EG each FMA may have a “local memory” to allow it to process
FMA in this way. Furthermore, we will keep in mind that “blocks of data.” This idea is based on the following insights.
the MoTor architecture is build from indivisible µ-Cells, each First, if we define a pixel as “ the smallest single component
consisting of four, interconnected into a doubly folded torus of a digital image” (see, [27]), then the data related to a
EG FAMs. single pixel is very likely to be not larger than a single 24
Let us now observe that the proposed MoTor topology bit number. Second, in early 2013 the largest number of FMA
has similar restriction as the array processors from the early units combined in a single computer system was 5.2 × 106 .
1990’s. To keep its favorable properties, the system must This means that, if there was a one-to-one correspondence
be square. While this was considered an important negative between the number of FMA units and the number of “sensed
(flexibility limiting) factor in the past, this is no longer pixels” then the system could process stream of data from a
the case. When the first array processors were built and 5.2 Megapixel input device (or could process a matrix of size
used, arithmetical operations and memory were “expensive.” N ≃ 2200).
Therefore, it was necessary to avoid performing “unnecessary” Let us now consider, development of the MoTor-based
operations (and maximally reduce the memory usage). Today, system. In the initial works, e.g. in [22], links between cells
when GFlops costs about 50 cents (see, [5]) and this price is have been conceptualized as programmable abstract links
systematically dropping, and when laptops come with 8 Gbytes (µ-Cells were surrounded by logical membranes that could
of RAM (while some cell phones come with as much as 64 be fused or divided as needed, to match the size of the
Gbytes of flash memory on a card), it is data movement / problem). Obviously, in an actual system, the abstract links
access / copying that is “expensive” (see, also [26]). Therefore, and membranes could be realized logically, while the whole
when matrices (images) are rectangular (rather than square), it system would have to be hard-wired to form an actual M oT or
is reasonable to assume that one could just pad them up, and system of a specific size. Therefore to build a large system with
treat them as square. Obviously, since the µ-Cell is a single M 2 µ-Cells (recall the assumption that the mesh will constitute
indivisible element of the M oT or system, if the matrix is of a square array), it can be expected that their groups will be
size N × N then N has to be even. combined into separate “processors,” similarly to multicore /
Observe that there are two sources of inspiration for the multi-FMA processors of today. As what concerns cell fusion
M oT or system: (i) matrix computations, and (ii) processing and division, it will be possible to assemble sub-system(s) of a
data from, broadly understood, sensor arrays (e.g. images). needed size, by logically splitting and/or fusing an appropriate
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on December 13,2024 at 05:50:30 UTC from IEEE Xplore. Restrictions apply.
458 PROCEEDINGS OF THE FEDCSIS. KRAKÓW, 2013
number of cells within the M oT or system. However, it is are likely to be encountered in the development of (large-
worthy to stress that, while the theoretical communication scale) M oT or based systems. Let us now assume, that the,
latency across the mesh-of-tori system is uniform, this may not just proposed, M oT or computer systems have been built. In
be the case when the system will be assembled from processors ([22], [23], [25]) it was shown that a large number of matrix
constituting logical (and in some sense also physical) macro- operations / manipulations can be unified through the use of
µ-Cells. In this case it may be possible that the communication a generalized matrix multiply-and-update (MMU) operation.
within the processor (physical macro-µ-Cell) will be slightly However, in this context it is important to realize that one more
faster than between processors. Therefore, the most natural problem we are facing today is the increasing complication of
split would be such that would involve complete macro-µ- codes that are being developed to take advantage of current
Cells (processors). However, let us stress that, the design of computer architectures (see, for instance [28], [26]). This being
the mesh-of-tori topology does not distinguish between the the case, a return to simplicity is needed. In the remaining
connections that are “within a chip” and “between the chips.” sections of this paper we will illustrate how a MATLAB
Therefore, the communication model used in the algorithms / MATHEMATICA style (meta-level) approach can be used
described in [22], [23], and considered in subsequent sections, to build a library of matrix manipulation operations that,
is independent of the hardware configurations. among others, can be implemented on the proposed M oT or
Finally, let us consider the input from the sensor array computer architecture. This library will be uniformly based on
(or sending a matrix) into the mesh-of-tori type system. As the “fused” matrix multiply-and-update (MMU) operation.
shown in [22], any input that is in the canonical (square A. Basic operations
matrix) arrangement, is not organized in a way that is needed
for the matrix processing in a (doubly folded) torus. How- To proceed, we will use the generalized matrix multiply and
ever, adjusting the data organization (e.g. to complete a 2D update operation in the form (as elaborated in [28]):
N × N DFT), requires 2 matrix multiplications (left and right
C ← MMU[⊗, ⊕](A, B, C) : C ← C ⊕ AN/T ⊗ BN/T . (1)
multiplication by appropriate transformation matrices, see be-
low). These two multiplications require 2N time steps on a Here, A, B and C are square matrices of (even) size N
MoTor architecture. Next, after the processing is completed, (recall the, above presented, reasons for restricting the class
the canonical arrangement can be restored, by reversing the of matrices); while the ⊗, ⊕ operations originate from a scalar
original transformation. Here, again, two multiplications are semiring; and N/T specify if a given matrix is to be treated
needed and their cost is 2N time steps. For the details about as being in a canonical (N) or in a transposed (T) form,
the needed transformations and their realizations as a triple respectively.
matrix multiplication, see [22]. In what follows, we present a collection of matrix / im-
age manipulations that can also be achieved through matrix
III. DATA MANIPULATIONS IN A M oT or SYSTEM multiplication. While they can be implemented using any
matrix multiplication algorithm, we use this as a springboard
Let us summarize points made thus far. First, we have to further elaborate the idea of MoTor system, and a library of
refreshed arguments that there is an unfulfilled need for routines that can complement it. Note that, for simplicity of
computer systems that (1) have focal-plane I/O that, among discussion (and due to the lack of space), in what follows we
others, can feed data from sensors directly to the operand only discuss the special case when a single EG FMA stores
registers / memory of extended FMA units (generalized to be scalar data elements. However, as discussed in [22], [23], [25],
capable of performing arithmetical operations originating from [24] all matrix manipulations can be naturally extended to
different semirings), (2) operate on matrices treating them as blocked algorithms. Therefore, we actually do not contradict
square (or cuboidal, e.g. tensor) objects, (3) are developed our earlier assumption that each EG FMA holds a block of
in such a way that (i) minimizes data movement / access / data (e.g. pixel array, or a block of a matrix).
copying / replication, (ii) maximizes data reuse, and (iii) is 1) Reordering for the mesh-of-tori processing: Let us start
aware of the fact that arithmetical operations are cheap in from the above mentioned fact that the canonical form of the
comparison with any form of data “movement.” Such systems matrix (image) fed to the M oT or system through the focal-
are needed not only to process data originating from the Big plane I/O is not correct for further (parallel) processing on a
Hadron Collider, but also for everyday electronics. Here, it is doubly folded torus. As shown in [23], the proper format can
worth mentioning that virtual reality and 3D media (part of be obtained by corresponding linear transform through two
the new enterprises, so called creative industries) are in the matrix-matrix multiplications. Specifically, matrix product in
latter category, illustrating where the computational power is the form M ← R × A × RT , where A is the original / input
going to be needed in an increasing rate, beyond the classic (N × N ) matrix that is to be transformed, M is the matrix
domains of scientific computing. in the format necessary for further processing on the mesh-
Second, we have briefly outlined key features of the, of-tori system, and R is the format rearranging matrix (for
recently proposed, mesh-of-tori topology, which has some details of the structure of the R matrix, consult [23]). Taking
favorable features and naturally fits with the proposed EG into account the implementation of the generalized MMU,
FMAs. Furthermore, we have pointed to some issues that proposed in [28], the needed transformation is:
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on December 13,2024 at 05:50:30 UTC from IEEE Xplore. Restrictions apply.
MARIA GANZHA ET AL: LIBRARY FOR MATRIX MULTIPLICATION-BASED DATA MANIPULATION 459
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on December 13,2024 at 05:50:30 UTC from IEEE Xplore. Restrictions apply.
460 PROCEEDINGS OF THE FEDCSIS. KRAKÓW, 2013
trices LBCAST and RBCAST are zero matrices with ones (based on the (×, min) semiring). Each of these functions will
in column 1 and row 1 respectively. Furthermore, matrix D be completed in place, in 2N time steps, through 2 generalized
is a copy of the 0̄ matrix, which is going to be used in matrix multiplications. Observe that, for all practical purposes,
both multiplications. In the first step, the selected element matrix ONES does not have to be instantiated. It consists of
is send to the EG FMA located in position (1, 1) of the 1̄ elements that, according to our assumptions, are already
M oT or system and stored in the operand register correspond- stored in operand registers of the EG FMAs. Finally, note that
ing to D(1, 1). Next, the MMU operation is invoked twice (regardless of the way it will finally be instantiated in the
(B ← LBCAST ∗ D ∗ RBCAST) within a ElBcast function, MoTor system) matrix ONES is independent of the size of
which has the form: ElBcast(element), where the element the macro-µ-Cell and remains unchanged and available after
specifies element that should be replicated. As a result, in 2N cell fusion / division operation. As previously, all information
time steps, the selected element is replicated to all elements about very “existence” of matrices ONES and TEMP, and
of matrix B and thus made available across the M oT or initialization of matrices TEMP and RESULT is going to be
system. Finally, the D(1, 1) element is zeroed. Note that, while hidden from the user.
matrices LBCAST and RBCAST have to be reinstantiated, the
matrix D, being a copy of the zero matrix remains unchanged B. Matrix (image) manipulations
during cell fusion / division. Possibility of use of the actual Let us now consider three simple matrix manipulations that
zero matrix, instead of matrix D has to be further evaluated. can be achieved with help of matrix multiplication. While they
Obviously, matrices LBCAST and RBCAST are available only are presented as matrix operations, their actual value can be
to the implementer, while being hidden from the user. seen when the underlying matrices represent images (e.g. each
4) Global reduction and broadcast: The next matrix opera- matrix element represents a pixel, or a block of pixels).
tion that can be formulated in terms of matrix multiplications, 1) Upside-down swap: Image (matrix) upside down swap
is the global reduction and broadcast. As seen in [23], when can be achieved by multiplying the matrix from the left hand
the standard arithmetic is applied, and matrix A is multiplied side by the SW AP matrix, which has the following form:
from both sides by a matrix of ones (matrix with all elements 0 0 1
equal to one, let us name it ONES), then the resulting matrix SW AP = 0 1 0
will have its elements equal to the sum of all elements of A. 1 0 0
On a mesh-of-tori system, this can be implemented in place,
in 2N time steps, when the matrix ONES is available (pre- Obviously, we assume that on the M oT or system, matrix
loaded) in all EG FMAs of the system. SWAP will be instantiated when macro-µ-Cell(s) will be cre-
However, recall that our approach is based on use of the ated (appropriate elements will be stored in separate operand
generalized MMU. Thanks to this, we can apply operations registers of the EG FMAs). However, it will not be made
originating from different semirings. Here, particularly inter- available to the user. This means that the upside-down swap
esting would be semirings, in which the addition and multi- will be achieved by calling a U Dswap(A) function, and
plication operations are defined as (×, max) or (×, min). In completed in place, in N time steps. Matrix SWAP will have
this case, the “generalized reduction and broadcast” operation to be re-initialized after each µ-Cell fusion or division.
is going to consist of two generalized MMUs (represented in 2) Left-right swap: The left-right image (matrix) swap can
notation from the equation 1): be achieved the same way as the upside-down swap, with
the only difference being that the image matrix A is going
MMU[⊗, ⊕](A, ONES, TEMP) : TEMP ← TEMP ⊕ A ⊗ ONES; to be multiplied by the SWAP matrix from the right hand
MMU[⊗, ⊕](ONES, TEMP, RESULT) : side. Therefore, on the M oT or system, the left-right swap
RESULT ← RESULT ⊕ ONES ⊗ TEMP. will be completed in place, in N time steps, by calling the
LRswap(A) function. All the remaining comments, concern-
Here, operations [⊗, ⊕] are defined in an appropriate semir- ing the SWAP matrix, presented above, remain unchanged.
ing, while matrices TEMP and RESULT are initialized as 3) Rotation: Interestingly, combining the two swaps into a
copies of the 0̄ matrix (zero matrix for a given semiring). single operation (multiplication of a given image / matrix A
Finally, ONES is a matrix of all ones, where the “one” element from left and right by the matrix SWAP) results in rotation
originates from a given scalar semiring (its element 1̄). of the matrix / image A by 180◦ . Obviously, from the above
Under these assumptions it is easy to see that we can follows that on the M oT or system, this operation can be com-
define at least three functions that will have the same general pleted in place, in 2N steps using two matrix multiplications,
form, while being based on different semirings: AddBcast – by calling an appropriately defined Rotate(A) function.
realizing summation of all elements in a matrix and broad-
IV. T OWARDS LIBRARY OF MATRIX MULTIPLICATIONS
casting the result to all processors (function based on the
BASED DATA MANIPULATIONS
standard arithmetic); M axBcast finding the largest element
in a matrix and broadcasting it to all processors (based on Let us now summarize the above considerations from the
the (×, max) semiring); and M inBcast finding the smallest point of view of development of a library of operations that can
element in the matrix and broadcasting it to all processors be performed on matrices / images though generalized matrix
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on December 13,2024 at 05:50:30 UTC from IEEE Xplore. Restrictions apply.
MARIA GANZHA ET AL: LIBRARY FOR MATRIX MULTIPLICATION-BASED DATA MANIPULATION 461
Table I
S UMMARY OF FUNCTIONS PROPOSED FOR THE LIBRARY
multiplication. In Table I we combine proposals presented thus We start from the interface (see, also [28]).
far. There, we present the functionality, the proposed function /∗ T − type of matrix element ∗/
name, the “special matrices” that have to be instantiated within
interface Matrix interface {
the MoTor system to complete the operations, and information p u b l i c M a t r i x 0 ( n ) {/∗ g e n e r a l i z e d z e r o m a t r i x ∗/}
if these matrices have to be reinitialized after µ-cell fusion / p u b l i c M a t r i x I ( n ) {/∗ g e n e r a l i z e d i d e n t i t y m a t r i x ∗/}
p u b l i c M a t r i x o p e r a t o r + / ∗ g e n e r a l i z e d A+B∗ /
splitting operation. Observe that, while the first two functions p u b l i c M a t r i x o p e r a t o r ∗ / ∗ g e n e r a l i z e d A∗B∗ /
are directly connected with the MoTor architecture, the remain- p u b l i c M a t r i x C a n o n i c a l t o M o t o r (A ) ;
/ ∗ r e o r d e r i n g f o r t h e me s h−of−t o r i p r o c e s s i n g ∗ /
ing ones can be seen as “system independent.” In other words, p u b l i c M a t r i x M o t o r t o C a n o n i c a l (A ) ; / ∗ i n v e r s e o f
they can be implemented for any computer architecture, taking t h e r e o r d e r i n g f o r t h e mesh−of−t o r i p r o c e s s i n g ∗ /
p u b l i c M a t r i x t r a n s p o s e (A ) ;
full advantage of the underlying architecture. {/∗ t r a n s p o s i t i o n o f m a t r i x A∗/}
This latter observation deserves further attention, and some / ∗ g e n e r a l i z e d p e r m u t a t i o n o f column / row
i and j i n m a t r i x A∗ /
points have to be made explicit. Only the transformations p u b l i c M a t r i x Column Permut (A, i , j ) ;
from the canonical to the M oT or format and back are p u b l i c M a t r i x Row Permut (A, i , j ) ;
/∗ g e n e r a l i z e d element b ro a d c a s t ∗/
M oT or architecture specific. The remaining functions are p u b l i c Matrix ElBcast ( element ) ;
system independent. While the above considerations have in p u b l i c M a t r i x A d d B c a s t (A ) ; / ∗ g e n e r a l i z e d s u mma t i o n
o f a l l e l e m e n t s o f A and b r o a d c a s t ∗ /
mind the M oT or architecture, the proposed functions use only / ∗ b r o a d c a s t t h e l a r g e s t e l e m e n t o f A∗ /
matrix multiplication and thus can be implemented to run p u b l i c M a t r i x MaxBcast (A ) ;
/ ∗ b r o a d c a s t t h e s m a l l e s t e l e m e n t o f A∗ /
on any computer architecture, using its best of breed matrix p u b l i c M a t r i x M i n B c a s t (A ) ;
multiplication algorithm. This being the case, and taking into / ∗ M a t r i x ( Image ) M a n i p u l a t i o n ∗ /
p u b l i c M a t r i x UDswap (A ) ; / ∗ u p s i d e −down swap ∗ /
account discussion presented in Section I, it may be desirable p u b l i c M a t r i x LRswap (A ) ; / ∗ l e f t −r i g h t swap ∗ /
to implement functions from Table I on existing computers, / ∗ image v e r t i c a l r o t a t i o n ∗ /
p u b l i c M a t r i x R o t a t e (A ) ; . . .
using state-of-the-art matrix multiplication algorithms and }
consider their efficiency.
A. Object oriented realization Just defined interface is to be used with the following class
Matrix. This class summarizes the proposals outlined above.
Let us now recall that our main goal is to consider func-
c l a s s Matrix i n h e r i t s calar S em ir in g
tions from Table I in the context of the MoTor architecture. im p lem en t M a t r i x i n t e r f a c e {
However, we also see them as a method of simplifying code T : t y p e o f e l e m e n t ; / ∗ d o u b le , s i n g l e , . . . ∗ /
writing (by introducing matrix operations represented in the p r i v a t e M a t r i x R ( n ) ; / ∗ m a t r i x f o r MoTor
tra nsf orma ti on ∗/
style similar to that found in MATLAB / MATHEMATICA). p r i v a t e M a t r i x ONES ( n ) / ∗ m a t r i x o f o n e s ∗ /
This being the case we assume that there may be multiple p r i v a t e M a t r i x PERMUT( i , j , n ) / ∗ i d e n t i t y m a t r i x
ways of implementing these routines, and that they are likely w i t h i n t e r c h a n g e d co lu m n s i and j ∗ /
/ / a n t i −d i a g o n a l m a t r i x o f o n e s
to be vendor / hardware specific. Nevertheless, at the time of p r i v a t e M a t r i x SWAP ( n )
writing of this paper, object oriented programming is one of / / Methods
more popular ways of writing codes in scientific computing p u b l i c M a t r i x 0 ( n ) { /∗ 0 m a t r i x ∗ / }
p u b lic Matrix I ( n ) {/∗ i d e n t i t y matrix ∗/}
and image processing. Furthermore, this means the possible p u b l i c M a t r i x t r a n s p o s e (A: M a t r i x ) {
trial implementations, suggested above, are likely to be tried / ∗MMU −b a s e d t r a n s p o s i t i o n o f A∗ / }
using this paradigm. This being the case, we have decided p u b l i c M a t r i x o p e r a t o r + {A, B : M a t r i x }
{ r e t u r n MMU(A, I ( n ) , B , a , b ) }
to conceptualize the top-level object-oriented representation p u b l i c M a t r i x o p e r a t o r ∗ {A, B : M a t r i x }
of the library of routines from Table I. Since different OO { r e t u r n MMU(A, B , m a t r i x 0 , a , b ) }
languages have slightly different syntax (and semantics), we p u b l i c M a t r i x Column Permut (A, i , j ) {
r e t u r n MMU(PERMUT( i , j , n ) , A, O( n ) ) }
use a generic notation, distinguishing information that needs p u b l i c M a t r i x Row Permut (A, i , j ) {
to be made available in the interface and in the main class. r e t u r n MMU(A, PERMUT( i , j , n ) ,O( n ) ) }
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on December 13,2024 at 05:50:30 UTC from IEEE Xplore. Restrictions apply.
462 PROCEEDINGS OF THE FEDCSIS. KRAKÓW, 2013
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on December 13,2024 at 05:50:30 UTC from IEEE Xplore. Restrictions apply.