Matrix-Matrix Operations: 5.1 Opening Remarks
Matrix-Matrix Operations: 5.1 Opening Remarks
Matrix-Matrix Operations
In order to truly appreciate what the FLAME notation and API bring to the table, it helps to look at a programming
problem that on the surface seems straightforward, but turns out to be trickier than expected. When programming with
indices, coming up with an algorithm turns out to be relatively simple. But, when the goal is to, for example, access
memory in a favorable pattern, finding an appropriate algorithm is sometimes more difficult.
In this launch, you experience this by executing algorithms from last week by hand. Then, you examine how
these algorithms can be implemented with for loops and indices. The constraint that the matrices are symmetric is
then added into the mix. Finally, you are asked to find an algorithm that takes advantage of symmetry in storage yet
accesses the elements of the matrix in a beneficial order. The expectation is that this will be a considerable challenge.
In Figure 5.1 we show Variant 1 for y := Ax + y in FLAME notation and below it, in Figure 5.2, a more traditional
implementation in M ATLAB. To understand it easily, we use the convention that the index i is used to keep track of
the current row. In the algorithm expressed with FLAME notation this would be aT1 . The j index is then used for the
loop that updates
ψ1 := aT1 x + ψ1 ,
which you hopefully recognize as a dot product (or, more precisely, a sapdot) operation.
195
196 Week 5. Matrix-Matrix Operations
end
LAFFPfC/Assignments/Week5/matlab/MatVec1.m
Figure 5.2: Function that computes y := Ax + y, returning the result in vector y out.
.
5.1. Opening Remarks * to edX 197
LAFFPfC/Assignments/Week5/matlab/SymMatVec1.m
Figure 5.3: Functions that compute y := Ax + y, returning the result in vector y out. On the right, matrix A is assumed
to be symmetric and only stored in the lower triangular part of array A.
Homework 5.1.1.2 Download the Live Script MatVec1LS.mlx into Assignments/Week5/matlab/ and follow
the directions in it to execute function MatVec1.
* SEE ANSWER
* DO EXERCISE ON edX
Now, if m = n then matrix A is square and if the elements indexed with i, j and j, i are equal (A(i, j) = A( j, i)) then
it is said to be a symmetric matrix.
Homework 5.1.1.4 Download the Live Script SymVec1LS.mlx into Assignments/Week5/matlab/ and follow
the directions in it to change the given function to only compute with the lower triangular part of the matrix.
* SEE ANSWER
* DO EXERCISE ON edX
Now, M ATLAB stores matrices in column-major order, which means that a matrix
1 −1 2
−2 2 0
−1 1 −2
is stored in memory by stacking columns:
1
−2
−1
−1
2
1
2
0
−2
Computation tends to be more efficient if one accesses memory contiguously. This means that an algorithm that
accesses A by columns often computes the answer faster than one that accesses A by rows.
In a linear algebra course you should have learned that,
1 −1 2 2 3 1 −1 2 3
−2 2 0 −1 + 1 = (2) −2 + (−1) 2 + (1) 0 + 1
−1 1 −2 1 0 −1 1 −2 0
3 1 −1 2
= 1 + (2) −2 + (−1) 2 + (1) 0 ,
0 −1 1 −2
which is exactly how Variant 3 for computing y := Ax + y, given in Figure 5.4, proceeds. It also means that the
implementation in Figure 5.2 can be rewritten as the one in Figure 5.5. The two implementations in Figures 5.2
and 5.5 differ only in the order of the loops indexed by i and j.
5.1. Opening Remarks * to edX 199
end
LAFFPfC/Assignments/Week5/matlab/MatVec3.m
Figure 5.5: Function that computes y := Ax + y, returning the result in vector y out.
.
200 Week 5. Matrix-Matrix Operations
Homework 5.1.1.6 Download the Live Script * SymVec3LS.mlx into Assignments/Week5/matlab/ and follow
the directions in it to change the given function to only compute with the lower triangular part of the matrix.
* SEE ANSWER
* DO EXERCISE ON edX
Homework 5.1.1.7 Which algorithm for computing y := Ax + y casts more computation in terms of the columns
of the stored matrix (and is therefore probably higher performing)?
* SEE ANSWER
* DO EXERCISE ON edX
Now we get to two exercises that we believe demonstrate the value of our notation and systematic derivation of
algorithms. They are surprisingly hard, even for an experts. Don’t be disappointed if you can’t work it out! The
answer comes later in the week.
Homework 5.1.1.8 (Challenge) Download the Live Script SymMatVecByColumnsLS.mlx into
Assignments/Week5/matlab/ and follow the directions in it to change the given function to only com-
pute with the lower triangular part of the matrix and only access the matrix by columns. (Not sort-of-kind-of as in
SymMatVec3.mlx.)
* SEE ANSWER
* DO EXERCISE ON edX
Homework 5.1.1.9 (Challenge) Find someone who knows a little (or a lot) about linear algebra and convince this
person that the answer to the last exercise is correct. Alternatively, if you did not manage to come up with an
answer for the last exercise, look at the answer to that exercise and convince yourself it is correct.
* SEE ANSWER
* DO EXERCISE ON edX
• It is difficult to find algorithms with specific (performance) properties even for relatively simple operations. The
problem: the traditional implementation involves a double nested loop, which makes the application of what
you learned in Week 3 bothersome.
• It is still difficult to give a convincing argument that even a relatively simple algorithm is correct, even after you
have completed Week 2. The problem: proving a double loop correct.
One could ask “But isn’t having any algorithm to compute the result good enough?” The graph in Figure 5.6 illustrates
the difference in performance of the different implementations (coded in C). The implementation that corresponds to
SymMatVecByColumns is roughly five times faster than the other implementations. It demonstrates there is a definite
performance gain that results from picking the right algorithm.
What you will find next is that the combination of our new notation and the application of systematic derivation
provides the solution, in Unit 5.2.6.
While we discuss efficiency here, implementing the algorithms as we do in M ATLAB generally means they don’t
execute particularly efficiently. If you execute A * x in M ATLAB, this is typically translated into a call to
a high-performance implementation. But implementing it yourself in M ATLAB, as loops or with our FLAME
API, is not particularly efficient. We do it to illustrate algorithms. One would want to implement these same
algorithms in a language that enables high-performance, like C. We have a FLAME API for C as well.
202 Week 5. Matrix-Matrix Operations
Figure 5.6: Execution time (top) and speedup (bottom) as a function of matrix size for the different implementations
of symmetric matrix-vector multiplication.
5.1. Opening Remarks * to edX 203
• Enumerate candidate loop invariants for matrix operations from their PMEs and eliminate loop invariants that
do not show promise.
• Accomplish a complete derivation and implementation of an algorithm.
5.2. Partitioning matrices into quadrants * to edX 205
Consider the matrix-vector operation Ax where A and x are of appropriate sizes so that this multiplication makes
sense. Partition
AT L AT R xT
A→ , and x → .
ABL ABR xB
Then
AT L AT R xT AT L xT + AT R xB
Ax = =
ABL ABR xB ABL xT + ABR xB
provided xT and xB have the appropriate size for the subexpressions to be well-defined.
Now, if A is symmetric, then A = AT . For the partitioned matrix this means that
T
AT L AT R ATT L ATBL
=
ABL ABR ATT R ATBR
If AT L is square (and hence so is ABR since A itself is), then we conclude that
• ATT L = AT L and hence AT L is symmetric.
• ATBR = ABR and hence ABR is symmetric.
• AT R = ATBL and ABL = ATT R . Thus, if AT R is not stored, one can compute with ATBL instead. Notice that one need
not explicitly transpose the matrix. In M ATLAB the command A0 ∗ x will compute AT x.
AT L ATBL
Hence, for a partitioned symmetric matrix where AT L is square, one can compute with if AT R is not
ABL ABR
AT L AT R
available (e.g., is not stored) or if ABL is not available (e.g., is not stored). In the first case,
ATT R ABR
AT L AT R xT AT L ATBL xT AT L xT + ATBL xB
Ax = = = .
ABL ABR xB ABL ABR xB ABL xT + ABR xB
The operation we wish to implement is mathematically given by y := Ax + y, where A is a symmetric matrix (and
hence square) and only the lower triangular part of matrix A can be accessed, because (for example) the strictly upper
triangular part is not stored.
206 Week 5. Matrix-Matrix Operations
We are going to implicitly remember that A is symmetric and only the lower triangular part of the matrix is stored. So,
in the postcondition we simply state that y = Ax + yb is to be computed.
AT L AT R
A→
ABL ABR
• AT R = ATBL , and
• if we partition
xT yT
x→ and y →
xB yB
then entering the partitioned matrix and vectors into the postcondition y = Ax + yb yields
yT AT L AT R xT ybT
= +
yB ABL ABR xB ybB
AT L xT + AT R xB + ybT
=
ABL xT + ABR xB + ybB
AT L xT + ATBL xB + ybT
= since AT R is not to be used.
ABL xT + ABR xB + ybB
yT AT L xT + ATBL xB + ybT
= .
yB ABL xT + ABR xB + ybB
5.2. Partitioning matrices into quadrants * to edX 207
Homework 5.2.2.1 Below on the left you find four loop invariants for computing y := Ax + y where A has no
special structure. On the right you find four loop invariants for computing y := Ax + y when A is symmetric and
stored in the lower triangular part of A. Match the loop invariants on the right to the loop invariants on the left
that you would expect maintain the same values in y before and after each iteration of the loop. (In the video, we
mentioned asking you to find two invariants. We think you can handle finding these four!)
yT ybT yT AT L xT +ATBL xB + ybT
(1) = (a) =
yB AB x + ybB yB ABL xT +ABR xB + ybB
yT AT x + ybT yT AT L xT + ATBL xB +b
yT
(2) = (b) =
yB ybB yB ABL xT + ABR xB + ybB
yT AT L xT + ATBL xB +b
yT
(3) y = AL xT + yb (c) =
yB ABL xT + ABR xB +b
yB
yT AT L xT + ATBL xB + ybT
(4) y = AR xB + yb (d) =
yB ABL xT +ABR xB +b
yB
* SEE ANSWER
* DO EXERCISE ON edX
Now, how do we come up with possible loop invariants? Each term in the PME is either included or not. This
gives one a table of candidate loop invariants, given in Figure 5.7. But not all of these candidates will lead to a valid
algorithm. In particular, any valid algorithm must include exactly one of the terms AT L xT or ABR xB . The reasons?
• Since AT L and ABR must be square submatrices, when the loop completes one of them must be the entire matrix
A while the other matrix is empty. But that means that one of the two terms must be included in the loop
invariant, since otherwise the loop invariant, together with the loop guard becoming false, will not imply the
postcondition.
• If both AT L xT and ABR xB are in the loop invariant, there is no simple initialization step that places the variables
in a state where the loop invariant is TRUE. Why? Because if one of the two matrices AT L and ABR is empty,
then the other one is the whole matrix A, and hence the final result must be computed as part of the initialization
step.
We conclude that exactly one of the terms AT L xT and ABR xB can and must appear in the loop invariant, leaving us with
the loop invariants tabulated in Figure 5.8.
208 Week 5. Matrix-Matrix Operations
yT
AT L xT ATBL xB ABL xT ABR xB =
yB
AT L xT + ATBL xB +b
yT
A No No No No
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
B Yes No No No
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
C No Yes No No
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
D Yes Yes No No
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
E No No Yes No
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
F Yes No Yes No
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
G No Yes Yes No
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
H Yes Yes Yes No
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
I No No No Yes
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
J Yes No No Yes
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
K No Yes No Yes
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
L Yes Yes No Yes
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
M No No Yes Yes
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
N Yes No Yes Yes
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
O No Yes Yes Yes
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
P Yes Yes Yes Yes
ABL xT +ABR xB +b
yB
Figure 5.7: Candidates for loop-invariants for y := Ax + y where A is symmetric and only its lower triangular part is
stored.
5.2. Partitioning matrices into quadrants * to edX 209
yT AT L xT + ATBL xB + ybT
PME: = .
yB ABL xT + ABR xB + ybB
yT
AT L xT ATBL xB ABL xT ABR xB = Invariant #
yB
AT L xT + ATBL xB +b
yT
Yes No No No 1
ABL xT +ABR xB +b yB
AT L xT + ATBL xB +b
yT
Yes Yes No No 2
ABL xT +ABR xB +b yB
T
A x + ABL xB +b yT
Yes No Yes No TL T 3
ABL xT +ABR xB +b yB
AT L xT + ATBL xB +b
yT
Yes Yes Yes No 4
ABL xT +ABR xB +b yB
AT L xT + ATBL xB +b
yT
No Yes Yes Yes 5
ABL xT +ABR xB +b yB
AT L xT + ATBL xB +b
yT
No Yes No Yes 6
ABL xT +ABR xB +b yB
AT L xT + ATBL xB +b
yT
No No Yes Yes 7
ABL xT +ABR xB +b yB
AT L xT + ATBL xB +b
yT
No No No Yes 8
ABL xT +ABR xB +b yB
Figure 5.8: Loop-invariants for y := Ax + y where A is symmetric and only its lower triangular part is stored.
210 Week 5. Matrix-Matrix Operations
Homework 5.2.3.1 You may want to derive the algorithm corresponding to Invariant 1 yourself, consulting the
video if you get stuck. Some resources:
• The * blank worksheet.
• Download * symv unb var1 ws.tex and place it in LAFFPfC/Assignments/Week5/LaTeX/. You will
need * color flatex.tex as well in that directory.
The condition
yT AT L xT + ybT
Pinv ∧ ¬G ≡ ( = ) ∧ ¬G
yB ybB
must imply that
R : y = Ax + yb
holds. The loop guard m(AT L ) < m(A) has the desired property.
Step 4: Initialization.
When we derived the PME in Step 2, we decided to partition the matrices like
AT L AT R xT yT
A→ , x → , and y → .
ABL ABR xB yB
The question now is how to choose the sizes of the submatrices and vectors so that the precondition
y = yb
implies that the loop invariant
yT AT L xT + ybT
=
yB ybB
holds after the initialization (and before the loop commences). The initialization
5.2. Partitioning matrices into quadrants * to edX 211
AT L AT R xT yT
A→ ,x→ ,y→
ABL ABR xB yB
We now note that, as part of the computation, AT L , xT and yT start by containing no elements and must ultimately
equal all of A, x and y, respectively. Thus, as part of the loop in Step 5a, the top elements of xB and yB are exposed by
x0 y0
xT yT
→ χ1 and → ψ1 .
xB yB
x2 y2
This is where things become less than straight forward. The repartitionings in Step 5a do not change the contents of
y: it is an “indexing” operation. We can thus ask ourselves the question of what the contents of y are in terms of the
newly exposed parts of A, x, and y. We can derive this state, Pbefore , via textual substitution: The repartitionings in
Step 5a imply that
AT L = A00 AT R = a01 A02 xT = x0 yT = y0
aT10 α11 aT12 , χ1 , and ψ1 .
ABL = ABR = xB = yB =
A20 a21 A22 x2 y2
If we substitute the expressions on the right of the equalities into the loop invariant, we find that
yT AT L xT + ybT
=
yB ybB
212 Week 5. Matrix-Matrix Operations
becomes
y0 A00 x0 + yb0
=
ψ1 ψ
b1
y2 yb2
and hence
y0 A00 x0 + yb0
ψ = ψ
1 b1
y2 yb2
If we substitute the expressions on the right of the equalities into the loop invariant we find that
yT AT L xT + ybT
=
yB ybB
becomes
y0 A00 (aT10 )T x0 yb0
+
= aT10 ,
ψ1 α11 χ1 ψ
b1
y2 yb2
Comparing the contents in Step 6 and Step 7 now tells us that the state of y must change from
y0 A00 x0 + yb0
ψ = ψ
1 b1
y2 yb2
to
y0 A00 x0 + (aT10 )T χ1 + yb0
ψ1 = aT x0 + α11 χ1 + ψ
10 ,
b1
y2 yb2
5.2. Partitioning matrices into quadrants * to edX 213
y0 := χ1 (aT10 )T + y0
ψ1 := aT10 x0 + α11 χ1 + ψ1 .
It is important to build fluency and contrast a number of different algorithmic variants so you can discover patterns.
So, please take time for the next homework!
Homework 5.2.4.1 Derive algorithms for Variants 2-8, corresponding to the loop invariants in Figure 5.8. (If you
don’t have time to do all, then we suggest you do at least Variants 2-4 and Variant 8). Some resources:
Homework 5.2.4.2 Match the loop invariant (on the left) to the “update” in the loop body (on the right):
AT L xT + ATBL xB +b
yT
Invariant 1: (a) y0 := χ1 a01 + y0
ABL xT +ABR xB +b yB
ψ1 := α11 χ1 + ψ1
y2 := χ1 a21 + y2
AT L xT + ATBL xB +b
yT
Invariant 2: (b) ψ1 := α11 χ1 + aT21 x2 + ψ1
ABL xT +ABR xB +b
yB
y2 := χ1 a21 + y2
AT L xT + ATBL xB +b
yT
Invariant 3: (c) y0 := χ1 (aT10 )T + y0
ABL xT +ABR xB +b
yB
ψ1 := α11 χ1 + ψ1
y2 := χ1 a21 + y2
AT L xT + ATBL xB +b
yT
Invariant 4: (d) ψ1 := aT10 x0 + α11 χ1 + aT21 x2 + ψ1
ABL xT +ABR xB +b
yB
AT L xT + ATBL xB +b
yT
Invariant 8: (e) y0 := χ1 (aT10 )T + y0
ABL xT +ABR xB +b
yB
ψ1 :=aT10 x0 + α11 χ1 + ψ1
* SEE ANSWER
* DO EXERCISE ON edX
Homework 5.2.4.3 Derive algorithms for Variants 2-8, corresponding to the loop invariants in Figure 5.8. (If you
don’t have time to do all, then we suggest you do at least Variants 2-4 and Variant 8). Some resources:
• The * blank worksheet.
• * color flatex.tex.
• Spark webpage.
• * symv unb var2 ws.tex, * symv unb var2 ws.tex,
* symv unb var3 ws.tex, * symv unb var4 ws.tex,
* symv unb var5 ws.tex, * symv unb var6 ws.tex,
* symv unb var7 ws.tex, * symv unb var8 ws.tex.
* SEE ANSWER
* DO EXERCISE ON edX
y := Ax + y,
Now, consider again the PME, color coded for the different parts of the matrix
yT AT L xT +ATBL xB + ybT
= .
yB ABL xT +ABR xB + ybB
Let us consider what computations this represents when AT L is 2 × 2 for our 4 × 4 example:
With this color coding, how the different algorithms perform computation is illustrated in Figure 5.9.
Homework 5.2.6.1 We now return to the launch for this week and the question of how to find an algorithm for
computing y := Ax + y, where A is symmetric and stored only in the lower triangular part of A. Consult Figure 5.10
to answer the question of which invariant(s) yield an algorithm that accesses the matrix by columns.
* SEE ANSWER
* DO EXERCISE ON edX
Week 5. Matrix-Matrix Operations
Invariant
2: Invariant
6:
AT L xT + ATBL xB +b
yT AT L xT + ATBL xB +b
yT ψ1 := aT10 x0 + α11 χ1 + aT21 x2 + ψ1
ABL xT +ABR xB +b yB ABL xT +ABR xB +b yB
Invariant 3:
Invariant 7:
y0 :=χ1 (aT10 )T + y0
AT L xT + ATBL xB +b
yT AT L xT + ATBL xB +b
yT ψ1 := α11 χ1 +ψ1
ABL xT +ABR xB +b yB ABL xT +ABR xB +b yB y2 := χ1 a21 + y2
Invariant 4:
Invariant 8:
AT L xT + ATBL xB +b
yT AT L xT + ATBL xB +b
yT ψ1 :=α11 χ1 + aT21 x2 +ψ1
ABL xT +ABR xB +b yB ABL xT +ABR xB +b yB y2 := χ1 a21 + y2
Figure 5.10: Summary of loop invariants for computing y := Ax + y, where A is symmetric and stored in the lower
triangular part of the matrix. To the right is the update to y in the derived loop corresponding to the invariants.
and
β0,0 β0,1 ··· β0,n−1
β1,0 β1,1 · · · β1,n−1
B= . .
.. .. ..
. .
βk−1,0 βk−1,1 · · · βk−1,n−1
Then the result of computing C := AB sets
k−1
γi, j := ∑ αi,p × β p, j (5.1)
p=0
218 Week 5. Matrix-Matrix Operations
for all 0 ≤ i < m and 0 ≤ j < n. In the notation from Weeks 1-3 this is given as
which gives some idea of how messy postconditions and loop invariants for this operation might become using that
notation.
Now, if one partitions matrices C, A, and B into submatrices:
C0,0 C0,1 ··· C0,N−1 A0,0 A0,1 ··· A0,K−1
C1,0 C1,1 · · · C1,N−1 A1,0 A1,1 · · · A1,K−1
C= ,A = ,
.
.. .. .. .
.. .. ..
. .
. .
CM−1,0 CM−1,1 · · · CM−1,N−1 AM−1,0 AM−1,1 · · · AM−1,K−1
and
B0,0 B0,1 ··· B0,N−1
B1,0 B1,1 · · · B1,N−1
B= ,
.. .. ..
. . .
BK−1,0 BK−1,1 · · · BK−1,N−1
where Ci, j , Ai,p , and B p, j are mi × n j , mi × k p , and k p × n j , respectively, then
k−1
Ci, j := ∑ Ai,p B p, j .
p=0
The computation with submatrices (blocks) mirrors the computation with the scalars in Equation 5.1:
k−1 k−1
Ci, j := ∑ Ai,p B p, j versus γi, j := ∑ αi,p β p, j .
p=0 p=0
Thus, to remember how to multiply with partitioned matrices, all you have to do is to remember how to multiply with
matrix elements except that Ai,p × B p, j does not necessarily commute. We will often talk about the constraint on how
matrix sizes must match up by saying that the matrices are partitioned conformally.
There are special cases of this that will be encountered in the subsequent discussions:
BT
AL AR = AL BT + AR BB ,
BB
AT AT B
B = , and
AB AB B
A BL BR = ABL ABR .
Homework 5.3.2.2 Derive Variant 1, the algorithm corresponding to Invariant 1, in the answer to the last home-
work. Assume the algorithm “marches” through the matrix one row or column at a time (meaning you are to derive
an unblocked algorithm).
Some resources:
• The * blank worksheet.
• * color flatex.tex.
• Spark webpage.
• * gemm unb var1 ws.tex
• * GemmUnbVar1LS.mlx
* SEE ANSWER
* DO EXERCISE ON edX
Homework 5.3.2.3 If you feel energetic, repeat the last homework for Invariant 2.
* SEE ANSWER
* DO EXERCISE ON edX
To arrive at a second PME (PME 2) for computing C := AB +C, we partition matrix A by rows:
AT
A→ .
AB
220 Week 5. Matrix-Matrix Operations
Homework 5.3.3.1 Identify a second PME (PME 2) that corresponds to the case where A is partitioned by rows.
* SEE ANSWER
* DO EXERCISE ON edX
Homework 5.3.3.2 Identify two loop invariants from this second PME (PME 2). Label these Invariant 3 and
Invariant 4.
* SEE ANSWER
* DO EXERCISE ON edX
Homework 5.3.3.3 Derive Variant 3, the algorithm corresponding to Invariant 3, in the answer to the last home-
work. Assume the algorithm “marches” through the matrix one row or column at a time (meaning you are to derive
an unblocked algorithm).
Some resources:
• * GemmUnbVar3LS.mlx
* SEE ANSWER
* DO EXERCISE ON edX
Homework 5.3.3.4 If you feel energetic, repeat the last homework for Invariant 4,
* SEE ANSWER
* DO EXERCISE ON edX
To arrive at the third PME for computing C := AB +C, we partition matrix A by columns:
A → AL AR .
Homework 5.3.4.1 Identify a third PME that corresponds to the case where A is partitioned by columns.
* SEE ANSWER
* DO EXERCISE ON edX
Homework 5.3.4.2 Identify two loop invariants from PME 3. Label these Invariant 5 and Invariant 6.
* SEE ANSWER
* DO EXERCISE ON edX
Homework 5.3.4.3 Derive Variant 5, the algorithm corresponding to Invariant 5, in the answer to the last home-
work. Assume the algorithm “marches” through the matrix one row or column at a time (meaning you are to derive
an unblocked algorithm).
Some resources:
• The * blank worksheet.
• * color flatex.tex.
• Spark webpage.
• * gemm unb var5 ws.tex
• * GemmUnbVar5LS.mlx
* SEE ANSWER
* DO EXERCISE ON edX
Homework 5.3.4.4 If you feel energetic, repeat the last homework for Invariant 6.
* SEE ANSWER
* DO EXERCISE ON edX
In the discussions so far, we always advanced the algorithm one row and/or column at a time:
222 Week 5. Matrix-Matrix Operations
As will become clear in the enrichment for this week, exposing a block of columns or rows allows one to “block” for
performance:
Such algorithms are usually referred to as “blocked algorithms,” explaining why we referred to previous algorithms
encountered in the course as “unblocked algorithms.”
5.4. Symmetric matrix-matrix multiplication * to edX 223
Homework 5.3.5.1 Derive Variants 1, 3, and 5, the algorithms corresponding to Invariant 1, 3, and 5.
Some resources:
• The * blank worksheet.
• * color flatex.tex.
• Spark webpage.
• * gemm blk var1 ws.tex, * gemm blk var3 ws.tex * gemm blk var5 ws.tex
• * GemmBlkVar1LS.mlx , * GemmBlkVar3LS.mlx , * GemmBlkVar5LS.mlx
* SEE ANSWER
* DO EXERCISE ON edX
Homework 5.3.5.2 If you feel energetic, also derive Blocked Variants 2, 4, and 6.
* SEE ANSWER
* DO EXERCISE ON edX
(Throughout: notice the parallel between this material and that for symmetric matrix-vector multiplication.)
Consider the matrix-matrix operation AB where A and B are of appropriate sizes so that this multiplication makes
sense. Partition
AT L AT R BT
A→ , and B → .
ABL ABR BB
Then
AT L AT R BT AT L BT + AT R BB
AB = =
ABL ABR BB ABL BT + ABR BB
provided BT and BB have the appropriate size for the subexpressions to be well-defined.
Recall from Unit 5.2.1 that if A is symmetric, then A = AT . For the partitioned matrix this means that
T
AT L AT R ATT L ATBL
=
ABL ABR ATT R ATBR
If AT L is square (and hence so is ABR since A itself is), then we conclude that
5.4.2 Deriving the first PME and corresponding loop invariants * to edX
The operation we wish to implement is mathematically given by C := AB +C, where A is a symmetric matrix (and
hence square) and only the lower triangular part of matrix A can be accessed, because (for example) the strictly upper
triangular part is not stored.
We are going to implicitly remember that A is symmetric and only the lower triangular part of the matrix is stored. So,
in the postcondition we simply state that C = AB + Cb is TRUE.
• AT R = ATBL , and
• if we partition
BT CT
B→ and C →
BB CB
AT L BT + AT R BB + CbT
=
ABL BT + ABR BB + CbB
AT L BT + ATBL BB + CbT
= (since AT R is not to be used).
ABL BT + ABR BB + CbB
This last observation gives us our first PME for this operation:
CT AT L BT + ATBL BB + CbT
PME 1: = .
CB ABL BT + ABR BB + CbB
Homework 5.4.2.1 Create a table of all loop invariants for PME 1, disgarding those for which there is no viable
loop guard or initialization command. You may want to start with Figure 5.11. The gray text there will help you
decide what to include in the loop invariant.
* SEE ANSWER
* DO EXERCISE ON edX
The condition
CT AT L BT + ATBL BB + CbT
Pinv ∧ ¬G ≡ = ∧ ¬G
CB ABL BT + CbB
Step 4: Initialization.
When we derived the PME in Step 2, we decided to partition the matrices like
AT L AT R BT CT
A→ , B → , and C → .
ABL ABR BB CB
The question now is how to choose the sizes of the submatrices and vectors so that the precondition
C = Cb
CT AT L BT + ATBL BB + CbT
PME: = .
CB ABL BT + ABR BB + CbB
CT
AT L BT ATBL BB ABL BT ABR BB =
CB
A B + ATBL BB +CbT
Yes No No No TL T 1
ABL BT +ABR BB +CbB
A B + ATBL BB +CbT
Yes Yes No No TL T 2
ABL BT +ABR BB +CbB
A B + ATBL BB +CbT
Yes No Yes No TL T 3
ABL BT +ABR BB +CbB
A B + ATBL BB +CbT
Yes Yes Yes No TL T 4
ABL BT +ABR BB +CbB
A B + ATBL BB +CbT
No Yes Yes Yes TL T 5
ABL BT +ABR BB +CbB
A B + ATBL BB +CbT
No Yes No Yes TL T 6
ABL BT +ABR BB +CbB
A B + ATBL BB +CbT
No No Yes Yes TL T 7
ABL BT +ABR BB +CbB
A B + ATBL BB +CbT
No No No Yes TL T 8
ABL BT +ABR BB +CbB
Figure 5.11: Table for Homework 5.4.2.1, in which to identify loop-invariants for C := AB +C where A is symmetric
and only its lower triangular part is stored.
5.4. Symmetric matrix-matrix multiplication * to edX 227
We now note that, as part of the computation, AT L , BT and CT start by containing no elements and must ultimately
equal all of A, B and C, respectively. Thus, as part of the loop, rows must be taken from BB and CB and must be added
to BT and CT , respectively, and the quadrant AT L must be expanded every time the loop executes:
A00 a01 A02 B0 C0
AT L AT R BT CT
Step 5a: → aT10 α11 aT12 ,
→ bT1 ,
→ cT1
ABL ABR BB CB
A20 a21 A22 B2 C2
and
A00 a01 A02 B0 C0
AT L AT R
BT bT
CT cT
Step 5b: aT10 α11 aT12
← , ← 1 , ← 1 .
ABL ABR BB CB
A20 a21 A22 B2 C2
This is where things become less straightforward. The repartitionings in Step 5a do not change the contents of C: it is
an “indexing” operation. We can thus ask ourselves the question of what the contents of C are in terms of the newly
exposed parts of A, B, and C. We can derive this state, Pbefore , via textual substitution: The repartitionings in Step 5a
imply that
AT L = A00 AT R = a01 A02 BT = B0 CT = C0
aT10 α11 aT12 , T
b1 , and cT10 .
ABL = ABR = BB = CB =
A20 a21 A20 B2 C20
If we substitute the expressions on the right of the equalities into the loop-invariant we find that
CT AT L BT + ATBL BB + CbT
=
CB ABL BT + CbB
becomes T
(A00 )(B0 ) + aT10 bT1
C + Cb0
0
A20 B2
cT1 =
T T
a 10 c
b 1
C2 B0 +
A20 Cb2
and hence
C0 A00 B0 + (aT10 )T bT1 + AT20 B2 + Cb0
cT = aT10 B0 + cbT1
1
C2 AT B0 + Cb2
20
228 Week 5. Matrix-Matrix Operations
If we substitute the expressions on the right of the equalities into the loop-invariant we find that
CT AT L BT + ATBL BB + CbT
=
CB ABL BT + CbB
becomes
A00 (aT10 )T B0 T Cb0
+ A20 a21 B22 +
C0
aT10 bT1 cbT1
α11
c1
T =
B0
C2 A20 a21 + Cb2
bT1
Comparing the contents in Step 6 and Step 7 now tells us that the state of C must change from
C0 A00 B0 + (aT10 )T bT1 + AT20 B2 + Cb0
cT = aT10 B0 + cbT1
1
C2 AT B0 + Cb2
20
to
C0 A00 B0 + (aT10 )T bT1 + AT20 B2 + Cb0
cT = aT10 B0 + α11 bT1 + aT21 B2 + cbT1 ,
1
C2 A20 B0 + a21 bT1 + Cb2
Homework 5.4.3.1 Derive as many unblocked algorithmic variants as you find useful.
Some resources:
• The * blank worksheet.
• * color flatex.tex.
• Spark webpage.
• * symm l unb var1 ws.tex, * symm l unb var2 ws.tex,
* symm l unb var3 ws.tex, * symm l unb var4 ws.tex,
* symm l unb var5 ws.tex, * symm l unb var6 ws.tex,
* symm l unb var7 ws.tex, * symm l unb var8 ws.tex.
• * SymmLUnbVar1LS.mlx, * SymmLUnbVar2LS.mlx,
* SymmLUnbVar3LS.mlx, * SymmLUnbVar4LS.mlx,
* SymmLUnbVar5LS.mlx, * SymmLUnbVar6LS.mlx,
* SymmLUnbVar7LS.mlx, * SymmLUnbVar8LS.mlx.
* SEE ANSWER
* DO EXERCISE ON edX
We now discuss how to derive a blocked algorithm for symmetric matrix-matrix multiplication. Such an algorithm
casts most computation in terms of matrix-matrix multiplication. If the matrix-matrix multiplication achieves high
performance, then so does the blocked symmetric matrix-matrix multiplication (for large problem sizes).
Step 4: Initialization.
We now note that, as part of the computation, AT L , BT and CT start by containing no elements and must ultimately
equal all of A, B and C, respectively. Thus, as part of the loop, rows must be taken from BB and CB and must be added
to BT and CT , respectively, and the quadrant AT L must be expanded every time the loop executes:
A00 A01 A02 B0 C0
AT L AT R BT CT
Step 5a: → A10 A11 A12 ,
→ B1 ,
→ C1
ABL ABR BB CB
A20 A21 A22 B2 C2
and
A00 A01 A02 B0 C0
AT L AT R
BT B
CT C
Step 5b: ←
A10 A11 A12 , ← 1 , ← 1 .
ABL ABR BB CB
A20 A21 A22 B2 C2
This is where things become again less straightforward. The repartitionings in Step 5a do not change the contents of
C: it is an “indexing” operation. We can thus ask ourselves the question of what the contents of C are in terms of the
newly exposed parts of A, B, and C. We can derive this state, Pbefore , via textual substitution: The repartitionings in
Step 5a imply that
AT L = A00 AT R = A01 A02 BT = B0 CT = C0
A10 A11 A12 , B1 , and C10 .
ABL = ABR = BB = CB =
A20 A21 A20 B2 C20
If we substitute the expressions on the right of the equalities into the loop-invariant we find that
CT AT L BT + CbT
=
CB CbB
becomes
C0 (A )(B ) + Cb
00 0 0
=
C1 Cb
1
C2 Cb2
and hence
C0 A00 B0 + Cb0
C = Cb1
1
C2 Cb2
5.4. Symmetric matrix-matrix multiplication * to edX 231
If we substitute the expressions on the right of the equalities into the loop-invariant we find that
CT AT L BT + CbT
=
CB CbB
becomes
C0 A00 AT10 B0 Cb0
+
C1 = A10 A11 B1 Cb1
C2 Cb2
Comparing the contents in Step 6 and Step 7 now tells us that the state of C must change from
C0 A00 B0 + Cb0
C = C
1
b1
C2 Cb2
to
C0 A00 B0 + AT10 B1 + Cb0
C1 = A10 B0 + A11 B1 + Cb1 ,
C2 Cb2
which can be accomplished by updating
C0 := AT10 B1 +C0
C1 := A10 B0 + A11 B1 +C1
Discussion
C0 := AT10 B1 +C0
C1 := A10 B0 +C1
C1 := A11 B1 +C1
Let’s assume that matrices C and B are m × n so that A is m × m, and that the algorithm uses a block size of nb . For
convenience, assume m is an integer multiple of nb : let m = Mnb . We are going to analyze how much time is spent in
each of the assignments.
Assume A00 is (Jnb ) × (Jnb ) in size, for 0 ≤ J < M − 1. Then, counting each multiply and each add as a floating
point operation (flop):
• C0 := AT10 B1 +C0 : This is a matrix-matrix multiply involving (Jnb ) × nb matrix AT10 and nb × n matrix B1 , for
2Jn2b n flops.
• C1 := A10 B0 +C1 : This is a matrix-matrix multiply involving nb × (Jnb ) matrix A10 and Jnb × n matrix B0 , for
2Jn2b n flops.
• C1 := A11 B1 + C1 : This is a symmetric matrix-matrix multiply involving nb × nb matrix A11 and nb × n matrix
B1 , for
2n2b n flops.
If we aggregate this over all iterations, we get
• All C0 := AT10 B1 +C0 :
M−1
M2 2
∑ 2Jn2b n flops ≈ 2 n n flops = (Mnb )2 n flops = m2 n flops.
2 b
J=0
• All C1 := A10 B0 +C1 : This is a matrix-matrix multiply involving nb × (Jnb ) matrix A10 and Jnb × n matrix B0 ,
for
M−1
M2
∑ 2Jn2b n flops ≈ 2 2 n2b n flops = (Mnb )2 n flops = m2 n flops.
J=0
• All C1 := A11 B1 +C1 : This is a symmetric matrix-matrix multiply involving nb ×nb matrix A11 and nb ×n matrix
B1 , for
M−1
∑ 2n2b n flops = 2Mn2b n flops = 2nb mn flops.
J=0
The point: If nb is much smaller than m, then most computation is being performed in the general matrix-matrix
multiplications
C0 := AT10 B1 +C0
C1 := A10 B0 +C1
and a relatively small amount in the symmetric matrix-matrix multiplication
C1 := A11 B1 +C1
Thus, one can use a less efficient implementation for this subproblem (for example, using an unblocked algorithm).
Alternatively, since A11 is relatively small, one can create a temporary matrix T in which to copy A11 with its up-
per triangular part explicitly copied as well, so that a general matrix-matrix multiplication can also be used for this
subproblem.
5.4. Symmetric matrix-matrix multiplication * to edX 233
Homework 5.4.5.1 Derive as many blocked algorithmic variants as you find useful.
Some resources:
• The * blank worksheet.
• * color flatex.tex.
• Spark webpage.
• * symm l blk var1 ws.tex, * symm l blk var2 ws.tex,
* symm l blk var3 ws.tex, * symm l blk var4 ws.tex,
* symm l blk var5 ws.tex, * symm l blk var6 ws.tex,
* symm l blk var7 ws.tex, * symm l blk var8 ws.tex.
• * SymmLBlkVar1LS.mlx, * SymmLBlkVar2LS.mlx,
* SymmLBlkVar3LS.mlx, * SymmLBlkVar4LS.mlx,
(The rest of these are not yet available.)
* SymmLBlkVar5LS.mlx, * SymmLBlkVar6LS.mlx,
* SymmLBlkVar7LS.mlx, * SymmLBlkVar8LS.mlx.
* SEE ANSWER
* DO EXERCISE ON edX
Notice that this is identical to PME 1 for general matrix-matrix multiplication in Unit 5.3.2.
The astute reader will recognize that the update for the resulting variants cast computation in terms of a symmetric
matrix-vector multiply
c1 := Ab1 + c1
C1 := AB1 +C1
• Kazushige Goto, Robert A. van de Geijn. “Anatomy of high-performance matrix multiplication.” ACM Trans-
actions on Mathematical Software (TOMS), 2008.
This paper on the GotoBLAS approach for implementing matrix-matrix multiplication is probably the most
frequently cited recent paper on high-performance matrix-matrix multiplication. It was written to be under-
standable by expert and novice alike.
• Field G. Van Zee, Robert A. van de Geijn. “BLIS: A Framework for Rapidly Instantiating BLAS Functionality.”
ACM Transactions on Mathematical Software (TOMS), 2015.
In this paper, the implementation of the GotoBLAS approach is refined, exposing more loops around a “micro-
kernel” so that less code needs to be highly optimized.
http://www.cs.utexas.edu/ flame/web/FLAMEPublications.html
http://www.cs.utexas.edu/ flame/web/FLAMEPublications.html
for free access. This then led to a more complete treatment in the dissertation
• Tze Meng Low.
A Calculus of Loop Invariants for Dense Linear Algebra Optimization.
Ph.D. Dissertation. The University of Texas at Austin, Department of Computer Science. December 2013.
This work shows how an important complication for compilers, the phase ordering problem can be side-stepped
by looking at the PME and loop invariants.
This dissertation is also available from the same webpage.
We believe you now have the background to understand these works.
For extra practice, the level-3 BLAS (matrix-matrix) operations are a good source. These operations involve two or
more matrices and are special cases of matrix-matrix multiplication.
GEMM .
Earlier this week, you already derived algorithms for the GEMM (general matrix matrix multiplication operation:
C := AB +C,
where A, B, and C are all matrices with appropriate sizes. This is a special case of the operation that is part of the
BLAS, which includes all of the following operations:
C := αAB + βC
C := αAT B + βC
C := αABT + βC
C := αAT BT + βC
(Actually, it includes even more if the matrices are complex valued). The key is that matrices A and B are not to be
explicitly transposed because of the memory operations and/or extra space that would require. We suggest you ignore
α and β. This then yields the unblocked algorithms/functions
• GEMM NN UNB VAR X(A, B, C) (no transpose A, no transpose B),
• GEMM TN UNB VAR X(A, B, C) (transpose A, no transpose B),
• GEMM NT UNB VAR X(A, B, C) (no transpose A, transpose B), and
• GEMM TT UNB VAR X(A, B, C) (transpose A, transpose B).
as well as the blocked algorithms/functions
• GEMM NN BLK VAR X(A, B, C) (no transpose A, no transpose B),
• GEMM TN BLK VAR X(A, B, C) (transpose A, no transpose B),
• GEMM NT BLK VAR X(A, B, C) (no transpose A, transpose B), and
• GEMM TT BLK VAR X(A, B, C) (transpose A, transpose B).
236 Week 5. Matrix-Matrix Operations
SYMM .
Earlier this week we discussed C := AB +C where A is symmetric and therefore only stored in the lower triangular
part of A. Obviously, the matrix could instead be stored in the upper triangular part of A. In addition, the symmetric
matrix could be on the right of matrix B, as in C := BA +C. This then yields the unblocked algorithms/functions
• SYMM LL UNB VAR X(A, B, C) (left, lower triangle stored),
• SYMM LU UNB VAR X(A, B, C) (left, upper triangle stored),
• SYMM RL UNB VAR X(A, B, C) (right, lower triangle stored), and
• SYMM RU UNB VAR X(A, B, C) (right, upper triangle stored).
and blocked algorithms/functions
• SYMM LL BLK VAR X(A, B, C) (left, lower triangle stored),
• SYMM LU BLK VAR X(A, B, C) (left, upper triangle stored),
• SYMM RL BLK VAR X(A, B, C) (right, lower triangle stored), and
• SYMM RU BLK VAR X(A, B, C) (right, upper triangle stored).
SYRK .
If matrix C is symmetric, then so is the result of what is known as a symmetric rank-k update (SYRK): C := AAT +
C. In this case, only the lower or upper triangular part of C needs to be stored and updated. Alternatively, the rank-k
update can compute with the transpose of A yielding C := AT A +C. The resulting unblocked algorithms/functions are
then
• SYRK LN UNB VAR X(A, B, C) (lower triangle stored, no transpose),
• SYRK LT UNB VAR X(A, B, C) (lower triangle stored, transpose),
• SYRK UN UNB VAR X(A, B, C) (upper triangle stored, no transpose),
• SYRK UT UNB VAR X(A, B, C) (upper triangle stored, transpose),
while the blocked algorithms/functions are
• SYRK LN BLK VAR X(A, C) (lower triangle stored, no transpose),
• SYRK LT BLK VAR X(A, C) (lower triangle stored, transpose),
• SYRK UN BLK VAR X(A, C) (upper triangle stored, no transpose),
• SYRK UT BLK VAR X(A, C) (upper triangle stored, transpose),
SYR 2 K .
Similarly, if matrix C is symmetric, then so is the result of what is known as a symmetric rank-2k update (SYR 2 K):
C := ABT + BAT C. In this case, only the lower or upper triangular part of C needs to be stored and updated. Alterna-
tively, the rank-2k update can compute with the transposes of A and B, yielding C := AT B + BT A + C. The resulting
unblocked algorithms/functions are then
• SYR 2 K LN UNB VAR X(A, B, C) (lower triangle stored, no transpose),
• SYR 2 K LT UNB VAR X(A, B, C) (lower triangle stored, transpose),
• SYR 2 K UN UNB VAR X(A, B, C) (upper triangle stored, no transpose),
• SYR 2 K UT UNB VAR X(A, B, C) (upper triangle stored, transpose),
5.6. Wrap Up * to edX 237
TRMM .
Another special case of matrix-matrix multiplication is given by B := AB, where A is (lower or upper) triangular.
It turns out that the output can overwrite the input matrix B if the computation is carefully ordered. Alternatively,
the triangular matrix can be to the right of B. Finally, A can be optionally transposed and/or have a unit or nonunit
diagonal.
B := LB
B := LT B
B := UB
B := U T B
where L is a lower triangular (possibly implicitly unit lower triangular) matrix and U is an upper triangular (possibly
implicitly unit upper triangular) matrix. This then yields the algorithms/functions
• TRMM LLNN UNB VAR X(L, B) where LLNN stands for left, lower triangular, no transpose, non unit diagonal,
• TRMM RLNN UNB VAR X(L, B) where LLNN stands for right, lower triangular, no transpose, non unit diagonal,
• and so forth.
TRSM .
The final matrix-matrix operation solves AX = B where A is triangular, and the solution X overwrites B. This is
known as triangular solve with multiple right-hand sides. We discuss this operation in detail in Week 6.