0% found this document useful (0 votes)
167 views44 pages

Matrix-Matrix Operations: 5.1 Opening Remarks

This document discusses matrix-matrix operations and summarizes an algorithm for computing y := Ax + y where A is a symmetric matrix. It begins by introducing the problem and showing an algorithm in FLAME notation. It then provides a MATLAB implementation of the algorithm without considering symmetry. It then modifies the algorithm and MATLAB code to take advantage of symmetry, where only the lower triangular portion of A is stored. Exercises are provided to practice implementing the algorithms.

Uploaded by

uta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
167 views44 pages

Matrix-Matrix Operations: 5.1 Opening Remarks

This document discusses matrix-matrix operations and summarizes an algorithm for computing y := Ax + y where A is a symmetric matrix. It begins by introducing the problem and showing an algorithm in FLAME notation. It then provides a MATLAB implementation of the algorithm without considering symmetry. It then modifies the algorithm and MATLAB code to take advantage of symmetry, where only the lower triangular portion of A is stored. Exercises are provided to practice implementing the algorithms.

Uploaded by

uta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Week 5

Matrix-Matrix Operations

5.1 Opening Remarks * to edX

5.1.1 Launch * to edX

* Watch Video on edX


* Watch Video on YouTube

In order to truly appreciate what the FLAME notation and API bring to the table, it helps to look at a programming
problem that on the surface seems straightforward, but turns out to be trickier than expected. When programming with
indices, coming up with an algorithm turns out to be relatively simple. But, when the goal is to, for example, access
memory in a favorable pattern, finding an appropriate algorithm is sometimes more difficult.
In this launch, you experience this by executing algorithms from last week by hand. Then, you examine how
these algorithms can be implemented with for loops and indices. The constraint that the matrices are symmetric is
then added into the mix. Finally, you are asked to find an algorithm that takes advantage of symmetry in storage yet
accesses the elements of the matrix in a beneficial order. The expectation is that this will be a considerable challenge.

Homework 5.1.1.1 Compute     


1 −1 2 2 3
    
  −1  +  1  =
 −2 2 0    

−1 1 −2 1 0
using algorithmic Variant 1 given in Figure 5.1.
* SEE ANSWER
* DO EXERCISE ON edX

In Figure 5.1 we show Variant 1 for y := Ax + y in FLAME notation and below it, in Figure 5.2, a more traditional
implementation in M ATLAB. To understand it easily, we use the convention that the index i is used to keep track of
the current row. In the algorithm expressed with FLAME notation this would be aT1 . The j index is then used for the
loop that updates
ψ1 := aT1 x + ψ1 ,

which you hopefully recognize as a dot product (or, more precisely, a sapdot) operation.

195
196 Week 5. Matrix-Matrix Operations

Algorithm: y := GEMV UNB VAR 1(A, x, y)


   
AT yT
A→ ,y→ 
AB yB
where AT has 0 rows, yT has 0 elements
while m(AT ) < m(A) do
   
A0 y
 0
   
AT   yT 
  →  aT  , 
  → ψ
 
 1

AB  1  yB 
A2 y2
where aT1 is a row, ψ1 is a scalar
ψ1 := aT1 x + ψ1
   
A y0
 0
   
AT   yT   ψ
  
  ←  aT1

, ← 1


AB   yB  
A2 y2
endwhile

Figure 5.1: Variant 1 for computing y := Ax + y in FLAME notation.

function [ y_out ] = MatVec1 ( A , x , y )


% Compute y := A x + y

% Extract the row and column size of A


[ m , n ] = size ( A );

% ( Strictly speaking you should check that x is a vector size n and y is a


% vector of size m ...)

% Copy y into y_out


y_out = y;

% Compute y_out = A * x + y_out


for i = 1: m
for j =1: n
y_out ( i ) = A( i ,j ) * x( j ) + y_out ( i );
end
end

end

LAFFPfC/Assignments/Week5/matlab/MatVec1.m

Figure 5.2: Function that computes y := Ax + y, returning the result in vector y out.
.
5.1. Opening Remarks * to edX 197

function [ y_out ] = SymMatVec1 ( A , x , y )


% Compute y := A x + y , assuming A is
symmetric and stored in lower
function [ y_out ] = MatVec1 ( A , x , y ) % triangular part of array A.
% Compute y := A x + y
% Extract the row and column size of A
% Extract the row and column size of A [ m , n ] = size ( A );
[ m , n ] = size ( A );
% ( Strictly speaking you should check that m
% ( Strictly speaking you should check that x = n , x is a vector size n and y is a
is a vector size n and y is a % vector of size n ...)
% vector of size m ...)
% Copy y into y_out
% Copy y into y_out y_out = y;
y_out = y;
% Compute y_out = A * x + y_out
% Compute y_out = A * x + y_out for i = 1: n
for i = 1: m for j =1: i
for j =1: n y_out ( i ) = A( i ,j ) * x( j ) +
y_out ( i ) = A( i ,j ) * x( j ) + y_out ( i );
y_out ( i ); end
end for j=i +1: n
end y_out ( i ) = A( j ,i ) * x( j ) +
y_out ( i );
end end
end
LAFFPfC/Assignments/Week5/matlab/MatVec1.m
end

LAFFPfC/Assignments/Week5/matlab/SymMatVec1.m

Figure 5.3: Functions that compute y := Ax + y, returning the result in vector y out. On the right, matrix A is assumed
to be symmetric and only stored in the lower triangular part of array A.

Homework 5.1.1.2 Download the Live Script MatVec1LS.mlx into Assignments/Week5/matlab/ and follow
the directions in it to execute function MatVec1.
* SEE ANSWER
* DO EXERCISE ON edX

Now, if m = n then matrix A is square and if the elements indexed with i, j and j, i are equal (A(i, j) = A( j, i)) then
it is said to be a symmetric matrix.

Homework 5.1.1.3 Knowing that the matrix is symmetric, compute


    
1 ? ? 2 3
    
 −2 2 ?   −1  +  1  =
    
−1 1 −2 1 0

using algorithmic Variant 1 given in Figure 5.1.


* SEE ANSWER
* DO EXERCISE ON edX
198 Week 5. Matrix-Matrix Operations

Homework 5.1.1.4 Download the Live Script SymVec1LS.mlx into Assignments/Week5/matlab/ and follow
the directions in it to change the given function to only compute with the lower triangular part of the matrix.
* SEE ANSWER
* DO EXERCISE ON edX

* Watch Video on edX


* Watch Video on YouTube

* Watch Video on edX


* Watch Video on YouTube

Now, M ATLAB stores matrices in column-major order, which means that a matrix
 
1 −1 2
 
 −2 2 0 
 
−1 1 −2
is stored in memory by stacking columns:
1
−2
−1
−1
2
1
2
0
−2
Computation tends to be more efficient if one accesses memory contiguously. This means that an algorithm that
accesses A by columns often computes the answer faster than one that accesses A by rows.
In a linear algebra course you should have learned that,
            
1 −1 2 2 3 1 −1 2 3
            
 −2 2 0   −1  +  1  = (2)  −2  + (−1)  2  + (1)  0  +  1 
            
−1 1 −2 1 0 −1 1 −2 0
       
3 1 −1 2
       
= 1  + (2)  −2  + (−1)  2  + (1)  0  ,
       

0 −1 1 −2
which is exactly how Variant 3 for computing y := Ax + y, given in Figure 5.4, proceeds. It also means that the
implementation in Figure 5.2 can be rewritten as the one in Figure 5.5. The two implementations in Figures 5.2
and 5.5 differ only in the order of the loops indexed by i and j.
5.1. Opening Remarks * to edX 199

Algorithm: y := GEMV UNB VAR 3(A, x, y)


 
  xT
A → AL AR , x →  
xB
where AL has 0 columns, xT has 0 rows
while n(AL ) < n(A) do
 
  x0
    xT  
→ , → 
AL AR A0 a1 A2  χ1 
xB  
x2
where a1 has 1 column, χ1 has 1 row
y := χ1 a1 + y
 
  x0
    xT  
A→ ← , ← 
AL AR A0 a1 A2  χ1 
xB  
x2
endwhile

Figure 5.4: Variant 3 for computing y := Ax + y in FLAME notation.

function [ y_out ] = MatVec3 ( A , x , y )


% Compute y := A x + y

% Extract the row and column size of A


[ m , n ] = size ( A );

% ( Strictly speaking you should check that x is a vector size n and y is a


% vector of size m ...)

% Copy y into y_out


y_out = y;

% Compute y_out = A * x + y_out


for j = 1: n
for i =1: m
y_out ( i ) = A( i ,j ) * x( j ) + y_out ( i );
end
end

end

LAFFPfC/Assignments/Week5/matlab/MatVec3.m

Figure 5.5: Function that computes y := Ax + y, returning the result in vector y out.
.
200 Week 5. Matrix-Matrix Operations

Homework 5.1.1.5 Knowing that the matrix is symmetric, compute


    
1 ? ? 1 1
    
 2 −2 ?   −1  +  2  =
    
−2 1 3 1 3

using algorithmic Variant 3 given in Figure 5.4.


* SEE ANSWER
* DO EXERCISE ON edX

Homework 5.1.1.6 Download the Live Script * SymVec3LS.mlx into Assignments/Week5/matlab/ and follow
the directions in it to change the given function to only compute with the lower triangular part of the matrix.
* SEE ANSWER
* DO EXERCISE ON edX

* Watch Video on edX


* Watch Video on YouTube

Homework 5.1.1.7 Which algorithm for computing y := Ax + y casts more computation in terms of the columns
of the stored matrix (and is therefore probably higher performing)?
* SEE ANSWER
* DO EXERCISE ON edX

* Watch Video on edX


* Watch Video on YouTube

Now we get to two exercises that we believe demonstrate the value of our notation and systematic derivation of
algorithms. They are surprisingly hard, even for an experts. Don’t be disappointed if you can’t work it out! The
answer comes later in the week.
Homework 5.1.1.8 (Challenge) Download the Live Script SymMatVecByColumnsLS.mlx into
Assignments/Week5/matlab/ and follow the directions in it to change the given function to only com-
pute with the lower triangular part of the matrix and only access the matrix by columns. (Not sort-of-kind-of as in
SymMatVec3.mlx.)
* SEE ANSWER
* DO EXERCISE ON edX

Homework 5.1.1.9 (Challenge) Find someone who knows a little (or a lot) about linear algebra and convince this
person that the answer to the last exercise is correct. Alternatively, if you did not manage to come up with an
answer for the last exercise, look at the answer to that exercise and convince yourself it is correct.
* SEE ANSWER
* DO EXERCISE ON edX

The point of these last two exercises is:


5.1. Opening Remarks * to edX 201

• It is difficult to find algorithms with specific (performance) properties even for relatively simple operations. The
problem: the traditional implementation involves a double nested loop, which makes the application of what
you learned in Week 3 bothersome.
• It is still difficult to give a convincing argument that even a relatively simple algorithm is correct, even after you
have completed Week 2. The problem: proving a double loop correct.
One could ask “But isn’t having any algorithm to compute the result good enough?” The graph in Figure 5.6 illustrates
the difference in performance of the different implementations (coded in C). The implementation that corresponds to
SymMatVecByColumns is roughly five times faster than the other implementations. It demonstrates there is a definite
performance gain that results from picking the right algorithm.
What you will find next is that the combination of our new notation and the application of systematic derivation
provides the solution, in Unit 5.2.6.

While we discuss efficiency here, implementing the algorithms as we do in M ATLAB generally means they don’t
execute particularly efficiently. If you execute A * x in M ATLAB, this is typically translated into a call to
a high-performance implementation. But implementing it yourself in M ATLAB, as loops or with our FLAME
API, is not particularly efficient. We do it to illustrate algorithms. One would want to implement these same
algorithms in a language that enables high-performance, like C. We have a FLAME API for C as well.
202 Week 5. Matrix-Matrix Operations

Figure 5.6: Execution time (top) and speedup (bottom) as a function of matrix size for the different implementations
of symmetric matrix-vector multiplication.
5.1. Opening Remarks * to edX 203

5.1.2 Outline Week 5 * to edX

5.1. Opening Remarks * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195


5.1.1. Launch * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.1.2. Outline Week 5 * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.1.3. What you will learn * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
5.2. Partitioning matrices into quadrants * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.2.1. Background * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.2.2. Example: Deriving algorithms for symmetric matrix-vector multiplication * to edX . . . . . 205
5.2.3. One complete derivation * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
5.2.4. Other variants * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.2.5. Visualizing the different algorithms * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . 214
5.2.6. Which variant? * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.3. Matrix-matrix multiplication * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
5.3.1. Background * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
5.3.2. Matrix-matrix multiplication by columns * to edX . . . . . . . . . . . . . . . . . . . . . . . 218
5.3.3. Matrix-matrix multiplication by rows * to edX . . . . . . . . . . . . . . . . . . . . . . . . . 219
5.3.4. Matrix-matrix multiplication via rank-1 updates * to edX . . . . . . . . . . . . . . . . . . . 220
5.3.5. Blocked algorithms * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
5.4. Symmetric matrix-matrix multiplication * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . 223
5.4.1. Background * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
5.4.2. Deriving the first PME and corresponding loop invariants * to edX . . . . . . . . . . . . . . 224
5.4.3. Deriving unblocked algorithms corresponding to PME 1 * to edX . . . . . . . . . . . . . . . 225
5.4.4. Blocked Algorithms * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.4.5. Other blocked algorithms * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
5.4.6. A second PME * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
5.5. Enrichment * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
5.5.1. The memory hierarchy * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
5.5.2. The GotoBLAS matrix-matrix multiplication algorithm * to edX . . . . . . . . . . . . . . . 234
5.5.3. The PME and loop invariants say it all! * to edX . . . . . . . . . . . . . . . . . . . . . . . . 234
5.6. Wrap Up * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
5.6.1. Additional exercises * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
5.6.2. Summary * to edX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
204 Week 5. Matrix-Matrix Operations

5.1.3 What you will learn * to edX


While the FLAME notation has allowed us to abstract away from those nasty indices, so far the operations used to
illustrate this have been simple enough that they could have been derived with the techniques from Week 3. We now
see that the FLAME notation facilitates the derivation of families of algorithms for progressively more complicated
operations with matrices and vectors, yielding some algorithms that are not easily found without it.
Upon completion of this week, you should be able to

• Multiply with partitioned matrices to take advantage of special structure.


• Derive partitioned matrix expressions for matrix-vector and matrix-matrix operations.
• Recognize that a particular operation can have several PMEs each with multiple loop invariants.

• Enumerate candidate loop invariants for matrix operations from their PMEs and eliminate loop invariants that
do not show promise.
• Accomplish a complete derivation and implementation of an algorithm.
5.2. Partitioning matrices into quadrants * to edX 205

5.2 Partitioning matrices into quadrants * to edX

5.2.1 Background * to edX

* Watch Video on edX


* Watch Video on YouTube

Consider the matrix-vector operation Ax where A and x are of appropriate sizes so that this multiplication makes
sense. Partition    
AT L AT R xT
A→  , and x →  .
ABL ABR xB
Then     
AT L AT R xT AT L xT + AT R xB
Ax =   = 
ABL ABR xB ABL xT + ABR xB
provided xT and xB have the appropriate size for the subexpressions to be well-defined.
Now, if A is symmetric, then A = AT . For the partitioned matrix this means that
 T  
AT L AT R ATT L ATBL
  = 
ABL ABR ATT R ATBR
If AT L is square (and hence so is ABR since A itself is), then we conclude that
• ATT L = AT L and hence AT L is symmetric.
• ATBR = ABR and hence ABR is symmetric.
• AT R = ATBL and ABL = ATT R . Thus, if AT R is not stored, one can compute with ATBL instead. Notice that one need
not explicitly transpose the matrix. In M ATLAB the command A0 ∗ x will compute AT x.
 
AT L ATBL
Hence, for a partitioned symmetric matrix where AT L is square, one can compute with   if AT R is not
ABL ABR
 
AT L AT R
available (e.g., is not stored) or   if ABL is not available (e.g., is not stored). In the first case,
ATT R ABR
       
AT L AT R xT AT L ATBL xT AT L xT + ATBL xB
Ax =   =  = .
ABL ABR xB ABL ABR xB ABL xT + ABR xB

5.2.2 Example: Deriving algorithms for symmetric matrix-vector multiplication * to edX

* Watch Video on edX


* Watch Video on YouTube

The operation we wish to implement is mathematically given by y := Ax + y, where A is a symmetric matrix (and
hence square) and only the lower triangular part of matrix A can be accessed, because (for example) the strictly upper
triangular part is not stored.
206 Week 5. Matrix-Matrix Operations

Step 1: Precondition and postcondition

We are going to implicitly remember that A is symmetric and only the lower triangular part of the matrix is stored. So,
in the postcondition we simply state that y = Ax + yb is to be computed.

Step 2: Deriving loop invariants

Since matrix A is symmetric, we want to partition

 
AT L AT R
A→ 
ABL ABR

where AT L is square since then, because of the symmetry of A, we know that

• AT L and ABR are symmetric,

• AT R = ATBL , and

• if we partition
   
xT yT
x→  and y →  
xB yB

then entering the partitioned matrix and vectors into the postcondition y = Ax + yb yields

      
yT AT L AT R xT ybT
  =   + 
yB ABL ABR xB ybB
 
AT L xT + AT R xB + ybT
=  
ABL xT + ABR xB + ybB
 
AT L xT + ATBL xB + ybT
=   since AT R is not to be used.
ABL xT + ABR xB + ybB

This last observation gives us our PME for this operation:

   
yT AT L xT + ATBL xB + ybT
 = .
yB ABL xT + ABR xB + ybB
5.2. Partitioning matrices into quadrants * to edX 207

Homework 5.2.2.1 Below on the left you find four loop invariants for computing y := Ax + y where A has no
special structure. On the right you find four loop invariants for computing y := Ax + y when A is symmetric and
stored in the lower triangular part of A. Match the loop invariants on the right to the loop invariants on the left
that you would expect maintain the same values in y before and after each iteration of the loop. (In the video, we
mentioned asking you to find two invariants. We think you can handle finding these four!)
       
yT ybT yT AT L xT +ATBL xB + ybT
(1)  =  (a)  = 
yB AB x + ybB yB ABL xT +ABR xB + ybB
       
yT AT x + ybT yT AT L xT + ATBL xB +b
yT
(2)  =  (b)  = 
yB ybB yB ABL xT + ABR xB + ybB
   
yT AT L xT + ATBL xB +b
yT
(3) y = AL xT + yb (c)  = 
yB ABL xT + ABR xB +b
yB
   
yT AT L xT + ATBL xB + ybT
(4) y = AR xB + yb (d)  = 
yB ABL xT +ABR xB +b
yB

* SEE ANSWER
* DO EXERCISE ON edX

* Watch Video on edX


* Watch Video on YouTube

* Watch Video on edX


* Watch Video on YouTube

Now, how do we come up with possible loop invariants? Each term in the PME is either included or not. This
gives one a table of candidate loop invariants, given in Figure 5.7. But not all of these candidates will lead to a valid
algorithm. In particular, any valid algorithm must include exactly one of the terms AT L xT or ABR xB . The reasons?

• Since AT L and ABR must be square submatrices, when the loop completes one of them must be the entire matrix
A while the other matrix is empty. But that means that one of the two terms must be included in the loop
invariant, since otherwise the loop invariant, together with the loop guard becoming false, will not imply the
postcondition.

• If both AT L xT and ABR xB are in the loop invariant, there is no simple initialization step that places the variables
in a state where the loop invariant is TRUE. Why? Because if one of the two matrices AT L and ABR is empty,
then the other one is the whole matrix A, and hence the final result must be computed as part of the initialization
step.

We conclude that exactly one of the terms AT L xT and ABR xB can and must appear in the loop invariant, leaving us with
the loop invariants tabulated in Figure 5.8.
208 Week 5. Matrix-Matrix Operations

 
yT
AT L xT ATBL xB ABL xT ABR xB  =
yB
 
AT L xT + ATBL xB +b
yT
A No No No No  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
B Yes No No No  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
C No Yes No No  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
D Yes Yes No No  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
E No No Yes No  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
F Yes No Yes No  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
G No Yes Yes No  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
H Yes Yes Yes No  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
I No No No Yes  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
J Yes No No Yes  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
K No Yes No Yes  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
L Yes Yes No Yes  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
M No No Yes Yes  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
N Yes No Yes Yes  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
O No Yes Yes Yes  
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
P Yes Yes Yes Yes  
ABL xT +ABR xB +b
yB

Figure 5.7: Candidates for loop-invariants for y := Ax + y where A is symmetric and only its lower triangular part is
stored.
5.2. Partitioning matrices into quadrants * to edX 209

   
yT AT L xT + ATBL xB + ybT
PME:  = .
yB ABL xT + ABR xB + ybB
 
yT
AT L xT ATBL xB ABL xT ABR xB  = Invariant #
yB
 
AT L xT + ATBL xB +b
yT
Yes No No No   1
ABL xT +ABR xB +b yB
 
AT L xT + ATBL xB +b
yT
Yes Yes No No   2
ABL xT +ABR xB +b yB
 
T
A x + ABL xB +b yT
Yes No Yes No  TL T  3
ABL xT +ABR xB +b yB
 
AT L xT + ATBL xB +b
yT
Yes Yes Yes No   4
ABL xT +ABR xB +b yB
 
AT L xT + ATBL xB +b
yT
No Yes Yes Yes   5
ABL xT +ABR xB +b yB
 
AT L xT + ATBL xB +b
yT
No Yes No Yes   6
ABL xT +ABR xB +b yB
 
AT L xT + ATBL xB +b
yT
No No Yes Yes   7
ABL xT +ABR xB +b yB
 
AT L xT + ATBL xB +b
yT
No No No Yes   8
ABL xT +ABR xB +b yB

Figure 5.8: Loop-invariants for y := Ax + y where A is symmetric and only its lower triangular part is stored.
210 Week 5. Matrix-Matrix Operations

5.2.3 One complete derivation * to edX


In this unit, we continue the derivation started in Unit 5.2.2, with the loop invariant
   
yT AT L xT + ybT
Invariant 1:  = .
yB ybB

Homework 5.2.3.1 You may want to derive the algorithm corresponding to Invariant 1 yourself, consulting the
video if you get stuck. Some resources:
• The * blank worksheet.
• Download * symv unb var1 ws.tex and place it in LAFFPfC/Assignments/Week5/LaTeX/. You will
need * color flatex.tex as well in that directory.

• The * Spark webpage.


Alternatively, you may want to download the completed worksheet (with intermediate steps later in the PDF) *
symv unb var1 ws answer.pdf and/or its source * symv unb var1 ws answer.tex.
* SEE ANSWER
* DO EXERCISE ON edX

* Watch Video on edX


* Watch Video on YouTube

Step 3: Determining the loop-guard.

The condition    
yT AT L xT + ybT
Pinv ∧ ¬G ≡ ( = ) ∧ ¬G
yB ybB
must imply that
R : y = Ax + yb
holds. The loop guard m(AT L ) < m(A) has the desired property.

Step 4: Initialization.

When we derived the PME in Step 2, we decided to partition the matrices like
     
AT L AT R xT yT
A→ , x →   , and y →  .
ABL ABR xB yB

The question now is how to choose the sizes of the submatrices and vectors so that the precondition
y = yb
implies that the loop invariant    
yT AT L xT + ybT
 = 
yB ybB
holds after the initialization (and before the loop commences). The initialization
5.2. Partitioning matrices into quadrants * to edX 211

     
AT L AT R xT yT
A→ ,x→ ,y→ 
ABL ABR xB yB

where AT L is 0 × 0, and xT and yT have 0 elements, has the desired property.

Step 5: Progressing through the matrix and vectors.

We now note that, as part of the computation, AT L , xT and yT start by containing no elements and must ultimately
equal all of A, x and y, respectively. Thus, as part of the loop in Step 5a, the top elements of xB and yB are exposed by
   
  x0   y0
xT   yT  
  →  χ1  and   →  ψ1  .
xB yB
   
x2 y2

They are added to xT and yT with


   
  x0   y0
xT   yT  
  ←  χ1  and   ←  ψ1  .
xB yB
   
x2 y2

Similarly, rows of A are exposed


 
  A00 a01 A02
AT L AT R  
  →  aT α11 aT 
 10 12 
ABL ABR
A20 a21 A22

and “moved”, in Step 5b,


 
  A00 a01 A02
AT L AT R  
  ←  aT α11 aT  .
10 12
ABL ABR
 
A20 a21 A22

Step 6: Determining the state after repartitioning.

This is where things become less than straight forward. The repartitionings in Step 5a do not change the contents of
y: it is an “indexing” operation. We can thus ask ourselves the question of what the contents of y are in terms of the
newly exposed parts of A, x, and y. We can derive this state, Pbefore , via textual substitution: The repartitionings in
Step 5a imply that
 
AT L = A00 AT R = a01 A02 xT = x0 yT = y0
       
aT10 α11 aT12 , χ1 , and ψ1 .
ABL =   ABR =   xB =   yB =  
A20 a21 A22 x2 y2

If we substitute the expressions on the right of the equalities into the loop invariant, we find that
   
yT AT L xT + ybT
 = 
yB ybB
212 Week 5. Matrix-Matrix Operations

becomes    
y0 A00 x0 + yb0
      
=
   
 ψ1 ψ
b1 
      
y2 yb2
and hence    
y0 A00 x0 + yb0
   
 ψ = ψ 
 1  b1 
y2 yb2

Step 7: Determining the state after moving the thick lines.

The movement of the thick lines in Step 5b means that now


       
A00 a01 A02 x0 y0
AT L =   AT R =   xT =   yT =  
T
a10 α11 T
a12 , χ1 , and ψ1 .
 
ABL = A20 a21 ABR = A22 xB = x2 yB = y2

If we substitute the expressions on the right of the equalities into the loop invariant we find that
   
yT AT L xT + ybT
 = 
yB ybB

becomes       
y0 A00 (aT10 )T x0 yb0
    + 
= aT10 ,
   
 ψ1 α11 χ1 ψ
b1
   
y2 yb2

where we recognize that due to symmetry a01 = (aT10 )T and hence .


   
y0 A00 x0 + (aT10 )T χ1 + yb0
   
 ψ1  =  aT x0 + α11 χ1 + ψ
b1 
   10 
y2 yb2

Step 8: Determining the update.

Comparing the contents in Step 6 and Step 7 now tells us that the state of y must change from
   
y0 A00 x0 + yb0
   
 ψ = ψ 
 1  b1 
y2 yb2

to    
y0 A00 x0 + (aT10 )T χ1 + yb0
   
 ψ1  =  aT x0 + α11 χ1 + ψ
   10 ,
b1 
y2 yb2
5.2. Partitioning matrices into quadrants * to edX 213

which can be accomplished by updating

y0 := χ1 (aT10 )T + y0
ψ1 := aT10 x0 + α11 χ1 + ψ1 .

5.2.4 Other variants * to edX

It is important to build fluency and contrast a number of different algorithmic variants so you can discover patterns.
So, please take time for the next homework!

Homework 5.2.4.1 Derive algorithms for Variants 2-8, corresponding to the loop invariants in Figure 5.8. (If you
don’t have time to do all, then we suggest you do at least Variants 2-4 and Variant 8). Some resources:

• The * blank worksheet.


• * color flatex.tex.
• Spark webpage.
• * symv unb var2 ws.tex, * symv unb var2 ws.tex,
* symv unb var3 ws.tex, * symv unb var4 ws.tex,
* symv unb var5 ws.tex, * symv unb var6 ws.tex,
* symv unb var7 ws.tex, * symv unb var8 ws.tex.
* SEE ANSWER
* DO EXERCISE ON edX
214 Week 5. Matrix-Matrix Operations

Homework 5.2.4.2 Match the loop invariant (on the left) to the “update” in the loop body (on the right):
 
AT L xT + ATBL xB +b
yT
Invariant 1:   (a) y0 := χ1 a01 + y0
ABL xT +ABR xB +b yB
ψ1 := α11 χ1 + ψ1
y2 := χ1 a21 + y2
 
AT L xT + ATBL xB +b
yT
Invariant 2:   (b) ψ1 := α11 χ1 + aT21 x2 + ψ1
ABL xT +ABR xB +b
yB
y2 := χ1 a21 + y2
 
AT L xT + ATBL xB +b
yT
Invariant 3:   (c) y0 := χ1 (aT10 )T + y0
ABL xT +ABR xB +b
yB
ψ1 := α11 χ1 + ψ1
y2 := χ1 a21 + y2
 
AT L xT + ATBL xB +b
yT
Invariant 4:   (d) ψ1 := aT10 x0 + α11 χ1 + aT21 x2 + ψ1
ABL xT +ABR xB +b
yB
 
AT L xT + ATBL xB +b
yT
Invariant 8:   (e) y0 := χ1 (aT10 )T + y0
ABL xT +ABR xB +b
yB
ψ1 :=aT10 x0 + α11 χ1 + ψ1

* SEE ANSWER
* DO EXERCISE ON edX

Homework 5.2.4.3 Derive algorithms for Variants 2-8, corresponding to the loop invariants in Figure 5.8. (If you
don’t have time to do all, then we suggest you do at least Variants 2-4 and Variant 8). Some resources:
• The * blank worksheet.

• * color flatex.tex.
• Spark webpage.
• * symv unb var2 ws.tex, * symv unb var2 ws.tex,
* symv unb var3 ws.tex, * symv unb var4 ws.tex,
* symv unb var5 ws.tex, * symv unb var6 ws.tex,
* symv unb var7 ws.tex, * symv unb var8 ws.tex.
* SEE ANSWER
* DO EXERCISE ON edX

5.2.5 Visualizing the different algorithms * to edX


Let us reexamine the symmetric matrix-vector multiplication

y := Ax + y,

where only the lower triangular part of symmetric matrix A is stored.


5.2. Partitioning matrices into quadrants * to edX 215

The PME for this operation is    


yT AT L xT + ATBL xB + ybT
 = .
yB ABL xT + ABR xB + ybB
Consider      
α0,0 ? ? ? χ0 ψ0
     
 1,0 α1,1 ? ? 
α χ  ψ 
 1  1
A= , x= , and y= .
  
 α2,0 α2,1 α2,2
 ?  
 χ2



 ψ2



α3,0 α3,1 α3,2 α3,3 χ3 ψ3
Then all the calculations that need to be performed are given by

α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ


b0
α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b1
α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b2
α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b3

Now, consider again the PME, color coded for the different parts of the matrix
   
yT AT L xT +ATBL xB + ybT
 = .
yB ABL xT +ABR xB + ybB

Let us consider what computations this represents when AT L is 2 × 2 for our 4 × 4 example:

α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ


b0
α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b1
α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b2
α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b3

With this color coding, how the different algorithms perform computation is illustrated in Figure 5.9.

5.2.6 Which variant? * to edX


Figure 5.10 summarizes all eight loop invariants for computing y := Ax + y for the case where A is symmetric and
stored in the lower triangular part of the matrix. In this figure, the algorithms corresponding to Invariants 1-4 move
through matrix A from the top-left to bottom-right while the algorithms corresponding to Invariants 5-8 move through
matrix A from the bottom-right to top-left. To the right of the invariants is the update to y that is in the loop body
of the resulting algorithm. Interestingly, for each algorithmic variant that moves through the matrix from top-left to
bottom-right, there is a corresponding variant that moves from the bottom-right to the top-left that results in the same
update to vector y.
There is a clear link between the two loop invariants that yield the same update, if you look at how each pair differs
and how the differences relate to the PME. In one of the enrichments, we point you to recent research that explains
what you observe.

Homework 5.2.6.1 We now return to the launch for this week and the question of how to find an algorithm for
computing y := Ax + y, where A is symmetric and stored only in the lower triangular part of A. Consult Figure 5.10
to answer the question of which invariant(s) yield an algorithm that accesses the matrix by columns.
* SEE ANSWER
* DO EXERCISE ON edX
Week 5. Matrix-Matrix Operations

Loop-invariant 1 Loop-invariant 2 Loop-invariant 3 Loop-invariant 4


            
yT T x +y
AT L xT +ABL yT T x +y
AT L xT +ABL yT T x +y
AT L xT +ABL yT T x +y
AT L xT +ABL
B bT B bT B bT B bT
 =  =  =  = 
yB ABL xT +ABR xB + ybB yB ABL xT +ABR xB + ybB yB ABL xT +ABR xB + ybB yB ABL xT +ABR xB + ybB
α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b0
α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b1
α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b2
α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b3
α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b0
α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b1
α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b2
α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b3
α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b0
α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b1
α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b2
α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b3
α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b0
α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b1
α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b2
α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b3
α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b 0 α0,0 χ0 +α1,0 χ1 +α2,0 χ2 +α3,0 χ3 +ψ
b0
α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b 1 α1,0 χ0 +α1,1 χ1 +α2,1 χ2 +α3,1 χ3 +ψ
b1
α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b 2 α2,0 χ0 +α2,1 χ1 +α2,2 χ2 +α3,2 χ3 +ψ
b2
α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b 3 α3,0 χ0 +α3,1 χ1 +α3,2 χ2 +α3,3 χ3 +ψ
b3
Figure 5.9: Illustration of how computation proceeds when computing y := Ax + y where A is symmetric and stored in the lower triangular part of A. The
shaded region shows the computation that is performed in the indicated iteration.
216
5.3. Matrix-matrix multiplication * to edX 217

Loop invariants Update


Invariant
 1:  Invariant
 5: 
AT L xT + ATBL xB +b
yT AT L xT + ATBL xB +b
yT y0 := χ1 (aT10 )T + y0
ψ1 :=aT10 x0 + α11 χ1 +ψ1
   
ABL xT +ABR xB +b yB ABL xT +ABR xB +b yB

Invariant
 2:  Invariant
 6: 
AT L xT + ATBL xB +b
yT AT L xT + ATBL xB +b
yT ψ1 := aT10 x0 + α11 χ1 + aT21 x2 + ψ1
   
ABL xT +ABR xB +b yB ABL xT +ABR xB +b yB

Invariant 3:
  Invariant 7:
  y0 :=χ1 (aT10 )T + y0
AT L xT + ATBL xB +b
yT AT L xT + ATBL xB +b
yT ψ1 := α11 χ1 +ψ1
   
ABL xT +ABR xB +b yB ABL xT +ABR xB +b yB y2 := χ1 a21 + y2

Invariant 4:
  Invariant 8:
 
AT L xT + ATBL xB +b
yT AT L xT + ATBL xB +b
yT ψ1 :=α11 χ1 + aT21 x2 +ψ1
   
ABL xT +ABR xB +b yB ABL xT +ABR xB +b yB y2 := χ1 a21 + y2

Figure 5.10: Summary of loop invariants for computing y := Ax + y, where A is symmetric and stored in the lower
triangular part of the matrix. To the right is the update to y in the derived loop corresponding to the invariants.

5.3 Matrix-matrix multiplication * to edX

5.3.1 Background * to edX


For details on why this operation is defined the way it is and practice with this operation, you may want to consult
Weeks 4-5 of Linear Algebra: Foundations to Frontiers (LAFF). Here we give the briefest of reviews.
Given matrices C, A, and B of sizes m×n, m×k, and k ×n, respectively, view these matrices as the two-dimensional
arrays that represent them:
   
γ0,0 γ0,1 ··· γ0,n−1 α0,0 α0,1 ··· α0,k−1
   
 γ1,0 γ1,1 · · · γ1,n−1   α1,0 α1,1 · · · α1,k−1 
C= . ,A =  ,
   
 .. .. .. .
.. .. ..
 . . 


 . . 

γm−1,0 γm−1,1 · · · γm−1,n−1 αm−1,0 αm−1,1 · · · αm−1,k−1

and  
β0,0 β0,1 ··· β0,n−1
 
 β1,0 β1,1 · · · β1,n−1 
B= . .
 
 .. .. ..
 . . 

βk−1,0 βk−1,1 · · · βk−1,n−1
Then the result of computing C := AB sets
k−1
γi, j := ∑ αi,p × β p, j (5.1)
p=0
218 Week 5. Matrix-Matrix Operations

for all 0 ≤ i < m and 0 ≤ j < n. In the notation from Weeks 1-3 this is given as

(∀i | 0 ≤ i < m : (∀ j | 0 ≤ j < n : γi, j = (∑ p | 0 ≤ p < k : αi,p × β p, j ))),

which gives some idea of how messy postconditions and loop invariants for this operation might become using that
notation.
Now, if one partitions matrices C, A, and B into submatrices:
   
C0,0 C0,1 ··· C0,N−1 A0,0 A0,1 ··· A0,K−1
   
 C1,0 C1,1 · · · C1,N−1   A1,0 A1,1 · · · A1,K−1 
C= ,A =  ,
   
.
.. .. .. .
.. .. ..

 . . 


 . . 

CM−1,0 CM−1,1 · · · CM−1,N−1 AM−1,0 AM−1,1 · · · AM−1,K−1

and  
B0,0 B0,1 ··· B0,N−1
 
 B1,0 B1,1 · · · B1,N−1 
B= ,
 
.. .. ..

 . . . 

BK−1,0 BK−1,1 · · · BK−1,N−1
where Ci, j , Ai,p , and B p, j are mi × n j , mi × k p , and k p × n j , respectively, then

k−1
Ci, j := ∑ Ai,p B p, j .
p=0

The computation with submatrices (blocks) mirrors the computation with the scalars in Equation 5.1:
k−1 k−1
Ci, j := ∑ Ai,p B p, j versus γi, j := ∑ αi,p β p, j .
p=0 p=0

Thus, to remember how to multiply with partitioned matrices, all you have to do is to remember how to multiply with
matrix elements except that Ai,p × B p, j does not necessarily commute. We will often talk about the constraint on how
matrix sizes must match up by saying that the matrices are partitioned conformally.
There are special cases of this that will be encountered in the subsequent discussions:
 
  BT
AL AR   = AL BT + AR BB ,
BB
   
AT AT B
 B =   , and
AB AB B
   
A BL BR = ABL ABR .

5.3.2 Matrix-matrix multiplication by columns * to edX

* Watch Video on edX


* Watch Video on YouTube
5.3. Matrix-matrix multiplication * to edX 219

To arrive at a first PME for computing C := AB +C, we partition matrix B by columns:


 
B → BL BR .

After placing this in the postcondition C = AB + C,


b we notice that C must be conformally partitioned, yielding
     
CL CR = A BL BR + CbL CbR .
But what we learned in the last unit is that then
   
CL CR = ABL + CbL ABR + CbR .
This is the sought-after PME:
   
• PME 1: CL CR = ABL + CbL ABR + CbR .

Homework 5.3.2.1 Identify two loop invariants from PME 1.


* SEE ANSWER
* DO EXERCISE ON edX

Homework 5.3.2.2 Derive Variant 1, the algorithm corresponding to Invariant 1, in the answer to the last home-
work. Assume the algorithm “marches” through the matrix one row or column at a time (meaning you are to derive
an unblocked algorithm).
Some resources:
• The * blank worksheet.

• * color flatex.tex.
• Spark webpage.
• * gemm unb var1 ws.tex
• * GemmUnbVar1LS.mlx

* SEE ANSWER
* DO EXERCISE ON edX

Homework 5.3.2.3 If you feel energetic, repeat the last homework for Invariant 2.
* SEE ANSWER
* DO EXERCISE ON edX

5.3.3 Matrix-matrix multiplication by rows * to edX

* Watch Video on edX


* Watch Video on YouTube

To arrive at a second PME (PME 2) for computing C := AB +C, we partition matrix A by rows:
 
AT
A→ .
AB
220 Week 5. Matrix-Matrix Operations

After placing this in the postcondition C = AB + C,


b we notice that C must be conformally partitioned.

Homework 5.3.3.1 Identify a second PME (PME 2) that corresponds to the case where A is partitioned by rows.
* SEE ANSWER
* DO EXERCISE ON edX

* Watch Video on edX


* Watch Video on YouTube

Homework 5.3.3.2 Identify two loop invariants from this second PME (PME 2). Label these Invariant 3 and
Invariant 4.
* SEE ANSWER
* DO EXERCISE ON edX

Homework 5.3.3.3 Derive Variant 3, the algorithm corresponding to Invariant 3, in the answer to the last home-
work. Assume the algorithm “marches” through the matrix one row or column at a time (meaning you are to derive
an unblocked algorithm).
Some resources:

• The * blank worksheet.


• * color flatex.tex.
• Spark webpage.
• * gemm unb var3 ws.tex

• * GemmUnbVar3LS.mlx
* SEE ANSWER
* DO EXERCISE ON edX

Homework 5.3.3.4 If you feel energetic, repeat the last homework for Invariant 4,
* SEE ANSWER
* DO EXERCISE ON edX

5.3.4 Matrix-matrix multiplication via rank-1 updates * to edX

* Watch Video on edX


* Watch Video on YouTube

To arrive at the third PME for computing C := AB +C, we partition matrix A by columns:
 
A → AL AR .

After placing this in the postcondition C = AB + C,


b what other matrix must be conformally patitioned?
5.3. Matrix-matrix multiplication * to edX 221

Homework 5.3.4.1 Identify a third PME that corresponds to the case where A is partitioned by columns.
* SEE ANSWER
* DO EXERCISE ON edX

* Watch Video on edX


* Watch Video on YouTube

Homework 5.3.4.2 Identify two loop invariants from PME 3. Label these Invariant 5 and Invariant 6.
* SEE ANSWER
* DO EXERCISE ON edX

Homework 5.3.4.3 Derive Variant 5, the algorithm corresponding to Invariant 5, in the answer to the last home-
work. Assume the algorithm “marches” through the matrix one row or column at a time (meaning you are to derive
an unblocked algorithm).
Some resources:
• The * blank worksheet.
• * color flatex.tex.

• Spark webpage.
• * gemm unb var5 ws.tex
• * GemmUnbVar5LS.mlx
* SEE ANSWER
* DO EXERCISE ON edX

Homework 5.3.4.4 If you feel energetic, repeat the last homework for Invariant 6.
* SEE ANSWER
* DO EXERCISE ON edX

* Watch Video on edX


* Watch Video on YouTube

5.3.5 Blocked algorithms * to edX

* Watch Video on edX


* Watch Video on YouTube

In the discussions so far, we always advanced the algorithm one row and/or column at a time:
222 Week 5. Matrix-Matrix Operations

Variant Step 5a Step 5b


       
1 BL BR → B0 b1 B2 , · · · BL BR ← B0 b1 B2 , · · ·
       
2 BL BR → B0 b1 B2 , · · · BL BR ← B0 b1 B2 , · · ·
   
  A0   A0
AT   AT  
3 →  aT1 , · · · ←  aT1 , · · ·
   
AB   AB  
A2 A2
   
  A0   A0
AT   AT  
4 →  aT1 , · · · ←  aT1 , · · ·
   
AB   AB  
A2 A2
       
5 AL AR → A0 a1 A2 , · · · AL AR ← A0 a1 A2 , · · ·
       
6 AL AR → A0 a1 A2 , · · · AL AR ← A0 a1 A2 , · · ·

As will become clear in the enrichment for this week, exposing a block of columns or rows allows one to “block” for
performance:

Variant Step 5a Step 5b


 
     
1 BL BR → B0 B1 B2 , · · · BL BR ← B0 B1 B2 , · · ·
       
2 BL BR → B0 B1 B2 , · · · BL BR ← B0 B1 B2 , · · ·
   
  A0   A0
AT   AT  
3  →  A1 , · · ·
  ←  A1 , · · ·

AB   AB  
A2 A2
   
  A0   A0
AT   AT  
4  →  A1 , · · ·
  ←  A1 , · · ·

AB   AB  
A2 A2
       
5 AL AR → A0 A1 A2 , · · · AL AR ← A0 A1 A2 , · · ·
       
6 AL AR → A0 A1 A2 , · · · AL AR ← A0 A1 A2 , · · ·

Such algorithms are usually referred to as “blocked algorithms,” explaining why we referred to previous algorithms
encountered in the course as “unblocked algorithms.”
5.4. Symmetric matrix-matrix multiplication * to edX 223

Homework 5.3.5.1 Derive Variants 1, 3, and 5, the algorithms corresponding to Invariant 1, 3, and 5.
Some resources:
• The * blank worksheet.
• * color flatex.tex.

• Spark webpage.
• * gemm blk var1 ws.tex, * gemm blk var3 ws.tex * gemm blk var5 ws.tex
• * GemmBlkVar1LS.mlx , * GemmBlkVar3LS.mlx , * GemmBlkVar5LS.mlx
* SEE ANSWER
* DO EXERCISE ON edX

Homework 5.3.5.2 If you feel energetic, also derive Blocked Variants 2, 4, and 6.
* SEE ANSWER
* DO EXERCISE ON edX

5.4 Symmetric matrix-matrix multiplication * to edX

5.4.1 Background * to edX

* Watch Video on edX


* Watch Video on YouTube

(Throughout: notice the parallel between this material and that for symmetric matrix-vector multiplication.)
Consider the matrix-matrix operation AB where A and B are of appropriate sizes so that this multiplication makes
sense. Partition    
AT L AT R BT
A→  , and B →  .
ABL ABR BB

Then     
AT L AT R BT AT L BT + AT R BB
AB =   = 
ABL ABR BB ABL BT + ABR BB

provided BT and BB have the appropriate size for the subexpressions to be well-defined.
Recall from Unit 5.2.1 that if A is symmetric, then A = AT . For the partitioned matrix this means that
 T  
AT L AT R ATT L ATBL
  = 
ABL ABR ATT R ATBR

If AT L is square (and hence so is ABR since A itself is), then we conclude that

• ATT L = AT L and hence AT L is symmetric.

• ATBR = ABR and hence ABR is symmetric.


224 Week 5. Matrix-Matrix Operations

• AT R = ATBL and ABL = ATT R .


 
AT L ATBL
Thus, for a partitioned symmetric matrix where AT L is square, one can compute with   if AT R is not
ABL ABR
 
AT L AT R
available (e.g., is not stored) or   if ABL is not available (e.g., is not stored). In the first case,
ATT R ABR
       
AT L AT R BT AT L ATBL BT AT L BT + ATBL BB
AB =   =  = .
ABL ABR BB ABL ABR BB ABL BT + ABR BB

5.4.2 Deriving the first PME and corresponding loop invariants * to edX

* Watch Video on edX


* Watch Video on YouTube

The operation we wish to implement is mathematically given by C := AB +C, where A is a symmetric matrix (and
hence square) and only the lower triangular part of matrix A can be accessed, because (for example) the strictly upper
triangular part is not stored.

Step 1: Precondition and postcondition

We are going to implicitly remember that A is symmetric and only the lower triangular part of the matrix is stored. So,
in the postcondition we simply state that C = AB + Cb is TRUE.

Step 2: Deriving loop-invariants

Since matrix A is symmetric, we want to partition


 
AT L AT R
A→ 
ABL ABR

where AT L is square because then, because of the symmetry of A, we know that

• AT L and ABR are symmetric,

• AT R = ATBL , and

• if we partition    
BT CT
B→  and C →  
BB CB

then entering the partitioned matrices into the postcondition C = AB + Cb yields


      
CT AT L AT R BT CbT
  =   + 
CB ABL ABR BB CbB
5.4. Symmetric matrix-matrix multiplication * to edX 225

 
AT L BT + AT R BB + CbT
=  
ABL BT + ABR BB + CbB
 
AT L BT + ATBL BB + CbT
=   (since AT R is not to be used).
ABL BT + ABR BB + CbB

This last observation gives us our first PME for this operation:
   
CT AT L BT + ATBL BB + CbT
PME 1:  = .
CB ABL BT + ABR BB + CbB

Homework 5.4.2.1 Create a table of all loop invariants for PME 1, disgarding those for which there is no viable
loop guard or initialization command. You may want to start with Figure 5.11. The gray text there will help you
decide what to include in the loop invariant.
* SEE ANSWER
* DO EXERCISE ON edX

5.4.3 Deriving unblocked algorithms corresponding to PME 1 * to edX


In this unit, we work out the details for Invariant 4, yielding unblocked Variant 4.

Step 3: Determining the loop-guard.

The condition    
CT AT L BT + ATBL BB + CbT
Pinv ∧ ¬G ≡  =  ∧ ¬G
CB ABL BT + CbB

must imply that


R : C = AB + Cb
holds.
We can choose G as m(AT L ) < m(A) or, equivalently because the partitioning of the matrices must be conformal,
m(CT ) < m(C) or m(BT ) < m(B).

Step 4: Initialization.

When we derived the PME in Step 2, we decided to partition the matrices like
     
AT L AT R BT CT
A→ , B →   , and C →  .
ABL ABR BB CB

The question now is how to choose the sizes of the submatrices and vectors so that the precondition

C = Cb

implies that the loop-invariant    


CT AT L BT + ATBL BB + CbT
 = 
CB ABL BT + CbB
226 Week 5. Matrix-Matrix Operations

   
CT AT L BT + ATBL BB + CbT
PME:  = .
CB ABL BT + ABR BB + CbB
 
CT
AT L BT ATBL BB ABL BT ABR BB  =
CB
 
A B + ATBL BB +CbT
Yes No No No  TL T  1
ABL BT +ABR BB +CbB
 
A B + ATBL BB +CbT
Yes Yes No No  TL T  2
ABL BT +ABR BB +CbB
 
A B + ATBL BB +CbT
Yes No Yes No  TL T  3
ABL BT +ABR BB +CbB
 
A B + ATBL BB +CbT
Yes Yes Yes No  TL T  4
ABL BT +ABR BB +CbB
 
A B + ATBL BB +CbT
No Yes Yes Yes  TL T  5
ABL BT +ABR BB +CbB
 
A B + ATBL BB +CbT
No Yes No Yes  TL T  6
ABL BT +ABR BB +CbB
 
A B + ATBL BB +CbT
No No Yes Yes  TL T  7
ABL BT +ABR BB +CbB
 
A B + ATBL BB +CbT
No No No Yes  TL T  8
ABL BT +ABR BB +CbB

Figure 5.11: Table for Homework 5.4.2.1, in which to identify loop-invariants for C := AB +C where A is symmetric
and only its lower triangular part is stored.
5.4. Symmetric matrix-matrix multiplication * to edX 227

holds after the initialization (and before the loop commences).


This leads us to the initialization step
     
AT L AT R BT CT
A→ ,B →   ,C →  
ABL ABR BB CB
where AT L is 0 × 0 and BT and CT have no rows.

Step 5: Progressing through the matrices.

We now note that, as part of the computation, AT L , BT and CT start by containing no elements and must ultimately
equal all of A, B and C, respectively. Thus, as part of the loop, rows must be taken from BB and CB and must be added
to BT and CT , respectively, and the quadrant AT L must be expanded every time the loop executes:
     
  A00 a01 A02   B0   C0
AT L AT R   BT   CT  
Step 5a:  →  aT10 α11 aT12  , 
 →  bT1  , 
 →  cT1 

ABL ABR   BB   CB  
A20 a21 A22 B2 C2
and      
  A00 a01 A02   B0   C0
AT L AT R 
  BT   bT
 
  CT   cT 
  
Step 5b:   aT10 α11 aT12
← , ← 1 , ←  1 .
ABL ABR   BB   CB  
A20 a21 A22 B2 C2

Step 6: Determining the state after repartitioning.

This is where things become less straightforward. The repartitionings in Step 5a do not change the contents of C: it is
an “indexing” operation. We can thus ask ourselves the question of what the contents of C are in terms of the newly
exposed parts of A, B, and C. We can derive this state, Pbefore , via textual substitution: The repartitionings in Step 5a
imply that
 
AT L = A00 AT R = a01 A02 BT = B0 CT = C0
       
aT10 α11 aT12 , T
b1 , and cT10 .
ABL =   ABR =   BB =   CB =  
A20 a21 A20 B2 C20

If we substitute the expressions on the right of the equalities into the loop-invariant we find that
   
CT AT L BT + ATBL BB + CbT
 = 
CB ABL BT + CbB
becomes   T   
 
 (A00 )(B0 ) +  aT10 bT1
C  + Cb0
 0 
  

 A20 B2 
 cT1 =
      
  T T

  a 10 c
b 1

C2    B0 +   
A20 Cb2
and hence    
C0 A00 B0 + (aT10 )T bT1 + AT20 B2 + Cb0
   
 cT  =  aT10 B0 + cbT1 
 1   
C2 AT B0 + Cb2
20
228 Week 5. Matrix-Matrix Operations

Step 7: Determining the state after moving the lines.

The movement of the lines in Step 5b means that now


       
A00 a01 A02 B0 C0
AT L =   AT R =   BT =   CT =  
aT10 α11 aT12 , bT1 , and cT1 .
 
ABL = A20 a21 ABR = A22 BB = B2 CB = C2

If we substitute the expressions on the right of the equalities into the loop-invariant we find that
   
CT AT L BT + ATBL BB + CbT
 = 
CB ABL BT + CbB

becomes
    
  A00 (aT10 )T B0  T Cb0
+ A20 a21 B22 + 
  C0   
  
aT10 bT1 cbT1

 α11 
 c1
 T =
    

     B0 
C2  A20 a21   + Cb2 
bT1

where we recognize that due to symmetry a01 = (aT10 )T and hence


   
C0 A00 B0 + (aT10 )T bT1 + AT20 B2 + Cb0
   
 cT  =  aT10 B0 + α11 bT1 + aT21 B2 + cbT1 
 1   
C2 A20 B0 + a21 bT1 + Cb2

Step 8: Determining the update.

Comparing the contents in Step 6 and Step 7 now tells us that the state of C must change from
   
C0 A00 B0 + (aT10 )T bT1 + AT20 B2 + Cb0
   
 cT  =  aT10 B0 + cbT1 
 1   
C2 AT B0 + Cb2
20

to
   
C0 A00 B0 + (aT10 )T bT1 + AT20 B2 + Cb0
   
 cT  =  aT10 B0 + α11 bT1 + aT21 B2 + cbT1 ,
 1   
C2 A20 B0 + a21 bT1 + Cb2

which can be accomplished by updating

cT1 := α11 bT1 + aT12 B2 + cT1


C2 := a21 bT1 +C2
5.4. Symmetric matrix-matrix multiplication * to edX 229

Homework 5.4.3.1 Derive as many unblocked algorithmic variants as you find useful.
Some resources:
• The * blank worksheet.
• * color flatex.tex.

• Spark webpage.
• * symm l unb var1 ws.tex, * symm l unb var2 ws.tex,
* symm l unb var3 ws.tex, * symm l unb var4 ws.tex,
* symm l unb var5 ws.tex, * symm l unb var6 ws.tex,
* symm l unb var7 ws.tex, * symm l unb var8 ws.tex.

• * SymmLUnbVar1LS.mlx, * SymmLUnbVar2LS.mlx,
* SymmLUnbVar3LS.mlx, * SymmLUnbVar4LS.mlx,
* SymmLUnbVar5LS.mlx, * SymmLUnbVar6LS.mlx,
* SymmLUnbVar7LS.mlx, * SymmLUnbVar8LS.mlx.
* SEE ANSWER
* DO EXERCISE ON edX

5.4.4 Blocked Algorithms * to edX

* Watch Video on edX


* Watch Video on YouTube

We now discuss how to derive a blocked algorithm for symmetric matrix-matrix multiplication. Such an algorithm
casts most computation in terms of matrix-matrix multiplication. If the matrix-matrix multiplication achieves high
performance, then so does the blocked symmetric matrix-matrix multiplication (for large problem sizes).

Step 1: Precondition and postcondition

Same as for the unblocked algorithms!


We are going to implicitly remember that A is symmetric and only the lower triangular part of the matrix is stored.
So, in the postcondition we simply state that C = AB + Cb is to be computed.

Step 2: Deriving loop-invariants

Same as for the unblocked algorithms!


   
CT AT L BT + CbT
We continue with Invariant 1:  = .
CB CbB

Step 3: Determining the loop-guard.

Same as for the unblocked variants!

Step 4: Initialization.

Same as for the unblocked variants!


230 Week 5. Matrix-Matrix Operations

This leads us to the initialization step


     
AT L AT R BT CT
A→ ,B →   ,C →  
ABL ABR BB CB

where AT L is 0 × 0 and BT and CT have no rows.

Step 5: Progressing through the matrices.

We now note that, as part of the computation, AT L , BT and CT start by containing no elements and must ultimately
equal all of A, B and C, respectively. Thus, as part of the loop, rows must be taken from BB and CB and must be added
to BT and CT , respectively, and the quadrant AT L must be expanded every time the loop executes:
     
  A00 A01 A02   B0   C0
AT L AT R   BT   CT  
Step 5a:  →  A10 A11 A12  , 
 →  B1  , 
 →  C1 

ABL ABR   BB   CB  
A20 A21 A22 B2 C2

and      
  A00 A01 A02   B0   C0
AT L AT R 
  BT   B
 
  CT   C
  
Step 5b:  ←
 A10 A11 A12 , ← 1 , ← 1 .

ABL ABR   BB   CB  
A20 A21 A22 B2 C2

Step 6: Determining the state after repartitioning.

This is where things become again less straightforward. The repartitionings in Step 5a do not change the contents of
C: it is an “indexing” operation. We can thus ask ourselves the question of what the contents of C are in terms of the
newly exposed parts of A, B, and C. We can derive this state, Pbefore , via textual substitution: The repartitionings in
Step 5a imply that
 
AT L = A00 AT R = A01 A02 BT = B0 CT = C0
       
A10 A11 A12 , B1 , and C10 .
ABL =   ABR =   BB =   CB =  
A20 A21 A20 B2 C20

If we substitute the expressions on the right of the equalities into the loop-invariant we find that
   
CT AT L BT + CbT
 = 
CB CbB

becomes    
C0 (A )(B ) + Cb
    00 0  0 
=
   
 C1 Cb
 1

   
C2 Cb2
and hence    
C0 A00 B0 + Cb0
   
C = Cb1 
 1  
C2 Cb2
5.4. Symmetric matrix-matrix multiplication * to edX 231

Step 7: Determining the state after moving the thick lines.

The movement of the thick lines in Step 5b means that now


       
A00 A01 A02 B0 C0
AT L =   AT R =   BT =   CT =  
A10 A11 A12 , B1 , and C1 .
 
ABL = A20 a21 ABR = A22 BB = B2 CB = C2

If we substitute the expressions on the right of the equalities into the loop-invariant we find that
   
CT AT L BT + CbT
 = 
CB CbB

becomes      

C0 A00 AT10 B0 Cb0
    + 
 C1  =  A10 A11 B1 Cb1
   

   
C2 Cb2

where we recognize that due to symmetry A01 = AT10 and hence


   
C0 A00 B0 + AT10 B1 + Cb0
   
 C1  =  A10 B0 + A11 B1 + Cb1 
   
C2 C2
b

Step 8: Determining the update.

Comparing the contents in Step 6 and Step 7 now tells us that the state of C must change from
   
C0 A00 B0 + Cb0
   
C = C 
1
b1
   
C2 Cb2

to    
C0 A00 B0 + AT10 B1 + Cb0
   
 C1  =  A10 B0 + A11 B1 + Cb1  ,
   
C2 Cb2
which can be accomplished by updating

C0 := AT10 B1 +C0
C1 := A10 B0 + A11 B1 +C1

Discussion

* Watch Video on edX


* Watch Video on YouTube
232 Week 5. Matrix-Matrix Operations

Let us discuss the update step


C0 := AT10 B1 +C0
C1 := A10 B0 + A11 B1 +C1
and decompose the second assignment into two steps:

C0 := AT10 B1 +C0
C1 := A10 B0 +C1
C1 := A11 B1 +C1
Let’s assume that matrices C and B are m × n so that A is m × m, and that the algorithm uses a block size of nb . For
convenience, assume m is an integer multiple of nb : let m = Mnb . We are going to analyze how much time is spent in
each of the assignments.
Assume A00 is (Jnb ) × (Jnb ) in size, for 0 ≤ J < M − 1. Then, counting each multiply and each add as a floating
point operation (flop):
• C0 := AT10 B1 +C0 : This is a matrix-matrix multiply involving (Jnb ) × nb matrix AT10 and nb × n matrix B1 , for
2Jn2b n flops.

• C1 := A10 B0 +C1 : This is a matrix-matrix multiply involving nb × (Jnb ) matrix A10 and Jnb × n matrix B0 , for
2Jn2b n flops.

• C1 := A11 B1 + C1 : This is a symmetric matrix-matrix multiply involving nb × nb matrix A11 and nb × n matrix
B1 , for
2n2b n flops.
If we aggregate this over all iterations, we get
• All C0 := AT10 B1 +C0 :
M−1
M2 2
∑ 2Jn2b n flops ≈ 2 n n flops = (Mnb )2 n flops = m2 n flops.
2 b
J=0

• All C1 := A10 B0 +C1 : This is a matrix-matrix multiply involving nb × (Jnb ) matrix A10 and Jnb × n matrix B0 ,
for
M−1
M2
∑ 2Jn2b n flops ≈ 2 2 n2b n flops = (Mnb )2 n flops = m2 n flops.
J=0

• All C1 := A11 B1 +C1 : This is a symmetric matrix-matrix multiply involving nb ×nb matrix A11 and nb ×n matrix
B1 , for
M−1
∑ 2n2b n flops = 2Mn2b n flops = 2nb mn flops.
J=0

The point: If nb is much smaller than m, then most computation is being performed in the general matrix-matrix
multiplications
C0 := AT10 B1 +C0
C1 := A10 B0 +C1
and a relatively small amount in the symmetric matrix-matrix multiplication

C1 := A11 B1 +C1
Thus, one can use a less efficient implementation for this subproblem (for example, using an unblocked algorithm).
Alternatively, since A11 is relatively small, one can create a temporary matrix T in which to copy A11 with its up-
per triangular part explicitly copied as well, so that a general matrix-matrix multiplication can also be used for this
subproblem.
5.4. Symmetric matrix-matrix multiplication * to edX 233

5.4.5 Other blocked algorithms * to edX

* Watch Video on edX


* Watch Video on YouTube

Homework 5.4.5.1 Derive as many blocked algorithmic variants as you find useful.
Some resources:
• The * blank worksheet.
• * color flatex.tex.

• Spark webpage.
• * symm l blk var1 ws.tex, * symm l blk var2 ws.tex,
* symm l blk var3 ws.tex, * symm l blk var4 ws.tex,
* symm l blk var5 ws.tex, * symm l blk var6 ws.tex,
* symm l blk var7 ws.tex, * symm l blk var8 ws.tex.
• * SymmLBlkVar1LS.mlx, * SymmLBlkVar2LS.mlx,
* SymmLBlkVar3LS.mlx, * SymmLBlkVar4LS.mlx,
(The rest of these are not yet available.)
* SymmLBlkVar5LS.mlx, * SymmLBlkVar6LS.mlx,
* SymmLBlkVar7LS.mlx, * SymmLBlkVar8LS.mlx.
* SEE ANSWER
* DO EXERCISE ON edX

5.4.6 A second PME * to edX


   
There is a second PME for this operation. Partition C → CL CR and Partition B → BL BR . Then, entering
this in the postcondition C = AB + C,
b we find that
     
CL CR =A BL BR + CbL CbR

yielding the second PME


   
PME 2: CL CR = ABL + CbL ABR + CbR .

Notice that this is identical to PME 1 for general matrix-matrix multiplication in Unit 5.3.2.
The astute reader will recognize that the update for the resulting variants cast computation in terms of a symmetric
matrix-vector multiply
c1 := Ab1 + c1

for the unblocked algorithms and the symmetric matrix-matrix multiply

C1 := AB1 +C1

for the blocked algorithms.


234 Week 5. Matrix-Matrix Operations

5.5 Enrichment * to edX

5.5.1 The memory hierarchy * to edX

* Watch Video on edX


* Watch Video on YouTube

5.5.2 The GotoBLAS matrix-matrix multiplication algorithm * to edX

* Watch Video on edX


* Watch Video on YouTube

A number of recent papers on matrix-matrix multiplication are listed below.

• Kazushige Goto, Robert A. van de Geijn. “Anatomy of high-performance matrix multiplication.” ACM Trans-
actions on Mathematical Software (TOMS), 2008.
This paper on the GotoBLAS approach for implementing matrix-matrix multiplication is probably the most
frequently cited recent paper on high-performance matrix-matrix multiplication. It was written to be under-
standable by expert and novice alike.

• Field G. Van Zee, Robert A. van de Geijn. “BLIS: A Framework for Rapidly Instantiating BLAS Functionality.”
ACM Transactions on Mathematical Software (TOMS), 2015.
In this paper, the implementation of the GotoBLAS approach is refined, exposing more loops around a “micro-
kernel” so that less code needs to be highly optimized.

These papers can be accessed for free from

http://www.cs.utexas.edu/ flame/web/FLAMEPublications.html

(Journal papers #11 and #39.)


We can list more reading material upon request.

5.5.3 The PME and loop invariants say it all! * to edX


In Unit 5.2.6, Figure 5.10, you may have noticed a pattern between the PME and the two loop invariants that yielded
the same update in the loop. One of those loop invariants yields an algorithm that marches through the matrix from
top-left to bottom-right while the other one marches from bottom-right to top-left.
Reversing the order in which a loop index changes (e.g. from incrementing to decrementing or vise versa) is known
as loop reversal. It is only valid under some circumstances. What Figure 5.10 suggests is that there may be a relation
between the PME and a loop invariant that tells us conditions under which it is legal to reverse.
How the PME and loop invariants give insight into, for example, opportunities for parallelism is first discussed in

• Tze Meng Low, Robert A. van de Geijn, Field G. Van Zee.


“Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications.”
PPoPP ’05: Proceedings of the tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Pro-
gramming, 2005.

Look for Conference Publication #8 at


5.6. Wrap Up * to edX 235

http://www.cs.utexas.edu/ flame/web/FLAMEPublications.html
for free access. This then led to a more complete treatment in the dissertation
• Tze Meng Low.
A Calculus of Loop Invariants for Dense Linear Algebra Optimization.
Ph.D. Dissertation. The University of Texas at Austin, Department of Computer Science. December 2013.
This work shows how an important complication for compilers, the phase ordering problem can be side-stepped
by looking at the PME and loop invariants.
This dissertation is also available from the same webpage.
We believe you now have the background to understand these works.

5.6 Wrap Up * to edX

5.6.1 Additional exercises * to edX


Level-3 BLAS (matrix-matrix) operations

For extra practice, the level-3 BLAS (matrix-matrix) operations are a good source. These operations involve two or
more matrices and are special cases of matrix-matrix multiplication.

GEMM .
Earlier this week, you already derived algorithms for the GEMM (general matrix matrix multiplication operation:

C := AB +C,

where A, B, and C are all matrices with appropriate sizes. This is a special case of the operation that is part of the
BLAS, which includes all of the following operations:

C := αAB + βC
C := αAT B + βC
C := αABT + βC
C := αAT BT + βC

(Actually, it includes even more if the matrices are complex valued). The key is that matrices A and B are not to be
explicitly transposed because of the memory operations and/or extra space that would require. We suggest you ignore
α and β. This then yields the unblocked algorithms/functions
• GEMM NN UNB VAR X(A, B, C) (no transpose A, no transpose B),
• GEMM TN UNB VAR X(A, B, C) (transpose A, no transpose B),
• GEMM NT UNB VAR X(A, B, C) (no transpose A, transpose B), and
• GEMM TT UNB VAR X(A, B, C) (transpose A, transpose B).
as well as the blocked algorithms/functions
• GEMM NN BLK VAR X(A, B, C) (no transpose A, no transpose B),
• GEMM TN BLK VAR X(A, B, C) (transpose A, no transpose B),
• GEMM NT BLK VAR X(A, B, C) (no transpose A, transpose B), and
• GEMM TT BLK VAR X(A, B, C) (transpose A, transpose B).
236 Week 5. Matrix-Matrix Operations

SYMM .
Earlier this week we discussed C := AB +C where A is symmetric and therefore only stored in the lower triangular
part of A. Obviously, the matrix could instead be stored in the upper triangular part of A. In addition, the symmetric
matrix could be on the right of matrix B, as in C := BA +C. This then yields the unblocked algorithms/functions
• SYMM LL UNB VAR X(A, B, C) (left, lower triangle stored),
• SYMM LU UNB VAR X(A, B, C) (left, upper triangle stored),
• SYMM RL UNB VAR X(A, B, C) (right, lower triangle stored), and
• SYMM RU UNB VAR X(A, B, C) (right, upper triangle stored).
and blocked algorithms/functions
• SYMM LL BLK VAR X(A, B, C) (left, lower triangle stored),
• SYMM LU BLK VAR X(A, B, C) (left, upper triangle stored),
• SYMM RL BLK VAR X(A, B, C) (right, lower triangle stored), and
• SYMM RU BLK VAR X(A, B, C) (right, upper triangle stored).

SYRK .
If matrix C is symmetric, then so is the result of what is known as a symmetric rank-k update (SYRK): C := AAT +
C. In this case, only the lower or upper triangular part of C needs to be stored and updated. Alternatively, the rank-k
update can compute with the transpose of A yielding C := AT A +C. The resulting unblocked algorithms/functions are
then
• SYRK LN UNB VAR X(A, B, C) (lower triangle stored, no transpose),
• SYRK LT UNB VAR X(A, B, C) (lower triangle stored, transpose),
• SYRK UN UNB VAR X(A, B, C) (upper triangle stored, no transpose),
• SYRK UT UNB VAR X(A, B, C) (upper triangle stored, transpose),
while the blocked algorithms/functions are
• SYRK LN BLK VAR X(A, C) (lower triangle stored, no transpose),
• SYRK LT BLK VAR X(A, C) (lower triangle stored, transpose),
• SYRK UN BLK VAR X(A, C) (upper triangle stored, no transpose),
• SYRK UT BLK VAR X(A, C) (upper triangle stored, transpose),

SYR 2 K .
Similarly, if matrix C is symmetric, then so is the result of what is known as a symmetric rank-2k update (SYR 2 K):
C := ABT + BAT C. In this case, only the lower or upper triangular part of C needs to be stored and updated. Alterna-
tively, the rank-2k update can compute with the transposes of A and B, yielding C := AT B + BT A + C. The resulting
unblocked algorithms/functions are then
• SYR 2 K LN UNB VAR X(A, B, C) (lower triangle stored, no transpose),
• SYR 2 K LT UNB VAR X(A, B, C) (lower triangle stored, transpose),
• SYR 2 K UN UNB VAR X(A, B, C) (upper triangle stored, no transpose),
• SYR 2 K UT UNB VAR X(A, B, C) (upper triangle stored, transpose),
5.6. Wrap Up * to edX 237

and the blocked ones are


• SYR 2 K LN BLK VAR X(A, B, C) (lower triangle stored, no transpose),
• SYR 2 K LT BLK VAR X(A, B, C) (lower triangle stored, transpose),

• SYR 2 K UN BLK VAR X(A, B, C) (upper triangle stored, no transpose),


• SYR 2 K UT BLK VAR X(A, B, C) (upper triangle stored, transpose).
You may want to consider deriving algorithms for this operation after Week 6, since there is similarity with opera-
tions discussed there.

TRMM .
Another special case of matrix-matrix multiplication is given by B := AB, where A is (lower or upper) triangular.
It turns out that the output can overwrite the input matrix B if the computation is carefully ordered. Alternatively,
the triangular matrix can be to the right of B. Finally, A can be optionally transposed and/or have a unit or nonunit
diagonal.

B := LB
B := LT B
B := UB
B := U T B

where L is a lower triangular (possibly implicitly unit lower triangular) matrix and U is an upper triangular (possibly
implicitly unit upper triangular) matrix. This then yields the algorithms/functions
• TRMM LLNN UNB VAR X(L, B) where LLNN stands for left, lower triangular, no transpose, non unit diagonal,

• TRMM RLNN UNB VAR X(L, B) where LLNN stands for right, lower triangular, no transpose, non unit diagonal,
• and so forth.

TRSM .
The final matrix-matrix operation solves AX = B where A is triangular, and the solution X overwrites B. This is
known as triangular solve with multiple right-hand sides. We discuss this operation in detail in Week 6.

5.6.2 Summary * to edX


In this week, we applied the techniques to more advanced problems. Not much to summarize!
238 Week 5. Matrix-Matrix Operations

You might also like