Science of Programming Matrix Computations
Science of Programming Matrix Computations
Science
of
Programming
Matrix
Computations
The
Science
of
Programming
Matrix
Computations
Robert A. van de Geijn
The University of Texas at Austin
Enrique S. Quintana-Ortı́
Universidad Jaume I
c 2007 by Robert A. van de Geijn and Enrique S. Quintana-Ortı́.
Copyright °
10 9 8 7 6 5 4 3 2 1
All rights reserved. No part of this book may be reproduced, stored, or transmitted in any manner without the
written permission of the publisher. For information, contact either of the authors.
No warranties, express or implied, are made by the publisher, authors, and their employers that the programs
contained in this volume are free of error. They should not be relied on as the sole basis to solve a problem
whose incorrect solution could result in injury to person or property. If the programs are employed in such a
manner, it is at the user’s own risk and the publisher, authors, and their employers disclaim all liability for such
misuse.
Trademarked names may be used in this book without the inclusion of a trademark symbol. These names are
used in an editorial context only; no infringement of trademark is intended.
List of Contributors v
Preface vii
1 Motivation 1
1.1 A Motivating Example: the LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Algorithmic Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Presenting Algorithms in Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 High Performance and Blocked Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Numerical Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Matrix-Vector Operations 27
3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
i
ii Contents
A large number of people have contributed, and continue to contribute, to the FLAME project. For a complete
list, please http://www.cs.utexas.edu/users/flame/
Below we list the people who have contributed directly to the knowledge and understanding that is summarized
in this text.
Paolo Bientinesi
Ernie Chan
Kazushige Goto
John A. Gunnels
Margaret E. Myers
Gregorio Quintana-Ortı́
v
vi List of Contributors
Preface
The only effective way to raise the confidence level of a program significantly is
to give a convincing proof of its correctness. But one should not first make the
program and then prove its correctness, because then the requirement of providing
the proof would only increase the poor programmer’s burden. On the contrary: the
programmer should let correctness proof and program grow hand in hand.
– E.W. Dijkstra
This book shows how to put the above words of wisdom to practice when programming algorithms for dense
linear algebra operations.
Programming as a Science
One definition of science is knowledge that has been reduced to a system. In this book we show how for a broad
class of matrix operations the derivation and implementation of algorithms can be made systematic.
Notation
Traditionally, algorithms in this area have been expressed by explicitly exposing the indexing of elements in
matrices and vectors. It is not clear whether this has its roots in how matrix operations were originally coded
in languages like Fortran77 or whether it was because the algorithms could be more concisely stated, something
that may have been important in the days when the typesetting of mathematics was time-consuming and the
printing of mathematical books expensive.
vii
viii Preface
The notation adopted in this book attempts to capture the pictures of matrices and vectors that often
accompany the explanation of an algorithm. Such a picture typically does not expose indexing. Rather, it
captures regions (submatrices and subvectors) that have been, or are to be, updated in a consistent manner.
Similarly, our notation identifies regions in matrices and vectors, hiding indexing details. While algorithms so
expressed require more space on a page, we believe the notation improves the understanding of the algorithm
as well as the opportunity for comparing and contrasting different algorithms that compute the same operation.
Goal-Oriented Programming
The new notation and the APIs for representing the algorithms in code set the stage for growing proof of
correctness and program hand-in-hand, as advocated by Dijkstra. For reasons that will become clear, high-
performance algorithms for computing matrix operations must inherently involve a loop. The key to developing
a loop is the ability to express the state (contents) of the variables, being updated by the loop, before and after
each iteration of the loop. It is the new notation that allows one to concisely express this state, which is called
the loop-invariant in computer science. Equally importantly, the new notation allows one to systematically
identify all reasonable states that can be maintained by a loop that computes the desired matrix operation. As
a result, the derivation of loops for computing matrix operations becomes systematic, allowing hand-in-hand
development of multiple algorithms and their proof of correctness.
High Performance
The scientific computing community insists on attaining high performance on whatever architectures are the
state-of-the-art. The reason is that there is always interest in solving larger problems and computation time
is often the limiting factor. The second half of the book demonstrates that the formal derivation methodol-
ogy facilitates high performance. The key insight is that the matrix-matrix product operation can inherently
achieve high performance, and that most computation intensive matrix operations can be arranged so that more
computation involves matrix-matrix multiplication.
Preface ix
Intended Audience
This book is in part a portal for accessing research papers, tools, and libraries that were and will be developed
as part of the Formal Linear Algebra Methods Environment (FLAME) project that is being pursued by re-
searchers at The University of Texas at Austin and other institutions. The basic knowledge behind the FLAME
methodology is presented in a way that makes it accessible to novices (e.g., undergraduate students with a
limited background in linear algebra and high-performance computing). However, the approach has been used
to produce state-of-the-art high-performance linear algebra libraries, making the book equally interesting to
experienced researchers and the practicing professional.
The audience of this book extends beyond those interested in the domain of linear algebra algorithms. It
is of interest to students and scholars with interests in the theory of computing since it shows how to make
the formal derivation of algorithms practical. It is of interest to the compiler community because the notation
and APIs present programs at a much higher level of abstraction than traditional code does, which creates new
opportunities for compiler optimizations. It is of interest to the scientific computing community since it shows
how to develop routines for a matrix operation when that matrix operation is not supported by an existing
library. It is of interest to the architecture community since it shows how algorithms and architectures interact.
It is of interest to the generative programming community, since the systematic approach to deriving algorithms
supports the mechanical derivation of algorithms and implementations.
Related Materials
While this book is meant to be self-contained, it may be desirably to use it in conjunction with books and texts
that focus on various other topics related to linear algebra. A brief list follows.
• Gilbert Strang. Linear Algebra and its Application, Third Edition. Academic Press, 1988.
Discusses the mathematics of linear algebra at a level appropriate for undergraduates.
• Gene H. Golub and Charles F. Van Loan. Matrix Computations, Third Edition. The Johns Hopkins
University Press, 1996
Advanced text that is best used as a reference or as a text for a class with a more advanced treatment of
the topics.
• Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms, Second Edition. SIAM, 2002.
An advanced book on the numerical analysis of linear algebra algorithms.
In addition, we recommend the following manuscripts for those who want to learn more about formal verification
and derivation of programs.
• Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms.
Department of Computer Sciences, University of Texas at Austin, August 2006.
Chapter 2 of this dissertation systematically justifies the structure of the worksheet and explains in even
more detail how it relates to formal derivation methods. It shows how the FLAME methodology can be
made mechanical and how it enables the systematic stability analysis of the algorithms that are derived.
We highly recommend reading this dissertation upon finishing this text.
Since formating the algorithms takes center stage in our approach, we recommend the classic reference for the
LATEX document preparation systems:
• Leslie Lamport. LATEX: A Document Preparation System, Second Edition. Addison-Wesley Publishing
Company, Inc., 1994.
User’s guide and reference manual for typesetting with LATEX.
Preface xi
Webpages
A companion webpage has been created for this book. The base address is
http://www.cs.utexas.edu/users/flame/books/TSoPMC/
(TSoPMC: The Science of Programming Matrix Computations). In the text the above path will be referred to
as $BASE/. On these webpages we have posted errata, additional materials, hints for the exercises, tools, and
software. We suggest the reader visit this website at this time to become familiar with its structure and content.
Wiki: www.linearalgebrawiki.org
Many examples of operations, algorithms, derivations, and implementations similar to those discussed in this
book can be found at
http://www.linearalgebrawiki.org/
Why www.lulu.com?
We considered publishing this book through more conventional channels. Indeed three major publishers of
technical books offered to publish it (and two politely declined). The problem, however, is that the cost of
textbooks has spiralled out of control and, given that we envision this book primarily as a reference and a
supplemental text, we could not see ourselves adding to the strain this places on students. By publishing it
ourselves through www.lulu.com, we have reduced the cost of a copy to a level where it is hardly worth printing
it oneself. Since we retain all rights to the material, we may or may not publish future editions the same way,
or through a conventional publisher.
Please visit $BASE/books/TSoPMC/ for details on how to purchase this book.
Acknowledgments
This research was partially sponsored by NSF grants ACI-0305163, CCF-0342369, CCF-0540926, and CCF-
0702714. Additional support came from the J. Tinsley Oden Faculty Fellowship Research Program of the
Institute for Computational Engineering and Sciences (ICES) at UT-Austin, a donation by Dr. James Truchard
(President, CEO, and Co-Founder of National Instruments), and an unrestricted grant from NEC Solutions
(America), Inc.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science Foundation.
xii Preface
Chapter 1
Motivation
Programming high-performance routines for computing linear algebra operations has long been a fine art. In
this book we show it to be a science by exposing a systematic approach that given an operation yields high-
performance implementations for computing it. The methodology builds upon a new notation for expressing
algorithms, new advances regarding the formal derivation of linear algebra algorithms, a new style of coding,
and the use of high-performance implementations of a few key linear algebra operations. In this chapter we
preview the approach.
Don’t Panic: A reader who is less well-versed in linear algebra should not feel intimidated by this chapter:
It is meant to demonstrate to more experienced readers that there is substance to the book. In Chapter 2, we
start over, more slowly. Indeed, a novice may wish to skip Chapter 1, and return to it later.
1
2 1. Motivation
Figure 1.1: Left: Typical explanation of an algorithm for computing the LU factorization, overwriting A with
L and U . Right: Same algorithm, using our notation.
matrices L and U overwrite the lower and upper triangular parts of A, respectively, and the diagonal of L is not
stored, since all its entries equal one. To show how the algorithm sweeps through the matrix the explanation is
often accompanied by the pictures in Figure 1.2 (left). The thick lines in that figure track the progress through
matrix A as it is updated.
1.2 Notation
In this book, we have adopted a non traditional notation that captures the pictures that often accompany
the explanation of an algorithm. This notation was developed as part of our Formal Linear Algebra Methods
Environment (FLAME) project [16, 3]. We will show that it facilitates a style of programming that allows the
algorithm to be captured in code as well as the systematic derivation of algorithms [4].
In Figure 1.1(right), we illustrate how the notation is used to express the LU factorization algorithm so
that it reflects the pictures in Figure 1.2. For added clarification we point to Figure 1.2 (right). The next few
chapters will explain the notation in detail, so that for now we leave it to the intuition of the reader.
1.2. Notation 3
? ?
done done AT L AT R
A
done (partially ABL ABR
updated)
? ?
A00 a01 A02
α11 aT
12 aT
10
α11 aT
12
(a)
a21 A22 A20 a21 A22
? ?
υ11:= υ11:=
uT T
12 := a12 uT T
12 := a12
α11 α11
(b)
l21:= A22:= l21:= A22:=
a21 T a21 T
υ11 A22−l21 u12 υ11 A22−l21 u12
? ?
done done AT L AT R
@
(c) @
R
A
done (partially ABL ABR
updated)
Figure 1.2: Progression of pictures that explain the LU factorization algorithm. Left: As typically presented.
Right: Annotated with labels to explain the notation in Fig. 1.1(right).
4 1. Motivation
Variant 1: Variant 1:
a01 := L−1
00 a01 (trsv) A01 := L−1
00 A01 (trsm)
−1 −1
aT T
10 := a10 U00 (trsv) A10 := A10 U00 (trsm)
α11 := α11 − aT 10 a01 (apdot) A11 := LU unb(A11 − A10 A01 ) (gemm,LU)
Variant 2: Variant 2:
−1 −1
aT T
10 := a10 U00 (trsv) A10 := A10 U00 (trsm)
α11 := α11 − aT 10 a01 (apdot) A11 := LU unb(A11 − A10 A01 ) (gemm,LU)
aT T T
12 := a12 − a10 A02 (gemv) A12 := A12 − A10 A02 (gemm)
Variant 3: Variant 3:
a01 := L−1
00 a01 (trsv) A01 := L−1
00 A01 (trsm)
α11 := α11 − aT 10 a01 (apdot) A11 := LU unb(A11 − A10 A01 ) (gemm,LU)
−1
a21 := (a21 − A20 a01 )/α11 (gemv,invscal) A21 := (A21 − A20 A01 )U11 (gemm,trsm)
Variant 4: Variant 4:
α11 := α11 − aT 10 a01 (apdot) A11 := LU unb(A11 − A10 A01 ) (gemm,LU)
−1
a21 := (a21 − A20 a01 )/α11 (gemv,invscal) A21 := (A21 − A20 A01 )U11 (gemm,trsm)
aT T T
12 := a12 − a10 A02 (gemv) A12 := L−1
11 (A 12 − A 10 02 )
A (gemm,LU)
Variant 5: Variant 5:
a21 := a21 /α11 (invscal) A11 := LU unb(A11 ) (LU)
−1
A22 := A22 − a21 aT (ger) A21 := A21 U11 (trsm)
12
A12 := L−1 A
11 12 (trsm)
A22 := A22 − A21 A12 (gemm)
Figure 1.3: Multiple algorithms for computing the LU factorization. Matrices Lii and Uii , i = 0, 1, 2, denote,
respectively, the unit lower triangular matrices and upper triangular matrices stored over the corresponding
Aii . Expressions involving L−1 −1
ii and Uii indicate the need to solve a triangular linear system.
1.3. Algorithmic Variants 5
Exercise 1.1 Typesetting algorithms like those in Figure 1.1 (right) may seem somewhat intimidating. We
have created a webpage that helps generate the LATEX source as well as a set of LATEXcommands (FLATEX). Visit
$BASE/Chapter1 and follow the directions on the webpage associated with this exercise to try out the tools.
tional programming languages such as Haskell and Mathematica; and the LabView G graphical program-
ming language.
Exercise 1.2 The formating in Figure 1.4 is meant to, as closely as possible, resemble the algorithm in Fig-
ure 1.1(right). The same webpage that helps generate LATEX source can also generate an outline for the code.
Visit $BASE/Chapter1 and duplicate the code in Figure 1.4 by following the directions on the webpage associated
with this exercise.
Reference
unb_var1
2.5 blk_var1
unb_var2
blk_var2
unb_var3
blk_var3
2 unb_var4
blk_var4
unb_var5
blk_var5
GFLOPS/sec.
1.5
0.5
0
0 500 1000 1500
matrix dimension n
Figure 1.5: Performance of unblocked and blocked algorithmic variants for computing the LU factorization.
The key insight is that cache-based architectures, as are currently popular, perform floating-point operations
(flops) at very fast rates, but fetch data from memory at a (relatively) much slower rate. For operations
like the matrix-matrix multiplication (gemm) this memory bandwidth bottleneck can be overcome by moving
submatrices into the processor cache(s) and amortizing this overhead over a large number of flops. This is
facilitated by the fact that gemm involves a number of operations of cubic order on an amount of data that is
of quadratic order. Details of how high performance can be attained for gemm are exposed in Chapter 5.
Given a high-performance implementation of gemm, other operations can attain high performance if the
bulk of the computation can be cast in terms of gemm. This is property of blocked algorithms. Figure 1.3
(right) displays blocked algorithms for the different algorithmic variants that compute the LU factorization. We
will show that the derivation of blocked algorithms is typically no more complex than the derivation of their
unblocked counterparts.
The performance of a code (an implementation of an algorithm) is often measured in terms of the rate at
which flops are performed. The maximal rate that can be attained by a target architecture is given by the
product of the clock rate of the processor times the number of flops that are performed per clock cycle. The
rate of computation for a code is computed by dividing the number of flops required to compute the operation
by the time it takes for it to be computed. A gigaflops (or GFLOPS) indicates a billion flops per second. Thus,
8 1. Motivation
an implementation that computes an operation that requires f flops in t seconds attains a rate of
f
× 10−9 GFLOPS.
t
Throughout the book we discuss how to compute the cost, in flops, of an algorithm.
The number of flops performed by an LU factorization is about 32 n3 , where n is the matrix dimension. In
Figure 1.5 we show the performance attained by implementations of the different algorithms in Figure 1.3 on
an Intel°R Pentium° R 4 workstation. The clock speed of the particular machine is 1.4 GHz and a Pentium 4
can perform two flops per clock cycle, for a peak performance of 2.8 GFLOPS, which marks the top line in the
graph. The block size nb was taken to equal 128. (We will eventually discuss how to determine a near-optimal
block size.) Note that blocked algorithms attain much better performance than unblocked algorithms and that
not all algorithmic variants attain the same performance.
In Chapter 5 it will become clear why we favor loop-based algorithms over recursive algorithms, and how
recursion does enter the picture.
This chapter introduces the reader to the systematic derivation of algorithms for linear algebra operations.
Through a very simple example we illustrate the core ideas: We describe the notation we will use to ex-
press algorithms; we show how assertions can be used to establish correctness; and we propose a goal-oriented
methodology for the derivation of algorithms. We also discuss how to incorporate an analysis of the cost into
the algorithm.
Finally, we show how to translate algorithms to code so that the correctness of the algorithm implies the
correctness of the implementation.
In this section, we introduce a notation for expressing algorithms that avoids the pitfalls of intricate indexing
and will allow us to more easily derive, express, and implement algorithms. We present the notation through
a simple example, the inner product of two vectors, an operation that will be used throughout this chapter for
illustration.
Given two vectors, x and y, of length m, the inner product or dot product (dot) of these vectors is given by
m−1
X
α := xT y = χi ψi
i=0
9
10 2. Derivation of Linear Algebra Algorithms
Algorithm: α := apdot(x, y, α)
µ ¶ µ ¶
xT yT
Partition x → ,y→
xB yB
where xT and yT have 0 elements
while m(xT ) < m(x) do
Repartition
µ ¶ x0 µ ¶ y0
xT y T
→ χ1 , → ψ1
xB yB
x2 y2
where χ1 and ψ1 are scalars
α := χ1 ψ1 + α
Continue with
µ ¶ x0 µ ¶ y0
xT yT
← χ1 , ← ψ1
xB yB
x2 y2
endwhile
Remark 2.1 We will use the symbol “:=” (“becomes”) to denote assignment while the symbol “=” is reserved
for equality.
Example 2.2 Let
1 2
x= 2 and y = 4 .
3 1
Then xT y = 1 · 2 + 2 · 4 + 3 · 1 = 13. Here we make use of the symbol “·” to denote the arithmetic product.
A traditional loop for implementing the updating of a scalar by adding a dot product to it, α := xT y + α,
is given by
2.1. A Farewell to Indices 11
k := 0
while k < m do
α := χk ψk + α
k := k + 1
endwhile
Our notation presents this loop as in Figure 2.1. The name of the algorithm in that figure reflects that it
performs a alpha plus dot product (apdot). To interpret the algorithm in Figure 2.1 note the following:
• We bid farewell to intricate indexing: In this example only indices from the sets {T, B} (Top and Bottom)
and {0, 1, 2} are required.
• Each vector has been subdivided into two subvectors, separated by thick lines. This is how we will
represent systematic movement through vectors (and later matrices).
• Subvectors xT and yT include the “top” elements of x and y that, in this algorithm, have already been
used to compute a partial update to α. Similarly, subvectors xB and yB include the “bottom” elements
of x and y that, in this algorithm, have not yet been used to update α. Referring back to the traditional
loop, xT and yT consist of elements 0, . . . , k − 1 and xB and yB consist of elements k, . . . , m − 1:
χ0 ψ0
.. ..
. .
µ ¶ µ ¶
xT χk−1 yT ψk−1
=
χk
and =
ψk
.
xB yB
.. ..
. .
χm−1 ψm−1
• The loop is executed as long as m(xT ) < m(x) is true, which takes the place of k < m in the traditional
loop. Here m(x) equals the length of vector x so that the loop terminates when xT includes all elements
of x.
• The statement
12 2. Derivation of Linear Algebra Algorithms
Repartition
µ ¶ x0 µ ¶ y0
xT y T
→ χ1 , → ψ1
xB yB
x2 y2
where χ1 and ψ1 are scalars
exposes the top elements of xB and yB , χ1 and ψ1 respectively, which were χk and ψk in the traditional
loop.
α := χ1 ψ1 + α,
Remark 2.3 It is important not to confuse the single elements exposed in our repartitionings, such as χ1 or
ψ1 , with the second entries of corresponding vectors.
• The statement
Continue with
µ ¶ x0 µ ¶ y0
xT yT
← χ1 , ← ψ1
xB yB
x2 y2
moves the top elements of xB and yB to xT and yT , respectively. This means that these elements have
now been used to update α and should therefore be added to xT and yT .
k := m − 1
while k ≥ 0 do
α := χk ψk + α
k := k − 1
endwhile
Modify the algorithm in Figure 2.1 so that it expresses this alternative algorithmic variant. Typeset the resulting
algorithm.
2.2. Predicates as Assertions about the State 13
This theorem can be interpreted as follows. Assume that the predicate Pinv holds before and after the
loop-body. Then, if Pinv holds before the loop, obviously it will also hold before the loop-body. The commands
in the loop-body are such that it holds again after the loop-body, which means that it will again be true before
the loop-body in the next iteration. We conclude that it will be true before and after the loop-body every time
through the loop. When G becomes false, Pinv will still be true, and therefore it will be true after the loop
completes (if the loop can be shown to terminate), we can assert that Pinv ∧ ¬G holds after the completion of
the loop, where the symbol “¬” denotes the logical negation. This can be summarized by
{Pinv }
while G do
{Pinv ∧ G} {Pinv ∧ G}
S =⇒ S
{Pinv } {Pinv }
endwhile
{Pinv ∧ ¬G}
if the loop can be shown to terminate. Here =⇒ stands for “implies”. The assertion Pinv is called the
loop-invariant for this loop.
Let us again consider the computation α := xT y + α. Let us use α̂ to denote the original contents (or state)
of α. Then we define the precondition for the algorithm as
Pinv : (α = xT
T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x))
is a loop-invariant as it holds
1. immediately before the loop (by the initialization and the definition of xT
T yT = 0 when m(xT ) = m(yT ) = 0),
Now, it is also easy to argue that the loop terminates so that, by the Fundamental Invariance Theorem,
{Pinv ∧ ¬G} holds after termination. Therefore,
Pinv ∧ ¬G ≡ (α = xT
T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x)) ∧ ¬(m(xT ) < m(x))
| {z } | {z }
Pinv ¬G
=⇒ (α = xT
T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x)) ∧ (m(xT ) ≥ m(x))
| {z }
=⇒ m(xT ) = m(yT ) = m(x)
=⇒ (α = xT y + α̂),
since xT and yT are subvectors of x and y and therefore m(xT ) = m(yT ) = m(x) implies that xT = x and
yT = y. Thus we can claim that the algorithm correctly computes α := xT y + α.
Step 1: Specifying the precondition and postcondition. The statement of the operation to be performed,
α := xT y + α, dictates the precondition and postcondition indicated in Steps 1a and 1b. The precondition is
given by
Ppre : α = α̂ ∧ 0 ≤ m(x) = m(y),
and the postcondition is
Ppost : α = xT y + α̂.
This partitioned matrix expression (PME) expresses the final value of α in terms of its original value and the
partitioned vectors.
Remark 2.9 The partitioned matrix expression (PME) is obtained by substitution of the partitioned operands
into the postcondition.
Now, at an intermediate iteration of the loop, α does not contain its final value. Rather, it contains some
partial result towards that final result. This partial result should be reflected in the loop-invariant. One such
intermediate state is given by
Pinv : (α = xT
T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x)),
which we note is exactly the loop-invariant that we used to prove the algorithm correct in Figure 2.2.
Remark 2.10 Once it is decided how to partition vectors and matrices into regions that have been updated
and/or used in a consistent fashion, loop-invariants can be systematically determined a priori.
must imply that “Ppost : α = xT y + α̂” holds. If xT and yT equal all of x and y, respectively, then the
loop-invariant implies the postcondition: The choice ‘G : m(xT ) < m(x)” satisfies the desired condition that
Pinv ∧ ¬G implies that m(xT ) = m(x), as xT must be a subvector of x, and
((α = xT
T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x))) ∧ (m(xT ) ≥ m(x))
| {z } | {z }
Pinv ¬G
=⇒ α = xT y + α̂,
as was already argued in the previous section. This loop-guard is entered in Step 3 in Figure 2.2.
Remark 2.11 The loop-invariant and the postcondition together prescribe a (non-unique) loop-guard G.
18 2. Derivation of Linear Algebra Algorithms
where xT and yT have no elements, then we place variables α, xT , xB , yT , and yB in a state where the
loop-invariant is satisfied. This initialization appears in Step 4 in Figure 2.2.
Remark 2.12 The loop-invariant and the precondition together prescribe the initialization.
Step 5: Progressing through the vectors. We now note that, as part of the computation, xT and yT start
by containing no elements and must ultimately equal all of x and y, respectively. Thus, as part of the loop,
elements must be taken from xB and yB and must be added to xT and yT , respectively. This is denoted in
Figure 2.2 by the statements
Repartition
µ ¶ x0 µ ¶ y0
xT yT
→ χ1 , → ψ1
xB yB
x2 y2
where χ1 and ψ1 are scalars,
and
Continue with
µ ¶ x0 µ ¶ y0
xT yT
← χ1 , ← ψ1 .
xB yB
x2 y2
This notation simply captures the movement of χ1 , the top element of xB , from xB to xT . Similarly ψ1 moves
from yB to yT . The movement through the vectors guarantees that the loop eventually terminates, which is
one condition required for the Fundamental Invariance Theorem to apply.
Remark 2.13 The initialization and the loop-guard together prescribe the movement through the vectors.
Step 6: Determining the state after repartitioning. The repartitionings in Step 5a do not change the
contents of α: it is an “indexing” operation. We can thus ask ourselves the question of what the contents of
2.4. Goal-Oriented Derivation of Algorithms 19
α are in terms of the exposed parts of x and y. We can derive this state, Pbefore , via textual substitution: The
repartitionings in Step 5a imply that
xT = x0 yT = y0
µ ¶ and µ ¶ .
χ1 ψ1
xB = yB =
x2 y2
If we substitute the expressions on the right of the equalities into the loop-invariant we find that
α = xT
T yT + α̂
implies that
T
α= x0 y0 + α̂,
|{z} |{z}
xT yT
which is entered in Step 6 in Figure 2.2.
Step 7: Determining the state after moving the thick lines. The movement of the thick lines in Step 5b
means that now
µ ¶ µ ¶
x0 y0
xT = yT =
χ1 and ψ1 ,
xB = x2 yB = y2
so that
α = xT
T yT + α̂
implies that µ ¶ µ ¶
x0 T y0
α= + α̂ = xT
0 y0 + χ1 ψ1 + α̂,
χ1 ψ1
| {z } | {z }
xT yT
which is then entered as state Pafter in Step 7 in Figure 2.2.
Remark 2.15 The state in Step 7 is determined via textual substitution and the application of the rules of
linear algebra.
20 2. Derivation of Linear Algebra Algorithms
Step 8: Determining the update. Comparing the contents in Step 6 and Step 7 now tells us that the state
of α must change from
Pbefore : α = xT
0 y0 + α̂
to
Pafter : α = xT y0 + α̂ +χ1 ψ1 ,
| 0 {z }
already in α
which can be accomplished by updating α as
α := χ1 ψ1 + α.
Remark 2.16 It is not the case that α̂ (the original contents of α) must be saved, and that the update
α = xT T
0 y0 + χ1 ψ1 + α̂ must be performed. Since α already contains x0 y0 + α̂, only χ1 ψ1 needs to be added.
Thus, α̂ is only needed to be able to reason about the correctness of the algorithm.
Final algorithm. Finally, we note that all the annotations (in the grey boxes) in Figure 2.2 were only introduced
to derive the statements of the algorithm. Deleting these produces the algorithm already stated in Figure 2.1.
Exercise 2.17 Reproduce Figure 2.2 by visiting $BASE/Chapter2/ and following the directions associated with
this exercise.
To assist in the typesetting, some LATEX macros, from the FLAME-LATEX (FLATEX) API, are collected in
Figure 2.3.
where m = m(x).
Let us examine how one would prove the equality in (2.1). There are two approaches: one is to say “well,
that is obvious” while the other proves it rigorously via mathematical induction:
2.5. Cost Analysis 21
¡ ¢
\FlaOneByThreeR{x_0}{x_1}{x_2} x0 x1 x2
¡ ¢
\FlaOneByThreeL{x_0}{x_1}{x_2} x0 x1 x2
µ ¶
\FlaTwoByTwo{A_{TL}}{A_{TR}} AT L AT R
{A_{BL}}{A_{BR}} ABL ABR
\FlaThreeByThreeBR{A_{00}}{A_{01}}{A_{02}} A00 A01 A02
{A_{10}}{A_{11}}{A_{12}} A10 A11 A12
{A_{20}}{A_{21}}{A_{22}} A20 A21 A22
\FlaThreeByThreeBL{A_{00}}{A_{01}}{A_{02}} A00 A01 A02
{A_{10}}{A_{11}}{A_{12}} A10 A11 A12
{A_{20}}{A_{21}}{A_{22}} A20 A21 A22
\FlaThreeByThreeTR{A_{00}}{A_{01}}{A_{02}} A00 A01 A02
{A_{10}}{A_{11}}{A_{12}} A10 A11 A12
{A_{20}}{A_{21}}{A_{22}} A20 A21 A22
\FlaThreeByThreeTL{A_{00}}{A_{01}}{A_{02}} A00 A01 A02
{A_{10}}{A_{11}}{A_{12}} A10 A11 A12
{A_{20}}{A_{21}}{A_{22}} A20 A21 A22
Pm−1 P(m+1)−1
• Assume k=0 2 = 2m. Show that k=0 2 = 2(m + 1):
(m+1)−1
Ãm−1 !
X X
2= 2 + 2 = 2m + 2 = 2(m + 1).
k=0 k=0
Pm−1
We conclude by the Principle of Mathematical Induction that k=0 2 = 2m.
2.6. Summary 23
This inductive proof can be incorporated into the worksheet as illustrated in Figure 2.4, yielding the cost
worksheet. In that figure, we introduce Csf which stands for “Cost-so-far”. Assertions are added to the
worksheet indicating the computation cost incurred so far at the specified point in the algorithm. In Step 1a,
the cost is given by Csf = 0. At Step 2, just before the loop, this translates to Csf = 2m(xT ) since m(xT ) = 0
and the operation in Step 4 is merely an indexing operation, which does not represent useful computation
and is therefore not counted. This is analogous to the base case in our inductive proof. The assertion that
Csf = 2m(xT ) is true at the top of the loop-body is equivalent to the induction hypothesis. We will refer to this
cost as the cost-invariant of the loop. We need to show that it is again true at the bottom of the loop-body,
where m(xT ) is one greater than m(xT ) at the top of the loop. We do so by inserting Csf = 2m(x0 ) in Step 6,
which follows by textual substitution and the fact that the operations in Step 5a are indexing operations and
do not count towards Csf . The fact that two flops are performed in Step 8 and the operations in Step 5b are
indexing operations means that Csf = 2m(x0 )+2 at Step 7. Upon completion m(xT ) = m(x0 )+1 in Step 2, due
to the fact that one element has been added to xT , shows that Csf = 2m(xT ) at the bottom of the loop-body.
Thus, as was true for the loop-invariant, Csf = 2m(xT ) upon leaving the loop. Since there m(xT ) = m(x), so
that the total cost of the algorithm is 2m(x) flops.
The above analysis demonstrates the link between a loop-invariant, a cost-invariant, and an inductive hy-
pothesis. The proof of the Fundamental Invariance Theorem employs mathematical induction [15].
2.6 Summary
In this chapter, we have introduced the fundamentals of the FLAME approach in the setting of a simple example,
the apdot. Let us recap the highlights so far.
• In our notation algorithms are expressed without detailed indexing. By partitioning vectors into subvec-
tors, the boundary between those subvectors can be used to indicate how far into the vector indexing has
reached. Elements near that boundary are of interest since they may move across the boundary as they
are updated and/or used in the current iteration. It is this insight that allows us to restrict indexing only
to the sets {T, B} and {0, 1, 2} when tracking vectors.
• Assertions naturally express the state in which variables should be at a given point in an algorithm.
• Loop-invariants are systematically identified a priori from the postcondition, which is the specification of
the computation to be performed. This makes the approach goal-oriented.
• Given a precondition, postcondition, and a specific loop-invariant, all other steps of the derivation are
prescribed. The systematic method for deriving all these parts is embodied in Figure 2.5, which we will
refer to as the worksheet from here on.
24 2. Derivation of Linear Algebra Algorithms
where
2 {Pinv }
3 while G do
2,3 {(Pinv ) ∧ (G)}
5a Repartition
˘ where
¯
6 Pbefore
8 SU
5b Continue with
˘ ¯
7 Pafter
2 {Pinv }
endwhile
2,3 {(Pinv ) ∧ ¬ (G)}
1b {Ppost }
• An expression for the cost of an algorithm can be determined by summing the cost of the operations in
the loop-body. A closed-form expression for this summation can then be proven correct by annotating
the worksheet with a cost-invariant.
Remark 2.18 A blank worksheet, to be used in subsequent exercises, can be obtained by visit-
ing $BASE/Chapter2/.
Figure 2.6: Vector-vector operations. Here, α is a scalar while x and y are vectors of length m.
26 2. Derivation of Linear Algebra Algorithms
Chapter 3
Matrix-Vector Operations
The previous chapter introduced the FLAME approach to deriving and implementing linear algebra algorithms.
The primary example chosen for that chapter was an operation that involved scalars and vectors only, as did
the exercises in that chapter. Such operations are referred to as vector-vector operations. In this chapter, we
move on to simple operations that combine matrices and vectors and that are thus referred to as matrix-vector
operations.
Matrices differ from vectors in that their two-dimensional shape permits systematic traversal in multiple
directions: While vectors are naturally accessed from top to bottom or vice-versa, matrices can be accessed
row-wise, column-wise, and by quadrants, as we will see in this chapter. This multitude of ways in which
matrices can be partitioned leads to a much richer set of algorithms.
We focus on three common matrix-vector operations, namely, the matrix-vector product, the rank-1 update,
and the solution of a triangular linear system of equations. The latter will also be used to illustrate the
derivation of a blocked variant of an algorithm, a technique that supports performance and modularity. We will
see that these operations build on the vector-vector operations encountered in the previous chapter and become
themselves building-blocks for blocked algorithms for matrix-vector operations, and more complex operations
in later chapters.
27
28 3. Matrix-Vector Operations
3.1 Notation
A vector x ∈ Rm is an ordered tuple of m real numbers1 . It is written as a column of elements, with parenthesis
around it:
χ0
χ1
x= .. .
.
χm−1
The parenthesis are there only for visual effect. In some cases, for clarity, we will include bars to separate the
elements:
χ0
χ1
x= .. .
.
χm−1
We adopt the convention that lowercase Roman letters are used for vector variables and lowercase Greek letters
for scalars. Also, the elements of a vector are denoted by the Greek letter that corresponds to the Roman letter
used to denote the vector. A table of corresponding letters is given in Appendix A.
If x is a column vector, then the row vector with identical elements organized as a row is denoted by xT :
xT = (χ0 , χ1 , . . . , χm−1 ) .
The “T ” superscript there stands for transposition. Sometimes we will leave out the commas that separate the
elements, replacing them with a blank instead, or we will use separation bars:
¡ ¢ ¡ ¢
xT = χ0 χ1 · · · χm−1 = χ0 χ1 · · · χm−1 .
Often it will be space-consuming
to havea column vector in a sentence written as a column of its elements.
χ0
χ1
T
Thus, rather than writing x = .. we will then write x = (χ0 , χ1 , . . . , χm−1 ) .
.
χm−1
A matrix A ∈ Rm×n is a two-dimensional array of elements where its (i, j) element is given by αij :
α00 α01 ... α0,n−1
α10 α11 ... α1,n−1
A= .. .. .. .. .
. . . .
αm−1,0 αm−1,1 . . . αm−1,n−1
1 Thorough the book we will take scalars, vectors, and matrices to be real valued. Most of the results also apply to the case
We adopt the convention that matrices are denoted by uppercase Roman letters. Frequently, we will partition
A by columns or rows:
ǎT0
ǎT
1
A = (a0 , a1 , . . . , an−1 ) = .. ,
.
ǎT
m−1
Remark 3.1 Lowercase Greek letters and Roman letters will be used to denote scalars and vectors, respec-
tively. Uppercase Roman letters will be used for matrices.
Exceptions to this rule are variables that denote the (integer) dimensions of the vectors and matrices which
are denoted by Roman lowercase letters to follow the traditional convention.
During an algorithm one or more variables (scalars, vectors, or matrices) will be modified so that they no
longer contain their original values (contents). Whenever we need to refer to the original contents of a variable
we will put a “ˆ” symbol on top of it. For example, Â, â, and α̂ will refer to the original contents (those before
the algorithm commences) of A, a, and α, respectively.
Remark 3.2 A variable name with a “ˆ” symbol on top of it refers to the original contents of that variable.
This will be used for scalars, vectors, matrices, and also for parts (elements, subvectors, submatrices) of these
variables.
While the reader has likely been exposed to the definition of the matrix-vector product before, we believe it to
be a good idea to review why it is defined as it is.
Definition 3.3 Let F : Rn → Rm be a function that maps a vector from Rn to a vector in Rm . Then F is said
to be a linear transformation if F(αx + y) = αF(x) + F(y) for all α ∈ R and x, y ∈ Rn .
30 3. Matrix-Vector Operations
Consider the unit basis vectors, ej ∈ Rn , 0 ≤ j < n, which are defined by the vectors of all zeroes except for
the jth element, which equals 1:
0
..
. j zeroes (elements 0, 1, . . . , j − 1)
0
ej =
1 ← element j
0
. n − j − 1 zeroes (elements j + 1, j + 2, . . . , n − 1)
..
0
T
Any vector x = (χ0 , χ1 , . . . , χn−1 ) ∈ Rn can then be written as a linear combination of these vectors:
χ0 1 0 0
χ1 0 1 0
x = . = χ0 . + χ1 . + · · · + χn−1 . = χ0 e0 + χ1 e1 + · · · + χn−1 en−1 .
.. .. .. ..
χn−1 0 0 1
If F : Rn → Rm is a linear transformation, then
F(x) = F(χ0 e0 + χ1 e1 + · · · + χn−1 en−1 ) = χ0 F(e0 ) + χ1 F(e1 ) + · · · + χn−1 F(en−1 )
= χ0 a0 + χ1 a1 + · · · + χn−1 an−1 ,
where aj = F(ej ) ∈ Rm , 0 ≤ j < n. Thus, we conclude that if we know how the linear transformation F acts
on the unit basis vectors, we can evaluate F(x) as a linear combination of the vectors aj , 0 ≤ j < n, with the
coefficients given by the elements of x. The matrix A ∈ Rm×n that has aj , 0 ≤ j < n, as its jth column thus
represents the linear transformation F, and the matrix-vector product Ax is defined as
Ax ≡ F(x) = χ0 a0 + χ1 a1 + · · · + χn−1 an−1
α00 α01 α0,n−1
α10 α11 α1,n−1
= χ0 .. + χ1 .. + · · · + χn−1 ..
. . .
αm−1,0 αm−1,1 αm−1,n−1
| {z } | {z } | {z }
a0 a1 an−1
α00 χ0 + α01 χ1 + ··· + α0,n−1 χn−1
α10 χ0 + α11 χ1 + ··· + α1,n−1 χn−1
= .. .. .. . (3.1)
. . .
αm−1,0 χ0 + αm−1,1 χ1 + ··· + αm−1,n−1 χn−1
3.2. Linear Transformations and Matrices 31
Exercise 3.4 Let A = (a0 , a1 , . . . , an−1 ) ∈ Rm×n be a partitioning of A by columns. Show that Aej = aj ,
0 ≤ j < n, from the definition of the matrix-vector product in (3.1).
Exercise 3.5 Let F : Rn → Rn have the property that F(x) = x for all x ∈ Rn . Show that
(a) F is a linear transformation.
(b) F(x) = In x, where In is the identity matrix defined as In = (e0 , e1 , . . . , en−1 ).
The following two exercises relate to the distributive property of the matrix-vector product.
Exercise 3.6 Let F : Rn → Rm and G : Rn → Rm both be linear transformations. Show that H : Rn → Rm
defined by H(x) = F(x) + G(x), is also a linear transformation. Next, let A, B, and C equal the matrices that
represent F, G, and H, respectively. Explain why C = A + B should be defined as the matrix that results from
adding corresponding elements of A and B.
Exercise 3.7 Let A ∈ Rm×n and x, y ∈ Rn . Show that A(x + y) = Ax + Ay; that is, the matrix-vector product
is distributive with respect to vector addition.
Two important definitions, which will be used later in the book, are the following.
Definition 3.8 A set of vectors v1 , v2 , . . . , vn ∈ Rm is said to be linearly independent if
ν1 v1 + ν2 v2 + · · · + νn vn = 0,
with ν1 , ν2 , . . . , νn ∈ R, implies ν1 = ν2 = . . . = νn = 0.
Definition 3.9 The column (row) rank of a matrix A is the maximal number of linearly independent column
(row) vectors of A.
Note that the row and column rank of a matrix are always equal.
Let us consider the operation y := αAx + βy, with x ∈ Rn , y =∈ Rm partitioned into elements, and
A ∈ Rm×n partitioned into elements, columns, and rows as discussed in Section 3.1. This is a more general
form of the matrix-vector product and will be referred to as gemv from here on. For simplicity, we consider
α = β = 1 in this section.
From
y := Ax + y = χ0 a0 + χ1 a1 + · · · + χn−1 an−1 + y = [[[[χ0 a0 + χ1 a1 ] + · · ·] + χn−1 an−1 ] + y] ,
we note that gemv can be computed by repeatedly performing axpy operations. Because of the commutative
property of vector addition, the axpys in this expression can be performed in any order.
Next, we show that gemv can be equally well computed as a series of apdots involving the rows of matrix
A, vector x, and the elements of y:
α00 χ0 + α01 χ1 + ··· +α0,n−1 χn−1 ψ0 ǎT
0 x + ψ0
α10 χ0 + α11 χ1 + ··· +α1,n−1 χn−1 ǎT
1 x + ψ1
ψ1
y := Ax + y = .. .. .. + .. = .. .
. . . . .
αm−1,0 χ0 + αm−1,1 χ1 + ··· +αm−1,n−1 χn−1 ψm−1 ǎT
m−1 x + ψm−1
32 3. Matrix-Vector Operations
where Ai,j ∈ Rmi ×np , xj ∈ Rnj , and yi ∈ Rmi . Then, the ith subvector of y = Ax is given by
ν−1
X
yi = Ai,j xj . (3.2)
j=0
Remark 3.15 Subscripts “L” and “R” will serve to specify the Left and Right submatrices/subvectors of a
matrix/vector, respectively. Similarly, subscripts “T” and “B” will be used to specify the Top and Bottom
submatrices/subvectors.
Exercise 3.16 Prove Corollary 3.13.
Exercise 3.17 Prove Corollary 3.14.
Remark 3.18 Corollaries 3.13 and 3.14 pose certain restrictions on the dimensions of the partitioned matri-
ces/vectors so that the matrix-vector product is “consistently” defined for these partitioned elements.
Let us share a few hints on what a conformal partitioning is for the type of expressions encountered in gemv.
Consider a matrix A, two vectors x, y, and an operation which relates them as:
1. x + y or x := y; then, m(x) = m(y), and any partitioning by elements in one of the two operands must
be done conformally (with the same dimensions) in the other operand.
2. Ax; then, n(A) = m(x) and any partitioning by columns in A must be conformally performed by elements
in x.
„ « Variant 1 «
„ „ « „Variant 2 «
yT AT x + ŷT yT ŷT
= ∧ Pcons = ∧ Pcons
yB ŷB yB AB x + ŷB
Variant 3 Variant 4
y = AL xT + ŷ ∧ Pcons y = AL xT + ŷ ∧ Pcons
3.3.1 Derivation
We now derive algorithms for gemv using the eight steps in the worksheet (Figure 2.5).
Remark 3.19 In order to derive the following algorithms, we do not assume the reader is a priori aware of
any method for computing gemv. Rather, we apply systematically the steps in the worksheet to derive two
different algorithms, which correspond to the computation of gemv via a series of axpys or apdots.
Step 1: Specifying the precondition and postcondition. The precondition for the algorithm is given by
Ppre : (A ∈ Rm×n ) ∧ (y ∈ Rm ) ∧ (x ∈ Rn ),
Step 2: Determining loop-invariants. Corollaries 3.13 and 3.14 provide us with two PMEs from which loop-
invariants can be determined:
µ ¶ µ ¶
yT AT x + ŷT
= ∧ Pcons and y = AL xT + AR xB + ŷPcons .
yB AB x + ŷB
Here Pcons : m(yT ) = m(AT ) for the first PME and Pcons : m(xT ) = n(AL ) for the second PME.
Remark 3.20 We will often use the consistency predicate “Pcons ” to establish conditions on the partitionings
of the operands that ensure that operations are well-defined.
3.3. Algorithms for the Matrix-Vector Product 35
A loop-invariant inherently describes an intermediate result towards the final result computed by a loop.
The observation that only part of the computations have been performed before each iteration yields the four
different loop-invariants in Figure 3.1.
Let us focus on Invariant 1:
µµ ¶ µ ¶¶
yT AT x + ŷT
Pinv −1 : = ∧ Pcons , (3.3)
yB ŷB
which reflects a state where elements of yT have already been updated with the final result while elements of
yB have not. We will see next how the partitioning of A by rows together with this loop-invariant fixes all
remaining steps in the worksheet and leads us to the algorithm identified as Variant 1 in Figure 3.2.
Step 3: Choosing a loop-guard. Upon completion of the loop, the loop-invariant is true, the loop-guard G is
false, and the condition
µµµ ¶ µ ¶¶ ¶
yT AT x + ŷT
Pinv ∧ ¬G ≡ = ∧ Pcons ∧ ¬G (3.4)
yB ŷB
must imply that y = Ax + ŷ. Now, if yT equals all of y then, by consistency, AT equals all of A, and (3.4)
implies that the postcondition is true. Therefore, we adopt “G : m(yT ) < m(y)” as the required loop-guard G
for the worksheet.
Step 4: Initialization. Next, we must find an initialization that, ideally with a minimum amount of computa-
tions, sets the variables of the algorithm in a state where the loop-invariant (including the consistency condition)
holds.
We note that the partitioning
µ ¶ µ ¶
AT yT
A→ , y→ ,
AB yB
where AT has 0 rows and yT has no elements, sets the variables AT , AB , yT , and yB in a state where the
loop-invariant is satisfied. This initialization, which only involves indexing operations, appears in Step 4 in
Figure 3.2.
Step 5: Progressing through the operands. As part of the computation, AT and yT , start by having no
elements, but must ultimately equal all of A and y, respectively. Thus, as part of the loop, rows must be taken
from AB and added to AT while elements must be moved from yB to yT . This is denoted in Figure 3.2 by the
36 3. Matrix-Vector Operations
repartitioning statements2
µ ¶ A0 µ ¶ y0
AT yT
→ aT
1
and → ψ1 ,
AB yB
A2 y2
This manner of moving the elements ensures that Pcons holds and that the loop terminates.
Step 6: Determining the state after repartitioning. The contents of y in terms of the partitioned matrix
and vectors, Pbefore in the worksheet in Figure 2.5, is determined via textual substitution as follows. From the
partitionings in Step 5a,
AT = A0 yT = y0
µ T ¶ and µ ¶ ,
a1 ψ1
AB = yB =
A2 y2
if we substitute the quantities on the right of the equalities into the loop-invariant,
µµ ¶ µ ¶¶
yT AT x + ŷT
= ∧ Pcons ,
yB ŷB
we find that
y0 A0 x + ŷ0 y0 A0 x + ŷ0
µ ¶ µ ¶
ψ1 = ψ̂1 , or, ψ1 = ψ̂1 ,
y2 ŷ2 y2 ŷ2
as entered in Step 6 in Figure 3.2.
Step 7: Determining the state after moving the thick lines. After moving the thick lines, in Step 5b
µ ¶ µ ¶
yT AT x + ŷT
=
yB ŷB
2 In the partitionings we do not use the superscript “ˇ” for the row aT as, in this case, there is no possible confusion with a
1
column of the matrix.
38 3. Matrix-Vector Operations
implies that
µ ¶ µ ¶ µ ¶
y0 A0 ŷ0 y0 A0 x + ŷ0
= x+ , ψ1 = aT .
ψ1 aT
1 ψ̂1 or, 1 x + ψ̂1
y2 ŷ2 y2 ŷ2
This is entered as the state Pafter in the worksheet in Figure 2.5, as shown in Step 7 in Figure 3.2.
Step 8: Determining the update. Comparing the contents in Step 6 and Step 7 now tells us that the contents
of y must be updated from
y0 A0 x + ŷ0 y0 A0 x + ŷ0
ψ1 = ψ̂1 to ψ1 = a T
1 x+ψ̂1
.
y2 ŷ2 y2 ŷ2
ψ1 := aT
1 x + ψ1 ,
Final algorithm. By deleting the annotations (assertions) we finally obtain the algorithm for gemv (Variant
1) given in Figure 3.3. All the arithmetic operations in this algorithm are performed in terms of apdot.
Remark 3.21 The partitionings together with the loop-invariant prescribe steps 3–8 of the worksheet.
Exercise 3.22 Derive an algorithm for computing y := Ax + y using the Invariant 2 in Figure 3.1.
Exercise 3.23 Consider Invariant 3 in Figure 3.1. Provide all steps that justify the worksheet in Figure 3.4.
State the algorithm without assertions.
Exercise 3.24 Derive an algorithm for computing y := Ax + y using the Invariant 4 in Figure 3.1.
ψ1 = aT
1 x + ψ1
Continue with
µ ¶ A0 µ ¶ y0
AT yT
← aT
1
, ← ψ1
AB yB
A2 y2
endwhile
Consider now Figure 3.5. where assertions are added indicating the computation cost incurred so far at
the specified points in the algorithm. In Step 1a, the cost is given by Csf = 0. At Step 2, just before the
loop, this translates to the cost-invariant Csf = 2m(yT )n since m(yT ) = 0. We need to show that the cost-
invariant, which is true at the top of the loop, is again true at the bottom of the loop-body, where m(yT ) is
one greater than m(yT ) at the top of the loop. We do so by inserting Csf = 2m(y0 )n in Step 6, which follows
by textual substitution and the fact that Step 5a is composed of indexing operations with no cost. As 2n flops
are performed in Step 8 and the operations in Step 5b are indexing operations, Csf = 2(m(y0 ) + 1)n at Step
7. Since m(yT ) = m(y0 ) + 1 in Step 2, due to the fact that one element has been added to yT , shows that
Csf = 2m(yT )n at the bottom of the loop-body. Thus, as was true for the loop-invariant, Csf = 2m(yT )n upon
leaving the loop. Since there m(yT ) = m(y), we establish that the total cost of the algorithm is 2mn flops.
Exercise 3.25 Prove that the costs of the algorithms corresponding to Variant 2–4 are also 2mn flops.
40 3. Matrix-Vector Operations
Consider the vectors x ∈ Rn , y ∈ Rm , and the matrix A ∈ Rm×n partitioned as in Section 3.1. A second
operation that plays a critical role in linear algebra is the rank-1 update (ger), defined as
A := A + αyxT . (3.6)
For simplicity in this section we consider α = 1. In this operation the (i, j) element of A is updated as
αi,j := αi,j + ψi χj , 0 ≤ i < m, 0 ≤ j < n.
3.4. Rank-1 Update 41
The term rank-1 update comes from the fact that the rank of the matrix yxT is at most one. Indeed,
yxT = (χ0 y, χ1 y, . . . , χn−1 y)
clearly shows that all columns of this matrix are multiples of the same vector y, and thus there can be at most
one linearly independent column.
Now, we note that
A := A + yxT
α00 α01 ... α0,n−1 ψ0
α10 α11 ... α1,n−1 ψ1
= .. .. .. .. + .. (χ0 , χ1 , . . . , χn−1 )
. . . . .
αm−1,0 αm−1,1 ... αm−1,n−1 ψm−1
α00 + ψ0 χ0 α01 + ψ0 χ1 ··· α0,n−1 + ψ0 χn−1
α10 + ψ1 χ0 α11 + ψχ ··· α1,n−1 + ψ1 χn−1
= .. .. .. ..
. . . .
αm−1,0 + ψm−1 χ0 αm−1,1 + ψχ ···
αm−1,n−1 + ψm−1 χn−1
ǎT
0 + ψ0 x
T
ǎT
1 + ψ1 x
T
= (a0 + χ0 y, a1 + χ1 y, . . . , an−1 + χn−1 y) = .. ,
.
ǎT
m−1 + ψm−1 x
T
which shows that, in the computation of A + yxT , column aj , 0 ≤ j < n, is replaced by aj + χj y while row ǎT
i ,
0 ≤ i < m, is replaced by ǎT
i + ψ i xT
.
Based on the above observations the next two corollaries give the PMEs that can be used to derive the
algorithms for computing ger.
Corollary 3.27 Partition matrix A and vector x as
µ ¶
¡ ¢ xT
A → AL AR and x → ,
xB
with n(AL ) = m(xT ). Then
µ ¶T
T
¡ ¢ xT ¡ ¢
A + yx = AL AR +y = AL + yxT
T AR + yxT
B .
xB
Corollary 3.28 Partition matrix A and vector y as
µ ¶ µ ¶
AT yT
A→ and y→ ,
AB yB
3.5. Solving Triangular Linear Systems of Equations 43
Remark 3.29 Corollaries 3.27 and 3.28 again pose restrictions on the dimensions of the partitioned matri-
ces/vectors so that an operation is “consistently” defined for these partitioned elements.
We now give a few rules that apply to the partitionings performed on the operands that arise in ger. Consider
two matrices A, B, and an operation which relates them as:
1. A + B or A := B; then, m(A) = m(B), n(A) = n(B) and any partitioning by rows/columns in one of
the two operands must be done conformally (with the same dimensions) in the other operand.
Consider now a matrix A, two vectors x, y and the ger operation
1. A + yxT ; then, m(A) = m(y), n(A) = m(x), and any partitioning by rows/columns in A must be
conformally performed by elements in y/x (and vice versa).
Exercise 3.33 Derive two different algorithms for ger using the partitionings
µ ¶ µ ¶
AT yT
A→ , y→ .
AB yB
Exercise 3.34 Prove that all of the previous four algorithms for ger incur in 2mn flops.
Here, A ∈ Rm×n is the coefficient matrix, b ∈ Rm is the right-hand side vector, and x ∈ Rn is the vector of
unknowns.
Let us now define the diagonal elements of the matrix A as those elements of the form αi,i , 0 ≤ i < min(m, n).
In this section we study a simple case of a linear system which appears when the coefficient matrix is square
and has zeros in all its elements above the diagonal; we then say that the coefficient matrix is lower triangular
and we prefer to denote it using L instead of A, where L stands for Lower:
λ00 0 ... 0 χ0 β0
λ10 λ11 ... 0 χ1 β1
.. .. .. .. .. = .. ≡ Lx = b.
. . . . . .
λn−1,0 λn−1,0 ... λn−1,n−1 χn−1 βn−1
Remark 3.35 Lower/upper triangular matrices will be denoted by letters such as L/U for Lower/Upper.
Lower/upper triangular matrices are square.
We next proceed to derive algorithms for computing this operation (hereafter, trsv) by filling out the
worksheet in Figure 2.5. During the presentation one should think of x as the vector that represents the final
solution, which ultimately will overwrite b upon completion of the loop.
Remark 3.36 In order to emphasize that the methodology allows one to derive algorithms for a given linear
algebra operation without an a priori knowledge of a method, we directly proceed with the derivation of an
algorithm for the solution of triangular linear systems, and delay the presentation of a concrete example until
the end of this section.
Step 1: Specifying the precondition and postcondition. The precondition for the algorithm is given by
Here, the predicate TrLw(L) is true if L is a lower triangular matrix. (A similar predicate, TrUp(U ), will play
an analogous role for upper triangular matrices.) The postcondition is that
in other words, upon completion the contents of b equal those of x, where x is the solution of the lower triangular
linear system Lx = b̂. This is indicated in Steps 1a and 1b in Figure 3.6.
Next, let us use L to introduce a new type of partitioning, into quadrants:
µ ¶
LT L 0
L→ .
LBL LBR
where “Pcons : n(LT L ) = m(xT ) = m(bT )” holds. Furthermore, we will require that both LT L and LBR are
themselves lower triangular matrices, that is,
holds.
Remark 3.37 We will often use the structural predicate “Pstruct ” to establish conditions on the structure of
the exposed blocks.
Remark 3.38 When dealing with triangular matrices, in order for the diagonal blocks (submatrices) that are
exposed to themselves be triangular, we always partition this type of matrices into quadrants, with square blocks
on the diagonal.
Although we employ predicates Pcons and Pstruct during the derivation of the algorithm, in order to con-
dense the assertions for this algorithm, we do not include these two predicates as part of the invariant in the
presentation of the corresponding worksheet.
3.5. Solving Triangular Linear Systems of Equations 47
Variant
! ! 1 !
„ «
bT xT
= , ∧ (LT L xT = b̂T ) ∧ Pcons ∧ Pstruct
bB b̂B
Variant
! ! 2 !
„ «
bT xT
= , ∧ (LT L xT = b̂T ) ∧ Pcons ∧ Pstruct
bB b̂B − LBL xT
This shows that xT can be computed from the first equality (the one at the top), after which b̂B must be
updated by subtracting LBL xT from it, before xB can be computed using the second equality. This constraint
on the order in which subresults must be computed yields the two loop-invariants in Figure 3.7.
Step 3: Choosing a Loop-guard. For either of the two loop-invariants, the loop-guard “G : m(bT ) < m(b)”
has the property that (Pinv ∧ ¬G) ⇒ Ppost .
where LT L is 0 × 0, and xT , bT have 0 elements, has the property that it sets the variables in a state where the
loop-invariant holds.
48 3. Matrix-Vector Operations
Step 5: Progressing through the operands. For either of the two loop-invariants, the repartitioning shown in
Step 5a in Figure 3.63 , followed by moving the thick lines as in Step 5b in the same figure denote that progress
is made through the operands so that the loop eventually terminates. It also ensures that Pcons and Pstruct
hold.
Only now does the derivation become dependent on the loop-invariant that we choose. Let us choose
Invariant 2, which will produce the algorithm identified as Variant 2 for this operation.
Step 6: Determining the state after repartitioning. Invariant 2 and the repartitioning of the partitioned
matrix and vectors imply that
b0 x0
µ ¶ Ã ! µ T ¶
β1 = β̂1 l10
∧ (L00 x0 = b̂0 )
− x0
b2 b̂2 L20
b0 x0
≡ β1 = β̂1 − l10 T
x0 ∧ (L00 x0 = b̂0 ),
b2 b̂2 − L20 x0
which is entered in Step 6 as in Figure 3.6.
Step 7: Determining the state after moving the thick lines. In Step 5b, Invariant 2 and the moving of the
thick lines imply that
µ ¶
µ ¶ x0 õ
b0 ¶µ ¶ Ã !!
χ1 L00 0 x0 b̂0
β1
= µ ¶
¡ ¢ x0 ∧ T
l10 λ11 χ1
=
β̂1
b2 b̂2 − L20 l21
χ1
ÃÃ !!
b0 x0
L 00 x0 = b̂ 0
≡ β1 = χ1 ∧
T
,
l10 x0 + λ11 χ1 = β̂1
b2 b̂2 − L20 x0 − χ1 l21
which is entered in the corresponding step as in Figure 3.6.
Step 8. Determining the update. Comparing the contents in Step 6 and Step 7 now tells us that the contents
of b must be updated from
b0 x0 b0 x0
β1 = β̂1 − lT x0 to β1 = χ1 ,
10
b2 b̂2 − L20 x0 b2 b̂2 − L20 x0 −χ1 l21
3 In the repartitioning of L the superscript “T ” denotes that l01
T is a row vector as corresponds to λ
11 being a scalar.
3.5. Solving Triangular Linear Systems of Equations 49
where
χ1 := β1 /λ11 and
b2 := b2 − χ1 l21 ,
Final algorithm. By deleting the temporary variable x, which is only used for the purpose of proving the
algorithm correct while it is constructed, we arrive at the algorithm in Figure 3.8. In Section 4.2, we discuss
an API for representing algorithms in Matlab M-script code, FLAME@lab. The FLAME@lab code for the
algorithm in Figure 3.8 is given in Figure 3.9.
Example 3.39 Let us now illustrate how this algorithm proceeds. Consider a triangular linear system defined
by 0 1 0 1
2 0 0 0 2
B 1 1 0 0 C B 3 C
L=B
@ 2
C, b=B C.
1 2 0 A @ 10 A
0 2 1 3 19
From a little manipulation we can see that the solution to this system is given by
χ0 := ( 2 )/2 = 1,
χ1 := ( 3−1·1 )/1 = 2,
χ2 := ( 10 − 2 · 1 − 1 · 2 )/2 = 3,
χ3 := ( 19 − 0 · 1 − 2 · 2 − 1 · 3 )/3 = 4.
In Figure 3.10 we show the initial contents of each quadrant (iteration labeled as 0) as well as the contents
as computation proceeds from the first to the fourth (and final) iteration. In the figure, faces of normal size
indicate data and operations/results that have already been performed/computed, while the small faces indicate
operations that have yet to be performed.
The way the solver classified as Variant 2 works, corresponds to what is called an “eager” algorithm, in
the sense that once an unknown is computed, it is immediately “eliminated” from the remaining equations.
Sometimes this algorithm is also classified as the “column-oriented” algorithm of forward substitution as, at
each iteration, it utilizes a column of L in the update of the remaining independent terms by using a saxpy
operation. It is sometimes called forward substitution for reasons that will become clear in Chapter 6.
Exercise 3.40 Prove that the cost of the triangular linear system solver formulated in Figure 3.8 is n2 +n ≈ n2
Pm(x )−1
flops. Hint: Use Csf = m(x0 ) + k=00 2(n − k − 1) flops.
50 3. Matrix-Vector Operations
β1 := β1 /λ11
b2 := b2 − β1 l21 (axpy)
Continue with
µ ¶ L00 0 0 µ ¶ b0
LT L 0 bT
← T
l10 λ11 0 , ← β1
LBL LBR bB
L20 l21 L22 b2
endwhile
Remark 3.41 When dealing with cost expression we will generally neglect lower order terms.
Exercise 3.42 Derive an algorithm for solving Lx = b by choosing the Invariant 1 in Figure 3.7. The solution
to this exercise corresponds to an algorithm that is “lazy” (for each equation, it does not eliminate previous
unknown until it becomes necessary) or row-oriented (accesses to L are by rows, in the form of apdots).
Exercise 3.43 Prove that the cost of the triangular linear system solver for the lazy algorithm obtained as the
solution to Exercise 3.42 is n2 flops.
Exercise 3.44 Derive algorithms for the solution of the following triangular linear systems:
1. U x = b.
2. LT x = b.
3. U T x = b.
Figure 3.9: FLAME@lab code for solving Lx = b, overwriting b with x (unblocked Variant 2).
Key objectives when designing and implementing linear algebra libraries are modularity and performance. In
this section we show how both can be accommodated by casting algorithms in terms of gemv. The idea is to
derive so-called blocked algorithms which differ from the algorithms derived so far in that they move the thick
lines more than one element, row, and/or column at a time. We illustrate this technique by revisiting the trsv
operation.
52 3. Matrix-Vector Operations
µ ¶ µ ¶ µ ¶
LT L 0 L−1
T L bT xT
#Iter. =
LBL LBR bB − LBL (L−1T L bT ) bB − LBL xT
2 0 0 0 ( 2 )/2 2
1 1 0 0 ( 3−1·1 )/2 3
0 =
2 1 2 0 ( 10 − 2 · 1 − 1 · 2 )/2 10
0 2 1 3 ( 19 − 0 · 1 − 2 · 2 − 1 · 3 )/3 19
2 0 0 0 ( 2 )/2 1
1 1 0 0 ( 3-1 · 1 )/1 2
1 =
2 1 2 0 ( 10 - 2 · 1 − 1 · 2 )/2 8
0 2 1 3 ( 19 - 0 · 1 − 2 · 2 − 1 · 3 )/3 19
2 0 0 0 ( 2 )/2 1
1 1 0 0 ( 3-1 · 1 )/1 2
2 =
2 1 2 0 ( 10 - 2 · 1 - 1 · 2 )/2 6
0 2 1 3 ( 19 - 0 · 1 - 2 · 2 − 1 · 3 )/3 15
2 0 0 0 ( 2 )/2 1
1 1 0 0 ( 3-1 · 1 )/1 2
3 =
2 1 2 0 ( 10 - 2 · 1 - 1 · 2 )/2 3
0 2 1 3 ( 19 - 0 · 1 - 2 · 2 - 1 · 3 )/3 12
2 0 0 0 ( 2 )/2 1
1 1 0 0 ( 3-1 · 1 )/1 2
4 =
2 1 2 0 ( 10 - 2 · 1 - 1 · 2 )/2 3
0 2 1 3 ( 19 - 0 · 1 - 2 · 2 - 1 · 3 )/3 4
Figure 3.10: Example of the computation of (b := x) ∧ (Lx = b) (Variant 2). Computations yet to be performed
are in tiny font.
Remark 3.45 The derivation of blocked algorithms is identical to that of unblocked algorithm up to and
including Step 4.
Let us choose Invariant 2, which will produce now the worksheet of a blocked algorithm identified as Variant
2 in Figure 3.11.
Step 5: Progressing through the operands. We now choose to move through vectors x and b by nb elements
per iteration. Here nb is the (ideal) block size of the algorithm. In other words, at each iteration of the loop,
nb elements are taken from xB and bB and moved to xT , bT , respectively. For consistency then, a block of
dimension nb × nb must be also moved from LBR to LT L . We can proceed in this manner by first repartitioning
3.6. Blocked Algorithms 53
where L11 is a block of dimension nb × nb , and x1 , b1 have nb elements each. These block/elements are then
moved to the corresponding parts of the matrix/vectors as indicated by
µ ¶ L00 0 0 µ ¶ x0 µ ¶ b0
LT L 0 xT b T
← L10 L11 0 , ← x1 , ← b1 .
LBL LBR xB bB
L20 L21 L22 x2 b2
This movement ensures that the loop eventually terminates and that both Pcons and Pstruct hold.
Remark 3.46 In practice, the block size is adjusted at each iteration as the minimum between the algorithmic
(or optimal) block size and the number of remaining elements.
Step 6: Determining the state after repartitioning. Invariant 2 and the definition of the repartitioning for
the blocked algorithm imply that
b0 x0
µ ¶ Ã ! µ ¶
b1 =
b̂1 L10
∧ (L00 x0 = b̂0 ),
− x0
b2 b̂2 L20
or
b0 x0
b1 = b̂1 − L10 x0 ∧ (L00 x0 = b̂0 ),
b2 b̂2 − L20 x0
Step 7: Determining the state after moving the thick lines. In Step 5b the Invariant 2 implies that
µ ¶
µ ¶ x0 õ
b 0 ¶µ ¶ Ã !!
x1 L 0 x b̂
b1 = µ ¶ ∧ 00 0
= 0
,
¡ ¢ x0 L10 L11 x1 b̂1
b2 b̂2 − L20 L21
x1
3.6. Blocked Algorithms 55
or ÃÃ
b0 x0 !!
b1 = L00 x0 = b̂0
x1 ∧ ,
L10 x0 + L11 x1 = b̂1
b2 b̂2 − L20 x0 − L21 x1
which is entered in the corresponding step as in Figure 3.11.
Step 8. Determining the update. Comparing the contents in Step 6 and Step 7 now tells us that the contents
of b must be updated from
b0 x0 b0 x0
b1 = b̂1 − L10 x0 to b1 = x1 ,
b2 b̂2 − L20 x0 b2 b̂2 − L20 x0 −L21 x1
where
From the the last equation we find that L11 x1 = (b̂1 −L10 x0 ). Since b1 already contains (b̂1 −L10 x0 ) we conclude
that in the update we first need to solve the triangular linear system
L11 x1 = b1 , (3.7)
Final algorithm. By deleting the assertions and the temporary variable x, we obtain the blocked algorithm in
Figure 3.12. If nb is set to 1 in this algorithm, then it performs exactly the same operations and in the same
order as the corresponding unblocked algorithm.
b1 := trsv(L11 , b1 )
b2 := b2 − L21 b1 (gemv)
Continue with
µ ¶ L00 0 0 µ ¶ b0
LT L 0 bT
← L10 L11 0 , ← b1
LBL LBR bB
L20 L21 L22 b2
endwhile
For the blocked algorithm for trsv in Figure 3.12, we consider that n = νnb with nb ¿ n. The algorithm
thus iterates ν times, with a triangular linear system of fixed order nb (L−1 11 b1 ) being solved and a gemv
operation of decreasing size (b1 := b1 − L21 b1 ) being performed at each iteration. As a matter of fact, the row
dimension of the matrix involved in the gemv operation decreases by nb rows per iteration so that, at iteration
k, L21 is of dimension (ν − k − 1)nb × nb . Thus, the cost of solving the triangular linear system using the blocked
algorithm is approximately
ν−1
X µ ¶
ν2
(n2b + 2(ν − k − 1)n2b ) ≈ 2n2b 2
ν − = n2 flops.
2
k=0
The cost of blocked variant of trsv is equal to that of the unblocked version. This is true for most of the
blocked algorithms that we will derive in this book. Nevertheless, be aware that there exist a class of blocked
algorithms, related to the computation of orthogonal factorizations, which do not satisfy this property.
Exercise 3.47 Show that all unblocked and blocked algorithms for computing the solution to Lx = B have
exactly the same operation count by performing an exact analysis of the operation count.
3.6. Blocked Algorithms 57
1 #include "FLAME.h"
2
3 int Trsv_lower_blk_var2( FLA_Obj L, FLA_Obj b, int nb_alg )
4 {
5 FLA_Obj LTL, LTR, L00, L01, L02,
6 LBL, LBR, L10, L11, L12,
7 L20, L21, L22;
8 FLA_Obj bT, b0,
9 bB, b1,
10 b2;
11 int b;
12
13 FLA_Part_2x2( L, <L, <R,
14 &LBL, &LBR, 0, 0, FLA_TL );
15 FLA_Part_2x1( b, &bT,
16 &bB, 0, FLA_TOP );
17
18 while ( FLA_Obj_length( LTL ) < FLA_Obj_length( L ) ){
19 b = min( FLA_Obj_length( LBR ), nb_alg );
20
21 FLA_Repart_2x2_to_3x3( LTL, /**/ LTR, &L00, /**/ &L01, &L02,
22 /* ************* */ /* ******************** */
23 &L10, /**/ &L11, &L12,
24 LBL, /**/ LBR, &L20, /**/ &L21, &L22,
25 b, b, FLA_BR );
26 FLA_Repart_2x1_to_3x1( bT, &b0,
27 /* ** */ /* ** */
28 &b1,
29 bB, &b2, b, FLA_BOTTOM );
30 /*------------------------------------------------------------*/
31 Trsv_lower_unb_var2( L11, b1 ); /* b1 := inv( L11 ) * b1 */
32 FLA_Gemv( FLA_NO_TRANSPOSE, /* b2 := b2 - L21 * b1 */
33 ONE, L21, b1, ONE, b2 )
34 /*------------------------------------------------------------*/
35 FLA_Cont_with_3x3_to_2x2( <L, /**/ <R, L00, L01, /**/ L02,
36 L10, L11, /**/ L12,
37 /* ************** */ /* ****************** */
38 &LBL, /**/ &LBR, L20, L21, /**/ L22,
39 FLA_TL );
40 FLA_Cont_with_3x1_to_2x1( &bT, b0,
41 b1,
42 /* ** */ /* ** */
43 &bB, b2, FLA_TOP );
44 }
45 return FLA_SUCCESS;
46 }
Figure 3.13: FLAME/C code for solving Lx = b, overwriting b with x (Blocked Variant 2).
58 3. Matrix-Vector Operations
3.7 Summary
Let us recap the highlights of this chapter.
• Most computations in unblocked algorithms for matrix-vector operations are expressed as axpys and
apdots.
• Operations involving matrices typically yield more algorithmic variants than those involving only vectors
due to the fact that matrices can be traversed in multiple directions.
• Blocked algorithms for matrix-vector operations can typically be cast in terms of gemv and/or ger.
High-performance can be achieved in a modular manner by optimizing these two operations, and casting
other operations in terms of them.
• The derivation of blocked algorithms is no more complex than that of unblocked algorithms.
• Algorithms for all matrix-vector operations that are discussed can be derived using the methodology
presented in Chapter 2.
• Again, we note that the derivation of loop-invariants is systematic, and that the algorithm is prescribed
once a loop-invariant is chosen although now a remaining choice is whether to derive an unblocked algo-
rithm or a blocked one.
Figure 3.14: Basic operations combining matrices and vectors. Cost is approximate.
60 3. Matrix-Vector Operations
Chapter 4
The FLAME Application Programming
Interfaces
In this chapter we present two Application Programming Interfaces (APIs) for coding linear algebra algorithms.
While these APIs are almost trivial extensions of the M-script language and the C programming language, they
greatly simplify the task of typesetting, programming, and maintaining families of algorithms for a broad spec-
trum of linear algebra operations. In combination with the FLAME methodology for deriving algorithms, these
APIs facilitate the rapid derivation, verification, documentation, and implementation of a family of algorithms
for a single linear algebra operation. Since the algorithms are expressed in code much like they are explained
in a classroom setting, the APIs become not just a tool for implementing libraries, but also a valuable resource
for teaching the algorithms that are incorporated in the libraries.
61
62 4. The FLAME Application Programming Interfaces
y1 = y1 + A1 x
Continue with
µ ¶ A0 µ ¶ y0
AT y T
← A1 , ← y1
AB yB
A2 y2
endwhile
y := y + A1 x1
Continue with
µ ¶ x0
¡ ¢ ¡ ¢ xT
AL AR ← A0 A1 A2 , ← x1
xB
x2
endwhile
two submatrices:
[ AT,...
AB ] = FLA_Part_2x1( A, mb, side )
Purpose: Partition matrix A into a top and a bottom side where the side indicated by side has mb rows.
A – matrix to be partitioned
mb – row dimension of side indicated by side
side – side for which row dimension is given
AT, AB – matrices for Top and Bottom parts
Here side can take on the values (character strings) ’FLA TOP’ or ’FLA BOTTOM’ to indicate that mb is the
row dimension of the Top matrix AT, or the Bottom matrix AB, respectively. The routine can also be used to
partition a (column) vector.
64 4. The FLAME Application Programming Interfaces
As an example of the use of this routine, the translation of the algorithm fragment from Figure 4.1 on the
left results in the code on the right:
µ ¶
AT [ AT,...
Partition A →
AB AB, ] = FLA_Part_2x1( A,...
where AT has 0 rows 0, ’FLA_TOP’ )
Remark 4.2 The above example stresses the fact that the formatting of the code can be used to help represent
the algorithm in code. Clearly, some of the benefit of the API would be lost if in the example the code appeared
as
[ AT, AB ] = FLA_Part_2x1( A, 0, ’FLA_TOP’ )
For some of the subsequent calls this becomes even more dramatic.
Also from Figure 4.1, we notice that it is necessary to be able to take a 2 × 1 partitioning of a given matrix
A (or vector y) and repartition that into a 3 × 1 partitioning so that the submatrices that need to be updated
and/or used for computation can be identified. To support this, we introduce the M-script function
[ A0,...
A1,...
A2 ] = FLA_Repart_2x1_to_3x1( AT,...
AB, mb, side )
Purpose: Repartition a 2 × 1 partitioning of a matrix into a 3 × 1 partitioning where submatrix A1 with mb
rows is split from from the side indicated by side.
AT, AB – matrices for Top and Bottom parts
mb – row dimension of A1
side – side from which A1 is partitioned
A0, A1, A2 – matrices for A0 , A1 , A2
Here side can take on the values ’FLA TOP’ or ’FLA BOTTOM’ to indicate that submatrix A1, with mb rows, is
partitioned from AT or AB, respectively.
4.2. The FLAME@lab Interface for M-script 65
Thus, for example, the translation of the algorithm fragment from Figure 4.1 on the left results in the code
on the right:
Repartition
µ ¶ A0 [ A0,...
AT
→ A1 A1,...
AB A2 ] = FLA_Repart_2x1_to_3x1( AT,...
A2 AB, mb, ’FLA_BOTTOM’ )
where A1 has mb rows
where parameter mb has the value mb .
Remark 4.3 Similarly to what is expressed in Remark 4.1, the invocation of the M-script function
[ A0,...
A1,...
A2 ] = FLA_Repart_2x1_to_3x1( AT,...
AB, mb, side )
creates three new matrices and any modification of the contents of A0, A1, A2 does not affect the original
matrix A nor the two submatrices AT, AB. Readability is greatly reduced if it were typeset like
Remark 4.4 Choosing variable names can further relate the code to the algorithm, as is illustrated by com-
paring
A0 A0 y0 y0
A1 and A1 ; and ψ1 and psi1 .
A2 A2 y2 y2
Once the contents of the so-identified submatrices have been updated, AT and AB must be updated to reflect
that progress is being made, in terms of the regions indicated by the thick lines. This movement of the thick
lines is accomplished by a call to the M-script function
[ AT,...
AB ] = FLA_Cont_with_3x1_to_2x1( A0,...
A1,...
A2, side )
Purpose: Update the 2 × 1 partitioning of a matrix by moving the boundaries so that A1 is joined to the side
indicated by side.
A0, A1, A2 – matrices for A0 , A1 , A2
side – side to which A1 is joined
AT, AB – matrices for Top and Bottom parts
66 4. The FLAME Application Programming Interfaces
For example, the algorithm fragment from Figure 4.1 on the left results in the code on the right:
Continue with [ AT,...
µ ¶ A0
AT AB ] = FLA_Cont_with_3x1_to_2x1( A0,...
← A1 A1,...
AB A2, ’FLA_TOP’ )
A2
The translation of the algorithm in Figure 4.1 to M-script code is now given in Figure 4.3. In the implemen-
tation, the parameter mb alg holds the algorithmic block size (the number of elements of y that will be computed
at each iteration). As, in general, m(y)(= m(A)) will not be a multiple of this block size, at each iteration
mb elements are computed, with mb determined as min(m(yB ), mb alg)(= min(m(AB ), mb alg)). Also, we use
there a different variable for the input vector, y, and the output vector (result), y out. The reason for this is
that it will allow the FLAME@lab code to be more easily translated to FLAME/C, the C API.
In M-script, size( A, 1 ) and size( A, 2 ) return the row and column dimension of array A, respectively.
Placing a “;” at the end of a statement suppresses the printing of the value computed in the statement. The
final statement
y_out = [ yT
yB ];
sets the output variable y out to the vector that results from concatenating yT and yB.
Exercise 4.5 Visit $BASE/Chapter4/ and follow the directions to reproduce and execute the code in Fig. 4.1.
Figure 4.3: M-script code for the blocked algorithm for computing y := Ax + y (Variant 1).
68 4. The FLAME Application Programming Interfaces
Figure 4.4: M-script code for the blocked algorithm for computing y := Ax + y (Variant 3).
[ ATL, ATR,...
ABL, ABR ] = FLA_Cont_with_3x3_to_2x2( A00, A01, A02,...
A10, A11, A12,...
A20, A21, A22, quadrant )
Purpose: Update the 2 × 2 partitioning of a matrix by moving the boundaries so that A11 is joined to the
quadrant indicated by quadrant.
A00-A22 – matrices for A00 –A22
quadrant – quadrant to which A11 is to be joined
ATL, ATR, ABL, ABR – matrices for TL, TR, BL, and BR quadrants
Remark 4.7 The routines described in this section for the Matlab M-script language suffice to implement
a broad range of algorithms encountered in dense linear algebra.
Exercise 4.8 Visit $BASE/Chapter4/ and follow the directions to download FLAME@lab and to code and
execute the algorithm in Figure 3.8 for solving the lower triangular linear system Lx = b.
Subprograms (BLAS) [21, 10, 9]. In this section we introduce a set of library routines that allow us to capture
linear algebra algorithms presented in the format used in FLAME in C code.
Readers familiar with MPI [23], PETSc [1], or PLAPACK [28] will recognize the programming style, object-
based programming, as being very similar to that used by those (and other) interfaces. It is this style of
programming that allows us to hide the indexing details much like FLAME@lab does. We will see that a more
substantial infrastructure must be provided in addition to the routines that partition and repartition matrix
objects.
FLA INT, FLA DOUBLE, FLA FLOAT, FLA DOUBLE COMPLEX, FLA COMPLEX
for the obvious datatypes that are commonly encountered. The leading dimension of the array that is used to
store the matrix in column-major order is itself determined inside of this call.
Remark 4.9 For simplicity, we chose to limit the storage of matrices to column-major storage. The leading
dimension of a matrix can be thought of as the dimension of the array in which the matrix is embedded (which
is often larger than the row-dimension of the matrix) or as the increment (in elements) required to address
consecutive elements in a row of the matrix. Column-major storage is chosen to be consistent with Fortran,
which is often still the choice of language for linear algebra applications.
FLAME/C treats vectors as special cases of matrices: an n × 1 matrix or a 1 × n matrix. Thus, to create
an object for a vector x of n double-precision real numbers either of the following calls suffices:
Here n is an integer variable with value n and x is an object of type FLA Obj.
Similarly, FLAME/C treats a scalar as a 1 × 1 matrix. Thus, to create an object for a scalar α the following
call is made:
where alpha is an object of type FLA Obj. A number of scalars occur frequently and are therefore predefined
by FLAME/C: FLA MINUS ONE, FLA ZERO, and FLA ONE.
If an object is created with FLA Obj create, a call to FLA Obj free is required to ensure that all space
associated with the object is properly released:
FLA_Error FLA_Obj_free( FLA_Obj *matrix )
Purpose: Free all space allocated to store data associated with matrix.
matrix – descriptor for the object
1 #include "FLAME.h"
2
3 #define BUFFER( i, j ) buff[ (j)*lda + (i) ]
4
5 void fill_matrix( FLA_Obj A )
6 {
7 FLA_Datatype
8 datatype;
9 int
10 m, n, lda;
11
12 datatype = FLA_Obj_datatype( A );
13 m = FLA_Obj_length( A );
14 n = FLA_Obj_width ( A );
15 lda = FLA_Obj_ldim ( A );
16
17 if ( datatype == FLA_DOUBLE ){
18 double *buff;
19 int i, j;
20
21 buff = ( double * ) FLA_Obj_buffer( A );
22
23 for ( j=0; j<n; j++ )
24 for ( i=0; i<m; i++ )
25 BUFFER( i, j ) = i+j*0.01;
26 }
27 else
28 FLA_Check_error_code( FLA_NOT_YET_IMPLEMENTED );
29 }
• line 1: FLAME/C program files start by including the FLAME.h header file.
• lines 5–6: FLAME/C objects A, x, and y, which hold matrix A and vectors x and y, respectively, are
declared to be of type FLA Obj.
• line 10: Before any calls to FLAME/C routines can be made, the environment must be initialized by a
call to FLA Init.
76 4. The FLAME Application Programming Interfaces
1 #include "FLAME.h"
2
3 void main()
4 {
5 FLA_Obj
6 A, x, y;
7 int
8 m, n;
9
10 FLA_Init( );
11
12 printf( "enter matrix dimensions m and n:" );
13 scanf( "%d%d", &m, &n );
14
15 FLA_Obj_create( FLA_DOUBLE, m, n, &A );
16 FLA_Obj_create( FLA_DOUBLE, n, 1, &x );
17 FLA_Obj_create( FLA_DOUBLE, m, 1, &y );
18
19 fill_matrix( A );
20 fill_matrix( x );
21 fill_matrix( y );
22
23 FLA_Obj_show( "y = [", y, "%lf", "]" );
24
25 matvec_blk_var1( A, x, y );
26
27 FLA_Obj_show( "A = [", A, "%lf", "]" );
28 FLA_Obj_show( "x = [", x, "%lf", "]" );
29 FLA_Obj_show( "y = [", y, "%lf", "]" );
30
31 FLA_Obj_free( &A );
32 FLA_Obj_free( &y );
33 FLA_Obj_free( &x );
34
35 FLA_Finalize( );
36 }
• lines 12–13: In our example, the user inputs the row and column dimension of matrix A.
• lines 15–17: Descriptors are created for A, x, and y.
• lines 19–21: The routine in Figure 4.5 is used to fill A, x, and y with values.
• line 25: Compute y := y + Ax using the routine for performing that operation (to be discussed later).
• lines 23, 27–29: Print out the contents of A, x, and (both the initial and final) y.
• lines 31–33: Free the objects.
• line 35: Finalize FLAME/C.
Exercise 4.10 Visit $BASE/Chapter4/ and follow the directions on how to download the libFLAME library.
Then compile and execute the sample driver as directed.
Once the contents of the so-identified submatrices have been updated, the contents of AT and AB must
be updated to reflect that progress is being made, in terms of the regions indicated by the thick lines. This
4.3. The FLAME/C Interface for the C Programming Language 79
Thus, the algorithm fragment from Figure 4.1 on the left results in the code on the right:
Continue with FLA_Cont_with_3x1_to_2x1( &AT, A0,
µ ¶ A0 A1,
AT
← A1 /* ** */ /* ** */
AB
A2 &AB, A2, FLA_TOP );
Using the three routines for horizontal partitioning, the algorithm in Figure 4.1 is translated into the C code
in Figure 4.7.
Figure 4.2 illustrates that, when stating a linear algebra algorithm one may in wish to proceed by columns.
Therefore, we introduce the following pair of C routines for partitioning and repartitioning a matrix (or vector)
vertically:
FLA_Error FLA_Part_1x2( FLA_Obj A, FLA_Obj *AL, FLA_Obj *AR, int nb, FLA_Side side )
Purpose: Partition matrix A into a left and right side where the side indicated by side has nb columns.
A – matrix to be partitioned
nb – column dimension of side indicated by side
side – side for which column dimension is given
AL, AR – views of Left and Right parts
and
80 4. The FLAME Application Programming Interfaces
1 #include "FLAME.h"
2
3 void MATVEC_BLK_VAR1( FLA_Obj A, FLA_Obj x, FLA_Obj y, int mb_alg )
4 {
5 FLA_Obj AT, A0, yT, y0,
6 AB, A1, yB, y1,
7 A2, y2;
8
9 int mb;
10
11 FLA_Part_2x1( A, &AT,
12 &AB, 0, FLA_TOP );
13
14 FLA_Part_2x1( y, &yT,
15 &yB, 0, FLA_TOP );
16
17 while ( FLA_Obj_length( yT ) < FLA_Obj_length( y ) ){
18 mb = min( FLA_Obj_length( AB ), mb_alg );
19
20 FLA_Repart_2x1_to_3x1( AT, &A0,
21 /* ** */ /* ** */
22 &A1,
23 AB, &A2, mb, FLA_BOTTOM );
24
25 FLA_Repart_2x1_to_3x1( yT, &y0,
26 /* ** */ /* ** */
27 &y1,
28 yB, &y2, mb, FLA_BOTTOM );
29 /*------------------------------------------------------------*/
30 MATVEC_VAR1( A1, x, y1 );
31 /*------------------------------------------------------------*/
32 FLA_Cont_with_3x1_to_2x1( &AT, A0,
33 A1,
34 /* ** */ /* ** */
35 &AB, A2, FLA_TOP );
36
37 FLA_Cont_with_3x1_to_2x1( &yT, y0,
38 y1,
39 /* ** */ /* ** */
40 &yB, y2, FLA_TOP );
41 }
42 }
1 #include "FLAME.h"
2
3 void MATVEC_BLK_VAR3( FLA_Obj A, FLA_Obj x, FLA_Obj y, int nb_alg )
4 {
5 FLA_Obj AL, AR, A0, A1, A2;
6
7 FLA_Obj xT, x0,
8 xB, x1,
9 x2;
10
11 int nb;
12
13 FLA_Part_1x2( A, &AL, &AR,
14 0, FLA_LEFT );
15
16 FLA_Part_2x1( x, &xT,
17 &xB,
18 0, FLA_TOP );
19
20 while ( FLA_Obj_length( xT ) < FLA_Obj_length( x ) ){
21 b = min( FLA_Obj_width( AR ), nb_alg );
22
23 FLA_Repart_1x2_to_1x3( AL, /**/ AR, &A0, /**/ &A1, &A2,
24 nb, FLA_RIGHT );
25
26 FLA_Repart_2x1_to_3x1( xT, &x0,
27 /* ** */ /* ** */
28 &x1,
29 xB, &x2,
30 nb, FLA_BOTTOM );
31 /*------------------------------------------------------------*/
32 MATVEC_VAR2( A1, x1, y );
33 /*------------------------------------------------------------*/
34 FLA_Cont_with_1x3_to_1x2( &AL, /**/ &AR, A0, A1, /**/ A2,
35 FLA_LEFT );
36
37 FLA_Cont_with_3x1_to_2x1( &xT, x0,
38 x1,
39 /* ** */ /* ** */
40 &xB, x2,
41 FLA_TOP );
42 }
43 }
Here quadrant can take on the values FLA TL, FLA TR, FLA BL, and FLA BR (defined in FLAME.h) to indicate
that mb and nb specify the dimensions of the Top-Left, Top-Right, Bottom-Left, or Bottom-Right quadrant,
respectively.
Given that a matrix is already partitioned into a 2 × 2 partitioning, it can be further repartitioned into 3 × 3
partitioning with the C routine:
FLA_Error FLA_Repart_from_2x2_to_3x3
( FLA_Obj ATL, FLA_Obj ATR, FLA_Obj *A00, FLA_Obj *A01, FLA_Obj *A02,
FLA_Obj *A10, FLA_Obj *A11, FLA_Obj *A12,
FLA_Obj ABL, FLA_Obj ABR, FLA_Obj *A20, FLA_Obj *A21, FLA_Obj *A22,
int mb, int nb, FLA_Quadrant quadrant )
Purpose: Repartition a 2 × 2 partitioning of matrix A into a 3 × 3 partitioning where mb × nb submatrix A11
is split from the quadrant indicated by quadrant.
ATL, ATR, ABL, ABR – views of TL, TR, BL, and BR quadrants
mb, nb – row and column dimensions of A11
quadrant – quadrant from which A11 is partitioned
A00-A22 – views of A00 –A22
Here quadrant can again take on the values FLA TL, FLA TR, FLA BL, and FLA BR to indicate that mb × nb
submatrix A11 is split from submatrix ATL, ATR, ABL, or ABR, respectively.
Given a 3 × 3 partitioning, the middle submatrix can be appended to either of the four quadrants, ATL, ATR,
ABL, and ABR, of the corresponding 2 × 2 partitioning with the C routine
FLA_Error FLA_Cont_with_3x3_to_2x2
( FLA_Obj *ATL, FLA_Obj *ATR, FLA_Obj A00, FLA_Obj A01, FLA_Obj A02,
FLA_Obj A10, FLA_Obj A11, FLA_Obj A12,
FLA_Obj *ABL, FLA_Obj *ABR, FLA_Obj A20, FLA_Obj A21, FLA_Obj A22,
FLA_Quadrant quadrant )
Purpose: Update the 2 × 2 partitioning of matrix A by moving the boundaries so that A11 is joined to the
quadrant indicated by quadrant.
ATL, ATR, ABL, ABR – views of TL, TR, BL, and BR quadrants
A00-A22 – views of A00 –A22
quadrant – quadrant to which A11 is to be joined
Here the value of quadrant (FLA TL, FLA TR, FLA BL, or FLA BR) specifies the quadrant submatrix A11 is to be
joined.
• Subroutines coded using the FLAME/C interface (including, possibly, a recursive call);
Naturally these are actually three points on a spectrum of possibilities, since one can mix these techniques.
A number of matrix and/or vector operations have been identified to be frequently used by the linear algebra
community. Many of these are part of the BLAS. Since highly optimized implementations of these operations
are supported by widely available library implementations, it makes sense to provide a set of subroutines that
are simply wrappers to the BLAS. An example of a wrapper routine to the level 2 BLAS routine cblas dgemv,
a commonly available kernel for computing a matrix-vector multiplication, is given in Figure 4.9.
For additional information on supported functionality see Appendix B or visit the webpage
http://www.cs.utexas.edu/users/flame/
4.3. The FLAME/C Interface for the C Programming Language 85
1 #include "FLAME.h"
2 #include "cblas.h"
3
4 void matvec_wrapper( FLA_Obj A, FLA_Obj x, FLA_Obj y )
5 {
6 FLA_Datatype
7 datatype_A;
8 int
9 m_A, n_A, ldim_A, m_x, n_y, inc_x, m_y, n_y, inc_y;
10
11 datatype_A = FLA_Obj_datatype( A );
12 m_A = FLA_Obj_length( A );
13 n_A = FLA_Obj_width ( A );
14 ldim_A = FLA_Obj_ldim ( A );
15
16 m_x = FLA_Obj_length( x );
17 n_x = FLA_Obj_width ( x );
18
19 m_y = FLA_Obj_length( y );
20 n_y = FLA_Obj_width ( y );
21
22 if ( m_x == 1 ) {
23 m_x = n_x;
24 inc_x = FLA_Obj_ldim( x );
25 }
26 else inc_x = 1;
27
28 if ( m_y == 1 ) {
29 m_y = n_y;
30 inc_y = FLA_Obj_ldim( y );
31 }
32 else inc_y = 1;
33
34 if ( datatype_A == FLA_DOUBLE ){
35 double *buff_A, *buff_x, *buff_y;
36
37 buff_A = ( double * ) FLA_Obj_buffer( A );
38 buff_x = ( double * ) FLA_Obj_buffer( x );
39 buff_y = ( double * ) FLA_Obj_buffer( y );
40
41 cblas_dgemv( CblasColMaj, CblasNoTrans,
42 1.0, buff_A, ldim_A, buff_x, inc_x,
43 1.0, buff_y, inc_y );
44 }
45 else FLA_Abort( "Datatype not yet supported", __LINE__, __FILE__ );
46 }
Figure 4.9: A sample matrix-vector multiplication routine. This routine is implemented as a wrapper to the
CBLAS routine cblas dgemv for matrix-vector multiplications.
86 4. The FLAME Application Programming Interfaces
Exercise 4.14 Use the routines in the FLAME/C API to implement the algorithm in Figure 3.8 for computing
the solution of a lower triangular system b := x, where Lx = b.
4.4 Summary
The FLAME@lab and FLAMEC APIs illustrate how by raising the level of abstraction at which one codes,
intricate indexing can be avoided in the code, therefore reducing the opportunity for the introduction of errors
and raising the confidence in correctness of the code. Thus, the proven correctness of those algorithms derived
using the FLAME methodology translates to a high degree of confidence in the implementation.
The two APIs that we presented are simple ones and serve to illustrate the issues. Similar interfaces to
more elaborate programming languages (e.g., C++, Java, and LabView’s G graphical programming language)
can be easily defined allowing special features of those languages to be used to even further raise the level of
abstraction at which one codes.
Dense linear algebra operations are often at the heart of scientific computations that stress even the fastest
computers available. As a result, it is important that routines that compute these operations attain high
performance in the sense that they perform near the minimal number of operations and achieve near the
highest possible rate of execution. In this chapter we show that high performance can be achieved by casting
computation as much as possible in terms of the matrix-matrix product operation (gemm). We also expose that
for many matrix-matrix operations in linear algebra the derivation techniques discussed so far yield algorithms
that are rich in gemm.
Remark 5.1 Starting from this chapter, we adopt a more concise manner of presenting the derivation of the
algorithms where we only specify the partitioning of the operands and the loop-invariant. Recall that these two
elements prescribe the remaining derivation procedure of the worksheet and, therefore, the algorithm.
87
88 5. High Performance Algorithms
flops
Operation flops memops memops
Vector-vector operations
scal x := αx n n 1/1
add x := x + y n n 1/1
dot α := xT y 2n 2n 1/1
apdot α := α + xT y 2n 2n 1/1
axpy y := αx + y 2n 3n 2/3
Matrix-vector operations
gemv y := αAx + βy 2n2 n2 2/1
ger A := αyxT + A 2n2 2n2 1/1
trsv x := T −1 b n2 n2 /2 2/1
Matrix-matrix operations
gemm C := αAB + βC 2n3 4n2 n/2
Figure 5.1: Analysis of the cost of various operations. Here α ∈ R, x, y, b ∈ Rn , and A, B, C, T ∈ Rn×n , with T
being triangular.
data must be fetched from the memory to the registers in the CPU, and results must eventually be returned
to the memory. The fundamental obstacle to high performance (executing useful computations at the rate at
which the CPU can process) is the speed of memory: fetching and/or storing a data item from/to the memory
requires more time than it takes to perform a flop with it. This is known as the memory bandwidth bottleneck.
The solution has been to introduce a small cache memory, which is fast enough to keep up with the CPU, but
small enough to be economical (e.g., in terms of space required inside the processor). The pyramid in Figure 5.2
depicts the resulting model of the memory architecture. The model is greatly simplified in that currently most
architectures have several layers of cache and often also present additional, even slower, levels of memory. The
model is sufficient to explain the main issues behind achieving high performance.
Fast Registers
¢A Small
6 ¢ A 6
¢CacheA
¢ A
¢ A
¢ A
? ¢ RAM A ?
Slow ¢ A Large
Figure 5.2: Simplified model of the memory architecture used to illustrate the high-performance implementation
of gemm.
Next, consider the gemv operation y := Ax + y, where x, y ∈ Rn and A ∈ Rn×n . This operation involves
roughly n2 data (for the matrix), initially stored in memory, and 2n2 flops. Thus, an optimal implementation
will fetch every element of A exactly once, yielding a ratio of one memop for every two flops. Although this is
better than the ratio for the axpy, memops still dominate the cost of the algorithm if they are much slower
than flops. Figure 5.1 summarizes the analysis for other matrix-vector operations.
It is by casting linear algebra algorithms in terms of the matrix-matrix product, gemm, that there is the
opportunity to overcome this memory bottleneck. Consider the product C := AB + C where all three matrices
are square of order n. This operation involves 4n2 memops (A and B must be fetched from memory while C
must be both fetched and stored) and requires 2n3 flops1 for a ratio of 4n2 /2n3 = 2/n memops/flops. Thus, if
n is large enough, the cost of performing memops is small relative to that of performing useful computations
with the data, and there is an opportunity to amortize the cost of fetching data into the cache over many
computations.
where α, β ∈ R, op(A) ∈ Rm×k , op(B) ∈ Rk×n , C ∈ Rm×n , and op(X) is one of X or X T . That is, the gemm
C := α A B + βC,
C := α AT B + βC,
C := α A B T + βC, or
C := α AT B T + βC.
In the remainder of this chapter we will focus on the special case where α = β = 1 and matrices A and B are
not transposed. All insights can be easily extended to the other cases.
Throughout this chapter, unless otherwise stated, we will assume that A ∈ Rm×k , B ∈ Rk×n , and C ∈
m×n
R . These matrices will be partitioned into rows, columns, and elements using the conventions discussed in
Section 3.1.
5.2.1 Definition
The reader is likely familiar with the matrix-matrix product operation. Nonetheless, it is our experience that
it is useful to review why the matrix-matrix product is defined as it is.
Like the matrix-vector product, the matrix-matrix product is related to the properties of linear transforma-
tions. In particular, the matrix-matrix product AB equals the matrix that corresponds to the composition of
the transformations represented by A and B.
We start by reviewing the definition of the composition of two transformations.
Definition 5.2 Consider two linear transformations F : Rn → Rk and G : Rk → Rm . Then the composition of
these transformations (G ◦ F ) : Rn → Rm is defined by (G ◦ F)(x) = G(F(x)) for all x ∈ Rn .
The next theorem shows that if both G and F are linear transformations then so is their composition.
Theorem 5.3 Consider two linear transformations F : Rn → Rk and G : Rk → Rm . Then (G ◦ F) is also a
linear transformation.
Proof: Let α ∈ R and x, y ∈ Rn . Then
(G ◦ F )(αx + y) = G(F(αx + y)) = G(αF(x) + F(y)) = αG(F(x)) + G(F(y)) = α(G ◦ F )(x) + (G ◦ F )(y).
!
With these observations we are ready to relate the composition of linear transformations to the matrix-matrix
product.
Assume A and B equal the matrices that correspond to the linear transformations G and F, respectively.
Since (G ◦ F ) is also a linear transformation, there exists a matrix C so that Cx = (G ◦ F )(x) for all x ∈ Rn .
The question now becomes how C relates to A and B. The key is the observation that Cx = (G ◦ F )(x) =
G(F(x)) = A(Bx), by the definition of composition and the relation between the matrix-vector product and
linear transformations.
5.2. Matrix-Matrix Product: Background 91
Let ei , ej ∈ Rn denote, respectively, the ith, jth unit basis vector. This observation defines the jth column
of C as follows
ǎT0 ǎT
0 bj
ǎT ǎT
1 bj
1
cj = Cej = (G ◦ F )(ej ) = A(Bej ) = Abj = .. bj = .. , (5.1)
. .
ǎT
m−1 ǎT
m−1 bj
Exercise 5.5 Show that the cost of computing the matrix-matrix product is 2mnk flops.
Exercise 5.6 Show that the ith row of C is given by čT T
i = ǎi B.
Exercise 5.7 Show that A(BC) = (AB)C. (This motivates the fact that no parenthesis are needed when more
than two matrices are multiplied together: ABC = A(BC) = (AB)C.)
where Ai,p ∈ Rmi ×kp , Bp,j ∈ Rkp ×nj , and Ci,j ∈ Rmi ×nj . Then, the (i, j) block of C = AB is given by
κ−1
X
Ci,j = Ai,p Bp,j . (5.3)
p=0
92 5. High Performance Algorithms
The proof of this theorem is tedious. We therefore resort to Exercise 5.9 to demonstrate why it is true
without giving a rigorous proof in this text.
Exercise 5.9 Show that
0 1 0 „ «„ « „ « „ «„ « „ « 1
1 −1 3 0 1 1 −1 −1 0 3 ` ´ 1 −1 2 3
B 2 −1 0 2 + 2 1 + (−1)
B 0 −1 C
C@ 1 A
B 2
B „ 0 « „ 1 −1 « „−1« „2 0 « „1 « „−1« C
C
@ −1 −1 1 =@
2 1 A −1 2 −1 0 1 ` ´ −1 2 2 1 A
2 1 −1 + 2 1 + (−1)
0 1 2 0 1 1 −1 2 0 1 1 2
Remark 5.10 Multiplying two partitioned matrices is exactly like multiplying two matrices with scalar ele-
ments, but with the individual elements replaced by submatrices. However, since the product of matrices does
not commute, the order of the submatrices of A and B in the product is important: While αi,p βp,j = βp,j αi,p ,
Ai,p Bp,j is generally not the same as Bp,j Ai,p . Also, the partitioning of A, B, and C must be conformal:
m(Ai,p ) = m(Ci,j ), n(Bp,j ) = n(Ci,j ), and n(Ai,p ) = m(Bp,j ), for 0 ≤ i < µ, 0 ≤ j < ν, 0 ≤ p < κ.
Remark 5.11 In the next section we will see that “small” is linked to a dimension a block of a matrix that
fits in the cache of the target architecture.
m n k Illustration Label
:= +
:= +
:= +
:= +
:= +
:= +
:= +
:= +
:= +
:= +
Figure 5.4: Naming convention for the shape of matrices involved in gemm.
94 5. High Performance Algorithms
C := a1 bT C := A1 B1 + C (gepp)
1 +C (ger)
Continue
` with
´ ` ´ Continue
` with
´ ` ´
AL AR ← A0 a1 A2 , AL AR ← A0 A1 A2 ,
0 1 0 1
„ « B0 „ « B0
BT @ BT
← b1 A
T
← @ B1 A
BB BB
B2 B2
endwhile endwhile
Figure 5.5: Left: gemm implemented as a sequence of ger operations (unblocked Variant 1). Right: gemm
implemented as a sequence of gepp operations (blocked Variant 1).
That is, each column of C is obtained from a gemv of A and the corresponding column of B. In line with
Remark 5.10, this can be viewed as partitioning B and C by columns and updating C as if A were a scalar and
B and C were row vectors.
Exercise
0
5.141 Show that 0 0 1 0 1 0 1 1
1 −1 3 0 1 1 −1 3 0 1 1 −1 3 0 1 1 −1 3 0 1
B 2 0 −1 C −1 0 2 B B 2 0 −1 C −1 B 2 0 −1 C 0 B 2 0 −1 C 2 C
B C @ 1 −1 1 A = B B C@ 1 A B C@ −1 A B C@ 1A C.
@ −1 2 1 A @ @ −1 2 1 A @ −1 2 1 A @ −1 2 1 A A
2 1 −1 2 1 −1
0 1 2 0 1 2 0 1 2 0 1 2
The following corollary of Theorem 5.8 yields a second PME from which another unblocked and a blocked
variant for gemm can be derived.
Corollary 5.15 Partition ¡ ¢ ¡ ¢
B → BL BR and C → CL CR ,
where n(BL ) = n(CL ). Then
¡ ¢ ¡ ¢ ¡ ¢
AB + C = A BL BR + CL CR = ABL + CL ABR + CR .
96 5. High Performance Algorithms
Continue
` with
´ ` ´ Continue
` with
´ ` ´
` BL BR ´ ← ` B0 b1 B2 ´ , ` BL BR ´ ← ` B0 B1 B2 ´ ,
CL CR ← C0 c1 C2 CL CR ← C0 C1 C2
endwhile endwhile
Figure 5.6: Left: gemm implemented as a sequence of gemv operations (unblocked Variant 2). Right: gemm
implemented as a sequence of gemp operations (blocked Variant 2).
¡¡ ¢ ¡ ¢¢
Pinv : CL CR = ABL + ĈL ĈR ∧ Pcons
with “Pcons : n(BL ) = n(CL ), we arrive at the algorithms for gemm in Figure 5.6. The unblocked variant in
this case is composed of gemv operations, and the blocked variant of gemp as, in this case, (see Table 5.3) both
B1 and C1 are “narrow” blocks of columns (panels) with only nb columns, nb ¿ m, k.
eT T T
i (AB) = (ei A)B = ǎi B.
5.3. Algorithms for gemm 97
contents of čT
i .
98 5. High Performance Algorithms
cT T T
1 := a1 B + c1 (gevm) C1 := A1 B + C1 (gepm)
Continue with
0 1 0 1 Continue with
0 1 0 1
„ « A0 „ « C0 „ « A0 „ « C0
AT CT AT CT
← @ aT1
A, ← @ cT
1
A ← @ A1 A , ← @ C1 A
AB CB AB CB
A2 C2 A2 C2
endwhile endwhile
Figure 5.7: Left: gemm implemented as a sequence of gevm operations (unblocked Variant 3). Right: gemm
implemented as a sequence of gepm operations (blocked Variant 3).
with “Pcons : m(AT ) = m(CT )”, we obtain the algorithms for gemm in Figure 5.7. The unblocked algorithm
there consists of gevm operations, and the blocked variant of the corresponding generalization in the form of
gepm as now A1 and C1 are blocks of mb rows and mb ¿ n, k (see Table 5.3).
Remark 5.18 The blocked variants in Figures 5.5, 5.6, and 5.7 compute gemm in terms of gepp, gemp, and
gepm operations, respectively. In Section 5.4 it will be shown how these three operations can be implemented
to achieve high performance. As a consequence, the blocked algorithms for gemm based on such operations
also achieve high performance.
5.3.4 Performance
Performance of the algorithms presented in Figures 5.5–5.7 on an Intel Pentium [email protected] is given in Fig-
ure 5.8. (See Section 1.5 for details on the architecture, compiler, etc.) The line labeled as “Simple” refers to a
traditional implementation of gemm, consisting of three nested loops fully optimized by the compiler. Highly
optimized implementations of the ger, gemv, and gevm operations were called by the gemm unb var1,
gemm unb var2, and gemm unb var3 implementations. Similarly, highly optimized implementations for the
gepp, gemp, and gepm operations were called by implementations of the blocked algorithms, which used a
5.3. Algorithms for gemm 99
Simple
Gemm with gepp
2.5 Gemm with gemp
Gemm with gepm
Gemm with ger
Gemm with gemv
Gemm with gevm
2
GFLOPS
1.5
0.5
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
m = n = k (square matrices)
• Among the unblocked algorithms, the one that utilizes ger is the slowest. This is not surprising: it
generates twice as much data traffic between main memory and the registers as either of the other two.
• The “hump” in the performance of the unblocked algorithms coincides with problem sizes that fit in the
cache. For illustration, consider the gemv-based implementation in gemm unb var2. Performance of
the code is reasonable when the dimensions are relatively small since from one gemv to the next matrix
A then remains in the cache.
• The blocked algorithms attain a substantial percentage of peak performance (2.8 GFLOPS), that ranges
between 70 and 85%.
• Among the blocked algorithms, the variant that utilizes gepp is the fastest. On almost all architectures
at this writing high performance implementations of gepp outperform those of gemp and gepm. The
reasons for this go beyond the scope of this text.
100 5. High Performance Algorithms
• On different architectures, the relative performance of the algorithms may be quite different. However,
blocked algorithms invariably outperform unblocked ones.
C := A B + C
• The dimensions mc , kc are small enough so that A, a column from B, and a column from C together fit
in the cache.
• If A and the two columns are in the cache then gemv can be computed at the peak rate of the CPU.
Under these assumptions, the approach to implementing algorithm gebp in Figure 5.9 amortizes the cost
of moving data between the main memory and the cache as follows. The total cost of updating C is mc kc +
(2mc + kc )n memops for 2mc kc n flops. Now, let c = mc ≈ kc . Then, the ratio between computation and data
movement is
2c2 n flops 2c flops
≈ when c ¿ n.
c2 + 3cn memops 3 memops
If c ≈ n/100 then even if memops are 10 times slower than flops, the memops add only about 10% overhead to
the computation.
We note the similarity between algorithm gebp unb var2 in Figure 5.9 and the unblocked Variant 2 for
gemm in Figure 5.6 (right).
Remark 5.19 In the highest-performance implementations of gebp, both A and B are typically copied into
a contiguous buffer and/or transposed. For complete details on this observation, see [13].
5.4. High-Performance Implementation of gepp, gemp, and gepm 101
Repartition
¡ ¢ ¡ ¢ ¡ ¢ ¡ ¢
BL BR → B0 b1 B2 , CL CR → C0 c1 C2
where b1 and c1 are columns
Continue
¡ with¢ ¡ ¢ ¡ ¢ ¡ ¢
BL BR ← B0 b1 B2 , CL CR ← C0 c1 C2
endwhile
Figure 5.9: gebp implemented as a sequence of gemv, with indication of the memops and flops costs. Note
that a program typically has no explicit control over the loading of a cache. Instead it is in using the data that
the cache is loaded, and by carefully ordering the computation that the architecture is encouraged to keep data
in the cache.
102 5. High Performance Algorithms
C1 := A1 B + C1 (gebp) C1 := A1 B + C1 (gepm)
Continue with
0 1 0 1 Continue with
0 1 0 1
„ « A0 „ « C0 „ « A0 „ « C0
AT CT AT CT
← @ A1 A , ← @ C1 A ← @ A1 A , ← @ C1 A
AB CB AB CB
A2 C2 A2 C2
endwhile endwhile
Figure 5.10: Left: gepp implemented as a sequence of gebp operations. Right: gemm implemented as a
sequence of gepm operations.
Exercise 5.20 Propose a similar scheme for the gepb operation, where A ∈ Rm×kc , B ∈ Rkc ×nc , and C ∈
Rm×nc . State your assumptions carefully. Analyze the ratio of flops to memops.
Exercise 5.21 Propose a similar scheme for the gepdot operation, where A ∈ Rmc ×k , B ∈ Rk×nc , and
C ∈ Rmc ×nc . State your assumptions carefully. Analyze the ratio of flops to memops.
Consider the gepp operation C := AB + C, where A ∈ Rm×kb , B ∈ Rkb ×n , and C ∈ Rm×n . By partitioning
the matrices into two different directions, we will obtain two algorithms for this operation, based on gebp or
gepb. We will review here the first variant while the second one is proposed as an exercise.
Assume m is an exact multiple of mb and partition matrices A and C into blocks of mb rows so that the
product takes the form
5.4. High-Performance Implementation of gepp, gemp, and gepm 103
C0 A0 B C0
C1 := A1 + C1
.. .. ..
. . .
Then, each Ci can be computed as a gebp of the form Ci := Ai B + Ci . Since it was argued that gebp can
attain high performance, provided mb = mc and kb = kc , so can the gepp.
Remark 5.22 In the implementation of gepp based on gebp there is complete freedom to chose mb = mc .
Also, kb is usually set by the routine that invokes gepp (e.g., gemm blk var1), so that it can be chosen there
as kb = kc . An analogous situation will occur for the alternative implementation of gepp based on gepb, and
for all implementations of gemp and gepm.
The algorithm for this is given in Figure 5.10 (left). For comparison, we repeat Variant 3 for computing
gemm in the same figure (right). The two algorithms are identical except that the constraint on the row
dimension of A and column dimension of B changes the update from a gepm to a gebp operation.
Exercise 5.23 For the gepp operation assume n is an exact multiple of nb , and partition B and C into blocks
of nb columns so that
B0 B1 ···
C0 C1 ··· := A + C0 C1 ···
Propose an alternative high-performance algorithm for computing gepp based on gepb. Compare the resulting
algorithm to the three variants for computing gemm. Which variant does it match?
B0
C := A0 A1 ··· B1 + C
..
.
Then, C can be computed as repeated updates of C with gepb operations, C := Ap Bp + C. The algorithm is
identical to gemm blk var1 in Figure 5.5 except that the update changes from a gepp to a gepb operation.
If gepb attains high performance, if nb = nc and kb = kc , so will this algorithm for computing gemp.
Exercise 5.24 For the gemp operation assume m is an exact multiple of mb , and partition A and C by blocks
of mb rows as
C0 A0 C0
C1 := A1 B + C1
.. .. ..
. . .
C0 C1 ··· A C0 C1 ···
:= B0 B1 ··· +
Then each block of C can be computed as Cj := ABj + C using the gepdot operation. The algorithm
is identical to gemm blk var2 in Figure 5.6 except that the update changes from a gemp to a gepdot
operation. If gepdot attains high performance, provided mb ≈ mc and nb = nc , so will this algorithm for
computing gepm.
5.5. Modularity and Performance via gemm: Implementing symm 105
Exercise 5.25 For the gepm operation assume k is an exact multiple of kb , and partition A and B by blocks
of kb rows and columns, respectively, so that
C A0 A1 ··· B0 C
:= B1 +
..
.
Remark 5.26 If one variant of matrix-matrix multiplication is used at one level, that same variant does not
occur at the next level. There are theoretical reasons for this that go beyond the scope of this text. For details,
see [17].
C := AB + C,
Remark 5.28 Unless otherwise stated, we assume hereafter that it is the lower part of the symmetric matrix
A, including the diagonal, that contains the relevant entries of the matrix. In our notation this is denoted as
SyLw(A). The algorithms that are derived will not make any reference to the contents of the strictly upper
triangular part (superdiagonals) of symmetric matrices.
106 5. High Performance Algorithms
C
C
C gemm blk var2 gemm unb var1
C (gepdot mb , nb , k) (ger mb , nb )
C
C gemm blk var3 +:= - +:=
CCW (gepm mb , n, k)
µ
¡
+:= ¡ gemm blk var1 gemm unb var2
@ (gebp mb , n, kb ) (gemv mb , kb )
R
@
+:= - +:=
Figure 5.11: Implementations of gemm. The legend on top of each figure indicates the algorithm that is invoked
in that case, and (between parenthesis) the shape and dimensions of the subproblems the case is decomposed
into. For instance, in the case marked with “?”, the product is performed via algorithm gemm blk var1,
which is then decomposed into matrix-matrix products of shape gepp and dimensions m, n, and kb .
5.5. Modularity and Performance via gemm: Implementing symm 107
When dealing with symmetric matrices, in general only the upper or lower part of the matrix is actually
stored. One option is to copy the stored part of A into both the upper and lower triangular part of a temporary
matrix and to the use gemm. This is undesirable if A is large, since it requires temporary space.
The precondition for the symmetric matrix-matrix product, symm, is given by
Ppre : (A ∈ Rm×m ) ∧ SyLw(A) ∧ (B, C ∈ Rm×n ),
while the postcondition is that
Ppost : C = AB + Ĉ.
We next formulate a partitioning and a collection of loop-invariants that potentially yield algorithms for
symm. Let us partition the symmetric matrix A into quadrants as
µ ¶
AT L AT BL
A→ .
ABL ABR
Then, from the postcondition, C = AB + Ĉ, a consistent partitioning of matrices B and C is given by
µ ¶ µ ¶
BT CT
B→ , C→ ,
BB CB
where “Pcons : n(AT L ) = m(BT ) ∧ m(AT L ) = m(CT )” holds. (A different possibility would be to also partition
B and C into quadrants, a case that is proposed as an exercise at the end of this section.) Very much as what
we do for triangular matrices, for symmetric matrices we also require the blocks in the diagonal to be square
(and therefore) symmetric. Thus, in the previous partitioning of A we want that
Pstruct : SyLw(AT L ) ∧ (m(AT L ) = n(AT L )) ∧ SyLw(ABR ) ∧ (m(ABR ) = n(ABR ))
holds. Indeed, because SyLw(A), it is sufficient to define
Pstruct : SyLw(AT L ).
Remark 5.29 When dealing with symmetric matrices, in order for the diagonal blocks that are exposed to
be themselves symmetric, we always partition this type of matrices into quadrants, with square blocks in the
diagonal.
The PME is given by à !
µ ¶ µ ¶µ ¶
CT AT L AT
BL BT ĈT
= + ,
CB ABL ABR BB ĈB
which is equivalent to à !
CT = AT L BT + AT
BL BB + ĈT
.
CB = ABL BT + ABR BB + ĈB
108 5. High Performance Algorithms
Recall that loop-invariants result by assuming that some computation is yet to be performed.
A systematic enumeration of subresults, each of which is a potential loop-invariants, is given in Table 5.12.
We are only interested in feasible loop-invariants:
Definition 5.30 A feasible loop-invariant is a loop-invariant that yields a correct algorithm when the derivation
methodology is applied. If a loop-invariant is not feasible, it is infeasible.
In the column marked by “Comment” reasons are given why a loop-invariant is not feasible.
Among the feasible loop-invariants in Figure 5.12, we now choose
õ ¶ à !!
CT AT L BT + ĈT
= ∧ Pcons ∧ Pstruct ,
CB ĈB
for the remainder of this section. This invariant yields the blocked algorithm in Figure 5.13. As part of the
update, in this algorithm the symmetric matrix-matrix multiplication A11 B1 needs to be computed (being a
square block in the diagonal of A, A11 = AT 11 ). In order to do so, we can apply an unblocked version of the
algorithm which, of course, would not reference the strictly upper triangular part of A11 . The remaining two
updates require the computation of two gemms, AT 10 B1 and A10 B0 , and do not reference any block in the strictly
upper triangular part of A either.
Exercise 5.31 Show that the cost of the algorithm for symm in Figure 5.13 is 2m2 n flops.
Exercise 5.32 Derive a pair of blocked algorithms for computing C := AB + Ĉ, with SyLw(A), by partitioning
all three matrices into quadrants and choosing two feasible loop-invariants found for this case.
Exercise 5.33 Derive a blocked algorithm for computing C := BA + Ĉ, with SyLw(A), A ∈ Rm×m , C, B ∈
Rn×m , by partitioning both B and C in a single dimension and choosing a feasible loop-invariant found for this
case.
5.5.1 Performance
Consider now m to be an exact multiple of mb , m = µmb . The algorithm in Figure 5.13 requires µ iterations,
with 2m2b n flops being performed as a symmetric matrix multiplication (A11 B1 ) at each iteration, while the rest
of the computations is in terms of two gemms (AT 10 B1 and A10 B0 ). The amount of computations carried out as
symmetric matrix multiplications, 2m mb n flops, is only a minor part of the total cost of the algorithm, 2m2 n
flops (provided mb ¿ m). Thus, given an efficient implementation of gemm, high performance can be expected
from this algorithm.
5.6 Summary
The highlights of this Chapter are:
• A high level description of the architectural features of a computer that affect high-performance imple-
mentation of the matrix-matrix multiply.
5.6. Summary 109
Computed? µ ¶
CT
AT L BT ATBL BB ABL BT ABR BB Pinv : = Comment
CB
!
ĈT No loop-guard exists so
N N N N
ĈB that Pinv ∧ ¬G ⇒ Ppost
à !
AT L BT + ĈT
Y N N N Variant 1 (Fig. 5.13)
ĈB
!
AT
BL BB + ĈT
No loop-guard exists so
N Y N N
ĈB that Pinv ∧ ¬G ⇒ Ppost .
!
AT L BT + ATBL BB + ĈT
Y Y N N Variant 2
ĈB
!
ĈT No loop-guard exists so
N N Y N
ABL BT + ĈB that Pinv ∧ ¬G ⇒ Ppost .
! Leads to an alternative al-
AT L BT + ĈT
Y N Y N gorithm.
ABL BT + ĈB
! Variant 3
ATBL BB + ĈT
No loop-guard exists so
N Y Y N
ABL BT + ĈB that Pinv ∧ ¬G ⇒ Ppost .
!
AT L BT + ATBL BB + ĈT
Y Y Y N Variant 4
ABL BT + ĈB
!
ĈT
N N N Y Variant 5
ABR BB + ĈB
!
AT L BT + ĈT No simple initialization ex-
Y N N Y
ABR BB + ĈB
! ists to achieve this state.
ATBL BB + ĈT
N Y N Y Variant 6
ABR BB + ĈB
!
AT L BT + AT
BL BB + ĈT No simple initialization ex-
Y Y N Y
ABR BB + ĈB
! ists to achieve this state.
ĈT
N N Y Y Variant 7
ABL BT + ABR BB + ĈB
!
AT L BT + ĈT No simple initialization ex-
Y N Y Y
ABL BT + ABR BB + ĈB
! ists to achieve this state.
ATBL BB + ĈT
N Y Y Y Variant 8
ABL BT + ABR BB + ĈB
!
AT L BT + AT
BL BB
+ ĈT No simple initialization ex-
Y Y Y Y
ABL BT + ABR BB + ĈB ists to achieve this state.
Figure 5.12: Potential loop-invariants for C := AB + C, with SyLw(A), using the partitioning in Section 5.5.
Potential invariants are derived from the PME by systematically including/excluding (Y/N) a term.
110 5. High Performance Algorithms
C0 := AT
10 B1 + C0
C1 := A10 B0 + A11 B1 + C1
Continue with
µ ¶ A00 A01 A02 µ ¶ B0 µ ¶ C0
AT L AT R B T C T
← A10 A11 A12 , ← B1 , ← C1
ABL ABR BB CB
A20 A21 A22 B2 C2
endwhile
Figure 5.13: Algorithm for computing C := AB + C, with SyLw(A) (blocked Variant 1).
• The hierarchical anatomy of the implementation of this operation that exploits the hierarchical organiza-
tion of multilevel memories of current architectures.
• The very high performance that is attained by this particular operation.
• How to cast algorithms for linear algebra operations in terms matrix-matrix multiply.
• The modular high-performance that results.
A recurrent theme of this and subsequent chapters will be that blocked algorithms for all major linear algebra
operations can be derived that cast most computations in terms of gepp, gemp, and gepm. The block size for
these is tied to the size of cache memory.
A commonly employed strategy for solving (dense) linear systems starts with the factorization of the coefficient
matrix of the system into the product of two triangular matrices, followed by the solves with the resulting
triangular systems. In this chapter we review two such factorizations, the LU and Cholesky factorizations.
The LU factorization (combined with pivoting) is the most commonly used method for solving general linear
systems. The Cholesky factorization plays an analogous role for systems with a symmetric positive definite
(SPD) coefficient matrix.
Throughout this chapter, and unless otherwise stated explicitly, we assume that the coefficient matrix (and
therefore the triangular matrices resulting from the factorization) to be nonsingular with n rows and columns.
LU factorization in this section. In particular, the symbol α11 will denote the same element in the first step of both methods.
113
114 6. The LU and Cholesky Factorizations
Gaussian elimination starts by subtracting a multiple of the first row of the matrix from the second row so as
to annihilate element α21 . To do so, the multiplier λ21 = α21 /α11 is first computed, after which the first row
times the multiplier λ21 is subtracted from the second row of A, resulting in
α11 α12 ··· α1,n
0 α22 − λ21 α12 · · · α2,n − λ21 α1,n
α31 α32 ··· α3,n
.
.. .. . . .
.
. . . .
αn,1 αn,2 ··· αn,n
Next, the multiplier to eliminate the element α31 is computed as λ31 = α31 /α11 , and the first row times λ31 is
subtracted from the third row of A to obtain
α11 α12 ··· α1,n
0 α22 − λ21 α12 · · · α2,n − λ21 α1,n
0 α32 − λ31 α12 · · · α3,n − λ31 α1,n
.
.. .. .. ..
. . . .
αn,1 αn,2 ··· αn,n
Repeating this for the remaining rows yields
α11 α12 ··· α1,n
0 α − λ21 α12 ··· α2,n − λ21 α1,n
22
0 α − λ31 α12 ··· α3,n − λ31 α1,n
32 . (6.1)
.. .. .. ..
. . . .
0 αn,2 − λn,1 α12 ··· αn,n − λn,1 α1,n
Typically, the multipliers λ21 , λ31 , . . . , λn,1 are stored over the zeroes that are introduced. After this, the process
continues with the bottom right n − 1 × n − 1 quadrant of the matrix until eventually A becomes an upper
triangular matrix.
We remind the reader of the following theorem, found in any standard linear algebra text, gives a number
of equivalent characterizations of a nonsingular matrix:
Theorem 6.2 Given a sqaure matrix A, the following equivalent:
• A is nonsingular.
• Ax = 0 if and only if x = 0.
Proof: We delay the proof of this theorem until after algorithms for computing the LU factorization have been
given.
Equating corresponding submatrices on the left and the right of this equation yields the following insights:
• According to (6.2), the first row of U equals the first row of A, just like the first row of A is left untouched
in Gaussian elimination.
which are the same operations that are performed during Gaussian elimination on the n − 1 × n − 1 bottom
right submatrix of A.
• After these computations have been performed, both the LU factorization and Gaussian elimination
proceed (recursively) with the n − 1 × n − 1 bottom right quadrant of A.
We conclude that Gaussian elimination and and the described algorithm for computing LU factorization perform
exactly the same computations.
6.2. The LU Factorization 117
¡ ¢
Exercise 6.5 Gaussian elimination is usually applied to the augmented matrix A b so that, upon comple-
tion, A is overwritten by U , b is overwritten by an intermediate result y, and the solution of the linear system
Ax = b is obtained from U x = y. Use the system defined by
0 1 0 1
3 −1 2 7
A = @ −3 3 −1 A , b= @ 0 A,
6 0 4 18
y satisfies Ly = b.
The previous exercise illustrates that in applying Gaussian elimination to the augmented system, both the
LU factorization of the matrix and the solution of the unit lower triangular system are performed simultaneously
(in the augmented matrix, b is overwritten with the solution of Ly = b). On the other hand, when solving a
linear system via the LU factorization, the matrix is first decomposed into the triangular matrices L and U ,
and then two linear systems are solved: y is computed from Ly = b, and then the solution x is obtained from
U x = y.
6.2.2 Variants
Let us examine next how to derive different variants for computing the LU factorization. The precondition2
and postcondition for this operation are given, respectively, by
Ppre : A = Â
Ppost : (A = {L\U }) ∧ (LU = Â),
where the notation A = {L\U } in the postcondition indicates that L overwrites the elements of A below the
diagonal while U overwrites those on and above the diagonal. (The unit elements on the diagonal entries of L
will not be stored since they are implicitly known.) The requirement that L and U overwrite specific parts of
A implicitly defines the dimensions and triangular structure of these factors.
In order to determine a collection of feasible loop-invariants, we start by choosing a partitioning of the
matrices involved in the factorization. The triangular form of L and U requires them to be partitioned into
quadrants with square diagonal blocks so that the off-diagonal block of zeroes can be cleanly identified., This
then requires A to be conformally partitioned into quadrants as well. Thus,
µ ¶ µ ¶ µ ¶
AT L AT R LT L 0 UT L UT R
A→ , L→ , U→ ,
ABL ABR LBL LBR 0 UBR
2 Strictly speaking, the precondition should also assert that A is square and has nonsingular leading principle submatrices (see
Theorem 6.4).
118 6. The LU and Cholesky Factorizations
where
Pcons : m(AT L ) = n(AT L ) = m(LT L ) = n(LT L ) = m(UT L ) = n(UT L )
holds. Substituting these into the postcondition yields
µ ¶ µ ¶ µ ¶µ ¶ Ã !
AT L AT R {L\U }T L UT R LT L 0 UT L UT R ÂT L ÂT R
= ∧ = ,
ABL ABR LBL {L\U }BR LBL LBR 0 UBR ÂBL ÂBR
from which, multiplying out the second expression, we obtain the PME for LU factorization:
LT L UT L = ÂT L LT L UT R = ÂT R
.
LBL UT L = ÂBL LBR UBR = ÂBR − LBL UT R
These equations exhibit data dependences which dictate an order for the computations: ÂT L must be factored
into LT L UT L before UT R := L−1 −1
T L ÂT R and LBL := ÂBL UT L can be computed, and these two triangular
systems need to be solved before the update ÂBR − LBL UT R can be carried out. By taking into account these
dependencies, the PME yields the five feasible loop-invariants for the LU factorization in Figure 6.1.
Exercise 6.6 Derive unblocked and blocked algorithms corresponding to each of the five loop-invariants in
Figure 6.1.
Note that the resulting algorithms are exactly those given in Figure 1.3.
The loop-invariants in Figure 1.3 yield all algorithms depicted on the cover, and discussed in, G.W. Stewart’s
book on matrix factorization [25]. All these algorithms perform the same computatons but in different order.
where n = m(A) = n(A). The base case comes from the fact that for a 0 × 0 matrix no computation needs to
be performed. The recurrence results from the the cost of the updates in the loop-body: l21 := a21 /µ11 costs
n − k − 1 flops and A22 := A22 − a21 aT 2
12 costs 2(n − k − 1) flops. Thus, the total cost for Variant 5 is given by
3
n−1
X ¡ 2
¢ n−1
X¡ ¢
Clu5 (n) = (n − k − 1) + 2(n − k − 1) = k + 2k 2
k=0 k=0
3 When AT L equals all the matrix, the loop-guard is evaluated false and no update is performed so that Clu (n) = Clu (n − 1).
6.2. The LU Factorization 119
Variant 1
„ « !
AT L AT R {L\U }T L ÂT R
=
ABL ABR ÂBL ÂBR
Variant 2
„ « !
AT L AT R {L\U }T L UT R
=
ABL ABR ÂBL ÂBR
Variant 3
„ « !
AT L AT R {L\U }T L ÂT R
=
ABL ABR LBL ÂBR
Variant 4
„ « !
AT L AT R {L\U }T L UT R
=
ABL ABR LBL ÂBR
Variant 5
„ « !
AT L AT R {L\U }T L UT R
=
ABL ABR LBL ÂBR − LBL UT R
n−1 n−1
à n
! n
X X X X
2 2 2
= 2 k + k=2 k −n + k−n
k=0 k=0 k=1 k=1
µ 3 ¶ µ ¶
n 3 n n2 n 2 3 5 2 n
= 2 − n2 + + − = n − n −
3 2 6 2 2 3 2 6
2 3
≈ n flops.
3
6.2.4 Performance
The performance of the LU factorization was already discussed in Section 1.5.
Exercise 6.11 Consider a set of Gauss transforms Lk , 0 ≤ k < n, defined as in (6.5). Show that
3. L0 L1 · · · Ln−1 ek = Lk ek , 0 ≤ k < n.
Hint: Use Result 2.
Definition 6.12 We will refer to µ ¶
LT L 0
Lac,k = , (6.6)
LBL In−k
where LT L ∈ Rk×k is unit lower triangular, as an accumulated Gauss transform.
Remark 6.13 In subsequent discussion, often we will not explicitly define the dimensions of Gauss transforms
and accumulated Gauss transforms, since they can be deduced from the dimension of the matrix or vector to
which the transformation is applied.
The name of this transform signifies that the product Lac,k B is equivalent to computing L0 L1 · · · Lk−1 B,
where the jth column of the Gauss transform Lj , 0 ≤ j < k, equals that of Lac,k .
Exercise 6.14 Show that the accumulated Gauss transform Lac,k = L0 L1 · · · Lk−1 , defined as in (6.6), satisfies
µ ¶
L−1 0
L−1 = TL and L−1 −1 −1 −1
ac,k A = Lk−1 · · · L1 L0 A.
ac,k −LBL L−1TL In−k
We are now ready to describe the LU factorization in terms of the application of Gauss transforms. Partition
A and the first Gauss transform L0 :
µ ¶ µ ¶
α11 aT 12 1 0
A→ , L0 → ,
a21 A22 l21 In−1
where α11 is a scalar. Next, observe the result of applying the inverse of L0 to A:
µ ¶µ ¶ µ ¶
1 0 α11 aT α11 aT
L−1 A = 12
= 12
.
0 −l21 In−1 a21 A22 a21 − α11 l21 A22 − l21 aT
12
By choosing l21 := a21 /α11 , we obtain a21 − α11 l21 = 0 and A is updated exactly as in the first iteration of the
unblocked algorithm for overwriting the matrix with its LU factorization via Variant 5.
Next, assume that after k steps A has been overwritten by L−1 −1 −1
k−1 · · · L1 L0 Â so that, by careful selection
of the Gauss transforms, µ ¶
UT L UT R
A := L−1
k−1 · · · L−1 −1
1 L0 Â = ,
0 ABR
where UT L ∈ Rk×k is upper triangular. Repartition
µ ¶ U00 u12 U02 Ik 0 0
UT L UT R
→ 0 α11 aT 12
and Lk → 0 1 0 ,
0 ABR
0 a21 A22 0 l21 In−k−1
122 6. The LU and Cholesky Factorizations
T
A simple calculation shows that x = (1, 1, 1) is the exact solution of the system. (Ax − b = 0).
Now, assume we use a computer where all the operations are done in four-digit decimal floating-point arith-
metic. Computing the LU factorization of A in this machine then yields
0 1 0 1
1.000 0.002 1.231 2.471
L = @ 598.0 1.000 A and U =@ −732.9 −1475 A ,
737.5 1.233 1.000 −1820
which shows two large multipliers in L and the consequent element growth in U .
If we next employ these factors to solve for the system, applying forward substitution to Ly = b, we obtain
0 1 0 1
y0 3.704
y = @ y1 A = @ −2208 A ,
y2 −2000
which present multipliers in L of smaller magnitude less than one, and no dramatic element growth in U .
Using these factors, from Ly = b̄ and U x = y, we obtain, respectively,
0 1 0 1
7.888 1.000
y= @ 3.693 A and x= @ 1.000 A .
1.407 1.000
T ¡ ¢T
is the permutation matrix that, when applied to vector x = (χ0 , χ1 , . . . , χn−1 ) , yields P x = χπ0 , χπ1 , . . . , χπn−1 .
The following exercise recalls a few essential properties of permutation matrices.
Exercise 6.18 Consider A, x ∈ Rn , and let P ∈ Rn×n be a permutation matrix. Show that
2. P A rearranges the rows of A exactly in the same order as the elements of x are rearranged by P x. Hint:
Partition P as in (6.7) and recall that row π of A is given by eT
π A.
3. AP T rearranges the columns of A exactly in the same order as the elements of x are rearranged by P x.
T
Hint: Consider (P AT ) .
We will frequently employ permutation matrices that swap the first element of a vector with element π of
that vector:
Definition 6.19 The permutation that, when applied to a vector, swaps the first element with element π is
defined as
In if π = 0,
0 0 1 0
P (π) = 0 Iπ−1 0 0
otherwise.
1 0 0 0
0 0 0 In−π−1
6.4. Partial Pivoting and High Performance 125
T
Definition 6.20 Given p = (π0 , π1 , . . . , πk−1 ) , a permutation of {0, 1, . . . , k − 1}, P (p) denotes the permuta-
tion µ ¶µ ¶ µ ¶
Ik−1 0 Ik−2 0 1 0
P (p) = ··· P (π0 ).
0 P (πk−1 ) 0 P (πk−2 ) 0 P (π1 )
Remark 6.21 In the previous definition, and from here on, we will typically not explicitly denote the di-
mension of a permutation matrix, since it can be deduced from the dimension of the matrix or the vector the
permutation is applied to.
Algorithm: [A, p] := LUP unb var5 B(A) Algorithm: [A, p] := LUP unb var5(A)
Partition
„ « „ «
Partition
„ « „ «
AT L AT R pT AT L AT R pT
A→ ,p→ A→ ,p→
ABL ABR pB ABL ABR pB
where AT L is 0 × 0 and pT has 0 elements where AT L is 0 × 0 and pT has 0 elements
while n(AT L ) < n(A) do while n(AT L ) < n(A) do
Repartition 0 1 Repartition 0 1
„ « A00 a01 A02 „ « A00 a01 A02
AT L AT R AT L AT R
→ @ aT
10 α11 aT
12
A, → @ aT
10 α11 aT
12
A,
ABL ABR ABL ABR
0 1 A20 a21 A22 0 1 A20 a21 A22
„ « p0 „ « p0
pT pT
→ @ π1 A → @ π1 A
pB pB
p2 p2
where α11 and π1 are scalars where α11 and π1 are scalars
„ « „ «
α11 α11
π1 := PivIndex π1 := PivIndex
„ « a21 „ T a21«
α11 aT 12 a10 α11 aT 12
a21 A22 „ « A20 a21 „A22 «
α11 aT
12 aT
10 α11 aT
12
:= P (π1 ) := P (π1 )
a21 A22 A20 a21 A22
a21 := a21 /α11 a21 := a21 /α11
A22 := A22 − a21 aT
12 A22 := A22 − a21 aT12
Figure 6.2: Unblocked algorithms for the LU factorization with partial pivoting (Variant 5). Left: basic
algorithm. Right: High-performance algorithm.
6.4. Partial Pivoting and High Performance 127
In words, there exists a Gauss transform L̄k , of the same dimension and structure as Lk , such that P Lk = L̄k P .
Exercise 6.23 Prove Lemma 6.22.
The above lemma supports the following observation: According to (6.8), the basic LU factorization with
partial pivoting yields
L−1 −1 −1
n−1 Pn−1 · · · L1 P1 L0 P0 A = U
or
A = P0 L0 P1 L1 · · · Pn−1 Ln−1 U.
(j)
From the lemma, there exist Gauss transforms Lk , 0 ≤ k ≤ j < n, such that
A = P0 L0 P1 L1 P2 L2 · · · Pn−1 Ln−1 U
| {z }
(1)
= P0 P1 L0 L1 P2 L2 · · · Pn−1 Ln−1 U
| {z }
(1) (1)
= P0 P1 L0 P2 L1 L2 · · · Pn−1 Ln−1 U
| {z }
(2) (2)
= P0 P1 P2 L0 L1 · · · Pn−1 Ln−1 U
= ···
(n−1) (n−1) (n−1)
= P0 P1 P2 · · · Pn−1 L0 L1 · · · Ln−1 U.
L−1 −1 −1
n−1 Pn−1 · · · L1 P1 L0 P0 A = U.
T
Then there exists a lower triangular matrix L such that P (p)A = LU with p = (π0 , π1 , . . . , πn−1 ) .
(n−1) (n−1) (n−1)
Proof: L is given by L = L0 L1 · · · Ln−1 .
If L, U , and p satisfy P (p)A = LU , then Ax = b can be solved by applying all permutations to b followed
by two (clean) triangular solves:
Ax = b =⇒ P A x = |{z}
|{z} Pb =⇒ LU x = b̄.
¯
LU b̄
Remark 6.25 It will become obvious that the family of algorithms for the LU factorization with pivoting
can be derived without the introduction of Gauss transforms or knowledge about how Gauss transforms and
permutation matrices can be reordered. They result from systematic application of the derivation techniques
to the operation that computes p, L, and U that satisfy the postcondition.
As usual, we start by deriving a PME for this operation: Partition A, L, and U into quadrants,
µ ¶ µ ¶ µ ¶
AT L AT R LT L 0 UT L UT R
A→ , L→ , U→ ,
ABL ABR LBL LBR 0 UBR
Theorem 6.26 The expressions in (6.13) and (6.14) are equivalent to the simultaneous equations
µ ¶ Ã !
ĀT L ĀT R ÂT L ÂT R
= P (pT ) , (6.15)
ĀBL ĀBR ÂBL ÂBR
L̄BL = P (pB )T LBL , (6.16)
µ ¶ µ ¶ ¯µ ¶¯
LT L ĀT L ¯ LT L ¯
UT L = ∧¯ ¯ ¯ ≤ 1, (6.17)
L̄BL ĀBL L̄BL ¯
LT L UT R = ĀT R , (6.18)
LBR UBR = P (pB )(ĀBR − L̄BL UT R ) ∧ |LBR | ≤ 1, (6.19)
which together represent the PME for LU factorization with partial pivoting.
Exercise 6.27 Prove Theorem 6.26.
Equations (6.15)–(6.19) have the following interpretation:
• Equations (6.15) and (6.16) are included for notational convenience. Equation (6.16) states that L̄BL
equals the final LBL except that its rows have not yet been permuted according to future computation.
• Equation (6.17) denotes an LU factorization with partial pivoting of the submatrices to the left of the
thick line in (6.13).
• Equations (6.18) and (6.19) indicate that UT R and {L\U }BR result from permuting the submatrices to the
right of the thick line in (6.14), after which UT R is computed by solving the triangular system L−1
T L ĀT R ,
and {L\U }BR result from updating ĀBR and performing an LU factorization with partial pivoting of that
quadrant. Equation (6.16) resurfaces here, since the permutations pB must also be applied to L̄BL to
yield LBL .
130 6. The LU and Cholesky Factorizations
Variant 3a
„ « !
AT L AT R {L\U }T L ÂT R
=
ABL AT R L̄BL ÂBR
„ « !
ĀT L ÂT L
∧ = P (pT )
ĀBL ÂBL
„ « „ « ˛„ «˛
LT L ĀT L ˛ LT L ˛
∧ UT L = ∧ ˛˛ ˛≤1
L̄BL ĀBL L̄BL ˛
Variant 3b
„ « „ «
AT L AT R {L\U }T L ĀT R
=
ABL AT R L̄BL ĀBR
„ « !
ĀT L ĀT R ÂT L ÂT R
∧ = P (pT )
ĀBL ĀBR ÂBL ÂBR
„ « „ « ˛„ «˛
LT L ĀT L ˛ LT L ˛
∧ UT L = ∧ ˛˛ ˛≤1
L̄BL ĀBL L̄BL ˛
Variant 4
„ « „ «
AT L AT R {L\U }T L UT R
=
ABL AT R L̄BL ĀBR
„ « !
ĀT L ĀT R ÂT L ÂT R
∧ = P (pT )
ĀBL ĀBR ÂBL ÂBR
„ « „ « ˛„ «˛
LT L ĀT L ˛ LT L ˛
∧ UT L = ∧ LT L UT R = ĀT R ∧ ˛˛ ˛≤1
L̄BL ĀBL L̄BL ˛
Variant 5
„ « „ «
AT L AT R {L\U }T L UT R
=
ABL AT R L̄BL ĀBR − L̄BL UT R
„ « !
ĀT L ĀT R ÂT L ÂT R
∧ = P (pT )
ĀBL ĀBR ÂBL ÂBR
„ « „ « ˛„ «˛
LT L ĀT L ˛ LT L ˛
∧ UT L = ∧ LT L UT R = ĀT R ∧ ˛ ˛
L̄BL ĀBL ˛ L̄BL ˛ ≤ 1
Figure 6.3: Four loop-invariants for the LU factorization with partial pivoting.
6.4. Partial Pivoting and High Performance 131
Equations (6.15)–(6.19) dictate an inherent order in which the computations must proceed:
• If one of pT , LT L , UT L , or L̄BL has been computed, so have the others. In other words, any loop invariant
must include the computation of all four of these results.
• In the loop invariant ABR = ĀBR only if AT R = UT R , since ĀBR requires UBR .
These constraints yield the four feasible loop invariants given in Figure 6.3. In that figure, the variant number
reflects the variant for the LU factorization without pivoting (Figure 6.1) that is most closely related. Variants 3a
and 3b only differ in that for Variant 3b the pivots computed so far have also been applied to the columns to
the right of the thick line µin (6.14).
¶ Variants 1 and 2 from Figure 6.1 have no correspondence here as pivoting
AT L
affects the entire rows of . In other words, in these two variants LT L and UT L have been computed
ABL
but L̄BL has not which can be argued is not possible for a feasible loop invariant.
Let us fully elaborate the case labeled as Variant 5. Partitioning to expose the next rows and columns of A,
L, U and the next element of p, as usual, and substituting into the loop invariant yields
A00 a01 A02 {L\U }00 u01 U02
aT α11 aT = ¯lT ᾱ11 − ¯l10
T
u01 aT ¯T (6.20)
10 12 10 12 − l10 U02
A20 a21 A22 L̄20 ā21 − L̄20 u01 Ā22 − L̄20 U02
Ā00 ā01 Ā02 Â00 â01 Â02 L00 Ā00
¯lT U00 = āT
∧ āT 10 ᾱ11 āT12
= P (p0 ) âT
10 α̂11 âT 12 ∧ 10 10 (6.21)
Ā20 ā21 Ā22 Â20 â21 Â22 L̄20 Ā20
¯ ¯
¯ ¯
¡ ¢ ¡ ¢ ¯ LT00 ¯
∧L00 u01 U02 = ā01 Ā02 ∧ ¯ ¯ ¯
l10 ¯ ≤ 1. (6.22)
¯
¯ L̄20 ¯
Similarly, after moving the thick lines substitution into the loop invariant yields
A00 a01 A02 {L\U }00 u01 U02
T
aT 10 α11 aT
12
= l10 µ11 uT
12 (6.23)
A20 a21 A22 ¯
L̄ ¯
l ¯ − L̄
Ā ¯ U − ¯l uT
20 21 22 20 02 21 12
¯
Ā ā ¯
¯01 Ā02 µ ¶ Â00 â01 Â02
00
¯T p0
∧ ā10 ¯ 11 ā
ᾱ ¯T12 = P ( ) âT10 α̂11 âT
12 (6.24)
¯ ¯ π 1
Ā 20 ¯21 Ā
ā 22 Â20 â21 Â22
132 6. The LU and Cholesky Factorizations
L00 0 µ ¶ ¯
Ā ¯01
ā µ ¶µ ¶ µ ¯ ¶
00
T U00 u01 ¯T L00 0 U02 Ā02
∧ l10 1 = ā10 α¯11 ∧ T = (6.25)
¯ ¯l 0 µ11 ¯ l10 1 uT
12 ¯T
ā 12
L̄ 20 21 Ā 20 ¯21
ā
¯ ¯
¯ L00 0 ¯
¯ ¯
∧¯¯ l T
1 ¯ ≤ 1. (6.26)
10 ¯
¯ L̄ ¯ ¯l ¯
20 21
A careful manipulation of the conditions after repartitioning, in (6.20)–(6.22), and the conditions after moving
the thick line, in (6.23)–(6.26), shows that the current contents of A must be updated by the steps
µ ¶
α11
1. π1 := PivIndex .
a21
µ T ¶ µ T ¶
a10 α11 aT12 a10 α11 aT12
2. := P (π1 ) .
A20 a21 A22 A20 a21 A22
The algorithms corresponding to Variants 3a, 4, and 5 are given in Figure 6.4. There, trilu(Ai,i ) stands
for the unit lower triangular matrix stored in Ai,i , i = 0, 1.
Figure 6.4: Unblocked and blocked algorithms for the LU factorization with partial pivoting.
134 6. The LU and Cholesky Factorizations
Let us examine how to derive different variants for computing this factorization. The precondition4 and
postcondition of the operation are expressed, respectively, as
where, as usual, tril (A) denotes the lower triangular part of A. The postcondition implicitly specifies the
dimensions and lower triangular structure of L.
The triangular structure of L requires it to be partitioned into quadrants with square diagonal blocks, and
this requires A to be conformally partitioned into quadrants as well:
µ ¶ µ ¶
AT L AT R LT L 0
A→ and L → ,
ABL ABR LBL LBR
where
Pcons : m(AT L ) = n(AT L ) = m(LT L ) = n(LT L ).
holds. Substituting these into the postcondition yields
µµ ¶¶ µ ¶ µ ¶
AT L ? tril (AT L ) 0 LT L 0
tril = =
ABL ABR ABL tril (A ) L LBR
¶ Ã !
BR BL
µ ¶µ T
LT L 0 LT L LT BL Â T L ?
∧ = .
LBL LBR 0 LTBR ÂBL ÂBR
The “?” symbol is used in this expression and from now on to indicate a part of a symmetric matrix that is not
referenced. The second part of the postcondition can then be rewritten as the PME
LT L LT
T L = ÂT L ?
,
LBL LTT L = ÂBL LBR LBR = ÂBR − LBL LT
BL
showing that ÂT L must be factored before LBL := ÂBL L−T T L can be solved, and LBL itself is needed in order
T
to compute the update ÂBR − LBL LBL . These dependences result in the three feasible loop-invariants for
Cholesky factorization in Figure 6.5. We present the unblocked and blocked algorithms that result from these
three invariants in Figure 6.6.
Exercise 6.29 Using the worksheet, show that the unblocked and blocked algorithms corresponding to the three
loop-invariants in Figure 6.5 are those given in Figure 6.6.
Exercise 6.30 Identify the type of operations that are performed in the blocked algorithms for the Cholesky
factorization in Figure 6.6 (right) as one of these types: trsm, gemm, chol, or syrk.
4A complete precondition would also assert that A is positive definite in order to guarantee existence of the factorization.
6.6. Summary 135
Variant 1
„ « !
AT L ? LT L ?
=
ABL ABR LBL ÂBR − LBL LT
BL
Variant 2 Variant 3
„ « ! „ « !
AT L ? LT L ? AT L ? LT L ?
= =
ABL ABR ÂBL ÂBR ABL ABR LBL ÂBR
n3
Cchol = flops.
3
Exercise 6.32 Show that the cost of the blocked algorithms for the Cholesky factorization is the same as that
of the nonblocked algorithms.
Considering that n is an exact multiple of nb with nb ¿ n, what is the amount of flops that are performed in
terms of gemm?
6.5.2 Performance
The performance of the Cholesky factorization is similar to that of the LU factorization, which was studied in
Section 1.5.
6.6 Summary
In this chapter it was demonstrated that
• The FLAME techniques for deriving algorithms extend to more complex linear algebra operations.
• Algorithms for factorization operations can be cast in terms of matrix-matrix multiply, and its special
cases, so that high performance can be attained.
• Complex operations, like the LU factorization with partial pivoting, fit the mold.
This chapter completes the discussion of the basic techniques that underlie the FLAME methodology.
136 6. The LU and Cholesky Factorizations
Variant 1: Variant 1:
√
α11 := α11 A11 := Chol unb(A11 )
a21 := a21 /α11 A21 := A21 tril (A11 )−T
A22 := A22 − a21 aT 21 A22 := A22 − A21 AT21
Variant 2: Variant 2:
aT T
10 := a10 tril (A)00
−T A10 := A10 tril (A)00 −T
α11 := α11 − aT a
10 10
A11 := A11 − A10 AT10
√
α11 := α11 A11 := Chol unb(A11 )
Variant 3: Variant 3:
α11 := α11 − aT 10 a10
A11 := A11 − A10 AT10
√
α11 := α11 A11 := Chol unb(A11 )
a21 := a21 − A20 a10 A21 := A21 − A20 AT10
a21 := a21 /α11 A21 := A21 tril (A)11 −T
We attempt to be very consistent with our notation in this book as well as in FLAME related papers, the
FLAME website, and the linear algebra wiki.
As mentioned in Remark 3.1, Lowercase Greek letters and Roman letters will be used to denote scalars and
vectors, respectively. Uppercase Roman letters will be used for matrices. Exceptions to this rule are variables
that denote the (integer) dimensions of the vectors and matrices which are denoted by Roman lowercase letters
to follow the traditional convention.
The letters used for a matrix, vectors that appear as submatrices of that matrix (e.g., its columns), and
elements of that matrix are chosen in a consistent fashion Similarly, letters used for a vector and elements of
that vector are chosen to correspond. This consistent choice is indicated in Figure A.1. In that table we do not
claim that Greek letters used are the Greek letters that correspond to the inidicated Roman letters. We are
merely indicating what letters we chose.
137
138 A. The Use of Letters
Figure A.1: Correspondence between letters used for matrices (uppercase Roman), vectors (lowercase Roman)
and the symbols used to denote their scalar entries (lowercase Greek letters).
Appendix B
Summary of FLAME/C Routines
In this appendix, we list a number of routines supported as part of the current implementation of the FLAME
library.
Additional Information
Information on the library, libFLAME, that uses the APIs and techniques discussed in this book, and the
functionality supported by the library, visit
http://www.cs.utexas.edu/users/flame/
A Quick Reference guide can be downloaded from
http://www.cs.utexas.edu/users/flame/Publications/
B.1 Parameters
A number of parameters can be passed in that indicate how FLAME objects are to be used. These are
summarized in Fig. B.1.
139
140 B. Summary of FLAME/C Routines
FLA_Init( )
Initialize FLAME.
FLA_Finalize( )
Finalize FLAME.
Partitioning, etc.
General operations
Note: the name of the FLA Axpy routine comes from the BLAS routine axpy which stands for double precision
alpha times vector x plus vector y. We have generalized this routine to also work with matrices.
Scalar operations
Vector-vector operations
Note: some of the below operations also appear above under “General operations”. Traditional users of the
BLAS would expect them to appear under the heading “Vector-vector operations,” which is why we repeat
them.
Matrix-vector operations
As for the vector-vector operations, we adopt a naming convention that is very familiar to those who have used
traditional level-2 BLAS routines. The name FLA XXYY encodes the following information:
XX Meaning
Ge General rectangular matrix.
Tr One of the operands is a triangular matrix.
Sy One of the operands is a symmetric matrix.
YY Meaning
mv Matrix-vector multiplication.
sv Solution of a linear system.
r Rank-1 update.
r2 Rank-2 update.
FLA_Gemv( FLA_Trans trans, FLA_Obj alpha, FLA_Obj A, FLA_Obj x, FLA_Obj beta, FLA_Obj y )
y := αoptrans (A)x + βy.
FLA_Ger( FLA_Obj alpha, FLA_Obj x, FLA_Obj y, FLA_Obj A )
A := αxy T + A.
FLA_Symv( FLA_Uplo uplo, FLA_Obj alpha, FLA_Obj A, FLA_Obj x, FLA_Obj beta, FLA_Obj y )
y := αAx + βy, where A is symmetric and stored in the upper or lower triangular part of A, as indicated by uplo.
FLA_Syr( FLA_Uplo uplo, FLA_Obj alpha, FLA_Obj x, FLA_Obj A )
A := αxxT + A, where A is symmetric and stored in the upper or lower triangular part of A, as indicated by uplo.
FLA_Syr2( FLA_Uplo uplo, FLA_Obj alpha, FLA_Obj x, FLA_Obj y, FLA_Obj A )
A := αxy T + αyxT + A, where A is symmetric and stored in the upper or lower triangular part of A, as indicated by uplo.
FLA_Syr2k( FLA_Uplo uplo, FLA_Trans trans, FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := α(optrans (A)optrans (B)T + optrans (B)optrans (A)T + βC, where C is symmetric and stored in the upper or lower triangular
part of C, as indicated by uplo.
FLA_Trmv( FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag, FLA_Obj A, FLA_Obj x )
x := optrans (A)x, where A is upper or lower triangular, as indicated by uplo.
FLA_Trmv_x( FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag,
FLA_Obj alpha, FLA_Obj A, FLA_Obj x, FLA_Obj beta, FLA_Obj y )
Update y := αoptrans (A)x + βy, where A is upper or lower triangular, as indicated by uplo.
FLA_Trsv( FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag, FLA_Obj A, FLA_Obj x )
x := optrans (A)−1 x, where A is upper or lower triangular, as indicated by uplo.
FLA_Trsv_x( FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag,
FLA_Obj alpha, FLA_Obj A, FLA_Obj x, FLA_Obj beta, FLA_Obj y )
y := αoptrans (A)−1 x + βy, where A is upper or lower triangular, as indicated by uplo.
B.5. A Subset of Supported Operations 147
Matrix-matrix operations
As for the vector-vector and matrix-vector operations, we adopt a naming convention that is very familiar to
those who have used traditional level-3 BLAS routines. FLA XXYY in the name encodes
XX Meaning
Ge General rectangular matrix.
Tr One of the operands is a triangular matrix.
Sy One of the operands is a symmetric matrix.
YY Meaning
mm Matrix-matrix multiplication.
sm Solution of a linear system with multiple right-hand sides.
rk Rank-k update.
r2k Rank-2k update.
FLA_Gemm( FLA_Trans transA, FLA_Trans transB, FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := αoptransA (A)optransB (B) + βC.
FLA_Symm( FLA_Side side, FLA_Uplo uplo, FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := αAB + βC or C := αBA + βC, where A is symmetric, side indicates the side from which A multiplies B, uplo indicates
whether A is stored in the upper or lower triangular part of A.
FLA_Syr2k( FLA_Uplo uplo, FLA_Trans trans, FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := α(optrans (A)optrans (B)T + optrans (B)optrans (A)T + βC, where C is symmetric and stored in the upper or lower triangular
part of C, as indicated by uplo.
FLA_Syrk( FLA_Uplo uplo, FLA_Trans trans, FLA_Obj alpha, FLA_Obj A, FLA_Obj beta, FLA_Obj C )
C := αoptrans (A)optrans (A)T + βC, where C is symmetric and stored in the upper or lower triangular part of C, as indicated
by uplo.
FLA_Trmm( FLA_Side side, FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag, FLA_Obj alpha, FLA_Obj A, FLA_Obj B )
B := αoptrans (A)B (side == FLA LEFT) or B := αBoptrans (A) (side == FLA RIGHT). where A is upper or lower triangular, as
indicated by uplo.
FLA_Trmm_x( FLA_Side side, FLA_Uplo uplo, FLA_Trans transA, FLA_Trans transB, FLA_Diag diag,
FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := αoptransA (A)optransB (B) + βC (side == FLA LEFT) or C := αoptransB (B)optransA (A) + βC (side == FLA RIGHT) where A
is upper or lower triangular, as indicated by uplo.
FLA_Trsm( FLA_Side side, FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag, FLA_Obj alpha, FLA_Obj A, FLA_Obj B )
B := αoptrans (A)−1 B (SIDE == FLA LEFT) or B := αBoptrans (A)−1 (SIDE == FLA RIGHT) where A is upper or lower triangular,
as indicated by uplo.
FLA_Trsm_x( FLA_Side side, FLA_Uplo uplo, FLA_Trans transA, FLA_Trans transB, FLA_Diag diag,
FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := αoptransA (A)−1 optransB (B) + βC (SIDE == FLA LEFT) or C := αoptransB (B)optransA (A)−1 + βC (SIDE == FLA RIGHT)
where A is upper or lower triangular, as indicated by uplo.
148 B. Summary of FLAME/C Routines
Bibliography
[1] Satish Balay, William Gropp, Lois Curfman McInnes, and Barry Smith. PETSc 2.0 Users Manual. Technical
Report ANL-95/11, Argonne National Laboratory, Oct. 1996. 4.3
[2] Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms.
PhD thesis, 2006. 1.6
[3] Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Ortı́, and Robert A. van de
Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Soft., 31(1):1–26, 2005.
submitted. 1.2
[4] Paolo Bientinesi, Enrique S. Quintana-Ortı́, and Robert A. van de Geijn. Representing linear algebra
algorithms in code: the FLAME application program interfaces. ACM Trans. Math. Soft., 31(1):27–59,
2005. 1.2
[5] P. D. Crout. A short method for evaluating determinants and solving systmes of linear equations with real
or complex coefficients. AIEE Trans., 60:1235–1240, 1941. 1.3
[6] James W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.
[7] E. W. Dijkstra. A constructive approach to the problem of program correctness. BIT, 8:174–186, 1968. 1.3
[8] E. W. Dijkstra. A discipline of programming. Prentice-Hall, 1976. 1.3
[9] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 basic linear algebra
subprograms. ACM Trans. Math. Soft., 16(1):1–17, March 1990. 4.3, B.5
[10] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of
FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft., 14(1):1–17, March 1988. 4.3, B.5
149
150 Bibliography
[11] R. W. Floyd. Assigning meanings to programs. In J. T. Schwartz, editor, Symposium on Applied Mathe-
matics, volume 19, pages 19–32. American Mathematical Society, 1967. 1.3
[12] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press,
Baltimore, 3rd edition, 1996.
[13] Kazushige Goto and Robert A. van de Geijn. On reducing TLB misses in matrix multiplication. ACM
Trans. Math. Soft., 2006. To appear. 5.19
[14] David Gries. The Science of Programming. Springer-Verlag, 1981.
[15] David Gries and Fred B. Schneider. A Logical Approach to Discrete Math. Texts and Monographs in
Computer Science. Springer-Verlag, 1992. 2.3, 2.5
[16] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal linear
algebra methods environment. ACM Trans. Math. Soft., 27(4):422–455, December 2001. 1.2
[17] John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. A family of high-performance matrix
multiplication algorithms. In Vassil N. Alexandrov, Jack J. Dongarra, Benjoe A. Juliano, René S. Renner,
and C.J. Kenneth Tan, editors, Computational Science - ICCS 2001, Part I, Lecture Notes in Computer
Science 2073, pages 51–60. Springer-Verlag, 2001. 5.26
[18] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, second edition, 2002.
[19] C. A. R. Hoare. An axiomatic basis for computer programming. Communications of the ACM, pages
576–580, October 1969. 1.3
[20] Leslie Lamport. LATEX: A Document Preparation System. Addison-Wesley, Reading, MA, 2nd edition,
1994.
[21] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran
usage. ACM Trans. Math. Soft., 5(3):308–323, Sept. 1979. 4.3, B.5
[22] C. Moler, J. Little, and S. Bangert. Pro-Matlab, User’s Guide. The Mathworks, Inc., 1987. 4.2
[23] Marc Snir, Steve W. Otto, Steven Huss-Lederman, David W. Walker, and Jack Dongarra. MPI: The
Complete Reference. The MIT Press, 1996. 4.3
[24] G. W. Stewart. Introduction to Matrix Computations. Academic Press, Orlando, Florida, 1973.
[25] G. W. Stewart. Matrix Algorithms Volume 1: Basic Decompositions. SIAM, 1998. 6.2.2
[26] Gilbert Strang. Linear Algebra and its Application, Third Edition. Academic Press, 1988.
[27] Lloyd N. Trefethen and III David Bau. Numerical Linear Algebra. SIAM, 1997.
Bibliography 151
[28] Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 1997. 4.3
[29] David S. Watkins. Fundamentals of Matrix Computations. John Wiley & Sons, Inc., New York, 2nd edition,
2002. 6.3
152 Bibliography
Index
153
154 Index
ej , 30 download, 77
element, 28 FLAME/C, 139
element growth, 122 bidimensional partitioning, 81
equality, 10 creators, 141
equations, 43 destructors, 141
finalizing, 71, 141
factorization FLA Cont with 1x3 to 1x2, 81
Cholesky, 113 FLA Cont with 3x1 to 2x1, 79
LU, 113 FLA Cont with 3x3 to 2x2, 83
fetch, 88 FLA Finalize, 71
final result, 17 FLA Init, 71, 141
FLA Cont with 1x3 to 1x2 FLA Merge 1x2, 84
FLAME@lab, 68 FLA Merge 2x1, 84
FLAME/C, 81 FLA Merge 2x2, 84
FLA Cont with 3x1 to 2x1 FLA Obj attach buffer, 73
FLAME@lab, 65 FLA Obj buffer, 73
FLAME/C, 79 FLA Obj create, 71
FLA Cont with 3x3 to 2x2 FLA Obj create without buffer, 73
FLAME@lab, 69 FLA Obj datatype, 73
FLAME/C, 83 FLA Obj free, 72
FLA Finalize, 71 FLA Obj free without buffer, 75
FLA Init, 71, 141 FLA Obj ldim, 73
lab FLA Obj length, 73
download, 70 FLA Obj show, 75
FLAME project, 2 FLA Obj width, 73
FLAME@lab, 62 FLA Part 1x2, 79
bidimensional partitioning, 68 FLA Part 2x1, 77
FLA Cont with 1x3 to 1x2, 68 FLA Part 2x2, 81
FLA Cont with 3x1 to 2x1, 65 FLA Repart 1x2 to 1x3, 79
FLA Cont with 3x3 to 2x2, 69 FLA Repart 2x1 to 3x1, 78
FLA Part 1x2, 66 FLA Repart 2x2 to 3x3, 83
FLA Part 2x1, 63 horizontal partitioning, 77
FLA Part 2x2, 68 initializing, 71, 141
FLA Repart 1x2 to 1x3, 66 inquiry routines, 141
FLA Repart 2x1 to 3x1, 64 manipulating objects, 71, 141
FLA Repart 2x2 to 3x3, 68 object
horizontal partitioning, 62 print contents, 75, 143
vertical partitioning, 66 operations
FLAMEC Cholesky factorization, 143
156 Index
peak, 8 scalar, 28
permutation, 124 dot
PivIndex(·), 125 cost, 88
pivot row, 123 shape
PME, 17 gebp, 93
postcondition, 13, 15, 16 gemm, 93
precondition, 13, 15, 16 gemp, 93
predicate, 13 gemv, 93
Preface, v gepb, 93
Principle of Mathematical Induction, 22 gepdot, 93
processor gepm, 93
model, 87 gepp, 93
programming language ger, 93
C, 5, 61 side, 63
Fortran, 5 small, 92
Haskell, 6 SPD, 113
LabView G, 6 S, 13
M-script, 5, 61 stability, 8
Mathematica, 6 analysis, 8
proof of correctness, 15 state, 13
after moving the thick lines, 19
quadrant, 68 determining, 19
Quick Reference Guide, 139 after repartitioning, 18
RAM, 87 determining, 18
rank, 31 store, 88
rank-1 update, 27, 40 SyLw(·), 105
ger, 40 SyLw(·), 132
cost, 88 symm, 111
definition, 40 blocked Variant 1, 110
PMEs, 42 cost, 108
real number, 20, 28 loop-invariants, 109
recursive, 8, 55 performance, 108
reference, 77 PME 1, 107
registers, 88 symmetric positive definite, 113
Repartition ..., 12 symv, 59
right-hand side vector, 44 syr, 59
syr2, 59
scal, 25 syr2k, 111
cost, 88 syrk, 111
Index 161
blocked Variant 2, 53
unblocked Variant 2, 45
www.linearalgebrawiki.org, ix