100% found this document useful (2 votes)

198 views178 pages

Science of Programming Matrix Computations

Uploaded by

yt peek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

198 views178 pages

Science of Programming Matrix Computations

Uploaded by

yt peek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 178

The

Science
of
Programming
Matrix
Computations
The
Science
of
Programming
Matrix
Computations
Robert A. van de Geijn
The University of Texas at Austin

Enrique S. Quintana-Ortı́
Universidad Jaume I
c 2007 by Robert A. van de Geijn and Enrique S. Quintana-Ortı́.
Copyright °

10 9 8 7 6 5 4 3 2 1

All rights reserved. No part of this book may be reproduced, stored, or transmitted in any manner without the
written permission of the publisher. For information, contact either of the authors.

No warranties, express or implied, are made by the publisher, authors, and their employers that the programs
contained in this volume are free of error. They should not be relied on as the sole basis to solve a problem
whose incorrect solution could result in injury to person or property. If the programs are employed in such a
manner, it is at the user’s own risk and the publisher, authors, and their employers disclaim all liability for such
misuse.

Trademarked names may be used in this book without the inclusion of a trademark symbol. These names are
used in an editorial context only; no infringement of trademark is intended.

Library of Congress Cataloging-in-Publication Data

van de Geijn, Robert A.

The Science of Programming Matrix Computations / Robert A. van de Geijn
and Enrique S. Quintana-Ortı́.
p. cm.
ISBN ?????????????? (pbk.)
1. Algebra, Linear. 2. Programming. I. Title.
QA ????? 2007
???? ????????

First Edition, December 2007

Contents

List of Contributors v

Preface vii

1 Motivation 1
1.1 A Motivating Example: the LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Algorithmic Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Presenting Algorithms in Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 High Performance and Blocked Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Numerical Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Derivation of Linear Algebra Algorithms 9

2.1 A Farewell to Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Predicates as Assertions about the State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Verifying Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Goal-Oriented Derivation of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Other Vector-Vector Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Further Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Matrix-Vector Operations 27
3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

i
ii Contents

3.2 Linear Transformations and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Algorithms for the Matrix-Vector Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Rank-1 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Solving Triangular Linear Systems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Blocked Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Other Matrix-Vector Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.9 Further Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 The FLAME Application Programming Interfaces 61

4.1 Example: gemv Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 The FLAME@lab Interface for M-script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 The FLAME/C Interface for the C Programming Language . . . . . . . . . . . . . . . . . . . . 70
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5 Further Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 High Performance Algorithms 87

5.1 Architectural Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Matrix-Matrix Product: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Algorithms for gemm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 High-Performance Implementation of gepp, gemp, and gepm . . . . . . . . . . . . . . . . . . . . 100
5.5 Modularity and Performance via gemm: Implementing symm . . . . . . . . . . . . . . . . . . . . 105
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.7 Other Matrix-Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.8 Further Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6 The LU and Cholesky Factorizations 113

6.1 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 The LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.3 The Basics of Partial Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.4 Partial Pivoting and High Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.5 The Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.7 Further Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

A The Use of Letters 137

Contents iii

B Summary of FLAME/C Routines 139

B.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
B.2 Initializing and Finalizing FLAME/C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
B.3 Manipulating Linear Algebra Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
B.4 Printing the Contents of an Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
B.5 A Subset of Supported Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
iv Contents
List of Contributors

A large number of people have contributed, and continue to contribute, to the FLAME project. For a complete
list, please http://www.cs.utexas.edu/users/flame/
Below we list the people who have contributed directly to the knowledge and understanding that is summarized
in this text.

Paolo Bientinesi

Ernie Chan

Kazushige Goto

John A. Gunnels

Tze Meng Low

Margaret E. Myers

Gregorio Quintana-Ortı́

Field G. Van Zee

v
vi List of Contributors
Preface

The only effective way to raise the confidence level of a program significantly is
to give a convincing proof of its correctness. But one should not first make the
program and then prove its correctness, because then the requirement of providing
the proof would only increase the poor programmer’s burden. On the contrary: the
programmer should let correctness proof and program grow hand in hand.

– E.W. Dijkstra

This book shows how to put the above words of wisdom to practice when programming algorithms for dense
linear algebra operations.

Programming as a Science
One definition of science is knowledge that has been reduced to a system. In this book we show how for a broad
class of matrix operations the derivation and implementation of algorithms can be made systematic.

Notation
Traditionally, algorithms in this area have been expressed by explicitly exposing the indexing of elements in
matrices and vectors. It is not clear whether this has its roots in how matrix operations were originally coded
in languages like Fortran77 or whether it was because the algorithms could be more concisely stated, something
that may have been important in the days when the typesetting of mathematics was time-consuming and the
printing of mathematical books expensive.

vii
viii Preface

The notation adopted in this book attempts to capture the pictures of matrices and vectors that often
accompany the explanation of an algorithm. Such a picture typically does not expose indexing. Rather, it
captures regions (submatrices and subvectors) that have been, or are to be, updated in a consistent manner.
Similarly, our notation identifies regions in matrices and vectors, hiding indexing details. While algorithms so
expressed require more space on a page, we believe the notation improves the understanding of the algorithm
as well as the opportunity for comparing and contrasting different algorithms that compute the same operation.

Application Programming Interfaces

The new notation creates a disconnect between how the algorithm is expressed and how it is then traditionally
represented in code. We solve this problem by introducing Application Programming Interfaces (APIs) that
allow the code to closely mirror the algorithms. In this book we refer to APIs for Matlab’s M-script language,
also used by Octaveand LabView MathScript, as well as for the C programming language. Since such APIs
can easily be defined for almost any programming language, the approach is largely language independent.

Goal-Oriented Programming
The new notation and the APIs for representing the algorithms in code set the stage for growing proof of
correctness and program hand-in-hand, as advocated by Dijkstra. For reasons that will become clear, high-
performance algorithms for computing matrix operations must inherently involve a loop. The key to developing
a loop is the ability to express the state (contents) of the variables, being updated by the loop, before and after
each iteration of the loop. It is the new notation that allows one to concisely express this state, which is called
the loop-invariant in computer science. Equally importantly, the new notation allows one to systematically
identify all reasonable states that can be maintained by a loop that computes the desired matrix operation. As
a result, the derivation of loops for computing matrix operations becomes systematic, allowing hand-in-hand
development of multiple algorithms and their proof of correctness.

High Performance
The scientific computing community insists on attaining high performance on whatever architectures are the
state-of-the-art. The reason is that there is always interest in solving larger problems and computation time
is often the limiting factor. The second half of the book demonstrates that the formal derivation methodol-
ogy facilitates high performance. The key insight is that the matrix-matrix product operation can inherently
achieve high performance, and that most computation intensive matrix operations can be arranged so that more
computation involves matrix-matrix multiplication.
Preface ix

A High-Performance Library: libFLAME

The methods described in this book have been used to implement a software library for dense and banded
linear algebra operations, libFLAME. This library is available under Open Source license. For information,
visit http://www.cs.utexas.edu/users/flame/.

Intended Audience
This book is in part a portal for accessing research papers, tools, and libraries that were and will be developed
as part of the Formal Linear Algebra Methods Environment (FLAME) project that is being pursued by re-
searchers at The University of Texas at Austin and other institutions. The basic knowledge behind the FLAME
methodology is presented in a way that makes it accessible to novices (e.g., undergraduate students with a
limited background in linear algebra and high-performance computing). However, the approach has been used
to produce state-of-the-art high-performance linear algebra libraries, making the book equally interesting to
experienced researchers and the practicing professional.
The audience of this book extends beyond those interested in the domain of linear algebra algorithms. It
is of interest to students and scholars with interests in the theory of computing since it shows how to make
the formal derivation of algorithms practical. It is of interest to the compiler community because the notation
and APIs present programs at a much higher level of abstraction than traditional code does, which creates new
opportunities for compiler optimizations. It is of interest to the scientific computing community since it shows
how to develop routines for a matrix operation when that matrix operation is not supported by an existing
library. It is of interest to the architecture community since it shows how algorithms and architectures interact.
It is of interest to the generative programming community, since the systematic approach to deriving algorithms
supports the mechanical derivation of algorithms and implementations.

Related Materials
While this book is meant to be self-contained, it may be desirably to use it in conjunction with books and texts
that focus on various other topics related to linear algebra. A brief list follows.

• Gilbert Strang. Linear Algebra and its Application, Third Edition. Academic Press, 1988.
Discusses the mathematics of linear algebra at a level appropriate for undergraduates.

• G. W. Stewart. Introduction to Matrix Computations. Academic Press, 1973.

A basic text that discusses the numerical issues (the effects of roundoff when floating-point arithmetic is
used) related to the topic.
x Preface

• James W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.

Lloyd N. Trefethen and David Bau, III. Numerical Linear Algebra. SIAM, 1997.
David S. Watkins. Fundamentals of Matrix Computations, Second Edition. John Wiley and Sons, Inc.,
2002.
Texts that discuss numerical linear algebra at the introductory graduate level.

• Gene H. Golub and Charles F. Van Loan. Matrix Computations, Third Edition. The Johns Hopkins
University Press, 1996
Advanced text that is best used as a reference or as a text for a class with a more advanced treatment of
the topics.

• G. W. Stewart. Matrix Algorithms Volume 1: Basic Decompositions. SIAM, 1998.

A systematic, but more traditional, treatment of many of the matrix operations and the related numerical
issues.

• Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms, Second Edition. SIAM, 2002.
An advanced book on the numerical analysis of linear algebra algorithms.

In addition, we recommend the following manuscripts for those who want to learn more about formal verification
and derivation of programs.

• David Gries. The Science of Programming. Springer-Verlag, 1981.

A text on the formal derivation and verification of programs.

• Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms.
Department of Computer Sciences, University of Texas at Austin, August 2006.
Chapter 2 of this dissertation systematically justifies the structure of the worksheet and explains in even
more detail how it relates to formal derivation methods. It shows how the FLAME methodology can be
made mechanical and how it enables the systematic stability analysis of the algorithms that are derived.
We highly recommend reading this dissertation upon finishing this text.

Since formating the algorithms takes center stage in our approach, we recommend the classic reference for the
LATEX document preparation systems:

• Leslie Lamport. LATEX: A Document Preparation System, Second Edition. Addison-Wesley Publishing
Company, Inc., 1994.
User’s guide and reference manual for typesetting with LATEX.
Preface xi

Webpages
A companion webpage has been created for this book. The base address is

http://www.cs.utexas.edu/users/flame/books/TSoPMC/

(TSoPMC: The Science of Programming Matrix Computations). In the text the above path will be referred to
as $BASE/. On these webpages we have posted errata, additional materials, hints for the exercises, tools, and
software. We suggest the reader visit this website at this time to become familiar with its structure and content.

Wiki: www.linearalgebrawiki.org
Many examples of operations, algorithms, derivations, and implementations similar to those discussed in this
book can be found at

http://www.linearalgebrawiki.org/

Why www.lulu.com?
We considered publishing this book through more conventional channels. Indeed three major publishers of
technical books offered to publish it (and two politely declined). The problem, however, is that the cost of
textbooks has spiralled out of control and, given that we envision this book primarily as a reference and a
supplemental text, we could not see ourselves adding to the strain this places on students. By publishing it
ourselves through www.lulu.com, we have reduced the cost of a copy to a level where it is hardly worth printing
it oneself. Since we retain all rights to the material, we may or may not publish future editions the same way,
or through a conventional publisher.
Please visit $BASE/books/TSoPMC/ for details on how to purchase this book.

Acknowledgments
This research was partially sponsored by NSF grants ACI-0305163, CCF-0342369, CCF-0540926, and CCF-
0702714. Additional support came from the J. Tinsley Oden Faculty Fellowship Research Program of the
Institute for Computational Engineering and Sciences (ICES) at UT-Austin, a donation by Dr. James Truchard
(President, CEO, and Co-Founder of National Instruments), and an unrestricted grant from NEC Solutions
(America), Inc.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science Foundation.
xii Preface
Chapter 1
Motivation

Programming high-performance routines for computing linear algebra operations has long been a fine art. In
this book we show it to be a science by exposing a systematic approach that given an operation yields high-
performance implementations for computing it. The methodology builds upon a new notation for expressing
algorithms, new advances regarding the formal derivation of linear algebra algorithms, a new style of coding,
and the use of high-performance implementations of a few key linear algebra operations. In this chapter we
preview the approach.
Don’t Panic: A reader who is less well-versed in linear algebra should not feel intimidated by this chapter:
It is meant to demonstrate to more experienced readers that there is substance to the book. In Chapter 2, we
start over, more slowly. Indeed, a novice may wish to skip Chapter 1, and return to it later.

1.1 A Motivating Example: the LU Factorization

Consider the linear system Ax = b, where the square matrix A and the right-hand side vector b are the input,
and the solution vector x is to be computed. Under certain conditions, a unique solution to this equation exists.
Gaussian elimination is a technique, that is typically first introduced in high school, for computing this solution.
In Chapter 6, we will link Gaussian elimination to the computation of the LU factorization of matrix A.
For now it suffices to understand that, given a square matrix A, the LU factorization computes a unit lower
triangular matrix L and an upper triangular matrix U so that A = LU . Given that x must satisfy Ax = b,
we find that LU x = b, or L(U x) = b. This suggests that once the LU factorization has been computed, the
solution to the linear system can be computed by first solving the triangular linear system Ly = b for y, after
which x is computed from the triangular linear system U x = y.
A sketch of a standard algorithm for computing the LU factorization is illustrated in Figure 1.1 (left). There,

1
2 1. Motivation

Algorithm: A := LU unb var5(A)

(a) Partition
„ « „ «
AT L AT R
α11 aT
12 Partition A →
A→ , ABL ABR
a21 A22
„ « where AT L is 0 × 0
1 0
L→ , while m(AT L ) < m(A) do
l L22
„ 21 T
«
υ11 u12 Repartition
U → . 0 1
0 U22 „ « A00 a01 A02
AT L AT R
→ @ aT 10 α11 aT
12
A
(b) - υ11 := α11 , ABL ABR
A20 a21 A22
overwriting α11 . (No-op) where α11 is a scalar
- uT T
12 := a12 , α11 := υ11 = α11
overwriting aT 12 . (No-op) aT T T
12 := u12 = a12
- l21 := a21 /υ11 , a21 := l21 = a21 /υ11
overwriting a21 . A22 := A22 − l21 uT12

- A22 := A22 − l21 uT 12 . Continue with 0 1

„ « A00 a01 A02
AT L AT R
(c) Continue by computing the LU ← @ aT
10 α11 aT
12
A
ABL ABR
factorization of A22 . A20 a21 A22
(Back to Step (a) with A = A22 .) endwhile

Figure 1.1: Left: Typical explanation of an algorithm for computing the LU factorization, overwriting A with
L and U . Right: Same algorithm, using our notation.

matrices L and U overwrite the lower and upper triangular parts of A, respectively, and the diagonal of L is not
stored, since all its entries equal one. To show how the algorithm sweeps through the matrix the explanation is
often accompanied by the pictures in Figure 1.2 (left). The thick lines in that figure track the progress through
matrix A as it is updated.

1.2 Notation
In this book, we have adopted a non traditional notation that captures the pictures that often accompany
the explanation of an algorithm. This notation was developed as part of our Formal Linear Algebra Methods
Environment (FLAME) project [16, 3]. We will show that it facilitates a style of programming that allows the
algorithm to be captured in code as well as the systematic derivation of algorithms [4].
In Figure 1.1(right), we illustrate how the notation is used to express the LU factorization algorithm so
that it reflects the pictures in Figure 1.2. For added clarification we point to Figure 1.2 (right). The next few
chapters will explain the notation in detail, so that for now we leave it to the intuition of the reader.
1.2. Notation 3

? ?
done done AT L AT R

A
done (partially ABL ABR
updated)

? ?
A00 a01 A02
α11 aT
12 aT
10
α11 aT
12
(a)
a21 A22 A20 a21 A22

? ?

υ11:= υ11:=
uT T
12 := a12 uT T
12 := a12
α11 α11
(b)
l21:= A22:= l21:= A22:=
a21 T a21 T
υ11 A22−l21 u12 υ11 A22−l21 u12

? ?

done done AT L AT R
@
(c) @
R
A
done (partially ABL ABR
updated)

Figure 1.2: Progression of pictures that explain the LU factorization algorithm. Left: As typically presented.
Right: Annotated with labels to explain the notation in Fig. 1.1(right).
4 1. Motivation

Algorithm: A := LU unb(A) Algorithm: A := LU blk(A)

„ « „ «
AT L AT R AT L AT R
Partition A → Partition A →
ABL ABR ABL ABR
where AT L is 0 × 0 where AT L is 0 × 0
while n(AT L ) < n(A) do while n(AT L ) < n(A) do
Determine block size nb
Repartition 0 1 Repartition 0 1
„ « A00 a01 A02 „ « A00 A01 A02
AT L AT R AT L AT R
→ @ aT 10 α11 aT
12
A → @ A10 A11 A12 A
ABL ABR ABL ABR
A20 a21 A22 A20 A21 A22
where α11 is a scalar where A11 is nb × nb

Variant 1: Variant 1:
a01 := L−1
00 a01 (trsv) A01 := L−1
00 A01 (trsm)
−1 −1
aT T
10 := a10 U00 (trsv) A10 := A10 U00 (trsm)
α11 := α11 − aT 10 a01 (apdot) A11 := LU unb(A11 − A10 A01 ) (gemm,LU)
Variant 2: Variant 2:
−1 −1
aT T
10 := a10 U00 (trsv) A10 := A10 U00 (trsm)
α11 := α11 − aT 10 a01 (apdot) A11 := LU unb(A11 − A10 A01 ) (gemm,LU)
aT T T
12 := a12 − a10 A02 (gemv) A12 := A12 − A10 A02 (gemm)
Variant 3: Variant 3:
a01 := L−1
00 a01 (trsv) A01 := L−1
00 A01 (trsm)
α11 := α11 − aT 10 a01 (apdot) A11 := LU unb(A11 − A10 A01 ) (gemm,LU)
−1
a21 := (a21 − A20 a01 )/α11 (gemv,invscal) A21 := (A21 − A20 A01 )U11 (gemm,trsm)
Variant 4: Variant 4:
α11 := α11 − aT 10 a01 (apdot) A11 := LU unb(A11 − A10 A01 ) (gemm,LU)
−1
a21 := (a21 − A20 a01 )/α11 (gemv,invscal) A21 := (A21 − A20 A01 )U11 (gemm,trsm)
aT T T
12 := a12 − a10 A02 (gemv) A12 := L−1
11 (A 12 − A 10 02 )
A (gemm,LU)
Variant 5: Variant 5:
a21 := a21 /α11 (invscal) A11 := LU unb(A11 ) (LU)
−1
A22 := A22 − a21 aT (ger) A21 := A21 U11 (trsm)
12
A12 := L−1 A
11 12 (trsm)
A22 := A22 − A21 A12 (gemm)

Continue with 0 1 Continue with 0 1

„ « A00 a01 A02 „ « A00 A01 A02
AT L AT R AT L AT R
← @ aT
10 α11 a12 A
T
← @ A10 A11 A12 A
ABL ABR ABL ABR
A20 a21 A22 A20 A21 A22
endwhile endwhile

Figure 1.3: Multiple algorithms for computing the LU factorization. Matrices Lii and Uii , i = 0, 1, 2, denote,
respectively, the unit lower triangular matrices and upper triangular matrices stored over the corresponding
Aii . Expressions involving L−1 −1
ii and Uii indicate the need to solve a triangular linear system.
1.3. Algorithmic Variants 5

Exercise 1.1 Typesetting algorithms like those in Figure 1.1 (right) may seem somewhat intimidating. We
have created a webpage that helps generate the LATEX source as well as a set of LATEXcommands (FLATEX). Visit
$BASE/Chapter1 and follow the directions on the webpage associated with this exercise to try out the tools.

1.3 Algorithmic Variants

For a given linear algebra operation there are often a number of loop-based algorithms. For the LU factorization,
there are five algorithmic variants, presented in Figure 1.3 (left), that perform the same arithmetic computations
on the same data, but in a different order. The first algorithm was proposed by Gauss around the beginning of
the 19th century and the last by Crout in a paper dated in 1941 [5]. This raises the question of how to find all
loop-based algorithms for a given linear algebra operation.
The ability to find all loop-based algorithms for a given operation is not just an academic exercise. Dense
linear algebra operations occur as subproblems in many scientific applications. The performance attained
when computing such an operation is dependent upon a number of factors, including the choice of algorithmic
variant, details of the computer architecture, and the problem size. Finding and using the appropriate variant
is in general worth the effort.
The topic of this book is the systematic derivation of a family of correct algorithms from the mathematical
specification of a linear algebra operation. The key technique that enables this comes from formal derivation
methods that date back to early work of Floyd [11], Dijkstra [8, 7], and Hoare [19], among others. One of the
contribution of the FLAME project has been the formulation of a sequence of steps that leads to algorithms
which are correct. These steps are first detailed for a simple case involving vectors in Chapter 2 and are
demonstrated for progressively more complex operations throughout the book. They are straightforward and
easy to apply even for people who lack a strong background in linear algebra or formal derivation methods. The
scope of the derivation methodology is broad, covering most important dense linear algebra operations.

1.4 Presenting Algorithms in Code

There is a disconnect between how we express algorithms and how they are traditionally coded. Specifically,
code for linear algebra algorithms traditionally utilizes explicit indexing into the arrays that store the matrices.
This is a frequent source of errors as code is developed. In Figure 1.4 we show how the introduction of an
appropriate Application Programming Interface (API), in this case for the C programming language, allows the
code to closely resemble the algorithm. In particular, intricate indexing is avoided, diminishing the opportunity
for the introduction of errors. APIs for M-script and the C programming languages are detailed in Chapter 4
and Appendix B. (Matlab Octave and LabView MathScript share a common scripting language, which
we will refer to as M-script.)
Similar APIs can be easily defined for almost every programming language. Indeed, as part of the FLAME
project, APIs have also been developed for Fortran, the traditional language of numerical linear algebra; func-
6 1. Motivation

1 void LU_UNB_VAR5( FLA_Obj A )

2 {
3 FLA_Obj ATL, ATR, A00, a01, A02,
4 ABL, ABR, a10t, alpha11, a12t,
5 A20, a21, A22;
6
7 FLA_Part_2x2( A, &ATL, &ATR,
8 &ABL, &ABR, 0, 0, FLA_TL );
9
10 while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){
11 FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02,
12 /* ******** */ /* ********************** */
13 &a10t, /**/ &alpha11, &a12t,
14 ABL, /**/ ABR, &A20, /**/ &a21, &A22,
15 1, 1, FLA_BR );
16 /*------------------------------------------------------------*/
17
18 FLA_Inv_scal( alpha11, a21 ); /* a21 := a21 / alpha11 */
19 FLA_Ger( FLA_MINUS_ONE, a21, a12t, A22 ); /* A22 := A22 - a21 * a12t */
20
21 /*------------------------------------------------------------*/
22 FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, a01, /**/ A02,
23 a10t, alpha11, /**/ a12t,
24 /* ********** */ /* ******************* */
25 &ABL, /**/ &ABR, A20, a21, /**/ A22,
26 FLA_TL );
27 }
28 }

Figure 1.4: C code for the algorithm in Figure 1.1.

tional programming languages such as Haskell and Mathematica; and the LabView G graphical program-
ming language.
Exercise 1.2 The formating in Figure 1.4 is meant to, as closely as possible, resemble the algorithm in Fig-
ure 1.1(right). The same webpage that helps generate LATEX source can also generate an outline for the code.
Visit $BASE/Chapter1 and duplicate the code in Figure 1.4 by following the directions on the webpage associated
with this exercise.

1.5 High Performance and Blocked Algorithms

The scientific applications that give rise to linear algebra operations often demand that high performance is
achieved by libraries that compute the operations. As a result, it is crucial to derive and implement high-
performance algorithms.
1.5. High Performance and Blocked Algorithms 7

Reference
unb_var1
2.5 blk_var1
unb_var2
blk_var2
unb_var3
blk_var3
2 unb_var4
blk_var4
unb_var5
blk_var5
GFLOPS/sec.
1.5

0.5

0
0 500 1000 1500
matrix dimension n

Figure 1.5: Performance of unblocked and blocked algorithmic variants for computing the LU factorization.

The key insight is that cache-based architectures, as are currently popular, perform floating-point operations
(flops) at very fast rates, but fetch data from memory at a (relatively) much slower rate. For operations
like the matrix-matrix multiplication (gemm) this memory bandwidth bottleneck can be overcome by moving
submatrices into the processor cache(s) and amortizing this overhead over a large number of flops. This is
facilitated by the fact that gemm involves a number of operations of cubic order on an amount of data that is
of quadratic order. Details of how high performance can be attained for gemm are exposed in Chapter 5.
Given a high-performance implementation of gemm, other operations can attain high performance if the
bulk of the computation can be cast in terms of gemm. This is property of blocked algorithms. Figure 1.3
(right) displays blocked algorithms for the different algorithmic variants that compute the LU factorization. We
will show that the derivation of blocked algorithms is typically no more complex than the derivation of their
unblocked counterparts.
The performance of a code (an implementation of an algorithm) is often measured in terms of the rate at
which flops are performed. The maximal rate that can be attained by a target architecture is given by the
product of the clock rate of the processor times the number of flops that are performed per clock cycle. The
rate of computation for a code is computed by dividing the number of flops required to compute the operation
by the time it takes for it to be computed. A gigaflops (or GFLOPS) indicates a billion flops per second. Thus,
8 1. Motivation

an implementation that computes an operation that requires f flops in t seconds attains a rate of
f
× 10−9 GFLOPS.
t
Throughout the book we discuss how to compute the cost, in flops, of an algorithm.
The number of flops performed by an LU factorization is about 32 n3 , where n is the matrix dimension. In
Figure 1.5 we show the performance attained by implementations of the different algorithms in Figure 1.3 on
an Intel°R Pentium° R 4 workstation. The clock speed of the particular machine is 1.4 GHz and a Pentium 4
can perform two flops per clock cycle, for a peak performance of 2.8 GFLOPS, which marks the top line in the
graph. The block size nb was taken to equal 128. (We will eventually discuss how to determine a near-optimal
block size.) Note that blocked algorithms attain much better performance than unblocked algorithms and that
not all algorithmic variants attain the same performance.
In Chapter 5 it will become clear why we favor loop-based algorithms over recursive algorithms, and how
recursion does enter the picture.

1.6 Numerical Stability

The numerical properties, like stability of algorithms, for linear algebra operations is a very important issue.
We recommend using this book in conjunction with one or more of the books, mentioned in the preface, that
treat numerical issues fully. In [2] it is discussed how the systematic approach for deriving algorithms can be
extended so that a stability analysis can be equally systematically derived.
Chapter 2
Derivation of Linear Algebra Algorithms

This chapter introduces the reader to the systematic derivation of algorithms for linear algebra operations.
Through a very simple example we illustrate the core ideas: We describe the notation we will use to ex-
press algorithms; we show how assertions can be used to establish correctness; and we propose a goal-oriented
methodology for the derivation of algorithms. We also discuss how to incorporate an analysis of the cost into
the algorithm.
Finally, we show how to translate algorithms to code so that the correctness of the algorithm implies the
correctness of the implementation.

2.1 A Farewell to Indices

In this section, we introduce a notation for expressing algorithms that avoids the pitfalls of intricate indexing
and will allow us to more easily derive, express, and implement algorithms. We present the notation through
a simple example, the inner product of two vectors, an operation that will be used throughout this chapter for
illustration.
Given two vectors, x and y, of length m, the inner product or dot product (dot) of these vectors is given by

m−1
X
α := xT y = χi ψi
i=0

9
10 2. Derivation of Linear Algebra Algorithms

Algorithm: α := apdot(x, y, α)
µ ¶ µ ¶
xT yT
Partition x → ,y→
xB yB
where xT and yT have 0 elements
while m(xT ) < m(x) do
Repartition    
µ ¶ x0 µ ¶ y0
xT y T
→  χ1  , →  ψ1 
xB yB
x2 y2
where χ1 and ψ1 are scalars

α := χ1 ψ1 + α

Continue with   
µ ¶ x0 µ ¶ y0
xT yT
←  χ1  , ←  ψ1 
xB yB
x2 y2
endwhile

Figure 2.1: Algorithm for computing α := xT y + α.

where χi and ψi equal the ith elements of x and y, respectively:

   
χ0 ψ0
 χ1   ψ1 
   
x= ..  and y =  .. .
 .   . 
χm−1 ψm−1

Remark 2.1 We will use the symbol “:=” (“becomes”) to denote assignment while the symbol “=” is reserved
for equality.
Example 2.2 Let    
1 2
x= 2  and y =  4 .
3 1
Then xT y = 1 · 2 + 2 · 4 + 3 · 1 = 13. Here we make use of the symbol “·” to denote the arithmetic product.
A traditional loop for implementing the updating of a scalar by adding a dot product to it, α := xT y + α,
is given by
2.1. A Farewell to Indices 11

k := 0
while k < m do
α := χk ψk + α
k := k + 1
endwhile

Our notation presents this loop as in Figure 2.1. The name of the algorithm in that figure reflects that it
performs a alpha plus dot product (apdot). To interpret the algorithm in Figure 2.1 note the following:

• We bid farewell to intricate indexing: In this example only indices from the sets {T, B} (Top and Bottom)
and {0, 1, 2} are required.

• Each vector has been subdivided into two subvectors, separated by thick lines. This is how we will
represent systematic movement through vectors (and later matrices).

• Subvectors xT and yT include the “top” elements of x and y that, in this algorithm, have already been
used to compute a partial update to α. Similarly, subvectors xB and yB include the “bottom” elements
of x and y that, in this algorithm, have not yet been used to update α. Referring back to the traditional
loop, xT and yT consist of elements 0, . . . , k − 1 and xB and yB consist of elements k, . . . , m − 1:
   
χ0 ψ0
 ..   .. 
 .   . 
µ ¶   µ ¶  
xT  χk−1  yT  ψk−1 
=
 χk

 and =
 ψk
.

xB   yB  
 ..   .. 
 .   . 
χm−1 ψm−1

• The initialization before the loop starts

µ ¶ µ ¶
xT yT
Partition x → ,y→
xB yB
where xT and yT have 0 elements

takes the place of the assignment k := 0 in the traditional loop.

• The loop is executed as long as m(xT ) < m(x) is true, which takes the place of k < m in the traditional
loop. Here m(x) equals the length of vector x so that the loop terminates when xT includes all elements
of x.

• The statement
12 2. Derivation of Linear Algebra Algorithms

Repartition
   
µ ¶ x0 µ ¶ y0
xT y T
→  χ1  , →  ψ1 
xB yB
x2 y2
where χ1 and ψ1 are scalars

exposes the top elements of xB and yB , χ1 and ψ1 respectively, which were χk and ψk in the traditional
loop.

• The exposed elements χ1 and ψ1 are used to update α in

α := χ1 ψ1 + α,

which takes the place of the update α := χk ψk + α in the traditional loop.

Remark 2.3 It is important not to confuse the single elements exposed in our repartitionings, such as χ1 or
ψ1 , with the second entries of corresponding vectors.

• The statement

Continue with
   
µ ¶ x0 µ ¶ y0
xT yT
←  χ1  , ←  ψ1 
xB yB
x2 y2

moves the top elements of xB and yB to xT and yT , respectively. This means that these elements have
now been used to update α and should therefore be added to xT and yT .

Exercise 2.4 Follow the instructions at $BASE/Chapter2/ to duplicate Figure 2.1.

Exercise 2.5 Consider the following loop, which computes α := xT y + α backwards:

k := m − 1
while k ≥ 0 do
α := χk ψk + α
k := k − 1
endwhile

Modify the algorithm in Figure 2.1 so that it expresses this alternative algorithmic variant. Typeset the resulting
algorithm.
2.2. Predicates as Assertions about the State 13

2.2 Predicates as Assertions about the State

To reason about the correctness of algorithms, predicates will be used to express assertions about the state
(contents) of the variables. Recall that a predicate, P , is simply a Boolean expression that evaluates to true or
false depending on the state of the variables that appear in the predicate. The placement of a predicate at a
specific point in an algorithm means that it must evaluate to true so that it asserts the state of the variables
that appear in the predicate. An assertion is a predicate that is used in this manner.
For example, after the command
α := 1
which assigns the value 1 to the scalar variable α, we can assert that the predicate “P : α = 1” holds (is true).
An assertion can then be used to indicate the state of variable α after the assignment as in
α := 1
{P : α = 1}

Remark 2.6 Assertions will be enclosed by curly brackets, { }, in the algorithms.

If Ppre and Ppost are predicates and S is a sequence of commands, then {Ppre }S{Ppost } is a predicate that
evaluates to true if and only if the execution of S, when begun in a state satisfying Ppre , terminates in a finite
amount of time in a state satisfying Ppost . Here {Ppre }S{Ppost } is called the Hoare triple, and Ppre and Ppost
are referred to as the precondition and postcondition for the triple, respectively.
Example 2.7 {α = β} α := α + 1 {α = β + 1} evaluates to true. Here “α = β” is the precondition while
“α = (β + 1)” is the postcondition.

2.3 Verifying Loops

Consider the loop in Figure 2.1, which has the form
while G do
S
endwhile
Here, G is a Boolean expression known as the loop-guard and S is the sequence of commands that form the
loop-body. The loop is executed as follows: If G is false, then execution of the loop terminates; otherwise S is
executed and the process is repeated. Each execution of S is called an iteration. Thus, if G is initially false, no
iterations occur.
We now formulate a theorem that can be applied to prove correctness of algorithms consisting of loops. For
a proof of this theorem (using slightly different notation), see [15].
14 2. Derivation of Linear Algebra Algorithms

Step Annotated Algorithm: α := xT y + α

1a {α = α̂ ∧ 0 ≤ m(x) = m(y)}
µ ¶ µ ¶
xT yT
4 Partition x → ,y→
xB yB
© where x T and y T have 0 elements ª
2 (α = xTT yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x))
3 while©¡ m(xT )T< m(x) do ¢ ª
2,3 (α = xT yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x)) ∧ (m(xT ) < m(x))
5a Repartition    
µ ¶ x0 µ ¶ y0
xT yT
→  χ1  , →  ψ1 
xB yB
x2 y2
© where χ 1 and ψ 1 are scalars ª
6 (α = xT 0 y0 + α̂) ∧ (0 ≤ m(x0 ) = m(y0 ) < m(x))
8 α := χ1 ψ1 + α
5b Continue with   
µ ¶ x0 µ ¶ y0
xT y T
←  χ1  , ←  ψ1 
xB yB
© x2 y2 ª
T
7 © (α = x 0 y 0 + χ 1 ψ 1 + α̂) ∧ (0 < m(x 0 ) + 1 = m(y0ª) + 1 ≤ m(x))
2 (α = xT T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x))
endwhile
©¡ ¢ ª
T
2,3 © (α =TxT yT + ª α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x)) ∧ ¬ (m(xT ) < m(x))
1b α = x y + α̂

Figure 2.2: Annotated algorithm for computing α := xT y + α.

Theorem 2.8 (Fundamental Invariance Theorem) Given the loop

while G do S endwhile
and a predicate Pinv assume that
1. {Pinv ∧ G}S{Pinv } holds – execution of S begun in a state in which Pinv and G are true terminates with
Pinv true –, and
2. execution of the loop begun in a state in which Pinv is true terminates.
Then, if the loop is entered in a state where Pinv is true, it will complete in a state where Pinv is true and the
loop-guard G is false.
Here the symbol “∧” denotes the logical and operator.
2.3. Verifying Loops 15

This theorem can be interpreted as follows. Assume that the predicate Pinv holds before and after the
loop-body. Then, if Pinv holds before the loop, obviously it will also hold before the loop-body. The commands
in the loop-body are such that it holds again after the loop-body, which means that it will again be true before
the loop-body in the next iteration. We conclude that it will be true before and after the loop-body every time
through the loop. When G becomes false, Pinv will still be true, and therefore it will be true after the loop
completes (if the loop can be shown to terminate), we can assert that Pinv ∧ ¬G holds after the completion of
the loop, where the symbol “¬” denotes the logical negation. This can be summarized by
{Pinv }
while G do
{Pinv ∧ G} {Pinv ∧ G}
S =⇒ S
{Pinv } {Pinv }
endwhile
{Pinv ∧ ¬G}
if the loop can be shown to terminate. Here =⇒ stands for “implies”. The assertion Pinv is called the
loop-invariant for this loop.
Let us again consider the computation α := xT y + α. Let us use α̂ to denote the original contents (or state)
of α. Then we define the precondition for the algorithm as

Ppre : α = α̂ ∧ 0 ≤ m(x) = m(y),

and the postcondition as

Ppost : α = xT y + α̂;

that is, the result the operation to be computed.

In order to prove the correctness, we next annotate the algorithm in Figure 2.1 with assertions that describe
the state of the variables, as shown in Figure 2.2. Each command in the algorithm has the property that, when
entered in a state described by the assertion that precedes it, it will complete in a state where the assertion
immediately after it holds. In that annotated algorithm, the predicate

Pinv : (α = xT
T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x))

is a loop-invariant as it holds

1. immediately before the loop (by the initialization and the definition of xT
T yT = 0 when m(xT ) = m(yT ) = 0),

2. before the loop-body, and

3. after the loop-body.

16 2. Derivation of Linear Algebra Algorithms

Now, it is also easy to argue that the loop terminates so that, by the Fundamental Invariance Theorem,
{Pinv ∧ ¬G} holds after termination. Therefore,

Pinv ∧ ¬G ≡ (α = xT
T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x)) ∧ ¬(m(xT ) < m(x))
| {z } | {z }
Pinv ¬G
=⇒ (α = xT
T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x)) ∧ (m(xT ) ≥ m(x))
| {z }
=⇒ m(xT ) = m(yT ) = m(x)
=⇒ (α = xT y + α̂),

since xT and yT are subvectors of x and y and therefore m(xT ) = m(yT ) = m(x) implies that xT = x and
yT = y. Thus we can claim that the algorithm correctly computes α := xT y + α.

2.4 Goal-Oriented Derivation of Algorithms

While in the previous sections we discussed how to add annotations to an existing algorithm in an effort to
prove it correct, we now demonstrate how one can methodically and constructively derive correct algorithms.
Goal-oriented derivation of algorithms starts with the specification of the operation for which an algorithm is to
be developed. From the specification, assertions are systematically determined and inserted into the algorithm
before commands are added. By then inserting commands that make the assertions true at the indicated points,
the algorithm is developed, hand-in-hand with its proof of correctness.
We draw the attention of the reader again to Figure 2.2. The numbers in the left column, labeled Step,
indicate in what order to fill out the annotated algorithm.

Step 1: Specifying the precondition and postcondition. The statement of the operation to be performed,
α := xT y + α, dictates the precondition and postcondition indicated in Steps 1a and 1b. The precondition is
given by
Ppre : α = α̂ ∧ 0 ≤ m(x) = m(y),
and the postcondition is
Ppost : α = xT y + α̂.

Step 2: Determining loop-invariants. As part of the computation of α := xT y + α we will sweep through

vectors x and y in a way that creates two different subvectors of each of those vectors: the parts that have
already been used to update α and the parts that remain yet to be used in this update. This suggests a
partitioning of x and y as µ ¶ µ ¶
xT yT
x→ and y→ ,
xB yB
2.4. Goal-Oriented Derivation of Algorithms 17

where m(xT ) = m(yT ) since otherwise xT

T yT is not well-defined.
We take these partitioned vectors and substitute them into the postcondition to find that
µ ¶T µ ¶ µ ¶
xT yT ¡ ¢ yT
α= + α̂ = xT T xT
B + α̂,
xB yB yB
or,
α = xT T
T yT + xB yB + α̂.

This partitioned matrix expression (PME) expresses the final value of α in terms of its original value and the
partitioned vectors.

Remark 2.9 The partitioned matrix expression (PME) is obtained by substitution of the partitioned operands
into the postcondition.
Now, at an intermediate iteration of the loop, α does not contain its final value. Rather, it contains some
partial result towards that final result. This partial result should be reflected in the loop-invariant. One such
intermediate state is given by
Pinv : (α = xT
T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x)),

which we note is exactly the loop-invariant that we used to prove the algorithm correct in Figure 2.2.

Remark 2.10 Once it is decided how to partition vectors and matrices into regions that have been updated
and/or used in a consistent fashion, loop-invariants can be systematically determined a priori.

Step 3: Choosing a loop-guard. The condition

Pinv ∧ ¬G ≡ ((α = xT
T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x))) ∧ ¬G

must imply that “Ppost : α = xT y + α̂” holds. If xT and yT equal all of x and y, respectively, then the
loop-invariant implies the postcondition: The choice ‘G : m(xT ) < m(x)” satisfies the desired condition that
Pinv ∧ ¬G implies that m(xT ) = m(x), as xT must be a subvector of x, and
((α = xT
T yT + α̂) ∧ (0 ≤ m(xT ) = m(yT ) ≤ m(x))) ∧ (m(xT ) ≥ m(x))
| {z } | {z }
Pinv ¬G

=⇒ α = xT y + α̂,
as was already argued in the previous section. This loop-guard is entered in Step 3 in Figure 2.2.

Remark 2.11 The loop-invariant and the postcondition together prescribe a (non-unique) loop-guard G.
18 2. Derivation of Linear Algebra Algorithms

Step 4: Initialization. If we partition

µ ¶ µ ¶
xT yT
x→ and y→ ,
xB yB

where xT and yT have no elements, then we place variables α, xT , xB , yT , and yB in a state where the
loop-invariant is satisfied. This initialization appears in Step 4 in Figure 2.2.

Remark 2.12 The loop-invariant and the precondition together prescribe the initialization.

Step 5: Progressing through the vectors. We now note that, as part of the computation, xT and yT start
by containing no elements and must ultimately equal all of x and y, respectively. Thus, as part of the loop,
elements must be taken from xB and yB and must be added to xT and yT , respectively. This is denoted in
Figure 2.2 by the statements

Repartition
   
µ ¶ x0 µ ¶ y0
xT yT
→  χ1  , →  ψ1 
xB yB
x2 y2
where χ1 and ψ1 are scalars,

and

Continue with
   
µ ¶ x0 µ ¶ y0
xT yT
←  χ1  , ←  ψ1  .
xB yB
x2 y2

This notation simply captures the movement of χ1 , the top element of xB , from xB to xT . Similarly ψ1 moves
from yB to yT . The movement through the vectors guarantees that the loop eventually terminates, which is
one condition required for the Fundamental Invariance Theorem to apply.

Remark 2.13 The initialization and the loop-guard together prescribe the movement through the vectors.

Step 6: Determining the state after repartitioning. The repartitionings in Step 5a do not change the
contents of α: it is an “indexing” operation. We can thus ask ourselves the question of what the contents of
2.4. Goal-Oriented Derivation of Algorithms 19

α are in terms of the exposed parts of x and y. We can derive this state, Pbefore , via textual substitution: The
repartitionings in Step 5a imply that

xT = x0 yT = y0
µ ¶ and µ ¶ .
χ1 ψ1
xB = yB =
x2 y2

If we substitute the expressions on the right of the equalities into the loop-invariant we find that

α = xT
T yT + α̂

implies that
T
α= x0 y0 + α̂,
|{z} |{z}
xT yT
which is entered in Step 6 in Figure 2.2.

Remark 2.14 The state in Step 6 is determined via textual substitution.

Step 7: Determining the state after moving the thick lines. The movement of the thick lines in Step 5b
means that now
µ ¶ µ ¶
x0 y0
xT = yT =
χ1 and ψ1 ,
xB = x2 yB = y2

so that
α = xT
T yT + α̂

implies that µ ¶ µ ¶
x0 T y0
α= + α̂ = xT
0 y0 + χ1 ψ1 + α̂,
χ1 ψ1
| {z } | {z }
xT yT
which is then entered as state Pafter in Step 7 in Figure 2.2.

Remark 2.15 The state in Step 7 is determined via textual substitution and the application of the rules of
linear algebra.
20 2. Derivation of Linear Algebra Algorithms

Step 8: Determining the update. Comparing the contents in Step 6 and Step 7 now tells us that the state
of α must change from
Pbefore : α = xT
0 y0 + α̂

to
Pafter : α = xT y0 + α̂ +χ1 ψ1 ,
| 0 {z }
already in α
which can be accomplished by updating α as

α := χ1 ψ1 + α.

This is then entered in Step 8 in Figure 2.2.

Remark 2.16 It is not the case that α̂ (the original contents of α) must be saved, and that the update
α = xT T
0 y0 + χ1 ψ1 + α̂ must be performed. Since α already contains x0 y0 + α̂, only χ1 ψ1 needs to be added.
Thus, α̂ is only needed to be able to reason about the correctness of the algorithm.

Final algorithm. Finally, we note that all the annotations (in the grey boxes) in Figure 2.2 were only introduced
to derive the statements of the algorithm. Deleting these produces the algorithm already stated in Figure 2.1.
Exercise 2.17 Reproduce Figure 2.2 by visiting $BASE/Chapter2/ and following the directions associated with
this exercise.
To assist in the typesetting, some LATEX macros, from the FLAME-LATEX (FLATEX) API, are collected in
Figure 2.3.

2.5 Cost Analysis

As part of deriving an algorithm, one should obtain a formula for its cost. In this section we discuss how to
incorporate an analysis of the cost into the annotated algorithm.
Let us again consider the algorithm for computing α := xT y + α in Figure 2.2 and assume that α and the
elements of x and y are all real numbers. Each execution of the loop requires two flops: a multiply and an add.
The cost of computing α := xT y + α is thus given by
m−1
X
2 flops = 2m flops, (2.1)
k=0

where m = m(x).
Let us examine how one would prove the equality in (2.1). There are two approaches: one is to say “well,
that is obvious” while the other proves it rigorously via mathematical induction:
2.5. Cost Analysis 21

FLATEX command Result

µ ¶
\FlaTwoByOne{x_T} xT
{x_B} x
 B 
\FlaThreeByOneB{x_0} x0
{x_1}  x1 
{x_2} x
 2 
\FlaThreeByOneT{x_0} x0
{x_1}  x1 
{x_2} x2
¡ ¢
\FlaOneByTwo{x_L}{x_R} xL xR

¡ ¢
\FlaOneByThreeR{x_0}{x_1}{x_2} x0 x1 x2

¡ ¢
\FlaOneByThreeL{x_0}{x_1}{x_2} x0 x1 x2
µ ¶
\FlaTwoByTwo{A_{TL}}{A_{TR}} AT L AT R
{A_{BL}}{A_{BR}} ABL ABR
 
\FlaThreeByThreeBR{A_{00}}{A_{01}}{A_{02}} A00 A01 A02
{A_{10}}{A_{11}}{A_{12}}  A10 A11 A12 
{A_{20}}{A_{21}}{A_{22}} A20 A21 A22
 
\FlaThreeByThreeBL{A_{00}}{A_{01}}{A_{02}} A00 A01 A02
{A_{10}}{A_{11}}{A_{12}}  A10 A11 A12 
{A_{20}}{A_{21}}{A_{22}} A20 A21 A22
 
\FlaThreeByThreeTR{A_{00}}{A_{01}}{A_{02}} A00 A01 A02
{A_{10}}{A_{11}}{A_{12}}  A10 A11 A12 
{A_{20}}{A_{21}}{A_{22}} A20 A21 A22
 
\FlaThreeByThreeTL{A_{00}}{A_{01}}{A_{02}} A00 A01 A02
{A_{10}}{A_{11}}{A_{12}}  A10 A11 A12 
{A_{20}}{A_{21}}{A_{22}} A20 A21 A22

Figure 2.3: Various FLATEX commands for typesetting partitioned matrices.

22 2. Derivation of Linear Algebra Algorithms

Step Cost Analysis: α := sapdot(x, y, α)

1a Csf = 0 flops
µ¶ µ ¶
xT yT
4 Partition x → ,y→
xB yB
where xT and yT have 0 elements
2 Csf = 2m(xT ) flops
3 while m(xT ) < m(x) do
2,3 Csf = 2m(xT ) flops
5a Repartition    
µ ¶ x0 µ ¶ y0
xT y T
→  χ1  , →  ψ1 
xB yB
x2 y2
where χ1 and ψ1 are scalars
6 Csf = 2m(x0 ) flops
8 α := χ1 ψ1 + α Cost: 2 flops
5b Continue with   
µ ¶ x0 µ ¶ y0
xT yT
←  χ1  , ←  ψ1 
xB yB
x2 y2
7 Csf = 2m(x0 ) + 2 flops
2 Csf = 2m(xT ) flops
endwhile
2,3 Csf = 2m(xT ) flops
1b Total Cost: 2m(x) flops

Figure 2.4: Cost analysis for the algorithm in Fig. 2.2.

• Base case. For m = 0:

0−1
X −1
X
2= 2 = 0 = 2(0) = 2m.
k=0 k=0

Pm−1 P(m+1)−1
• Assume k=0 2 = 2m. Show that k=0 2 = 2(m + 1):
(m+1)−1
Ãm−1 !
X X
2= 2 + 2 = 2m + 2 = 2(m + 1).
k=0 k=0

Pm−1
We conclude by the Principle of Mathematical Induction that k=0 2 = 2m.
2.6. Summary 23

This inductive proof can be incorporated into the worksheet as illustrated in Figure 2.4, yielding the cost
worksheet. In that figure, we introduce Csf which stands for “Cost-so-far”. Assertions are added to the
worksheet indicating the computation cost incurred so far at the specified point in the algorithm. In Step 1a,
the cost is given by Csf = 0. At Step 2, just before the loop, this translates to Csf = 2m(xT ) since m(xT ) = 0
and the operation in Step 4 is merely an indexing operation, which does not represent useful computation
and is therefore not counted. This is analogous to the base case in our inductive proof. The assertion that
Csf = 2m(xT ) is true at the top of the loop-body is equivalent to the induction hypothesis. We will refer to this
cost as the cost-invariant of the loop. We need to show that it is again true at the bottom of the loop-body,
where m(xT ) is one greater than m(xT ) at the top of the loop. We do so by inserting Csf = 2m(x0 ) in Step 6,
which follows by textual substitution and the fact that the operations in Step 5a are indexing operations and
do not count towards Csf . The fact that two flops are performed in Step 8 and the operations in Step 5b are
indexing operations means that Csf = 2m(x0 )+2 at Step 7. Upon completion m(xT ) = m(x0 )+1 in Step 2, due
to the fact that one element has been added to xT , shows that Csf = 2m(xT ) at the bottom of the loop-body.
Thus, as was true for the loop-invariant, Csf = 2m(xT ) upon leaving the loop. Since there m(xT ) = m(x), so
that the total cost of the algorithm is 2m(x) flops.
The above analysis demonstrates the link between a loop-invariant, a cost-invariant, and an inductive hy-
pothesis. The proof of the Fundamental Invariance Theorem employs mathematical induction [15].

2.6 Summary
In this chapter, we have introduced the fundamentals of the FLAME approach in the setting of a simple example,
the apdot. Let us recap the highlights so far.

• In our notation algorithms are expressed without detailed indexing. By partitioning vectors into subvec-
tors, the boundary between those subvectors can be used to indicate how far into the vector indexing has
reached. Elements near that boundary are of interest since they may move across the boundary as they
are updated and/or used in the current iteration. It is this insight that allows us to restrict indexing only
to the sets {T, B} and {0, 1, 2} when tracking vectors.

• Assertions naturally express the state in which variables should be at a given point in an algorithm.

• Loop-invariants are the key to proving loops correct.

• Loop-invariants are systematically identified a priori from the postcondition, which is the specification of
the computation to be performed. This makes the approach goal-oriented.

• Given a precondition, postcondition, and a specific loop-invariant, all other steps of the derivation are
prescribed. The systematic method for deriving all these parts is embodied in Figure 2.5, which we will
refer to as the worksheet from here on.
24 2. Derivation of Linear Algebra Algorithms

Step Annotated Algorithm: [D, E, F, . . .] := op(A, B, C, D, . . .)

1a {Ppre }
4 Partition

where
2 {Pinv }
3 while G do
2,3 {(Pinv ) ∧ (G)}
5a Repartition

˘ where
¯
6 Pbefore
8 SU
5b Continue with

˘ ¯
7 Pafter
2 {Pinv }
endwhile
2,3 {(Pinv ) ∧ ¬ (G)}
1b {Ppost }

Figure 2.5: Worksheet for deriving an algorithm.

• An expression for the cost of an algorithm can be determined by summing the cost of the operations in
the loop-body. A closed-form expression for this summation can then be proven correct by annotating
the worksheet with a cost-invariant.

Remark 2.18 A blank worksheet, to be used in subsequent exercises, can be obtained by visit-
ing $BASE/Chapter2/.

2.7 Other Vector-Vector Operations

A number of other commonly encountered operations involving vectors are tabulated in Figure 2.6.

2.8 Further Exercises

For additional exercises, visit $BASE/Chapter2/.
2.8. Further Exercises 25

Name Abbreviation Operation Cost (flops)

Scaling scal x := αx m
Inverse scaling invscal x := x/α m
Addition add y := x + y m
Dot (inner) product dot α := xT y 2m − 1
alpha plus dot product apdot α := α + xT y 2m
alpha x plus y axpy y := αx + y 2m

Figure 2.6: Vector-vector operations. Here, α is a scalar while x and y are vectors of length m.
26 2. Derivation of Linear Algebra Algorithms
Chapter 3
Matrix-Vector Operations

The previous chapter introduced the FLAME approach to deriving and implementing linear algebra algorithms.
The primary example chosen for that chapter was an operation that involved scalars and vectors only, as did
the exercises in that chapter. Such operations are referred to as vector-vector operations. In this chapter, we
move on to simple operations that combine matrices and vectors and that are thus referred to as matrix-vector
operations.

Matrices differ from vectors in that their two-dimensional shape permits systematic traversal in multiple
directions: While vectors are naturally accessed from top to bottom or vice-versa, matrices can be accessed
row-wise, column-wise, and by quadrants, as we will see in this chapter. This multitude of ways in which
matrices can be partitioned leads to a much richer set of algorithms.

We focus on three common matrix-vector operations, namely, the matrix-vector product, the rank-1 update,
and the solution of a triangular linear system of equations. The latter will also be used to illustrate the
derivation of a blocked variant of an algorithm, a technique that supports performance and modularity. We will
see that these operations build on the vector-vector operations encountered in the previous chapter and become
themselves building-blocks for blocked algorithms for matrix-vector operations, and more complex operations
in later chapters.

27
28 3. Matrix-Vector Operations

3.1 Notation
A vector x ∈ Rm is an ordered tuple of m real numbers1 . It is written as a column of elements, with parenthesis
around it:  
χ0
 χ1 
 
x= .. .
 . 
χm−1
The parenthesis are there only for visual effect. In some cases, for clarity, we will include bars to separate the
elements:  
χ0
 χ1 
 
x= .. .
 . 
χm−1
We adopt the convention that lowercase Roman letters are used for vector variables and lowercase Greek letters
for scalars. Also, the elements of a vector are denoted by the Greek letter that corresponds to the Roman letter
used to denote the vector. A table of corresponding letters is given in Appendix A.
If x is a column vector, then the row vector with identical elements organized as a row is denoted by xT :
xT = (χ0 , χ1 , . . . , χm−1 ) .
The “T ” superscript there stands for transposition. Sometimes we will leave out the commas that separate the
elements, replacing them with a blank instead, or we will use separation bars:
¡ ¢ ¡ ¢
xT = χ0 χ1 · · · χm−1 = χ0 χ1 · · · χm−1 .
Often it will be space-consuming
 to havea column vector in a sentence written as a column of its elements.
χ0
 χ1 
  T
Thus, rather than writing x =  ..  we will then write x = (χ0 , χ1 , . . . , χm−1 ) .
 . 
χm−1
A matrix A ∈ Rm×n is a two-dimensional array of elements where its (i, j) element is given by αij :
 
α00 α01 ... α0,n−1
 α10 α11 ... α1,n−1 
 
A= .. .. .. .. .
 . . . . 
αm−1,0 αm−1,1 . . . αm−1,n−1
1 Thorough the book we will take scalars, vectors, and matrices to be real valued. Most of the results also apply to the case

where they are complex valued.

3.2. Linear Transformations and Matrices 29

We adopt the convention that matrices are denoted by uppercase Roman letters. Frequently, we will partition
A by columns or rows:
 
ǎT0
 ǎT 
 1 
A = (a0 , a1 , . . . , an−1 ) =  .. ,
 . 
ǎT
m−1

where aj and ǎT

i stand, respectively, for the jth column and the ith row of A. Both aj and ǎi are vectors. The
“ˇ” superscript is used to distinguish aj from ǎj . Following our convention, letters for the columns, rows, and
elements are picked to correspond to the letter used for the matrix, as indicated in Appendix A.

Remark 3.1 Lowercase Greek letters and Roman letters will be used to denote scalars and vectors, respec-
tively. Uppercase Roman letters will be used for matrices.
Exceptions to this rule are variables that denote the (integer) dimensions of the vectors and matrices which
are denoted by Roman lowercase letters to follow the traditional convention.

During an algorithm one or more variables (scalars, vectors, or matrices) will be modified so that they no
longer contain their original values (contents). Whenever we need to refer to the original contents of a variable
we will put a “ˆ” symbol on top of it. For example, Â, â, and α̂ will refer to the original contents (those before
the algorithm commences) of A, a, and α, respectively.

Remark 3.2 A variable name with a “ˆ” symbol on top of it refers to the original contents of that variable.
This will be used for scalars, vectors, matrices, and also for parts (elements, subvectors, submatrices) of these
variables.

3.2 Linear Transformations and Matrices

While the reader has likely been exposed to the definition of the matrix-vector product before, we believe it to
be a good idea to review why it is defined as it is.

Definition 3.3 Let F : Rn → Rm be a function that maps a vector from Rn to a vector in Rm . Then F is said
to be a linear transformation if F(αx + y) = αF(x) + F(y) for all α ∈ R and x, y ∈ Rn .
30 3. Matrix-Vector Operations

Consider the unit basis vectors, ej ∈ Rn , 0 ≤ j < n, which are defined by the vectors of all zeroes except for
the jth element, which equals 1:
 
0 
 ..  
 .  j zeroes (elements 0, 1, . . . , j − 1)
  
 0 
 
ej =  
 1  ← element j
 0  
 
 .  n − j − 1 zeroes (elements j + 1, j + 2, . . . , n − 1)
 ..  
0
T
Any vector x = (χ0 , χ1 , . . . , χn−1 ) ∈ Rn can then be written as a linear combination of these vectors:
       
χ0 1 0 0
 χ1   0   1   0 
       
x =  .  = χ0  .  + χ1  .  + · · · + χn−1  .  = χ0 e0 + χ1 e1 + · · · + χn−1 en−1 .
 ..   ..   ..   .. 
χn−1 0 0 1
If F : Rn → Rm is a linear transformation, then
F(x) = F(χ0 e0 + χ1 e1 + · · · + χn−1 en−1 ) = χ0 F(e0 ) + χ1 F(e1 ) + · · · + χn−1 F(en−1 )
= χ0 a0 + χ1 a1 + · · · + χn−1 an−1 ,
where aj = F(ej ) ∈ Rm , 0 ≤ j < n. Thus, we conclude that if we know how the linear transformation F acts
on the unit basis vectors, we can evaluate F(x) as a linear combination of the vectors aj , 0 ≤ j < n, with the
coefficients given by the elements of x. The matrix A ∈ Rm×n that has aj , 0 ≤ j < n, as its jth column thus
represents the linear transformation F, and the matrix-vector product Ax is defined as
Ax ≡ F(x) = χ0 a0 + χ1 a1 + · · · + χn−1 an−1
     
α00 α01 α0,n−1
 α10   α11   α1,n−1 
     
= χ0  ..  + χ1  ..  + · · · + χn−1  .. 
 .   .   . 
αm−1,0 αm−1,1 αm−1,n−1
| {z } | {z } | {z }
a0 a1 an−1
 
α00 χ0 + α01 χ1 + ··· + α0,n−1 χn−1
 α10 χ0 + α11 χ1 + ··· + α1,n−1 χn−1 
 
=  .. .. .. . (3.1)
 . . . 
αm−1,0 χ0 + αm−1,1 χ1 + ··· + αm−1,n−1 χn−1
3.2. Linear Transformations and Matrices 31

Exercise 3.4 Let A = (a0 , a1 , . . . , an−1 ) ∈ Rm×n be a partitioning of A by columns. Show that Aej = aj ,
0 ≤ j < n, from the definition of the matrix-vector product in (3.1).
Exercise 3.5 Let F : Rn → Rn have the property that F(x) = x for all x ∈ Rn . Show that
(a) F is a linear transformation.
(b) F(x) = In x, where In is the identity matrix defined as In = (e0 , e1 , . . . , en−1 ).
The following two exercises relate to the distributive property of the matrix-vector product.
Exercise 3.6 Let F : Rn → Rm and G : Rn → Rm both be linear transformations. Show that H : Rn → Rm
defined by H(x) = F(x) + G(x), is also a linear transformation. Next, let A, B, and C equal the matrices that
represent F, G, and H, respectively. Explain why C = A + B should be defined as the matrix that results from
adding corresponding elements of A and B.
Exercise 3.7 Let A ∈ Rm×n and x, y ∈ Rn . Show that A(x + y) = Ax + Ay; that is, the matrix-vector product
is distributive with respect to vector addition.
Two important definitions, which will be used later in the book, are the following.
Definition 3.8 A set of vectors v1 , v2 , . . . , vn ∈ Rm is said to be linearly independent if
ν1 v1 + ν2 v2 + · · · + νn vn = 0,
with ν1 , ν2 , . . . , νn ∈ R, implies ν1 = ν2 = . . . = νn = 0.
Definition 3.9 The column (row) rank of a matrix A is the maximal number of linearly independent column
(row) vectors of A.
Note that the row and column rank of a matrix are always equal.
Let us consider the operation y := αAx + βy, with x ∈ Rn , y =∈ Rm partitioned into elements, and
A ∈ Rm×n partitioned into elements, columns, and rows as discussed in Section 3.1. This is a more general
form of the matrix-vector product and will be referred to as gemv from here on. For simplicity, we consider
α = β = 1 in this section.
From
y := Ax + y = χ0 a0 + χ1 a1 + · · · + χn−1 an−1 + y = [[[[χ0 a0 + χ1 a1 ] + · · ·] + χn−1 an−1 ] + y] ,
we note that gemv can be computed by repeatedly performing axpy operations. Because of the commutative
property of vector addition, the axpys in this expression can be performed in any order.
Next, we show that gemv can be equally well computed as a series of apdots involving the rows of matrix
A, vector x, and the elements of y:
     
α00 χ0 + α01 χ1 + ··· +α0,n−1 χn−1 ψ0 ǎT
0 x + ψ0
 α10 χ0 + α11 χ1 + ··· +α1,n−1 χn−1     ǎT
1 x + ψ1

   ψ1   
y := Ax + y =  .. .. ..  +  ..  =  .. .
 . . .   .   . 
αm−1,0 χ0 + αm−1,1 χ1 + ··· +αm−1,n−1 χn−1 ψm−1 ǎT
m−1 x + ψm−1
32 3. Matrix-Vector Operations

Example 3.10 Consider 0 1 0 1

0 1 2 3 1 2
3 B 5 C B 1 2 2 C
x = @ 0 A, y=B C
@ 1 A, and A=B
@ 9
C.
0 2 A
1
8 7 1 0
The matrix-vector product y := Ax + y can be computed as several axpys:
222 0 1 0 1 3 0 1 3 0 1 3 0 1
3 1 2 2 13
666 B 1 C B C 7
2 C 7 B 2 C 7 B 5 C 7 B 10 C
y := 6 66 B C B
444 3 @ 9 A + 0 @
B
+ 1@ C 7+B C 7=B C
5 @ 30 A ,
0 A 5 2 A 5 @ 1 A
7 1 0 8 29
as well as via repeated apdots:
0 0 1 1
3
B (3 1 2) @ 0 A+2 C
B C
B 1 C
B C
B 0 1 C
B 3 C
B C 0 1 0 1
B (1 2 2) @ 0 A+5 C
B C 3·3+1·0+2·1+2 13
B 1 C B
B C B 1·3+2·0+2·1+5 C B
C = B 10
C
C.
y := B 0 1 C=@
B 3 C 9 · 3 + 0 · 0 + 2 · 1 + 1 A @ 30 A
B C
B (9 0 2) @ 0 A+1 C 7·3+1·0+0·1+9 29
B C
B 1 C
B C
B 0 1 C
B 3 C
B C
@ (7 1 0) @ 0 A+8 A
1
Exercise 3.11 For the data given in Example 3.10, show that the order in which the axpys are carried out
does not affect the result.
The following theorem is a generalization of the above observations.
Theorem 3.12 Partition
     
A00 A01 ... A0,ν−1 χ0 ψ0
 A10 A11 ... A1,ν−1   χ1   ψ1 
     
A→ .. .. . .  , x →  .  , and y →  .  ,
 . . . . .
.   .
.   .
. 
Aµ−1,0 Aµ−1,1 ... Aµ−1,ν−1 χµ−1 ψν−1

where Ai,j ∈ Rmi ×np , xj ∈ Rnj , and yi ∈ Rmi . Then, the ith subvector of y = Ax is given by
ν−1
X
yi = Ai,j xj . (3.2)
j=0

The proof of this theorem is tedious and is therefore skipped.

Two corollaries follow immediately from the above theorem. There corollaries provide the PMEs for two
sets of loop-invariants from which algorithms for computing gemv can be derived.
3.3. Algorithms for the Matrix-Vector Product 33

Corollary 3.13 Partition matrix A and vector x as

µ ¶
¡ ¢ xT
A→ AL AR and x→ ,
xB
with n(AL ) = m(xT ) (and, therefore, n(AR ) = m(xB )). Here m(X) and n(X) equal the row and column size
of matrix X, respectively. Then
µ ¶
¡ ¢ xT
Ax + y = AL AR + y = AL xT + AR xB + y.
xB
Corollary 3.14 Partition matrix A and vector y as
µ ¶ µ ¶
AT yT
A→ and y→ ,
AB yB
so that m(AT ) = m(yT ) (and, therefore, m(AB ) = m(yB )). Then
µ ¶ µ ¶ µ ¶
AT yT AT x + yT
Ax + y = x+ = .
AB yB AB x + yB

Remark 3.15 Subscripts “L” and “R” will serve to specify the Left and Right submatrices/subvectors of a
matrix/vector, respectively. Similarly, subscripts “T” and “B” will be used to specify the Top and Bottom
submatrices/subvectors.
Exercise 3.16 Prove Corollary 3.13.
Exercise 3.17 Prove Corollary 3.14.

Remark 3.18 Corollaries 3.13 and 3.14 pose certain restrictions on the dimensions of the partitioned matri-
ces/vectors so that the matrix-vector product is “consistently” defined for these partitioned elements.
Let us share a few hints on what a conformal partitioning is for the type of expressions encountered in gemv.
Consider a matrix A, two vectors x, y, and an operation which relates them as:

1. x + y or x := y; then, m(x) = m(y), and any partitioning by elements in one of the two operands must
be done conformally (with the same dimensions) in the other operand.
2. Ax; then, n(A) = m(x) and any partitioning by columns in A must be conformally performed by elements
in x.

3.3 Algorithms for the Matrix-Vector Product

In this section we derive algorithms for computing the matrix-vector product as well as the cost of these
algorithms.
34 3. Matrix-Vector Operations

„ « Variant 1 «
„ „ « „Variant 2 «
yT AT x + ŷT yT ŷT
= ∧ Pcons = ∧ Pcons
yB ŷB yB AB x + ŷB
Variant 3 Variant 4
y = AL xT + ŷ ∧ Pcons y = AL xT + ŷ ∧ Pcons

Figure 3.1: Four loop-invariants for gemv.

3.3.1 Derivation
We now derive algorithms for gemv using the eight steps in the worksheet (Figure 2.5).

Remark 3.19 In order to derive the following algorithms, we do not assume the reader is a priori aware of
any method for computing gemv. Rather, we apply systematically the steps in the worksheet to derive two
different algorithms, which correspond to the computation of gemv via a series of axpys or apdots.

Step 1: Specifying the precondition and postcondition. The precondition for the algorithm is given by

Ppre : (A ∈ Rm×n ) ∧ (y ∈ Rm ) ∧ (x ∈ Rn ),

while the postcondition is

Ppost : y = Ax + ŷ.
Recall the use of ŷ in the postcondition denoting the original contents of y.
In the second step we need to choose a partitioning for the operands involved in the computation. There
are three ways of partitioning matrices: by rows, by columns, or into quadrants. While the first two were
already possible for vectors, the third one is new. The next two subsections show that the partitionings by
rows/columns naturally lead to the algorithms composed of apdots/axpys. The presentation of the third type
of partitioning, by quadrants, is delayed until we deal with triangular matrices, later in this chapter.

Step 2: Determining loop-invariants. Corollaries 3.13 and 3.14 provide us with two PMEs from which loop-
invariants can be determined:
µ ¶ µ ¶
yT AT x + ŷT
= ∧ Pcons and y = AL xT + AR xB + ŷPcons .
yB AB x + ŷB

Here Pcons : m(yT ) = m(AT ) for the first PME and Pcons : m(xT ) = n(AL ) for the second PME.

Remark 3.20 We will often use the consistency predicate “Pcons ” to establish conditions on the partitionings
of the operands that ensure that operations are well-defined.
3.3. Algorithms for the Matrix-Vector Product 35

A loop-invariant inherently describes an intermediate result towards the final result computed by a loop.
The observation that only part of the computations have been performed before each iteration yields the four
different loop-invariants in Figure 3.1.
Let us focus on Invariant 1:
µµ ¶ µ ¶¶
yT AT x + ŷT
Pinv −1 : = ∧ Pcons , (3.3)
yB ŷB

which reflects a state where elements of yT have already been updated with the final result while elements of
yB have not. We will see next how the partitioning of A by rows together with this loop-invariant fixes all
remaining steps in the worksheet and leads us to the algorithm identified as Variant 1 in Figure 3.2.

Step 3: Choosing a loop-guard. Upon completion of the loop, the loop-invariant is true, the loop-guard G is
false, and the condition
µµµ ¶ µ ¶¶ ¶
yT AT x + ŷT
Pinv ∧ ¬G ≡ = ∧ Pcons ∧ ¬G (3.4)
yB ŷB

must imply that y = Ax + ŷ. Now, if yT equals all of y then, by consistency, AT equals all of A, and (3.4)
implies that the postcondition is true. Therefore, we adopt “G : m(yT ) < m(y)” as the required loop-guard G
for the worksheet.

Step 4: Initialization. Next, we must find an initialization that, ideally with a minimum amount of computa-
tions, sets the variables of the algorithm in a state where the loop-invariant (including the consistency condition)
holds.
We note that the partitioning
µ ¶ µ ¶
AT yT
A→ , y→ ,
AB yB

where AT has 0 rows and yT has no elements, sets the variables AT , AB , yT , and yB in a state where the
loop-invariant is satisfied. This initialization, which only involves indexing operations, appears in Step 4 in
Figure 3.2.

Step 5: Progressing through the operands. As part of the computation, AT and yT , start by having no
elements, but must ultimately equal all of A and y, respectively. Thus, as part of the loop, rows must be taken
from AB and added to AT while elements must be moved from yB to yT . This is denoted in Figure 3.2 by the
36 3. Matrix-Vector Operations

Step Annotated Algorithm: y := Ax + y

1a {(A ∈ Rm×n ) ∧ (y ∈ Rm ) ∧ (x ∈ Rn )}
µ ¶ µ ¶
AT yT
4 Partition A → ,y→
AB yB
where AT has 0 rows and yT has 0 elements
½µµ ¶ µ ¶¶ ¾
yT AT x + ŷT
2 = ∧ Pcons
yB ŷB
3 while m(yT ) < m(y) do
½µµµ ¶ µ ¶¶ ¶ ¾
yT AT x + ŷT
2,3 = ∧ Pcons ∧ (m(yT ) < m(y))
yB ŷB
5a Repartition    
µ ¶ A0 µ ¶ y0
AT yT
→  aT 1
, →  ψ1 
AB yB
A2 y2
where aT 1 is a row and ψ 1 is a scalar
   
 y0 A0 x + ŷ0 
6  ψ1  =  ψ̂1 
 
y2 ŷ2
8 ψ1 = aT 1 x + ψ1
5b Continue with   
µ ¶ A0 µ ¶ y0
AT yT
←  aT 1
, ←  ψ1 
AB yB
A2 y2
   
 y0 A0 x + ŷ0 
7  ψ1  =  aT x + ψ̂1 
 1 
y2 ŷ2
½µµ ¶ µ ¶¶ ¾
yT AT x + ŷT
2 = ∧ Pcons
yB ŷB
endwhile
½µµµ ¶ µ ¶¶ ¶ ¾
yT AT x + ŷT
2,3 = ∧ Pcons ∧ ¬ (m(yT ) < m(y))
yB ŷB
1b {y = Ax + ŷ}

Figure 3.2: Annotated algorithm for computing y := Ax + y (Variant 1).

3.3. Algorithms for the Matrix-Vector Product 37

repartitioning statements2
   
µ ¶ A0 µ ¶ y0
AT yT
→  aT
1
 and →  ψ1  ,
AB yB
A2 y2

and the redefinition of the original partitionings in

   
µ ¶ A0 µ ¶ y0
AT yT
←  aT 1
 and ←  ψ1  .
AB yB
A2 y2

This manner of moving the elements ensures that Pcons holds and that the loop terminates.

Step 6: Determining the state after repartitioning. The contents of y in terms of the partitioned matrix
and vectors, Pbefore in the worksheet in Figure 2.5, is determined via textual substitution as follows. From the
partitionings in Step 5a,
AT = A0 yT = y0
µ T ¶ and µ ¶ ,
a1 ψ1
AB = yB =
A2 y2

if we substitute the quantities on the right of the equalities into the loop-invariant,
µµ ¶ µ ¶¶
yT AT x + ŷT
= ∧ Pcons ,
yB ŷB

we find that        
y0 A0 x + ŷ0 y0 A0 x + ŷ0
µ ¶ µ ¶
 ψ1 = ψ̂1 , or,  ψ1  =  ψ̂1 ,
y2 ŷ2 y2 ŷ2
as entered in Step 6 in Figure 3.2.

Step 7: Determining the state after moving the thick lines. After moving the thick lines, in Step 5b
µ ¶ µ ¶
yT AT x + ŷT
=
yB ŷB
2 In the partitionings we do not use the superscript “ˇ” for the row aT as, in this case, there is no possible confusion with a
1
column of the matrix.
38 3. Matrix-Vector Operations

implies that
 µ ¶   µ ¶ µ ¶     
y0 A0 ŷ0 y0 A0 x + ŷ0
 = x+ ,  ψ1  =  aT .
ψ1 aT
1 ψ̂1 or, 1 x + ψ̂1
y2 ŷ2 y2 ŷ2

This is entered as the state Pafter in the worksheet in Figure 2.5, as shown in Step 7 in Figure 3.2.

Step 8: Determining the update. Comparing the contents in Step 6 and Step 7 now tells us that the contents
of y must be updated from
       
y0 A0 x + ŷ0 y0 A0 x + ŷ0
 ψ1  =  ψ̂1  to  ψ1  =  a T
1 x+ψ̂1
.
y2 ŷ2 y2 ŷ2

Therefore, we conclude that y must be updated by adding aT

1 x to ψ1 :

ψ1 := aT
1 x + ψ1 ,

which is entered as the corresponding update in Figure 3.2.

Final algorithm. By deleting the annotations (assertions) we finally obtain the algorithm for gemv (Variant
1) given in Figure 3.3. All the arithmetic operations in this algorithm are performed in terms of apdot.

Remark 3.21 The partitionings together with the loop-invariant prescribe steps 3–8 of the worksheet.

Exercise 3.22 Derive an algorithm for computing y := Ax + y using the Invariant 2 in Figure 3.1.
Exercise 3.23 Consider Invariant 3 in Figure 3.1. Provide all steps that justify the worksheet in Figure 3.4.
State the algorithm without assertions.
Exercise 3.24 Derive an algorithm for computing y := Ax + y using the Invariant 4 in Figure 3.1.

3.3.2 Cost Analysis of gemv

We next prove that the cost of the algorithm in Figure 3.3 is 2mn flops.
Each execution of the loop of the algorithm in Figure 3.3 is a apdot which requires 2n flops, with n =
m(x) = n(A) (see Table 2.6). The cost of computing y := Ax + y is thus given by
m−1
X
2n flops = 2mn flops, (3.5)
k=0
3.3. Algorithms for the Matrix-Vector Product 39

Algorithm: y := matvec var1(A, x, y)

µ ¶ µ ¶
AT yT
Partition A → ,y→
AB yB
where AT has 0 rows and yT has 0 elements
while m(yT ) < m(y) do
Repartition    
µ ¶ A0 µ ¶ y0
AT y T
→  aT 1
, →  ψ1 
AB yB
A2 y2
where aT1 is a row and ψ 1 is a scalar

ψ1 = aT
1 x + ψ1

Continue with   
µ ¶ A0 µ ¶ y0
AT yT
←  aT
1
, ←  ψ1 
AB yB
A2 y2
endwhile

Figure 3.3: Algorithm for computing y := Ax + y (Variant 1).

where m = m(y) = m(A).

Consider now Figure 3.5. where assertions are added indicating the computation cost incurred so far at
the specified points in the algorithm. In Step 1a, the cost is given by Csf = 0. At Step 2, just before the
loop, this translates to the cost-invariant Csf = 2m(yT )n since m(yT ) = 0. We need to show that the cost-
invariant, which is true at the top of the loop, is again true at the bottom of the loop-body, where m(yT ) is
one greater than m(yT ) at the top of the loop. We do so by inserting Csf = 2m(y0 )n in Step 6, which follows
by textual substitution and the fact that Step 5a is composed of indexing operations with no cost. As 2n flops
are performed in Step 8 and the operations in Step 5b are indexing operations, Csf = 2(m(y0 ) + 1)n at Step
7. Since m(yT ) = m(y0 ) + 1 in Step 2, due to the fact that one element has been added to yT , shows that
Csf = 2m(yT )n at the bottom of the loop-body. Thus, as was true for the loop-invariant, Csf = 2m(yT )n upon
leaving the loop. Since there m(yT ) = m(y), we establish that the total cost of the algorithm is 2mn flops.

Exercise 3.25 Prove that the costs of the algorithms corresponding to Variant 2–4 are also 2mn flops.
40 3. Matrix-Vector Operations

Step Annotated Algorithm: y := Ax + y

1a {(A ∈ Rm×n ) ∧ (y ∈ Rm ) ∧ (x ∈ Rn )}
µ ¶
¡ ¢ xT
4 Partition A → AL AR , x →
xB
where AL has 0 columns and xT has 0 elements
2 {(y = AL xT + ŷ) ∧ Pcons }
3 while m(xT ) < m(x) do
2,3 {((y = AL xT + ŷ) ∧ Pcons ) ∧ (m(xT ) < m(x))}
5a Repartition  
µ ¶ x0
¡ ¢ ¡ ¢ xT
AL AR → A0 a1 A2 , →  χ1 
xB
x2
where a1 is a column and χ1 is a scalar
6 {y = A0 x0 + ŷ}
8 y := y + χ1 a1
5b Continue with  
µ ¶ x0
¡ ¢ ¡ ¢ xT
AL AR ← A0 a1 A2 , ←  χ1 
xB
x2
7 {y = A0 x0 + χ1 a1 + ŷ}
2 {(y = AL xT + ŷ) ∧ Pcons }
endwhile
2,3 {((y = AL xT + ŷ) ∧ Pcons ) ∧ ¬ (m(xT ) < m(x))}
1b {y = Ax + ŷ}

Figure 3.4: Annotated algorithm for computing y := Ax + y (Variant 3).

3.4 Rank-1 Update

Consider the vectors x ∈ Rn , y ∈ Rm , and the matrix A ∈ Rm×n partitioned as in Section 3.1. A second
operation that plays a critical role in linear algebra is the rank-1 update (ger), defined as

A := A + αyxT . (3.6)

For simplicity in this section we consider α = 1. In this operation the (i, j) element of A is updated as
αi,j := αi,j + ψi χj , 0 ≤ i < m, 0 ≤ j < n.
3.4. Rank-1 Update 41

Step Cost Analysis: y := matvec var1(A, x, y)

1a Csf = 0 flops
µ ¶ µ ¶ µ ¶
AT yT ŷT
4 Partition A → ,y→ , ŷ →
AB yB ŷB
where AT has 0 rows, and yT , ŷT have 0 elements
2 Csf = 2m(yT )n flops
3 while m(yT ) < m(y) do
2,3 Csf = 2m(yT )n flops
5a Repartition      
µ ¶ A0 µ ¶ y0 µ ¶ ŷ0
AT y T ŷT
→  aT 1
, →  ψ1  , →  ψ̂1 
AB yB ŷB
A2 y2 ŷ2
T
where a1 is a row, and ψ1 , ψ̂1 are scalars
6 Csf = 2m(y0 )n flops
8 ψ1 = aT
1 x + ψ1 Cost: 2n flops
5b Continue with     
µ ¶ A0 µ ¶ y0 µ ¶ ŷ0
AT yT ŷT
←  aT 1
, ←  ψ1  , ←  ψ̂1 
AB yB ŷB
A2 y2 ŷ2
7 Csf = 2(m(y0 ) + 1)n flops
2 Csf = 2m(yT )n flops
endwhile
2,3 Csf = 2m(yT )n flops
1b Total Cost: 2mn flops

Figure 3.5: Cost analysis for the algorithm in Fig. 3.3.

Example 3.26 Consider

0 1 0 1
0 1 2 3 1 2
3 B 5 C B 1 2 2 C
x = @ 0 A, y=B C
@ 1 A, and A=B
@ 9
C.
0 2 A
1
8 7 1 0

The result of performing a ger on A involving vectors x and y is given by:

0 1 0 1 0 1 0 1
3 1 2 2 3+2·3 1+2·0 2+2·1 9 1 4
B 1 2 2 C B C B 2+5·1 C B 7 C
A := B C+B 5 C (3, 0, 1) = B 1 + 5 · 3 2+5·0 C = B 16 2 C.
@ 9 0 2 A @ 1 A @ 9+1·3 0+1·0 2 + 1 · 1 A @ 12 0 3 A
7 1 0 8 7+8·3 1+8·0 0+8·1 31 1 8
42 3. Matrix-Vector Operations

The term rank-1 update comes from the fact that the rank of the matrix yxT is at most one. Indeed,
yxT = (χ0 y, χ1 y, . . . , χn−1 y)
clearly shows that all columns of this matrix are multiples of the same vector y, and thus there can be at most
one linearly independent column.
Now, we note that
A := A + yxT
   
α00 α01 ... α0,n−1 ψ0
 α10 α11 ... α1,n−1   ψ1 
   
=  .. .. .. .. + ..  (χ0 , χ1 , . . . , χn−1 )
 . . . .   . 
αm−1,0 αm−1,1 ... αm−1,n−1 ψm−1
 
α00 + ψ0 χ0 α01 + ψ0 χ1 ··· α0,n−1 + ψ0 χn−1
 α10 + ψ1 χ0 α11 + ψχ ··· α1,n−1 + ψ1 χn−1 
 
=  .. .. .. .. 
 . . . . 
αm−1,0 + ψm−1 χ0 αm−1,1 + ψχ ···
αm−1,n−1 + ψm−1 χn−1
 
ǎT
0 + ψ0 x
T
 ǎT
1 + ψ1 x
T 
 
= (a0 + χ0 y, a1 + χ1 y, . . . , an−1 + χn−1 y) =  .. ,
 . 
ǎT
m−1 + ψm−1 x
T

which shows that, in the computation of A + yxT , column aj , 0 ≤ j < n, is replaced by aj + χj y while row ǎT
i ,
0 ≤ i < m, is replaced by ǎT
i + ψ i xT
.
Based on the above observations the next two corollaries give the PMEs that can be used to derive the
algorithms for computing ger.
Corollary 3.27 Partition matrix A and vector x as
µ ¶
¡ ¢ xT
A → AL AR and x → ,
xB
with n(AL ) = m(xT ). Then
µ ¶T
T
¡ ¢ xT ¡ ¢
A + yx = AL AR +y = AL + yxT
T AR + yxT
B .
xB
Corollary 3.28 Partition matrix A and vector y as
µ ¶ µ ¶
AT yT
A→ and y→ ,
AB yB
3.5. Solving Triangular Linear Systems of Equations 43

such that m(AT ) = m(yT ). Then

µ ¶ µ ¶ µ ¶
AT yT AT + yT xT
A + yxT = + xT = .
AB yB AB + yB xT

Remark 3.29 Corollaries 3.27 and 3.28 again pose restrictions on the dimensions of the partitioned matri-
ces/vectors so that an operation is “consistently” defined for these partitioned elements.
We now give a few rules that apply to the partitionings performed on the operands that arise in ger. Consider
two matrices A, B, and an operation which relates them as:
1. A + B or A := B; then, m(A) = m(B), n(A) = n(B) and any partitioning by rows/columns in one of
the two operands must be done conformally (with the same dimensions) in the other operand.
Consider now a matrix A, two vectors x, y and the ger operation
1. A + yxT ; then, m(A) = m(y), n(A) = m(x), and any partitioning by rows/columns in A must be
conformally performed by elements in y/x (and vice versa).

Exercise 3.30 Prove Corollary 3.27.

Exercise 3.31 Prove Corollary 3.28.
Exercise 3.32 Derive two different algorithms for ger using the partitionings
µ ¶
¡ ¢ xT
A → AL AR , x → .
xB

Exercise 3.33 Derive two different algorithms for ger using the partitionings
µ ¶ µ ¶
AT yT
A→ , y→ .
AB yB

Exercise 3.34 Prove that all of the previous four algorithms for ger incur in 2mn flops.

3.5 Solving Triangular Linear Systems of Equations

Consider the linear system with n unknowns and m equations defined by

α00 χ0 + α01 χ1 + . . . +α0,n χn = β0 ,

α10 χ0 + α11 χ1 + . . . +α1,n χn = β1 ,
.. .. .. ..
. . . = .
αm−1,0 χ0 + αm−1,1 χ1 + . . . +αm−1,n χn = βm − 1.
44 3. Matrix-Vector Operations

This system of equations can also be expressed in matrix form as:

    
α00 α01 ... α0,n−1 χ0 β0
 α10 α11 ... α1,n−1  χ1   β1 
    
 .. .. .. ..  .. = ..  ≡ Ax = b.
 . . . .  .   . 
αm−1,0 αm−1,1 ... αm−1,n−1 χn−1 βm−1

Here, A ∈ Rm×n is the coefficient matrix, b ∈ Rm is the right-hand side vector, and x ∈ Rn is the vector of
unknowns.
Let us now define the diagonal elements of the matrix A as those elements of the form αi,i , 0 ≤ i < min(m, n).
In this section we study a simple case of a linear system which appears when the coefficient matrix is square
and has zeros in all its elements above the diagonal; we then say that the coefficient matrix is lower triangular
and we prefer to denote it using L instead of A, where L stands for Lower:
    
λ00 0 ... 0 χ0 β0
 λ10 λ11 ... 0  χ1   β1 
    
 .. .. .. ..  .. = ..  ≡ Lx = b.
 . . . .  .   . 
λn−1,0 λn−1,0 ... λn−1,n−1 χn−1 βn−1

This triangular linear system of equations has a solution if λi,i 6= 0, 0 ≤ i < n.

An analogous case occurs when the coefficient matrix is upper triangular, that is, all the elements below the
diagonal of the matrix are zero.

Remark 3.35 Lower/upper triangular matrices will be denoted by letters such as L/U for Lower/Upper.
Lower/upper triangular matrices are square.
We next proceed to derive algorithms for computing this operation (hereafter, trsv) by filling out the
worksheet in Figure 2.5. During the presentation one should think of x as the vector that represents the final
solution, which ultimately will overwrite b upon completion of the loop.

Remark 3.36 In order to emphasize that the methodology allows one to derive algorithms for a given linear
algebra operation without an a priori knowledge of a method, we directly proceed with the derivation of an
algorithm for the solution of triangular linear systems, and delay the presentation of a concrete example until
the end of this section.

Step 1: Specifying the precondition and postcondition. The precondition for the algorithm is given by

Ppre : (L ∈ Rn×n ) ∧ TrLw(L) ∧ (x, b ∈ Rn ).

3.5. Solving Triangular Linear Systems of Equations 45

Step Annotated Algorithm: (b := x) ∧ (Lx = b)

1a {(L ∈ Rn×n ) ∧ TrLw(L) ∧ (x, b ∈ Rn )}
µ ¶ µ ¶ µ ¶
LT L 0 xT bT
4 Partition L → ,x→ ,b→
LBL LBR xB bB
(Ãµ where L TL is 0 × 0, and x T , b T have 0 elements
¶ Ã !! )
bT xT
2 = ∧ (LT L xT = b̂T )
bB b̂B − LBL xT
3 while
(ÃÃm(bT ) < m(b)Ãdo !! ! )
µ ¶
bT xT
2,3 = ∧ (LT L xT = b̂T ) ∧ (m(bT ) < m(b))
bB b̂B − LBL xT
5a Repartition      
µ ¶ L00 0 0 µ ¶ x0 µ ¶ b0
LT L 0 xT bT
→  l10 T
λ11 0 , →  χ1  , →  β1 
LBL LBR xB bB
L20 l21 L22 x2 b2
 where λ 11 , χ 1 , and β are
1  scalars 
 

 b0 x0 

 β   β̂ − lT x 
6  1 =  1 10 0  ∧ (L x
00 0 = b̂0 )

 

b2 b̂2 − L20 x0
χ1 := β1 /λ11
8
b2 := b2 − χ1 l21 (axpy)
5b Continue with      
µ ¶ L00 0 0 µ ¶ x0 µ ¶ b0
LT L 0 xT b T
←  l10 λ11T
0 , ←  χ1  , ←  β1 
LBL LBR xB bB
L20 l21 L22 x2 b2
    Ã !
 b0 x0 
 β1  =   ∧ L00 x0 = b̂0
7 χ1 T
 l10 x0 + λ11 χ1 = β̂1 
b
(Ãµ 2 ¶ Ã b̂2 − L20 x0 −!! χ1 l21 )
bT xT
2 = ∧ (LT L xT = b̂T )
bB b̂B − LBL xT
endwhile
(ÃÃµ ¶ Ã !! ! )
bT xT
2,3 = ∧ (LT L xT = b̂T ) ∧ ¬ (m(bT ) < m(b))
bB b̂B − LBL xT
n o
1b (b = x) ∧ (Lx = b̂)

Figure 3.6: Annotated algorithm for solving Lx = b (unblocked Variant 2).

46 3. Matrix-Vector Operations

Here, the predicate TrLw(L) is true if L is a lower triangular matrix. (A similar predicate, TrUp(U ), will play
an analogous role for upper triangular matrices.) The postcondition is that

Ppost : (b = x) ∧ (Lx = b̂);

in other words, upon completion the contents of b equal those of x, where x is the solution of the lower triangular
linear system Lx = b̂. This is indicated in Steps 1a and 1b in Figure 3.6.
Next, let us use L to introduce a new type of partitioning, into quadrants:
µ ¶
LT L 0
L→ .
LBL LBR

Since upon termination Lx = b̂, vectors x and b must be partitioned consistently as

µ ¶ µ ¶
xT bT
x→ , b→ ,
xB bB

where “Pcons : n(LT L ) = m(xT ) = m(bT )” holds. Furthermore, we will require that both LT L and LBR are
themselves lower triangular matrices, that is,

Pstruct : TrLw(LT L ) ∧ (m(LT L ) = n(LT L )) ∧ TrLw(LBR ) ∧ (m(LBR ) = n(LBR ))

is true. Now, as L is square lower triangular, it is actually sufficient to require that

Pstruct : m(LT L ) = n(LT L ).

holds.

Remark 3.37 We will often use the structural predicate “Pstruct ” to establish conditions on the structure of
the exposed blocks.

Remark 3.38 When dealing with triangular matrices, in order for the diagonal blocks (submatrices) that are
exposed to themselves be triangular, we always partition this type of matrices into quadrants, with square blocks
on the diagonal.
Although we employ predicates Pcons and Pstruct during the derivation of the algorithm, in order to con-
dense the assertions for this algorithm, we do not include these two predicates as part of the invariant in the
presentation of the corresponding worksheet.
3.5. Solving Triangular Linear Systems of Equations 47

Variant
! ! 1 !
„ «
bT xT
= , ∧ (LT L xT = b̂T ) ∧ Pcons ∧ Pstruct
bB b̂B
Variant
! ! 2 !
„ «
bT xT
= , ∧ (LT L xT = b̂T ) ∧ Pcons ∧ Pstruct
bB b̂B − LBL xT

Figure 3.7: Two loop-invariants for solving Lx = b, overwriting b with x.

Step 2: Determining loop-invariants. The PME is given by

µµ ¶ µ ¶¶ Ãµ ¶µ ¶ Ã !!
bT xT LT L 0 xT b̂T
= ∧ =
bB xB LBL LBR xB b̂B

or, by Theorem 3.12,

µµ ¶ µ ¶¶ Ã !
bT xT LT L xT = b̂T
= ∧ ,
bB xB LBL xT + LBR xB = b̂B

which is finally equivalent to

µµ ¶ µ ¶¶ Ã !
bT xT LT L xT = b̂T
= ∧ .
bB xB LBR xB = b̂B − LBL xT

This shows that xT can be computed from the first equality (the one at the top), after which b̂B must be
updated by subtracting LBL xT from it, before xB can be computed using the second equality. This constraint
on the order in which subresults must be computed yields the two loop-invariants in Figure 3.7.

Step 3: Choosing a Loop-guard. For either of the two loop-invariants, the loop-guard “G : m(bT ) < m(b)”
has the property that (Pinv ∧ ¬G) ⇒ Ppost .

Step 4: Initialization. For either of the two loop-invariants, the initialization

µ ¶ µ ¶ µ ¶
LT L 0 xT bT
L→ , x→ , and b→ ,
LBL LBR xB bB

where LT L is 0 × 0, and xT , bT have 0 elements, has the property that it sets the variables in a state where the
loop-invariant holds.
48 3. Matrix-Vector Operations

Step 5: Progressing through the operands. For either of the two loop-invariants, the repartitioning shown in
Step 5a in Figure 3.63 , followed by moving the thick lines as in Step 5b in the same figure denote that progress
is made through the operands so that the loop eventually terminates. It also ensures that Pcons and Pstruct
hold.
Only now does the derivation become dependent on the loop-invariant that we choose. Let us choose
Invariant 2, which will produce the algorithm identified as Variant 2 for this operation.

Step 6: Determining the state after repartitioning. Invariant 2 and the repartitioning of the partitioned
matrix and vectors imply that
   
b0 x0
 µ ¶ Ã ! µ T ¶
 β1 =  β̂1 l10

 ∧ (L00 x0 = b̂0 )
− x0
b2 b̂2 L20
   
b0 x0
  
≡  β1  =  β̂1 − l10 T
x0  ∧ (L00 x0 = b̂0 ),
b2 b̂2 − L20 x0
which is entered in Step 6 as in Figure 3.6.

Step 7: Determining the state after moving the thick lines. In Step 5b, Invariant 2 and the moving of the
thick lines imply that
  µ ¶ 
 µ ¶  x0 Ãµ
b0 ¶µ ¶ Ã !!
  χ1  L00 0 x0 b̂0
 β1  
= µ ¶  
 ¡ ¢ x0  ∧ T
l10 λ11 χ1
=
β̂1
b2 b̂2 − L20 l21
χ1
    ÃÃ !!
b0 x0
L 00 x0 = b̂ 0
≡  β1  =  χ1  ∧
T
,
l10 x0 + λ11 χ1 = β̂1
b2 b̂2 − L20 x0 − χ1 l21
which is entered in the corresponding step as in Figure 3.6.

Step 8. Determining the update. Comparing the contents in Step 6 and Step 7 now tells us that the contents
of b must be updated from
       
b0 x0 b0 x0
 
 β1  =  β̂1 − lT x0  to  β1  =  χ1 ,
10
b2 b̂2 − L20 x0 b2 b̂2 − L20 x0 −χ1 l21
3 In the repartitioning of L the superscript “T ” denotes that l01
T is a row vector as corresponds to λ
11 being a scalar.
3.5. Solving Triangular Linear Systems of Equations 49

where

L00 x0 = b̂0 and

T
l10 x0 + λ11 χ1 = β̂1 .
T T
Manipulating the last equation yields that χ1 = (β̂1 − l10 x0 )/λ11 . Since β1 already contains (β̂1 − l10 x0 ) we
conclude that the update to be performed is given by

χ1 := β1 /λ11 and
b2 := b2 − χ1 l21 ,

which is entered in the corresponding step as in Figure 3.6.

Final algorithm. By deleting the temporary variable x, which is only used for the purpose of proving the
algorithm correct while it is constructed, we arrive at the algorithm in Figure 3.8. In Section 4.2, we discuss
an API for representing algorithms in Matlab M-script code, FLAME@lab. The FLAME@lab code for the
algorithm in Figure 3.8 is given in Figure 3.9.
Example 3.39 Let us now illustrate how this algorithm proceeds. Consider a triangular linear system defined
by 0 1 0 1
2 0 0 0 2
B 1 1 0 0 C B 3 C
L=B
@ 2
C, b=B C.
1 2 0 A @ 10 A
0 2 1 3 19
From a little manipulation we can see that the solution to this system is given by
χ0 := ( 2 )/2 = 1,
χ1 := ( 3−1·1 )/1 = 2,
χ2 := ( 10 − 2 · 1 − 1 · 2 )/2 = 3,
χ3 := ( 19 − 0 · 1 − 2 · 2 − 1 · 3 )/3 = 4.

In Figure 3.10 we show the initial contents of each quadrant (iteration labeled as 0) as well as the contents
as computation proceeds from the first to the fourth (and final) iteration. In the figure, faces of normal size
indicate data and operations/results that have already been performed/computed, while the small faces indicate
operations that have yet to be performed.
The way the solver classified as Variant 2 works, corresponds to what is called an “eager” algorithm, in
the sense that once an unknown is computed, it is immediately “eliminated” from the remaining equations.
Sometimes this algorithm is also classified as the “column-oriented” algorithm of forward substitution as, at
each iteration, it utilizes a column of L in the update of the remaining independent terms by using a saxpy
operation. It is sometimes called forward substitution for reasons that will become clear in Chapter 6.
Exercise 3.40 Prove that the cost of the triangular linear system solver formulated in Figure 3.8 is n2 +n ≈ n2
Pm(x )−1
flops. Hint: Use Csf = m(x0 ) + k=00 2(n − k − 1) flops.
50 3. Matrix-Vector Operations

Algorithm: b := trsv var2(L, b)

µ ¶ µ ¶
LT L 0 bT
Partition L → ,b→
LBL LBR bB
where LT L is 0 × 0 and bT has 0 elements
while m(bT ) < m(b) do
Repartition    
µ ¶ L00 0 0 µ ¶ b0
LT L 0 b T
→  l10
T
λ11 0 , →  β1 
LBL LBR bB
L20 l21 L22 b2
where λ11 and β1 are scalars

β1 := β1 /λ11
b2 := b2 − β1 l21 (axpy)

Continue with    
µ ¶ L00 0 0 µ ¶ b0
LT L 0  bT
← T
l10 λ11 0 , ←  β1 
LBL LBR bB
L20 l21 L22 b2
endwhile

Figure 3.8: Algorithm for solving Lx = b (unblocked Variant 2).

Remark 3.41 When dealing with cost expression we will generally neglect lower order terms.

Exercise 3.42 Derive an algorithm for solving Lx = b by choosing the Invariant 1 in Figure 3.7. The solution
to this exercise corresponds to an algorithm that is “lazy” (for each equation, it does not eliminate previous
unknown until it becomes necessary) or row-oriented (accesses to L are by rows, in the form of apdots).
Exercise 3.43 Prove that the cost of the triangular linear system solver for the lazy algorithm obtained as the
solution to Exercise 3.42 is n2 flops.
Exercise 3.44 Derive algorithms for the solution of the following triangular linear systems:

1. U x = b.

2. LT x = b.

3. U T x = b.

Here L, U ∈ Rn×n are lower and upper triangular, respectively, and x, b ∈ Rn .

3.6. Blocked Algorithms 51

1 function [ b_out ] = Trsv_lower_unb_var2( L, b )

2
3 [ LTL, LTR, ...
4 LBL, LBR ] = FLA_Part_2x2( L, 0, 0, ’FLA_TL’ );
5 [ bT, ...
6 bB ] = FLA_Part_2x1( b, 0, ’FLA_TOP’ );
7
8 while ( size( LTL, 1 ) < size( L, 1 ) )
9 [ L00, l01, L02, ...
10 l10t, lambda11, l12t, ...
11 L20, l21, L22 ] = FLA_Repart_2x2_to_3x3( LTL, LTR, ...
12 LBL, LBR, 1, 1, ’FLA_BR’ );
13 [ b0, ...
14 beta1, ...
15 b2 ] = FLA_Repart_2x1_to_3x1( bT, ...
16 bB, 1, ’FLA_BOTTOM’ );
17 %------------------------------------------------------------%
18 beta1 = beta1 / lambda1;
19 b2 = b2 - beta1 * l21;
20 %------------------------------------------------------------%
21 [ LTL, LTR, ...
22 LBL, LBR ] = FLA_Cont_with_3x3_to_2x2( L00, l01, L02, ...
23 l10t, lambda11, l12t, ...
24 L20, l21, L22, ’FLA_TL’ );
25 [ bT, ...
26 bB ] = FLA_Cont_with_3x1_to_2x1( b0, ...
27 beta1, ...
28 b2, ’FLA_TOP’ );
29 end
30 b_out = [ bT
31 bB ];
32 return

Figure 3.9: FLAME@lab code for solving Lx = b, overwriting b with x (unblocked Variant 2).

3.6 Blocked Algorithms

Key objectives when designing and implementing linear algebra libraries are modularity and performance. In
this section we show how both can be accommodated by casting algorithms in terms of gemv. The idea is to
derive so-called blocked algorithms which differ from the algorithms derived so far in that they move the thick
lines more than one element, row, and/or column at a time. We illustrate this technique by revisiting the trsv
operation.
52 3. Matrix-Vector Operations

µ ¶ µ ¶ µ ¶
LT L 0 L−1
T L bT xT
#Iter. =
LBL LBR bB − LBL (L−1T L bT ) bB − LBL xT
2 0 0 0 ( 2 )/2 2
1 1 0 0 ( 3−1·1 )/2 3
0 =
2 1 2 0 ( 10 − 2 · 1 − 1 · 2 )/2 10
0 2 1 3 ( 19 − 0 · 1 − 2 · 2 − 1 · 3 )/3 19
2 0 0 0 ( 2 )/2 1
1 1 0 0 ( 3-1 · 1 )/1 2
1 =
2 1 2 0 ( 10 - 2 · 1 − 1 · 2 )/2 8
0 2 1 3 ( 19 - 0 · 1 − 2 · 2 − 1 · 3 )/3 19
2 0 0 0 ( 2 )/2 1
1 1 0 0 ( 3-1 · 1 )/1 2
2 =
2 1 2 0 ( 10 - 2 · 1 - 1 · 2 )/2 6
0 2 1 3 ( 19 - 0 · 1 - 2 · 2 − 1 · 3 )/3 15
2 0 0 0 ( 2 )/2 1
1 1 0 0 ( 3-1 · 1 )/1 2
3 =
2 1 2 0 ( 10 - 2 · 1 - 1 · 2 )/2 3
0 2 1 3 ( 19 - 0 · 1 - 2 · 2 - 1 · 3 )/3 12
2 0 0 0 ( 2 )/2 1
1 1 0 0 ( 3-1 · 1 )/1 2
4 =
2 1 2 0 ( 10 - 2 · 1 - 1 · 2 )/2 3
0 2 1 3 ( 19 - 0 · 1 - 2 · 2 - 1 · 3 )/3 4

Figure 3.10: Example of the computation of (b := x) ∧ (Lx = b) (Variant 2). Computations yet to be performed
are in tiny font.

Remark 3.45 The derivation of blocked algorithms is identical to that of unblocked algorithm up to and
including Step 4.
Let us choose Invariant 2, which will produce now the worksheet of a blocked algorithm identified as Variant
2 in Figure 3.11.

Step 5: Progressing through the operands. We now choose to move through vectors x and b by nb elements
per iteration. Here nb is the (ideal) block size of the algorithm. In other words, at each iteration of the loop,
nb elements are taken from xB and bB and moved to xT , bT , respectively. For consistency then, a block of
dimension nb × nb must be also moved from LBR to LT L . We can proceed in this manner by first repartitioning
3.6. Blocked Algorithms 53

Step Annotated Algorithm: (b := x) ∧ (Lx = b)

˘ ¯
1a (L ∈ Rn×n ) ∧ TrLw(L) ∧ (x, b ∈ Rn )
„ « „ « „ «
LT L 0 xT bT
4 Partition L → ,x→ ,b→
LBL LBR xB bB
( „ where LT L is 0 × 0, and xT!! , bT have 0 elements)
«
bT xT
2 = ∧ (LT L xT = b̂T )
bB b̂B − LBL xT
3 while( m(bT ) < m(b) do !! ! )
„ «
bT xT
2,3 = ∧ (LT L xT = b̂T ) ∧ (m(bT ) < m(b))
bB b̂B − LBL xT
5a Determine block size nb
Repartition 0 1 0 1 0 1
„ « L00 0 0 „ « x0 „ « b0
LT L 0 x T bT
→ @ L10 L11 0 A, → @ x1 A , → @ b1 A
LBL LBR xB bB
L20 L21 L22 x2 b2
800where 1
L11 0 is nb × nb , and 1 x1 , b1 have nb elements
1 9
>
< b0 x0 >
=
B@ b A B b̂ − L x CC
6 @ 1 =@ 1 10 0 AA ∧ (L00 x0 = b̂0 )
>
: >
;
b2 b̂2 − L20 x0
x1 := trsv(L1 , b1 )
8
b2 := b2 − L21 x1 (gemv)
5b Continue with 0 1 0 1 0 1
„ « L00 0 0 „ « x0 „ « b0
LT L 0 x T bT
← @ L10 L11 0 A, ← @ x1 A , ← @ b1 A
LBL LBR xB bB
L20 L21 L22 x2 b2
800 1 0 11 !9
< b0 x0 =
@@ b1 A = @ x1 AA ∧ L00 x0 = b̂0
7
: L10 x0 + L11 x1 = b̂1 ;
b2 b̂2 − L20 x0 − !!L21 x1
( „ « )
bT xT
2 = ∧ (LT L xT = b̂T )
bB b̂B − LBL xT
endwhile
( !! ! )
„ «
bT xT
2,3 = ∧ (LT L xT = b̂T ) ∧ ¬ (m(bT ) < m(b))
bB b̂ − LBL xT
n oB
1b (b = x) ∧ (Lx = b̂)

Figure 3.11: Annotated algorithm for solving Lx = b (blocked Variant 2).

54 3. Matrix-Vector Operations

the matrix and the vectors as:

     
µ ¶ L00 0 0 µ ¶ x0 µ ¶ b0
LT L 0  xT bT
→ L10 L11 0 , →  x1  , →  b1  ,
LBL LBR xB bB
L20 L21 L22 x2 b2

where L11 is a block of dimension nb × nb , and x1 , b1 have nb elements each. These block/elements are then
moved to the corresponding parts of the matrix/vectors as indicated by
     
µ ¶ L00 0 0 µ ¶ x0 µ ¶ b0
LT L 0  xT b T
← L10 L11 0 , ←  x1  , ←  b1  .
LBL LBR xB bB
L20 L21 L22 x2 b2

This movement ensures that the loop eventually terminates and that both Pcons and Pstruct hold.

Remark 3.46 In practice, the block size is adjusted at each iteration as the minimum between the algorithmic
(or optimal) block size and the number of remaining elements.

Step 6: Determining the state after repartitioning. Invariant 2 and the definition of the repartitioning for
the blocked algorithm imply that
   
b0 x0
 µ ¶ Ã ! µ ¶
 b1 =
 b̂1 L10

 ∧ (L00 x0 = b̂0 ),
− x0
b2 b̂2 L20

or    
b0 x0
  
 b1  =  b̂1 − L10 x0  ∧ (L00 x0 = b̂0 ),
b2 b̂2 − L20 x0

which is entered in Step 6 as in Figure 3.11.

Step 7: Determining the state after moving the thick lines. In Step 5b the Invariant 2 implies that
  µ ¶ 
 µ ¶  x0 Ãµ
b 0 ¶µ ¶ Ã !!
  x1  L 0 x b̂
 b1 = µ ¶  ∧ 00 0
= 0
,
  ¡ ¢ x0  L10 L11 x1 b̂1
b2 b̂2 − L20 L21
x1
3.6. Blocked Algorithms 55

or     ÃÃ
b0 x0 !!
 b1  =    L00 x0 = b̂0
x1 ∧ ,
L10 x0 + L11 x1 = b̂1
b2 b̂2 − L20 x0 − L21 x1
which is entered in the corresponding step as in Figure 3.11.

Step 8. Determining the update. Comparing the contents in Step 6 and Step 7 now tells us that the contents
of b must be updated from
       
b0 x0 b0 x0
 
 b1  =  b̂1 − L10 x0  to  b1  =  x1 ,
b2 b̂2 − L20 x0 b2 b̂2 − L20 x0 −L21 x1

where

L00 x0 = b̂0 and

L10 x0 + L11 x1 = b̂1 .

From the the last equation we find that L11 x1 = (b̂1 −L10 x0 ). Since b1 already contains (b̂1 −L10 x0 ) we conclude
that in the update we first need to solve the triangular linear system

L11 x1 = b1 , (3.7)

and then perform the update

b2 := b2 − L21 x1 . (3.8)
This is entered in the corresponding step as in Figure 3.11. Now, solving the triangular linear system in (3.7)
can be accomplished by using any unblocked algorithm or by a recursive call to the same routine, but with a
smaller block size. We note that the operation that appears in (3.8) is a gemv.

Final algorithm. By deleting the assertions and the temporary variable x, we obtain the blocked algorithm in
Figure 3.12. If nb is set to 1 in this algorithm, then it performs exactly the same operations and in the same
order as the corresponding unblocked algorithm.

3.6.1 Cost analysis

Rather than proving a given cost expression for an algorithm, one is usually confronted with the problem of
having to obtain such a formulae. We therefore shift the purpose of our cost analysis to match the more common
exercise of having to obtain the cost of a (blocked) algorithm. In order to do so, we will generally assume that
the blocksize is an exact multiple of the problem size.
56 3. Matrix-Vector Operations

Algorithm: b := trsv blk var2(L, b)

µ ¶ µ ¶
LT L 0 bT
Partition L → ,b→
LBL LBR bB
where LT L is 0 × 0 and bT has 0 elements
while m(bT ) < m(b) do
Determine block size nb
Repartition    
µ ¶ L00 0 0 µ ¶ b0
LT L 0 bT
→  L10 L11 0 , →  b1 
LBL LBR bB
L20 L21 L22 b2
where L11 is nb × nb and b1 has nb elements

b1 := trsv(L11 , b1 )
b2 := b2 − L21 b1 (gemv)

Continue with    
µ ¶ L00 0 0 µ ¶ b0
LT L 0  bT
← L10 L11 0 , ←  b1 
LBL LBR bB
L20 L21 L22 b2
endwhile

Figure 3.12: Algorithm for solving Lx = b (blocked Variant 2).

For the blocked algorithm for trsv in Figure 3.12, we consider that n = νnb with nb ¿ n. The algorithm
thus iterates ν times, with a triangular linear system of fixed order nb (L−1 11 b1 ) being solved and a gemv
operation of decreasing size (b1 := b1 − L21 b1 ) being performed at each iteration. As a matter of fact, the row
dimension of the matrix involved in the gemv operation decreases by nb rows per iteration so that, at iteration
k, L21 is of dimension (ν − k − 1)nb × nb . Thus, the cost of solving the triangular linear system using the blocked
algorithm is approximately
ν−1
X µ ¶
ν2
(n2b + 2(ν − k − 1)n2b ) ≈ 2n2b 2
ν − = n2 flops.
2
k=0

The cost of blocked variant of trsv is equal to that of the unblocked version. This is true for most of the
blocked algorithms that we will derive in this book. Nevertheless, be aware that there exist a class of blocked
algorithms, related to the computation of orthogonal factorizations, which do not satisfy this property.
Exercise 3.47 Show that all unblocked and blocked algorithms for computing the solution to Lx = B have
exactly the same operation count by performing an exact analysis of the operation count.
3.6. Blocked Algorithms 57

1 #include "FLAME.h"
2
3 int Trsv_lower_blk_var2( FLA_Obj L, FLA_Obj b, int nb_alg )
4 {
5 FLA_Obj LTL, LTR, L00, L01, L02,
6 LBL, LBR, L10, L11, L12,
7 L20, L21, L22;
8 FLA_Obj bT, b0,
9 bB, b1,
10 b2;
11 int b;
12
13 FLA_Part_2x2( L, &LTL, &LTR,
14 &LBL, &LBR, 0, 0, FLA_TL );
15 FLA_Part_2x1( b, &bT,
16 &bB, 0, FLA_TOP );
17
18 while ( FLA_Obj_length( LTL ) < FLA_Obj_length( L ) ){
19 b = min( FLA_Obj_length( LBR ), nb_alg );
20
21 FLA_Repart_2x2_to_3x3( LTL, /**/ LTR, &L00, /**/ &L01, &L02,
22 /* ************* */ /* ******************** */
23 &L10, /**/ &L11, &L12,
24 LBL, /**/ LBR, &L20, /**/ &L21, &L22,
25 b, b, FLA_BR );
26 FLA_Repart_2x1_to_3x1( bT, &b0,
27 /* ** */ /* ** */
28 &b1,
29 bB, &b2, b, FLA_BOTTOM );
30 /*------------------------------------------------------------*/
31 Trsv_lower_unb_var2( L11, b1 ); /* b1 := inv( L11 ) * b1 */
32 FLA_Gemv( FLA_NO_TRANSPOSE, /* b2 := b2 - L21 * b1 */
33 ONE, L21, b1, ONE, b2 )
34 /*------------------------------------------------------------*/
35 FLA_Cont_with_3x3_to_2x2( &LTL, /**/ &LTR, L00, L01, /**/ L02,
36 L10, L11, /**/ L12,
37 /* ************** */ /* ****************** */
38 &LBL, /**/ &LBR, L20, L21, /**/ L22,
39 FLA_TL );
40 FLA_Cont_with_3x1_to_2x1( &bT, b0,
41 b1,
42 /* ** */ /* ** */
43 &bB, b2, FLA_TOP );
44 }
45 return FLA_SUCCESS;
46 }

Figure 3.13: FLAME/C code for solving Lx = b, overwriting b with x (Blocked Variant 2).
58 3. Matrix-Vector Operations

3.7 Summary
Let us recap the highlights of this chapter.
• Most computations in unblocked algorithms for matrix-vector operations are expressed as axpys and
apdots.
• Operations involving matrices typically yield more algorithmic variants than those involving only vectors
due to the fact that matrices can be traversed in multiple directions.
• Blocked algorithms for matrix-vector operations can typically be cast in terms of gemv and/or ger.
High-performance can be achieved in a modular manner by optimizing these two operations, and casting
other operations in terms of them.

• The derivation of blocked algorithms is no more complex than that of unblocked algorithms.
• Algorithms for all matrix-vector operations that are discussed can be derived using the methodology
presented in Chapter 2.
• Again, we note that the derivation of loop-invariants is systematic, and that the algorithm is prescribed
once a loop-invariant is chosen although now a remaining choice is whether to derive an unblocked algo-
rithm or a blocked one.

3.8 Other Matrix-Vector Operations

A number of other commonly encountered matrix-vector operations tabulated in Figure 3.14.

3.9 Further Exercises

For additional exercises, visit $BASE/Chapter3/.
3.9. Further Exercises 59

Name Abbrev. Operation Cost Comment

(flops)
general matrix gemv y := αAx + βy 2mn A ∈ Rm×n
vector multiplication y := αAT x + βy
symmetric matrix symv y := αAx + βy 2n2 A ∈ Rn×n is symmetric and stored in
vector multiplication lower/upper triangular part of matrix.
triangular matrix trmv x := Lx n2 L ∈ Rn×n is lower triangular.
vector multiplication x := LT x U ∈ Rn×n is upper triangular.
x := U x
x := U T x
general rank-1 ger A := A + αyxT 2mn A ∈ Rm×n
update
symmetric syr A := A + αxxT n2 A ∈ Rn×n is symmetric and stored in
rank-1 update lower/upper triangular part of matrix.
A :=
symmetric syr2 2n2 A ∈ Rn×n is symmetric and stored in
A + α(xy T + yxT )
rank-2 update lower/upper triangular part of matrix.
triangular solve trsv x := L−1 x n2 L ∈ Rn×n is lower triangular.
x := L−T x U ∈ Rn×n is upper triangular.
x := U −1 x X −T = (X −1 )T = (X T )−1
x := U −T x

Figure 3.14: Basic operations combining matrices and vectors. Cost is approximate.
60 3. Matrix-Vector Operations
Chapter 4
The FLAME Application Programming
Interfaces

In this chapter we present two Application Programming Interfaces (APIs) for coding linear algebra algorithms.
While these APIs are almost trivial extensions of the M-script language and the C programming language, they
greatly simplify the task of typesetting, programming, and maintaining families of algorithms for a broad spec-
trum of linear algebra operations. In combination with the FLAME methodology for deriving algorithms, these
APIs facilitate the rapid derivation, verification, documentation, and implementation of a family of algorithms
for a single linear algebra operation. Since the algorithms are expressed in code much like they are explained
in a classroom setting, the APIs become not just a tool for implementing libraries, but also a valuable resource
for teaching the algorithms that are incorporated in the libraries.

4.1 Example: gemv Revisited

In order to present the FLAME APIs, we consider in this chapter again the matrix-vector product (gemv),
y := y + Ax, where A ∈ Rm×n , y ∈ Rm , and x ∈ Rn .
In Chapter 3 we derived unblocked algorithms for the computation of this operation. In Figure 4.1 and 4.2
we show the two blocked algorithms for computing gemv. The first of these algorithms proceeds by blocks of
rows of A, computing a new subvector of entries of y at each iteration by means of a matrix-vector product of
dimension mb × n. To expose the necessary operands this algorithm requires the matrix to be partitioned by
rows.
The second algorithm accesses the elements of A by blocks of columns and, at each iteration, updates y

61
62 4. The FLAME Application Programming Interfaces

Algorithm: y := matvec blk var1(A, x, y)

µ ¶ µ ¶
AT yT
Partition A → ,y→
AB yB
where AT has 0 rows and yT has 0 elements
while m(yT ) < m(y) do
Determine block size mb
Repartition    
µ ¶ A0 µ ¶ y0
AT yT
→  A1  , → y1 
AB yB
A2 y2
where A1 has mb rows and y1 has mb elements

y1 = y1 + A1 x

Continue with   
µ ¶ A0 µ ¶ y0
AT y T
←  A1  , ←  y1 
AB yB
A2 y2
endwhile

Figure 4.1: Algorithm for computing y := Ax + y (blocked Variant 1).

completely by performing a matrix-vector product of dimension m × nb . In this case, it is necessary to partition

the matrix A by columns.

4.2 The FLAME@lab Interface for M-script

A popular tool for numerical computations is Matlab [22]. This application encompasses both a high-level
programming language, M-script, and an interactive interface for algorithm development and numerical com-
putations. Tools that use the same scripting language include Octaveand LabView MathScript.
In this section we describe the FLAME API for M-script: FLAME@lab, an extension consisting of nine
functions that allow one to code the algorithms derived via the FLAME approach using the M-script language.

4.2.1 Horizontal partitioning

As illustrated in Figure 4.1, in stating a linear algebra algorithm one may wish to produce a 2 × 1 partitioning
of a matrix (or vector). Given a Matlab matrix A, the following M-script function in FLAME@lab creates
4.2. The FLAME@lab Interface for M-script 63

Algorithm: y := matvec blk var3(A, x, y)

µ ¶
¡ ¢ xT
Partition A → AL AR , x →
xB
where AL has 0 columns and xT has 0 elements
while m(xT ) < m(x) do
Determine block size nb
Repartition  
µ x0 ¶
¡ ¢ ¡ ¢ xT
AL AR → A0 A1 A2 , →  x1 
xB
x2
where A1 has nb columns and x1 has nb elements

y := y + A1 x1

Continue with  
µ ¶ x0
¡ ¢ ¡ ¢ xT
AL AR ← A0 A1 A2 , ←  x1 
xB
x2
endwhile

Figure 4.2: Algorithm for computing y := Ax + y (blocked Variant 3).

two submatrices:

[ AT,...
AB ] = FLA_Part_2x1( A, mb, side )
Purpose: Partition matrix A into a top and a bottom side where the side indicated by side has mb rows.
A – matrix to be partitioned
mb – row dimension of side indicated by side
side – side for which row dimension is given
AT, AB – matrices for Top and Bottom parts

Here side can take on the values (character strings) ’FLA TOP’ or ’FLA BOTTOM’ to indicate that mb is the
row dimension of the Top matrix AT, or the Bottom matrix AB, respectively. The routine can also be used to
partition a (column) vector.
64 4. The FLAME Application Programming Interfaces

Remark 4.1 Invocation of the M-script function

[ AT,...
AB ] = FLA_Part_2x1( A, mb, side )
in Matlab creates two new submatrices. Subsequent modifications of the contents of these submatrices there-
fore do not affect the original contents of the matrix. This is an important difference to consider with respect
to the FLAME/C API, to be introduced in Section 4.3, where the submatrices are views (references) into the
original matrix, not copies of it!

As an example of the use of this routine, the translation of the algorithm fragment from Figure 4.1 on the
left results in the code on the right:
µ ¶
AT [ AT,...
Partition A →
AB AB, ] = FLA_Part_2x1( A,...
where AT has 0 rows 0, ’FLA_TOP’ )

Remark 4.2 The above example stresses the fact that the formatting of the code can be used to help represent
the algorithm in code. Clearly, some of the benefit of the API would be lost if in the example the code appeared
as
[ AT, AB ] = FLA_Part_2x1( A, 0, ’FLA_TOP’ )

For some of the subsequent calls this becomes even more dramatic.

Also from Figure 4.1, we notice that it is necessary to be able to take a 2 × 1 partitioning of a given matrix
A (or vector y) and repartition that into a 3 × 1 partitioning so that the submatrices that need to be updated
and/or used for computation can be identified. To support this, we introduce the M-script function

[ A0,...
A1,...
A2 ] = FLA_Repart_2x1_to_3x1( AT,...
AB, mb, side )
Purpose: Repartition a 2 × 1 partitioning of a matrix into a 3 × 1 partitioning where submatrix A1 with mb
rows is split from from the side indicated by side.
AT, AB – matrices for Top and Bottom parts
mb – row dimension of A1
side – side from which A1 is partitioned
A0, A1, A2 – matrices for A0 , A1 , A2
Here side can take on the values ’FLA TOP’ or ’FLA BOTTOM’ to indicate that submatrix A1, with mb rows, is
partitioned from AT or AB, respectively.
4.2. The FLAME@lab Interface for M-script 65

Thus, for example, the translation of the algorithm fragment from Figure 4.1 on the left results in the code
on the right:
Repartition  
µ ¶ A0 [ A0,...
AT
→  A1  A1,...
AB A2 ] = FLA_Repart_2x1_to_3x1( AT,...
A2 AB, mb, ’FLA_BOTTOM’ )
where A1 has mb rows
where parameter mb has the value mb .

Remark 4.3 Similarly to what is expressed in Remark 4.1, the invocation of the M-script function
[ A0,...
A1,...
A2 ] = FLA_Repart_2x1_to_3x1( AT,...
AB, mb, side )
creates three new matrices and any modification of the contents of A0, A1, A2 does not affect the original
matrix A nor the two submatrices AT, AB. Readability is greatly reduced if it were typeset like

[ A0, A1, A2 ] = FLA_Repart_2x1_to_3x1( AT, AB, mb, side )

Remark 4.4 Choosing variable names can further relate the code to the algorithm, as is illustrated by com-
paring    
A0 A0 y0 y0
 A1  and A1 ; and  ψ1  and psi1 .
A2 A2 y2 y2

Once the contents of the so-identified submatrices have been updated, AT and AB must be updated to reflect
that progress is being made, in terms of the regions indicated by the thick lines. This movement of the thick
lines is accomplished by a call to the M-script function
[ AT,...
AB ] = FLA_Cont_with_3x1_to_2x1( A0,...
A1,...
A2, side )
Purpose: Update the 2 × 1 partitioning of a matrix by moving the boundaries so that A1 is joined to the side
indicated by side.
A0, A1, A2 – matrices for A0 , A1 , A2
side – side to which A1 is joined
AT, AB – matrices for Top and Bottom parts
66 4. The FLAME Application Programming Interfaces

For example, the algorithm fragment from Figure 4.1 on the left results in the code on the right:
Continue with  [ AT,...
µ ¶ A0
AT AB ] = FLA_Cont_with_3x1_to_2x1( A0,...
←  A1  A1,...
AB A2, ’FLA_TOP’ )
A2
The translation of the algorithm in Figure 4.1 to M-script code is now given in Figure 4.3. In the implemen-
tation, the parameter mb alg holds the algorithmic block size (the number of elements of y that will be computed
at each iteration). As, in general, m(y)(= m(A)) will not be a multiple of this block size, at each iteration
mb elements are computed, with mb determined as min(m(yB ), mb alg)(= min(m(AB ), mb alg)). Also, we use
there a different variable for the input vector, y, and the output vector (result), y out. The reason for this is
that it will allow the FLAME@lab code to be more easily translated to FLAME/C, the C API.
In M-script, size( A, 1 ) and size( A, 2 ) return the row and column dimension of array A, respectively.
Placing a “;” at the end of a statement suppresses the printing of the value computed in the statement. The
final statement

y_out = [ yT
yB ];

sets the output variable y out to the vector that results from concatenating yT and yB.

Exercise 4.5 Visit $BASE/Chapter4/ and follow the directions to reproduce and execute the code in Fig. 4.1.

4.2.2 Vertical partitioning

Alternatively, as illustrated in Figure 4.2, in stating a linear algebra algorithm one may wish to initially partition
a matrix by columns. For this we introduce the M-script function
[ AL, AR ] = FLA_Part_1x2( A, nb, side )
Purpose: Partition matrix A into a left and a right side where the side indicated by side has nb columns.
A – matrix to be partitioned
nb – column dimension of side indicated by side
side – side for which column dimension is given
AL, AR – matrices for Left and Right parts
Here side can take on the values ’FLA LEFT’ or ’FLA RIGHT’ to indicate whether the partitioning is performed
on the Left matrix or on the Right matrix, respectively. This routine can also be used to partition a row vector.
Matrices that have already been vertically partitioned can be further partitioned using the M-script function
4.2. The FLAME@lab Interface for M-script 67

1 function [ y_out ] = MATVEC_BLK_VAR1( A, x, y, mb_alg )

2 [ AT, ...
3 AB ] = FLA_Part_2x1( A, 0, ’FLA_TOP’ );
4 [ yT, ...
5 yB ] = FLA_Part_2x1( y, 0, ’FLA_TOP’ );
6
7 while ( size( yT, 1 ) < size( y, 1 ) )
8 mb = min( size( AB, 1 ), mb_alg );
9 [ A0, ...
10 A1, ...
11 A2 ] = FLA_Repart_2x1_to_3x1( AT, ...
12 AB, mb, ’FLA_BOTTOM’ );
13 [ y0, ...
14 y1, ...
15 y2 ] = FLA_Repart_2x1_to_3x1( yT, ...
16 yB, mb, ’FLA_BOTTOM’ );
17 %------------------------------------------------------------%
18 y1 = y1 + A1 * x;
19 %------------------------------------------------------------%
20 [ AT, ...
21 AB ] = FLA_Cont_with_3x1_to_2x1( A0, ...
22 A1, ...
23 A2, ’FLA_TOP’ );
24 [ yT, ...
25 yB ] = FLA_Cont_with_3x1_to_2x1( y0, ...
26 y1, ...
27 y2, ’FLA_TOP’ );
28 end
29 y_out = [ yT
30 yB ];
31 return

Figure 4.3: M-script code for the blocked algorithm for computing y := Ax + y (Variant 1).
68 4. The FLAME Application Programming Interfaces

[ A0, A1, A2 ] = FLA_Repart_1x2_to_1x3( AL, AR, nb, side )

Purpose: Repartition a 1 × 2 partitioning of a matrix into a 1 × 3 partitioning where submatrix A1 with nb
columns is split from the right of AL or the left of AR, as indicated by side.
AL, AR – matrices for Left and Right parts
nb – column dimension of A1
side – side from which A1 is partitioned
A0, A1, A2 – matrices for A0 , A1 , A2
Adding the middle submatrix to the first or last submatrix is now accomplished by a call to the M-script
function
[ AL, AR ] = FLA_Cont_with_1x3_to_1x2( A0, A1, A2, side )
Purpose: Update the 1 × 2 partitioning of a matrix by moving the boundaries so that A1 is joined to the side
indicated by side.
A0, A1, A2 – matrices for A0 , A1 , A2
side – side to which A1 is joined
AL, AR – matrices for views of Left and Right parts
Armed with these three functions, it is straight-forward to code the algorithm in Figure 4.2 in M-script
language, as shown in Figure 4.4.
Exercise 4.6 Visit $BASE/Chapter4/ and follow the directions to reproduce and execute the code in Fig. 4.2.

4.2.3 Bidimensional partitioning

Similar to the horizontal and vertical partitionings into submatrices discussed above, in stating a linear algebra
algorithm one may wish to partition a matrix into four quadrants (as, for example, was necessary for the
coefficient matrix L in the solution of triangular linear systems of equations; see Section 3.5). In FLAME@lab,
the following M-script function creates one matrix for each of the four quadrants:
[ ATL, ATR,...
ABL, ABR ] = FLA_Part_2x2( A, mb, nb, quadrant )
Purpose: Partition matrix A into four quadrants where the quadrant indicated by quadrant is mb × nb.
A – matrix to be partitioned
mb, nb – row and column dimensions of quadrant indicated by quadrant
quadrant – quadrant for which dimensions are given
ATL, ATR, ABL, ABR – matrices for TL, TR, BL, and BR quadrants
Here quadrant is a Matlab string that can take on the values ’FLA TL’, ’FLA TR’, ’FLA BL’, and ’FLA BR’
to indicate that mb and nb are the dimensions of ATL, ATR, ABL, and ATR, respectively.
Given that a matrix is already partitioned into a 2 × 2 partitioning, it can be further repartitioned into 3 × 3
partitioning with the M-script function
4.2. The FLAME@lab Interface for M-script 69

1 function [ y_out ] = MATVEC_BLK_VAR3( A, x, y, nb_alg )

2 [ AL, AR ] = FLA_Part_1x2( A, 0, ’FLA_LEFT’ );
3 [ xT, ...
4 xB ] = FLA_Part_2x1( x, 0, ’FLA_TOP’ );
5
6 while ( size( xT, 1 ) < size( x, 1 ) )
7 nb = min( size( AR, 2 ), nb_alg );
8 [ A0, A1, A2 ]= FLA_Repart_1x2_to_1x3( AL, AR, nb, ’FLA_RIGHT’ );
9 [ x0, ...
10 x1, ...
11 x2 ] = FLA_Repart_2x1_to_3x1( xT, ...
12 xB, nb, ’FLA_BOTTOM’ );
13 %------------------------------------------------------------%
14 y = y + A1 * x1;
15 %------------------------------------------------------------%
16 [ AL, AR ] = FLA_Cont_with_1x3_to_1x2( A0, A1, A2, ’FLA_LEFT’ );
17 [ xT, ...
18 xB ] = FLA_Cont_with_3x1_to_2x1( x0, ...
19 x1, ...
20 x2, ’FLA_TOP’ );
21 end
22 y_out = y;
23 return

Figure 4.4: M-script code for the blocked algorithm for computing y := Ax + y (Variant 3).

[ A00, A01, A02,...

A10, A11, A12,...
A20, A21, A22 ] = FLA_Repart_2x2_to_3x3( ATL, ATR,...
ABL, ABR, mb, nb, quadrant )
Purpose: Repartition a 2 × 2 partitioning of a matrix into a 3 × 3 partitioning where the mb × nb submatrix
A11 is split from the quadrant indicated by quadrant.
ATL, ATR, ABL, ABR – Matrices for TL, TR, BL, and BR quadrants
mb, nb – row and column dimensions of A11
quadrant – quadrant from which A11 is partitioned
A00-A22 – matrices for A00 –A22
Given a 3 × 3 partitioning, the middle submatrix can be appended to one of the four quadrants, ATL, ATR,
ABL, and ABR, of the corresponding 2 × 2 partitioning with the M-script function
70 4. The FLAME Application Programming Interfaces

[ ATL, ATR,...
ABL, ABR ] = FLA_Cont_with_3x3_to_2x2( A00, A01, A02,...
A10, A11, A12,...
A20, A21, A22, quadrant )
Purpose: Update the 2 × 2 partitioning of a matrix by moving the boundaries so that A11 is joined to the
quadrant indicated by quadrant.
A00-A22 – matrices for A00 –A22
quadrant – quadrant to which A11 is to be joined
ATL, ATR, ABL, ABR – matrices for TL, TR, BL, and BR quadrants

Remark 4.7 The routines described in this section for the Matlab M-script language suffice to implement
a broad range of algorithms encountered in dense linear algebra.

Exercise 4.8 Visit $BASE/Chapter4/ and follow the directions to download FLAME@lab and to code and
execute the algorithm in Figure 3.8 for solving the lower triangular linear system Lx = b.

4.2.4 A few other useful routines

We have already encountered the situation where an algorithm computes with a lower triangular part of a
matrix, or a matrix is symmetric and only one of the triangular parts is stored. The following M-script routines
are frequently useful:

Routine Function Comment

tril (·) Returns lower triangular part of matrix M-script native
triu(·) Returns upper triangular part of matrix M-script native
trilu(·) Returns lower triangular part of matrix and sets diagonal element to one FLAME@lab
triuu(·) Returns upper triangular part of matrix and sets diagonal element to one FLAME@lab
symml(·) B = symml(A) set B to the symmetric matrix with tril (B) = tril (A) FLAME@lab
symmu(·) B = symmu(A) set B to the symmetric matrix with triu(B) = triu(A) FLAME@lab
Routines identified as “native” are part of M-script while the routines identified as FLAME@lab must be
downloaded.

4.3 The FLAME/C Interface for the C Programming Language

It is easily recognized that the FLAME@lab codes will likely fall short of attaining peak performance. (In
particular, the copying that inherently occurs when submatrices are created and manipulated represents pure
overhead.) In order to attain high performance, one tends to code in more traditional programming languages,
like Fortran or C, and to link to high-performance implementations of libraries such as the Basic Linear Algebra
4.3. The FLAME/C Interface for the C Programming Language 71

Subprograms (BLAS) [21, 10, 9]. In this section we introduce a set of library routines that allow us to capture
linear algebra algorithms presented in the format used in FLAME in C code.
Readers familiar with MPI [23], PETSc [1], or PLAPACK [28] will recognize the programming style, object-
based programming, as being very similar to that used by those (and other) interfaces. It is this style of
programming that allows us to hide the indexing details much like FLAME@lab does. We will see that a more
substantial infrastructure must be provided in addition to the routines that partition and repartition matrix
objects.

4.3.1 Initializing and finalizing FLAME/C

Before using the FLAME/C environment one must initialize it with a call to
FLA_Error FLA_Init( )
Purpose: Initialize FLAME/C.
If no more FLAME/C calls are to be made, the environment is exited by calling
FLA_Error FLA_Finalize( )
Purpose: Finalize FLAME/C.

4.3.2 Linear algebra objects

The following attributes describe a matrix as it is stored in the memory of a computer:
• the datatype of the entries in the matrix, e.g., double or float,
• m and n, the row and column dimensions of the matrix,
• the address where the data is stored, and
• the mapping that describes how the two-dimensional array of elements in a matrix is mapped to memory,
which is inherently one-dimensional.
The following call creates an object (descriptor or handle) of type FLA Obj for a matrix and creates space
to store the entries in the matrix:
FLA_Error FLA_Obj_create( FLA_Datatype datatype, int m, int n, FLA_Obj *matrix )
Purpose: Create an object that describes an m × n matrix and create the associated storage array.
datatype – datatype of matrix
m, n – row dimensions of matrix
matrix – descriptor for the matrix
Valid datatype values include
72 4. The FLAME Application Programming Interfaces

FLA INT, FLA DOUBLE, FLA FLOAT, FLA DOUBLE COMPLEX, FLA COMPLEX
for the obvious datatypes that are commonly encountered. The leading dimension of the array that is used to
store the matrix in column-major order is itself determined inside of this call.

Remark 4.9 For simplicity, we chose to limit the storage of matrices to column-major storage. The leading
dimension of a matrix can be thought of as the dimension of the array in which the matrix is embedded (which
is often larger than the row-dimension of the matrix) or as the increment (in elements) required to address
consecutive elements in a row of the matrix. Column-major storage is chosen to be consistent with Fortran,
which is often still the choice of language for linear algebra applications.

FLAME/C treats vectors as special cases of matrices: an n × 1 matrix or a 1 × n matrix. Thus, to create
an object for a vector x of n double-precision real numbers either of the following calls suffices:

FLA Obj create( FLA DOUBLE, n, 1, &x );

FLA Obj create( FLA DOUBLE, 1, n, &x );

Here n is an integer variable with value n and x is an object of type FLA Obj.
Similarly, FLAME/C treats a scalar as a 1 × 1 matrix. Thus, to create an object for a scalar α the following
call is made:

FLA Obj create( FLA DOUBLE, 1, 1, &alpha );

where alpha is an object of type FLA Obj. A number of scalars occur frequently and are therefore predefined
by FLAME/C: FLA MINUS ONE, FLA ZERO, and FLA ONE.
If an object is created with FLA Obj create, a call to FLA Obj free is required to ensure that all space
associated with the object is properly released:
FLA_Error FLA_Obj_free( FLA_Obj *matrix )
Purpose: Free all space allocated to store data associated with matrix.
matrix – descriptor for the object

4.3.3 Inquiry routines

In order to be able to work with the raw data, a number of inquiry routines can be used to access information
about a matrix (or vector or scalar). The datatype and row and column dimensions of the matrix can be
extracted by calling
4.3. The FLAME/C Interface for the C Programming Language 73

FLA_Datatype FLA_Obj_datatype( FLA_Obj matrix )

int FLA_Obj_length ( FLA_Obj matrix )
int FLA_Obj_width ( FLA_Obj matrix )
Purpose: Extract datatype, row, or column dimension of matrix, respectively.
matrix – object that describes the matrix
return value – datatype, row, or column dimension of matrix, respectively
The address of the array that stores the matrix and its leading dimension can be retrieved by calling
void *FLA_Obj_buffer( FLA_Obj matrix )
int FLA_Obj_ldim ( FLA_Obj matrix )
Purpose: Extract address and leading dimension of matrix, respectively.
matrix – object that describes the matrix
return value – address and leading dimension of matrix, respectively

4.3.4 Filling the entries of an object

A sample routine for filling a matrix (or a vector) with data is given in Figure 4.5. The macro definition in
line 3 is used to access the matrix A stored in array A using column-major ordering.

4.3.5 Attaching an existing buffer of elements

Sometimes a user already has a buffer for the data and an object that references this data needs to be created.
For this FLAME/C provides the call
FLA_Error FLA_Obj_create_without_buffer( FLA_Datatype datatype, int m, int n, FLA_Obj *matrix )
Purpose: Create an object that describes an m × n matrix without creating the associated storage array.
datatype – datatype of matrix
m, n – row dimensions of matrix
matrix – descriptor for the matrix
A buffer with data, assumed to be stored in column major order with a know leading dimension, can then
be attached to the object with the call
FLA_Error FLA_Obj_attach_buffer( void *buff, int ldim, FLA_Obj *matrix )
Purpose: Create an object that describes an m × n matrix without creating the associated storage array.
buff – address of buffer
ldim – leading dimension of matrix
matrix – descriptor for the matrix
74 4. The FLAME Application Programming Interfaces

1 #include "FLAME.h"
2
3 #define BUFFER( i, j ) buff[ (j)*lda + (i) ]
4
5 void fill_matrix( FLA_Obj A )
6 {
7 FLA_Datatype
8 datatype;
9 int
10 m, n, lda;
11
12 datatype = FLA_Obj_datatype( A );
13 m = FLA_Obj_length( A );
14 n = FLA_Obj_width ( A );
15 lda = FLA_Obj_ldim ( A );
16
17 if ( datatype == FLA_DOUBLE ){
18 double *buff;
19 int i, j;
20
21 buff = ( double * ) FLA_Obj_buffer( A );
22
23 for ( j=0; j<n; j++ )
24 for ( i=0; i<m; i++ )
25 BUFFER( i, j ) = i+j*0.01;
26 }
27 else
28 FLA_Check_error_code( FLA_NOT_YET_IMPLEMENTED );
29 }

Figure 4.5: A sample routine for filling a matrix.

4.3. The FLAME/C Interface for the C Programming Language 75

Freeing an object that was created without a buffer is accomplished by calling

FLA_Error FLA_Obj_free_without_buffer( FLA_Obj *matrix )
Purpose: Free the space allocated for the object matrix without freeing the buffer that stores elements of
matrix.
matrix – descriptor for the object

4.3.6 A most useful utility routine

We single out one of the more useful routines in the FLAME/C library, which is particularly helpful for simple
debugging:
FLA_Error FLA_Obj_show( char *string1, FLA_Obj A, char *format, char *string2 )
Purpose: Print the contents of A.
string1 – string to be printed before contents
A – descriptor for A
format – format to be used to print each individual element
string2 – string to be printed after contents
In particular, the result of
FLA_Obj_show( "A =[", A, "%lf", "];" );
is
A = [
< entries_of_A >
];
which can then be cut and pasted into a Matlab or Octave session. This becomes useful when checking
results against a FLAME@lab implementation of an operation.

4.3.7 Writing a driver routine

We now give an example of how to use the calls introduced so far to write a sample driver routine that calls a
routine that performs the matrix-vector multiplication y := y + Ax.
In Figure 4.6 we present the contents of the driver routine:

• line 1: FLAME/C program files start by including the FLAME.h header file.

• lines 5–6: FLAME/C objects A, x, and y, which hold matrix A and vectors x and y, respectively, are
declared to be of type FLA Obj.

• line 10: Before any calls to FLAME/C routines can be made, the environment must be initialized by a
call to FLA Init.
76 4. The FLAME Application Programming Interfaces

1 #include "FLAME.h"
2
3 void main()
4 {
5 FLA_Obj
6 A, x, y;
7 int
8 m, n;
9
10 FLA_Init( );
11
12 printf( "enter matrix dimensions m and n:" );
13 scanf( "%d%d", &m, &n );
14
15 FLA_Obj_create( FLA_DOUBLE, m, n, &A );
16 FLA_Obj_create( FLA_DOUBLE, n, 1, &x );
17 FLA_Obj_create( FLA_DOUBLE, m, 1, &y );
18
19 fill_matrix( A );
20 fill_matrix( x );
21 fill_matrix( y );
22
23 FLA_Obj_show( "y = [", y, "%lf", "]" );
24
25 matvec_blk_var1( A, x, y );
26
27 FLA_Obj_show( "A = [", A, "%lf", "]" );
28 FLA_Obj_show( "x = [", x, "%lf", "]" );
29 FLA_Obj_show( "y = [", y, "%lf", "]" );
30
31 FLA_Obj_free( &A );
32 FLA_Obj_free( &y );
33 FLA_Obj_free( &x );
34
35 FLA_Finalize( );
36 }

Figure 4.6: A sample C driver routine for matrix-vector multiplication.

4.3. The FLAME/C Interface for the C Programming Language 77

• lines 12–13: In our example, the user inputs the row and column dimension of matrix A.
• lines 15–17: Descriptors are created for A, x, and y.
• lines 19–21: The routine in Figure 4.5 is used to fill A, x, and y with values.
• line 25: Compute y := y + Ax using the routine for performing that operation (to be discussed later).
• lines 23, 27–29: Print out the contents of A, x, and (both the initial and final) y.
• lines 31–33: Free the objects.
• line 35: Finalize FLAME/C.
Exercise 4.10 Visit $BASE/Chapter4/ and follow the directions on how to download the libFLAME library.
Then compile and execute the sample driver as directed.

4.3.8 Horizontal partitioning

Having introduced the basics of the C API for FLAME, let us turn now to those routines that allow to objects
to be partitioned and repartitioned.
Figure 4.1 illustrates the need for a 2 × 1 partitioning of a matrix (or a vector). In C we avoid complicated
indexing by introducing the notion of a view, which is a reference into an existing matrix or vector.
Given a descriptor A of a matrix A, the following routine creates descriptors for two submatrices:
FLA_Error FLA_Part_2x1( FLA_Obj A, FLA_Obj *AT,
FLA_Obj *AB, int mb, FLA_Side side )
Purpose: Partition matrix A into a top and bottom side where the side indicated by side has mb rows.
A – matrix to be partitioned
mb – row dimension of side indicated by side
side – side for which row dimension is given
AT, AB – views of Top and Bottom parts
Here side can take on the values FLA TOP or FLA BOTTOM to indicate that mb indicates the row dimension of AT
or AB, respectively.
For example, theµalgorithm
¶ fragment from Figure 4.1 on the left is translated into the code on the right
AT
Partition A → FLA_Part_2x1( A, &AT,
AB
&AB, 0, FLA_TOP );
where AT has 0 rows
Remark 4.11 As was the case for FLAME@lab, we stress that the formatting of the code is important:
Clearly, some of the benefit of the API would be lost if in the example the code appeared as
FLA_Part_2x1( A, &AT, &AB, 0, FLA_TOP );
78 4. The FLAME Application Programming Interfaces

Remark 4.12 Invocation of the C routine

FLA_Part_2x1( A, &AT,
&AB, 0, FLA_TOP );
creates two views (referrences), one for each submatrix. Subsequent modifications of the contents of a view
do affect the original contents of the matrix. This is an important difference to consider with respect to
the FLAME@lab API introduced in the previous section, where the submatrices were copies of the original
matrix!
From Figure 4.1, we also realize the need for an operation that takes a 2 × 1 partitioning of a given matrix
A and repartitions this into a 3 × 1 partitioning so that submatrices that need to be updated and/or used for
computation can be identified. To support this, we introduce the routine:
FLA_Error FLA_Repart_from_2x1_to_3x1( FLA_Obj AT, FLA_Obj *A0,
FLA_Obj *A1,
FLA_Obj AB, FLA_Obj *A2, int mb, FLA_Side side )
Purpose: Repartition a 2 × 1 partitioning of matrix A into a 3 × 1 partitioning where submatrix A1 with mb
rows is split from the side indicated by side.
AT, AB – views of Top and Bottom sides
mb – row dimension of A1
side – side from which A1 is partitioned
A0, A1, A2 – views of A0 , A1 , A2
Here side can take on the values FLA TOP or FLA BOTTOM to indicate that mb submatrix A1 is partitioned from
AT or AB, respectively.
Thus, the following fragment of algorithm from Figure 4.1 on the left translates to the code on the right:
Repartition  
µ ¶ A0 FLA_Repart_2x1_to_3x1( AT, &A0,
AT   /* ** */ /* ** */
→ A1
AB &A1,
A2
AB, &A2, mb, FLA_BOTTOM );
where A1 has mb rows
Remark 4.13 Choosing variable names can further relate the code to the algorithm, as is illustrated by com-
paring    
A0 A0 y0 y0
 A1  and A1 ; and  ψ1  and psi1 .
A2 A2 y2 y2

Once the contents of the so-identified submatrices have been updated, the contents of AT and AB must
be updated to reflect that progress is being made, in terms of the regions indicated by the thick lines. This
4.3. The FLAME/C Interface for the C Programming Language 79

movement of the thick lines is accomplished by a call to the C routine:

FLA_Error FLA_Cont_with_3x1_to_2x1( FLA_Obj *AT, FLA_Obj A0,

FLA_Obj A1,
FLA_Obj *AB, FLA_Obj A2, FLA_Side side )
Purpose: Update the 2 × 1 partitioning of matrix A by moving the boundaries so that A1 is joined to the side
indicated by side.
A0, A1, A2 – views of A0 , A1 , A2
side – side to which A1 is joined
AT, AB – views of Top and Bottom sides

Thus, the algorithm fragment from Figure 4.1 on the left results in the code on the right:
Continue with  FLA_Cont_with_3x1_to_2x1( &AT, A0,
µ ¶ A0 A1,
AT
←  A1  /* ** */ /* ** */
AB
A2 &AB, A2, FLA_TOP );
Using the three routines for horizontal partitioning, the algorithm in Figure 4.1 is translated into the C code
in Figure 4.7.

4.3.9 Vertical partitioning

Figure 4.2 illustrates that, when stating a linear algebra algorithm one may in wish to proceed by columns.
Therefore, we introduce the following pair of C routines for partitioning and repartitioning a matrix (or vector)
vertically:

FLA_Error FLA_Part_1x2( FLA_Obj A, FLA_Obj *AL, FLA_Obj *AR, int nb, FLA_Side side )
Purpose: Partition matrix A into a left and right side where the side indicated by side has nb columns.
A – matrix to be partitioned
nb – column dimension of side indicated by side
side – side for which column dimension is given
AL, AR – views of Left and Right parts

and
80 4. The FLAME Application Programming Interfaces

1 #include "FLAME.h"
2
3 void MATVEC_BLK_VAR1( FLA_Obj A, FLA_Obj x, FLA_Obj y, int mb_alg )
4 {
5 FLA_Obj AT, A0, yT, y0,
6 AB, A1, yB, y1,
7 A2, y2;
8
9 int mb;
10
11 FLA_Part_2x1( A, &AT,
12 &AB, 0, FLA_TOP );
13
14 FLA_Part_2x1( y, &yT,
15 &yB, 0, FLA_TOP );
16
17 while ( FLA_Obj_length( yT ) < FLA_Obj_length( y ) ){
18 mb = min( FLA_Obj_length( AB ), mb_alg );
19
20 FLA_Repart_2x1_to_3x1( AT, &A0,
21 /* ** */ /* ** */
22 &A1,
23 AB, &A2, mb, FLA_BOTTOM );
24
25 FLA_Repart_2x1_to_3x1( yT, &y0,
26 /* ** */ /* ** */
27 &y1,
28 yB, &y2, mb, FLA_BOTTOM );
29 /*------------------------------------------------------------*/
30 MATVEC_VAR1( A1, x, y1 );
31 /*------------------------------------------------------------*/
32 FLA_Cont_with_3x1_to_2x1( &AT, A0,
33 A1,
34 /* ** */ /* ** */
35 &AB, A2, FLA_TOP );
36
37 FLA_Cont_with_3x1_to_2x1( &yT, y0,
38 y1,
39 /* ** */ /* ** */
40 &yB, y2, FLA_TOP );
41 }
42 }

Figure 4.7: FLAME/C code for computing y := Ax + y (Blocked Variant 1).

4.3. The FLAME/C Interface for the C Programming Language 81

FLA_Error FLA_Repart_from_1x2_to_1x3( FLA_Obj AL, FLA_Obj AR,

FLA_Obj *A0, FLA_Obj *A1, FLA_Obj *A2,
int nb, FLA_Side side )
Purpose: Repartition a 1 × 2 partitioning of matrix A into a 1 × 3 partitioning where submatrix A1 with nb
columns is split from the side indicated by side.
AL, AR – views of Left and Right sides
nb – column dimension of A1
side – side from which A1 is partitioned
A0, A1, A2 – views of A0 , A1 , A2
Here side can take on the values FLA LEFT or FLA RIGHT.
Adding the middle submatrix to the first or last submatrix is now accomplished by a call to the C routine:

FLA_Error FLA_Cont_with_1x3_to_1x2( FLA_Obj AL, FLA_Obj AR,

FLA_Obj A0, FLA_Obj A1, FLA_Obj A2, FLA_Side side )
Purpose: Update the 1 × 2 partitioning of matrix A by moving the boundaries so that A1 is joined to the side
indicated by side.
A0, A1, A2 – views of A0 , A1 , A2
side – side to which A1 is joined
AL, AR – views of Left and Right sides
Using the three routines just introduced, the algorithm in Figure 4.1 is translated into the C code in
Figure 4.8.

4.3.10 Bidimensional partitioning

Similar to the horizontal and vertical partitionings into submatrices discussed above, in stating a linear algebra
algorithm one may wish to partition a matrix into four quadrants. The following C routine creates one matrix
for each of the four quadrants:
FLA_Error FLA_Part_2x2( FLA_Obj A, FLA_Obj *ATL, FLA_Obj *ATR,
FLA_Obj *ABL, FLA_Obj *ABR,
int mb, int nb, FLA_Quadrant quadrant )
Purpose: Partition matrix A into four quadrants where the quadrant indicated by quadrant is mb × nb.
A – matrix to be partitioned
mb, nb – row and column dimensions of quadrant indicated by quadrant
quadrant – quadrant for which dimensions are given in mb and nb
ATL-ABR – views of TL, TR, BL, and BR quadrants
82 4. The FLAME Application Programming Interfaces

1 #include "FLAME.h"
2
3 void MATVEC_BLK_VAR3( FLA_Obj A, FLA_Obj x, FLA_Obj y, int nb_alg )
4 {
5 FLA_Obj AL, AR, A0, A1, A2;
6
7 FLA_Obj xT, x0,
8 xB, x1,
9 x2;
10
11 int nb;
12
13 FLA_Part_1x2( A, &AL, &AR,
14 0, FLA_LEFT );
15
16 FLA_Part_2x1( x, &xT,
17 &xB,
18 0, FLA_TOP );
19
20 while ( FLA_Obj_length( xT ) < FLA_Obj_length( x ) ){
21 b = min( FLA_Obj_width( AR ), nb_alg );
22
23 FLA_Repart_1x2_to_1x3( AL, /**/ AR, &A0, /**/ &A1, &A2,
24 nb, FLA_RIGHT );
25
26 FLA_Repart_2x1_to_3x1( xT, &x0,
27 /* ** */ /* ** */
28 &x1,
29 xB, &x2,
30 nb, FLA_BOTTOM );
31 /*------------------------------------------------------------*/
32 MATVEC_VAR2( A1, x1, y );
33 /*------------------------------------------------------------*/
34 FLA_Cont_with_1x3_to_1x2( &AL, /**/ &AR, A0, A1, /**/ A2,
35 FLA_LEFT );
36
37 FLA_Cont_with_3x1_to_2x1( &xT, x0,
38 x1,
39 /* ** */ /* ** */
40 &xB, x2,
41 FLA_TOP );
42 }
43 }

Figure 4.8: FLAME/C code for computing y := Ax + y (Blocked Variant 3).

4.3. The FLAME/C Interface for the C Programming Language 83

Here quadrant can take on the values FLA TL, FLA TR, FLA BL, and FLA BR (defined in FLAME.h) to indicate
that mb and nb specify the dimensions of the Top-Left, Top-Right, Bottom-Left, or Bottom-Right quadrant,
respectively.
Given that a matrix is already partitioned into a 2 × 2 partitioning, it can be further repartitioned into 3 × 3
partitioning with the C routine:
FLA_Error FLA_Repart_from_2x2_to_3x3
( FLA_Obj ATL, FLA_Obj ATR, FLA_Obj *A00, FLA_Obj *A01, FLA_Obj *A02,
FLA_Obj *A10, FLA_Obj *A11, FLA_Obj *A12,
FLA_Obj ABL, FLA_Obj ABR, FLA_Obj *A20, FLA_Obj *A21, FLA_Obj *A22,
int mb, int nb, FLA_Quadrant quadrant )
Purpose: Repartition a 2 × 2 partitioning of matrix A into a 3 × 3 partitioning where mb × nb submatrix A11
is split from the quadrant indicated by quadrant.
ATL, ATR, ABL, ABR – views of TL, TR, BL, and BR quadrants
mb, nb – row and column dimensions of A11
quadrant – quadrant from which A11 is partitioned
A00-A22 – views of A00 –A22
Here quadrant can again take on the values FLA TL, FLA TR, FLA BL, and FLA BR to indicate that mb × nb
submatrix A11 is split from submatrix ATL, ATR, ABL, or ABR, respectively.
Given a 3 × 3 partitioning, the middle submatrix can be appended to either of the four quadrants, ATL, ATR,
ABL, and ABR, of the corresponding 2 × 2 partitioning with the C routine
FLA_Error FLA_Cont_with_3x3_to_2x2
( FLA_Obj *ATL, FLA_Obj *ATR, FLA_Obj A00, FLA_Obj A01, FLA_Obj A02,
FLA_Obj A10, FLA_Obj A11, FLA_Obj A12,
FLA_Obj *ABL, FLA_Obj *ABR, FLA_Obj A20, FLA_Obj A21, FLA_Obj A22,
FLA_Quadrant quadrant )
Purpose: Update the 2 × 2 partitioning of matrix A by moving the boundaries so that A11 is joined to the
quadrant indicated by quadrant.
ATL, ATR, ABL, ABR – views of TL, TR, BL, and BR quadrants
A00-A22 – views of A00 –A22
quadrant – quadrant to which A11 is to be joined
Here the value of quadrant (FLA TL, FLA TR, FLA BL, or FLA BR) specifies the quadrant submatrix A11 is to be
joined.

4.3.11 Merging views

Sometimes it becomes convenient to merge multiple views into a single view. For this FLAME/C provides the
routines
84 4. The FLAME Application Programming Interfaces

FLA_Error FLA_Merge_2x1( FLA_Obj AT,

FLA_Obj AB, FLA_Obj *A )
Purpose: Merge a 2 × 1 partitioned matrix.
AT, AB – views of Top and Bottom sides
A – matrix of which AT and AB are views

FLA_Error FLA_Merge_1x2( FLA_Obj AL, FLA_Obj AR, FLA_Obj *A )

Purpose: Merge a 1 × 2 partitioned matrix.
AL, AR – views of Left and Right sides
A – matrix of which AT and AB are views

FLA_Error FLA_Merge_2x2( FLA_Obj ATL, FLA_Obj ATR,

FLA_Obj ABL, FLA_Obj ABR, FLA_Obj *A )
Purpose: Merge a 2 × 2 partitioned matrix.
ATL-ABR – views of quadrants
A – matrix of which ATL-ABR are views

4.3.12 Computational kernels

All operations described in the last subsection hide the details of indexing in the linear algebra objects. To
compute with and/or update data associated with a linear algebra object, one calls subroutines that perform
the desired operations.
Such subroutines typically take one of three forms:

• Subroutines coded using the FLAME/C interface (including, possibly, a recursive call);

• Subroutines coded using a more traditional coding style; or

• Wrappers to highly optimized kernels.

Naturally these are actually three points on a spectrum of possibilities, since one can mix these techniques.
A number of matrix and/or vector operations have been identified to be frequently used by the linear algebra
community. Many of these are part of the BLAS. Since highly optimized implementations of these operations
are supported by widely available library implementations, it makes sense to provide a set of subroutines that
are simply wrappers to the BLAS. An example of a wrapper routine to the level 2 BLAS routine cblas dgemv,
a commonly available kernel for computing a matrix-vector multiplication, is given in Figure 4.9.
For additional information on supported functionality see Appendix B or visit the webpage
http://www.cs.utexas.edu/users/flame/
4.3. The FLAME/C Interface for the C Programming Language 85

1 #include "FLAME.h"
2 #include "cblas.h"
3
4 void matvec_wrapper( FLA_Obj A, FLA_Obj x, FLA_Obj y )
5 {
6 FLA_Datatype
7 datatype_A;
8 int
9 m_A, n_A, ldim_A, m_x, n_y, inc_x, m_y, n_y, inc_y;
10
11 datatype_A = FLA_Obj_datatype( A );
12 m_A = FLA_Obj_length( A );
13 n_A = FLA_Obj_width ( A );
14 ldim_A = FLA_Obj_ldim ( A );
15
16 m_x = FLA_Obj_length( x );
17 n_x = FLA_Obj_width ( x );
18
19 m_y = FLA_Obj_length( y );
20 n_y = FLA_Obj_width ( y );
21
22 if ( m_x == 1 ) {
23 m_x = n_x;
24 inc_x = FLA_Obj_ldim( x );
25 }
26 else inc_x = 1;
27
28 if ( m_y == 1 ) {
29 m_y = n_y;
30 inc_y = FLA_Obj_ldim( y );
31 }
32 else inc_y = 1;
33
34 if ( datatype_A == FLA_DOUBLE ){
35 double *buff_A, *buff_x, *buff_y;
36
37 buff_A = ( double * ) FLA_Obj_buffer( A );
38 buff_x = ( double * ) FLA_Obj_buffer( x );
39 buff_y = ( double * ) FLA_Obj_buffer( y );
40
41 cblas_dgemv( CblasColMaj, CblasNoTrans,
42 1.0, buff_A, ldim_A, buff_x, inc_x,
43 1.0, buff_y, inc_y );
44 }
45 else FLA_Abort( "Datatype not yet supported", __LINE__, __FILE__ );
46 }

Figure 4.9: A sample matrix-vector multiplication routine. This routine is implemented as a wrapper to the
CBLAS routine cblas dgemv for matrix-vector multiplications.
86 4. The FLAME Application Programming Interfaces

Exercise 4.14 Use the routines in the FLAME/C API to implement the algorithm in Figure 3.8 for computing
the solution of a lower triangular system b := x, where Lx = b.

4.4 Summary
The FLAME@lab and FLAMEC APIs illustrate how by raising the level of abstraction at which one codes,
intricate indexing can be avoided in the code, therefore reducing the opportunity for the introduction of errors
and raising the confidence in correctness of the code. Thus, the proven correctness of those algorithms derived
using the FLAME methodology translates to a high degree of confidence in the implementation.
The two APIs that we presented are simple ones and serve to illustrate the issues. Similar interfaces to
more elaborate programming languages (e.g., C++, Java, and LabView’s G graphical programming language)
can be easily defined allowing special features of those languages to be used to even further raise the level of
abstraction at which one codes.

4.5 Further Exercises

For additional exercises, visit $BASE/Chapter4/.
Chapter 5
High Performance Algorithms

Dense linear algebra operations are often at the heart of scientific computations that stress even the fastest
computers available. As a result, it is important that routines that compute these operations attain high
performance in the sense that they perform near the minimal number of operations and achieve near the
highest possible rate of execution. In this chapter we show that high performance can be achieved by casting
computation as much as possible in terms of the matrix-matrix product operation (gemm). We also expose that
for many matrix-matrix operations in linear algebra the derivation techniques discussed so far yield algorithms
that are rich in gemm.

Remark 5.1 Starting from this chapter, we adopt a more concise manner of presenting the derivation of the
algorithms where we only specify the partitioning of the operands and the loop-invariant. Recall that these two
elements prescribe the remaining derivation procedure of the worksheet and, therefore, the algorithm.

5.1 Architectural Considerations

We start by introducing a simple model of a processor and its memory hierarchy. Next, we argue that the
vector-vector and matrix-vector operations discussed so far can inherently not achieve high performance, while
the matrix-matrix product potentially can.

5.1.1 A simple model

A processor, current as of this writing, includes a Central Processing Unit (CPU), in which computation occurs,
and a (main) memory, usually a Random Access Memory (RAM). In order to perform an operation in the CPU,

87
88 5. High Performance Algorithms

flops
Operation flops memops memops
Vector-vector operations
scal x := αx n n 1/1
add x := x + y n n 1/1
dot α := xT y 2n 2n 1/1
apdot α := α + xT y 2n 2n 1/1
axpy y := αx + y 2n 3n 2/3
Matrix-vector operations
gemv y := αAx + βy 2n2 n2 2/1
ger A := αyxT + A 2n2 2n2 1/1
trsv x := T −1 b n2 n2 /2 2/1
Matrix-matrix operations
gemm C := αAB + βC 2n3 4n2 n/2

Figure 5.1: Analysis of the cost of various operations. Here α ∈ R, x, y, b ∈ Rn , and A, B, C, T ∈ Rn×n , with T
being triangular.

data must be fetched from the memory to the registers in the CPU, and results must eventually be returned
to the memory. The fundamental obstacle to high performance (executing useful computations at the rate at
which the CPU can process) is the speed of memory: fetching and/or storing a data item from/to the memory
requires more time than it takes to perform a flop with it. This is known as the memory bandwidth bottleneck.
The solution has been to introduce a small cache memory, which is fast enough to keep up with the CPU, but
small enough to be economical (e.g., in terms of space required inside the processor). The pyramid in Figure 5.2
depicts the resulting model of the memory architecture. The model is greatly simplified in that currently most
architectures have several layers of cache and often also present additional, even slower, levels of memory. The
model is sufficient to explain the main issues behind achieving high performance.

5.1.2 Inherent limitations

Vector-vector and matrix-vector operations can inherently benefit little from cache memories. As an example,
consider the axpy operation, y := αx + y, where α ∈ R and x, y ∈ Rn initially and ultimately reside in the
main memory. Define a memop as a memory fetch or store operation. An efficient implementation of axpy will
load α from memory to a register and will then compute with x and y, which must be fetched from memory
as well. The updated result, y, must also be stored, for a total of about 3n memops for the 2n flops that are
executed. Thus, three memops are required for every two flops. If memops are more expensive than flops (as
usually is the case), it is the memops that limit the performance that can be attained for axpy. In Figure 5.1
we summarize the results of a similar analysis for other vector-vector operations.
5.2. Matrix-Matrix Product: Background 89

Fast Registers
¢A Small
6 ¢ A 6
¢CacheA
¢ A
¢ A
¢ A
? ¢ RAM A ?
Slow ¢ A Large

Figure 5.2: Simplified model of the memory architecture used to illustrate the high-performance implementation
of gemm.

Next, consider the gemv operation y := Ax + y, where x, y ∈ Rn and A ∈ Rn×n . This operation involves
roughly n2 data (for the matrix), initially stored in memory, and 2n2 flops. Thus, an optimal implementation
will fetch every element of A exactly once, yielding a ratio of one memop for every two flops. Although this is
better than the ratio for the axpy, memops still dominate the cost of the algorithm if they are much slower
than flops. Figure 5.1 summarizes the analysis for other matrix-vector operations.
It is by casting linear algebra algorithms in terms of the matrix-matrix product, gemm, that there is the
opportunity to overcome this memory bottleneck. Consider the product C := AB + C where all three matrices
are square of order n. This operation involves 4n2 memops (A and B must be fetched from memory while C
must be both fetched and stored) and requires 2n3 flops1 for a ratio of 4n2 /2n3 = 2/n memops/flops. Thus, if
n is large enough, the cost of performing memops is small relative to that of performing useful computations
with the data, and there is an opportunity to amortize the cost of fetching data into the cache over many
computations.

5.2 Matrix-Matrix Product: Background

While it is the matrix-vector product, discussed in Chapter 3, that is the fundamental matrix operation that
links matrices to linear operators, the matrix-matrix product is the operation that is fundamental to high-
performance implementation of linear algebra operations. In this section we review the matrix-matrix product,
its properties, and some nomenclature.
The general form of the gemm operation is given by

C := α op(A) op(B) + βC,

where α, β ∈ R, op(A) ∈ Rm×k , op(B) ∈ Rk×n , C ∈ Rm×n , and op(X) is one of X or X T . That is, the gemm

1 The cost of gemm will be show to be cubic later in this chapter.

90 5. High Performance Algorithms

operation can take one of the following four forms

C := α A B + βC,
C := α AT B + βC,
C := α A B T + βC, or
C := α AT B T + βC.

In the remainder of this chapter we will focus on the special case where α = β = 1 and matrices A and B are
not transposed. All insights can be easily extended to the other cases.
Throughout this chapter, unless otherwise stated, we will assume that A ∈ Rm×k , B ∈ Rk×n , and C ∈
m×n
R . These matrices will be partitioned into rows, columns, and elements using the conventions discussed in
Section 3.1.

5.2.1 Definition
The reader is likely familiar with the matrix-matrix product operation. Nonetheless, it is our experience that
it is useful to review why the matrix-matrix product is defined as it is.
Like the matrix-vector product, the matrix-matrix product is related to the properties of linear transforma-
tions. In particular, the matrix-matrix product AB equals the matrix that corresponds to the composition of
the transformations represented by A and B.
We start by reviewing the definition of the composition of two transformations.
Definition 5.2 Consider two linear transformations F : Rn → Rk and G : Rk → Rm . Then the composition of
these transformations (G ◦ F ) : Rn → Rm is defined by (G ◦ F)(x) = G(F(x)) for all x ∈ Rn .
The next theorem shows that if both G and F are linear transformations then so is their composition.
Theorem 5.3 Consider two linear transformations F : Rn → Rk and G : Rk → Rm . Then (G ◦ F) is also a
linear transformation.
Proof: Let α ∈ R and x, y ∈ Rn . Then

(G ◦ F )(αx + y) = G(F(αx + y)) = G(αF(x) + F(y)) = αG(F(x)) + G(F(y)) = α(G ◦ F )(x) + (G ◦ F )(y).

!
With these observations we are ready to relate the composition of linear transformations to the matrix-matrix
product.
Assume A and B equal the matrices that correspond to the linear transformations G and F, respectively.
Since (G ◦ F ) is also a linear transformation, there exists a matrix C so that Cx = (G ◦ F )(x) for all x ∈ Rn .
The question now becomes how C relates to A and B. The key is the observation that Cx = (G ◦ F )(x) =
G(F(x)) = A(Bx), by the definition of composition and the relation between the matrix-vector product and
linear transformations.
5.2. Matrix-Matrix Product: Background 91

Let ei , ej ∈ Rn denote, respectively, the ith, jth unit basis vector. This observation defines the jth column
of C as follows
   
ǎT0 ǎT
0 bj
 ǎT   ǎT
1 bj

 1   
cj = Cej = (G ◦ F )(ej ) = A(Bej ) = Abj =  ..  bj =  .. , (5.1)
 .   . 
ǎT
m−1 ǎT
m−1 bj

so that the (i, j) element of C is given by

k−1
X
γi,j = eT T
i cj = ǎi bj = αi,p βp,j . (5.2)
p=0

Remark 5.4 Given x ∈ Rn , by definition ABx = (AB)x = A(Bx).

Exercise 5.5 Show that the cost of computing the matrix-matrix product is 2mnk flops.
Exercise 5.6 Show that the ith row of C is given by čT T
i = ǎi B.
Exercise 5.7 Show that A(BC) = (AB)C. (This motivates the fact that no parenthesis are needed when more
than two matrices are multiplied together: ABC = A(BC) = (AB)C.)

5.2.2 Multiplying partitioned matrices

The following theorem will become a central result in the derivation of blocked algorithms for the matrix-matrix
product.
Theorem 5.8 Partition
   
A00 A01 ... A0,κ−1 B00 B01 ... B0,ν−1
 A10 A11 ... A1,κ−1   B10 B11 ... B1,ν−1 
   
A→ .. .. .. .  , B →  . . .. ..  , and
 . . . ..   .. .. . . 
 Aµ−1,0 Aµ−1,1 ... Aµ−1,κ−1  Bκ−1,0 Bκ−1,1 ... Bκ−1,ν−1
C00 C01 ... C0,ν−1
 C10 C11 ... C1,ν−1 
 
C→ .. .. .. .. ,
 . . . . 
Cµ−1,0 Cµ−1,1 ... Cµ−1,ν−1

where Ai,p ∈ Rmi ×kp , Bp,j ∈ Rkp ×nj , and Ci,j ∈ Rmi ×nj . Then, the (i, j) block of C = AB is given by
κ−1
X
Ci,j = Ai,p Bp,j . (5.3)
p=0
92 5. High Performance Algorithms

The proof of this theorem is tedious. We therefore resort to Exercise 5.9 to demonstrate why it is true
without giving a rigorous proof in this text.
Exercise 5.9 Show that
0 1 0 „ «„ « „ « „ «„ « „ « 1
1 −1 3 0 1 1 −1 −1 0 3 ` ´ 1 −1 2 3
B 2 −1 0 2 + 2 1 + (−1)
B 0 −1 C
C@ 1 A
B 2
B „ 0 « „ 1 −1 « „−1« „2 0 « „1 « „−1« C
C
@ −1 −1 1 =@
2 1 A −1 2 −1 0 1 ` ´ −1 2 2 1 A
2 1 −1 + 2 1 + (−1)
0 1 2 0 1 1 −1 2 0 1 1 2

Remark 5.10 Multiplying two partitioned matrices is exactly like multiplying two matrices with scalar ele-
ments, but with the individual elements replaced by submatrices. However, since the product of matrices does
not commute, the order of the submatrices of A and B in the product is important: While αi,p βp,j = βp,j αi,p ,
Ai,p Bp,j is generally not the same as Bp,j Ai,p . Also, the partitioning of A, B, and C must be conformal:
m(Ai,p ) = m(Ci,j ), n(Bp,j ) = n(Ci,j ), and n(Ai,p ) = m(Bp,j ), for 0 ≤ i < µ, 0 ≤ j < ν, 0 ≤ p < κ.

5.2.3 Shapes of gemm

It is convenient to view gemm as a collection of special cases, based on the dimensions (shape) of the operands
A, B, and C. In Figure 5.3 we show the different shapes that will be encountered in our algorithms, and label
them for future reference. Special cases are labeled as gexy, where the first two letters, “ge”, stand for general
and refer to the fact that matrices A, B, and C have no special structure. The next pair of letters, “x” and “y”,
stand for the shape (determined by the dimensions m, n, and k) of matrices A and B, respectively, and follow
the convention in Figure 5.4. The exception to this naming convention is gepdot, for which both dimensions of
C are small so that the operation resembles a dot product with matrices, and ger which has already appeared.

Remark 5.11 In the next section we will see that “small” is linked to a dimension a block of a matrix that
fits in the cache of the target architecture.

5.3 Algorithms for gemm

Equations (5.1), (5.2), and Exercise 5.6 show that there are a number of different ways of computing gemm.
In this section we formulate three unblocked algorithms for computing gemm and their corresponding blocked
counterparts by performing different partitionings of the matrices. The algorithms are obtained by only em-
ploying horizontal/vertical partitionings of two of the three matrices of the problem in each case.
The precondition for all the algorithms is

Ppre : (A ∈ Rm×k ) ∧ (B ∈ Rk×n ) ∧ (C ∈ Rm×n ),

while the postcondition is given by

Ppost : C = AB + Ĉ.
5.3. Algorithms for gemm 93

m n k Illustration Label

:= +

large large large gemm

:= +

large large 1 ger

:= +

large large small gepp

:= +

large 1 large gemv

:= +

large small large gemp

:= +

1 large large gevm

:= +

small large large gepm

:= +

large small small gepb

:= +

small large small gebp

:= +

small small large gepdot

Figure 5.3: Special shapes of gemm.

Letter Shape Description

m Matrix Two dimensions are large or unknown.
p Panel One of the dimensions is small.
b Block Two dimensions are small.
v Vector One of the dimensions equals one.

Figure 5.4: Naming convention for the shape of matrices involved in gemm.
94 5. High Performance Algorithms

5.3.1 Implementing gemm with ger or gepp

We start by noting that
 
b̌T0
 b̌T 
 1 
C := AB + C = (a0 , a1 , . . . , ak−1 )  ..  + C = a0 b̌T T T
0 + a1 b̌1 + · · · + ak−1 b̌k−1 + C, (5.4)
 . 
b̌T
k−1
which shows that gemm can be computed as a series of gers. In line with Remark 5.10, this can be viewed as
partitioning A by columns, B by rows, and updating the matrix C as if A and B were row and column vectors
(of symbols), respectively, and AB the dot product of these two vectors.
Exercise
0 5.12 Show
1 that 0 1 0 1 0 1
1 −1 3 0 1 1 −1 3
B 2 −1 0 2
B 0 −1 C
C@ 1
B 2 C`
B C −1
´ B 0 C ` ´ B −1 C ` ´
@ −1 −1 1 A =@ 0 2 +B C 1 −1 1 B
+@ C 2 1 −1 .
2 1 A −1 A @ 2 A 1 A
2 1 −1
0 1 2 0 1 2
The following corollary of Theorem 5.8 yields a PME for deriving the first unblocked and blocked variants
for gemm.
Corollary 5.13 Partition µ ¶
¡ ¢ BT
A → AL AR and B → ,
BB
where n(AL ) = m(BT ). Then,
µ ¶
¡ ¢ BT
AB + C = AL AR + C = AL BT + AR BB + C.
BB
Using the partitionings defined in the corollary and the loop-invariant
Pinv : (C = AL BT + Ĉ) ∧ Pcons ,
with “Pcons : n(AL ) = m(BT )”, we obtain the two algorithms for gemm in Figure 5.5. The unblocked variant
there performs all its computations in terms of ger, while the corresponding blocked variant operates on A1 of
kb columns and B1 of kb rows, with kb ¿ m, n in general, and therefore utilizes a generalization of ger, namely
gepp (see Table 5.3).

5.3.2 Implementing gemm with gemv or gemp

Alternatively, C := AB + Ĉ can be computed by columns as
(c0 , c1 , . . . , cn−1 ) := A (b0 , b1 , . . . , bn−1 ) + (ĉ0 , ĉ1 , . . . , ĉn−1 ) = (Ab0 + ĉ0 , Ab1 + ĉ1 , . . . , Abn−1 + ĉn−1 ). (5.5)
| {z }
“scalar product”
with symbols
5.3. Algorithms for gemm 95

Algorithm: C := gemm unb var1(A, B, C) Algorithm: C := gemm blk var1(A, B, C)

` ´ ` ´
Partition
„ A«→ AL AR , Partition
„ A«→ AL AR ,
BT BT
B→ B→
BB BB
where AL has 0 columns and where AL has 0 columns and
BT has 0 rows BT has 0 rows
while n(AL ) < n(A) do while n(AL ) < n(A) do
Determine block size kb
Repartition
` ´ ` ´ Repartition
` ´ ` ´
AL AR → A0 a1 A2 , AL AR → A0 A1 A2 ,
0 1 0 1
„ « B0 „ « B0
BT BT
→ @ bT 1
A → @ B1 A
BB BB
B2 B2
where a1 is a column and where A1 has kb columns and
bT
1 is a row B1 has kb rows

C := a1 bT C := A1 B1 + C (gepp)
1 +C (ger)

Continue
` with
´ ` ´ Continue
` with
´ ` ´
AL AR ← A0 a1 A2 , AL AR ← A0 A1 A2 ,
0 1 0 1
„ « B0 „ « B0
BT @ BT
← b1 A
T
← @ B1 A
BB BB
B2 B2
endwhile endwhile

Figure 5.5: Left: gemm implemented as a sequence of ger operations (unblocked Variant 1). Right: gemm
implemented as a sequence of gepp operations (blocked Variant 1).

That is, each column of C is obtained from a gemv of A and the corresponding column of B. In line with
Remark 5.10, this can be viewed as partitioning B and C by columns and updating C as if A were a scalar and
B and C were row vectors.
Exercise
0
5.141 Show that 0 0 1 0 1 0 1 1
1 −1 3 0 1 1 −1 3 0 1 1 −1 3 0 1 1 −1 3 0 1
B 2 0 −1 C −1 0 2 B B 2 0 −1 C −1 B 2 0 −1 C 0 B 2 0 −1 C 2 C
B C @ 1 −1 1 A = B B C@ 1 A B C@ −1 A B C@ 1A C.
@ −1 2 1 A @ @ −1 2 1 A @ −1 2 1 A @ −1 2 1 A A
2 1 −1 2 1 −1
0 1 2 0 1 2 0 1 2 0 1 2
The following corollary of Theorem 5.8 yields a second PME from which another unblocked and a blocked
variant for gemm can be derived.
Corollary 5.15 Partition ¡ ¢ ¡ ¢
B → BL BR and C → CL CR ,
where n(BL ) = n(CL ). Then
¡ ¢ ¡ ¢ ¡ ¢
AB + C = A BL BR + CL CR = ABL + CL ABR + CR .
96 5. High Performance Algorithms

Algorithm: C := gemm unb var2(A, B, C) Algorithm: C := gemm blk var2(A, B, C)

` ´ ` ´
Partition
` B → ´ BL BR , Partition
` B → ´ BL BR ,
C → CL CR C → CL CR
where BL and CL have 0 columns where BL and CL have 0 columns
while n(BL ) < n(B) do while n(BL ) < n(B) do
Determine block size nb
Repartition
` ´ ` ´ Repartition
` ´ ` ´
` BL BR ´ → ` B0 b1 B2 ´ , ` BL BR ´ → ` B0 B1 B2 ´ ,
CL CR → C0 c1 C2 CL CR → C0 C1 C2
where b1 and c1 are columns where B1 and C1 have nb columns

c1 := Ab1 + c1 (gemv) C1 := AB1 + C1 (gemp)

Continue
` with
´ ` ´ Continue
` with
´ ` ´
` BL BR ´ ← ` B0 b1 B2 ´ , ` BL BR ´ ← ` B0 B1 B2 ´ ,
CL CR ← C0 c1 C2 CL CR ← C0 C1 C2
endwhile endwhile

Figure 5.6: Left: gemm implemented as a sequence of gemv operations (unblocked Variant 2). Right: gemm
implemented as a sequence of gemp operations (blocked Variant 2).

Using the partitionings in the corollary and the loop-invariant

¡¡ ¢ ¡ ¢¢
Pinv : CL CR = ABL + ĈL ĈR ∧ Pcons

with “Pcons : n(BL ) = n(CL ), we arrive at the algorithms for gemm in Figure 5.6. The unblocked variant in
this case is composed of gemv operations, and the blocked variant of gemp as, in this case, (see Table 5.3) both
B1 and C1 are “narrow” blocks of columns (panels) with only nb columns, nb ¿ m, k.

5.3.3 Implementing gemm with gevm or gepm

Finally, we recall that the ith row of AB is given by

eT T T
i (AB) = (ei A)B = ǎi B.
5.3. Algorithms for gemm 97

Let us denote the original contents of the rows of C by c̃T 2

i . We can formulate gemm as partitioning A and C
by rows and computing C := AB + Ĉ as if B were a scalar and A and C were column vectors (of symbols):
       
čT
0 ǎT0 c̃T
0 ǎ0T B + c̃T
0
 čT   ǎT   c̃T   ǎ1T B + c̃T 
 1   1   1   1 
 ..  :=  ..  B +  ..  =  .. . (5.6)
 .   .   .   . 
čT
m−1 ǎT
m−1 c̃T
m−1 ǎT T
m−1 + c̃m−1
| {z }
“scalar product”
with symbols

In this case, the product is composed of gevm operations.

Exercise 5.16 Show that
0 0 1 1
` −1 ´0 2
B
B 1 −1 3 @ 1 −1 1 A C
C
B C
B 0 2 1 −1 1 C
B −1 0 2 C
0 1 B ` ´ C
1 −1 3 0 1 B 2 0 −1 @ 1 −1 1 A C
−1 0 2 B C
B 2 −1 C B C
B 0 C@ 1 −1 1 A=B 0 2 1 −1 1 C.
@ −1 2 1 A B −1 0 2 C
2 1 −1 B ` ´ C
0 1 2 B −1 2 1 @ 1 −1 1 A C
B C
B 2 1 −1 1 C
B 0 C
B −1 0 2 C
B ` ´ C
@ 0 1 2 @ 1 −1 1 A A
2 1 −1
Finally, the following corollary of Theorem 5.8 yields a third PME from which another unblocked and a
blocked variant for gemmcan be derived.
Corollary 5.17 Partition µ ¶ µ ¶
AT CT
A→ and C → ,
AB CB
where m(AT ) = m(CT ). Then
µ ¶ µ ¶ µ ¶
AT CT AT B + C T
AB + C = B+ = .
AB CB AB B + CB

Using the partitionings in the corollary and the loop-invariant

Ãµ ¶ Ã !!
CT AT B + ĈT
Pinv : = ∧ Pcons ,
CB ĈB
2 We deviate here from our usual notation as, in this case, we would have had to use a double superscript to denote the original

contents of čT
i .
98 5. High Performance Algorithms

Algorithm: C := gemm unb var3(A, B, C) Algorithm: C := gemm blk var3(A, B, C)

„ « „ « „ « „ «
AT CT AT CT
Partition A → ,C→ Partition A → ,C→
AB CB AB CB
where AT and CT have 0 rows where AT and CT have 0 rows
while m(AT ) < m(A) do while m(AT ) < m(A) do
Determine block size mb
Repartition 0 1 0 1 Repartition 0 1 0 1
„ « A0 „ « C0 „ « A0 „ « C0
AT CT AT C T
→ @ aT1
A, → @ cT
1
A → @ A1 A , → @ C1 A
AB CB AB CB
A2 C2 A2 C2
where aT T
1 and c1 are rows
where A1 and C1 have mb rows

cT T T
1 := a1 B + c1 (gevm) C1 := A1 B + C1 (gepm)

Continue with
0 1 0 1 Continue with
0 1 0 1
„ « A0 „ « C0 „ « A0 „ « C0
AT CT AT CT
← @ aT1
A, ← @ cT
1
A ← @ A1 A , ← @ C1 A
AB CB AB CB
A2 C2 A2 C2
endwhile endwhile

Figure 5.7: Left: gemm implemented as a sequence of gevm operations (unblocked Variant 3). Right: gemm
implemented as a sequence of gepm operations (blocked Variant 3).

with “Pcons : m(AT ) = m(CT )”, we obtain the algorithms for gemm in Figure 5.7. The unblocked algorithm
there consists of gevm operations, and the blocked variant of the corresponding generalization in the form of
gepm as now A1 and C1 are blocks of mb rows and mb ¿ n, k (see Table 5.3).

Remark 5.18 The blocked variants in Figures 5.5, 5.6, and 5.7 compute gemm in terms of gepp, gemp, and
gepm operations, respectively. In Section 5.4 it will be shown how these three operations can be implemented
to achieve high performance. As a consequence, the blocked algorithms for gemm based on such operations
also achieve high performance.

5.3.4 Performance
Performance of the algorithms presented in Figures 5.5–5.7 on an Intel Pentium [email protected] is given in Fig-
ure 5.8. (See Section 1.5 for details on the architecture, compiler, etc.) The line labeled as “Simple” refers to a
traditional implementation of gemm, consisting of three nested loops fully optimized by the compiler. Highly
optimized implementations of the ger, gemv, and gevm operations were called by the gemm unb var1,
gemm unb var2, and gemm unb var3 implementations. Similarly, highly optimized implementations for the
gepp, gemp, and gepm operations were called by implementations of the blocked algorithms, which used a
5.3. Algorithms for gemm 99

Simple
Gemm with gepp
2.5 Gemm with gemp
Gemm with gepm
Gemm with ger
Gemm with gemv
Gemm with gevm
2

GFLOPS
1.5

0.5

0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
m = n = k (square matrices)

Figure 5.8: Performance of the implementations of the algorithms in Figures 5.5–5.7.

block size of kb = nb = mb = 128. All three matrices were taken to be square.

Some observations:

• Among the unblocked algorithms, the one that utilizes ger is the slowest. This is not surprising: it
generates twice as much data traffic between main memory and the registers as either of the other two.

• The “hump” in the performance of the unblocked algorithms coincides with problem sizes that fit in the
cache. For illustration, consider the gemv-based implementation in gemm unb var2. Performance of
the code is reasonable when the dimensions are relatively small since from one gemv to the next matrix
A then remains in the cache.

• The blocked algorithms attain a substantial percentage of peak performance (2.8 GFLOPS), that ranges
between 70 and 85%.

• Among the blocked algorithms, the variant that utilizes gepp is the fastest. On almost all architectures
at this writing high performance implementations of gepp outperform those of gemp and gepm. The
reasons for this go beyond the scope of this text.
100 5. High Performance Algorithms

• On different architectures, the relative performance of the algorithms may be quite different. However,
blocked algorithms invariably outperform unblocked ones.

5.4 High-Performance Implementation of gepp, gemp, and gepm

In the previous section, we saw that if fast implementations of gepp, gemp, and gepm are available, then high
performance implementations of gemm can be built upon these. In this section we show how fast implementa-
tions of these operations can be obtained.

5.4.1 The basic building blocks: gebp, gepb, and gepdot

Consider the matrix-matrix product operation with A ∈ Rmc ×kc , B ∈ Rkc ×n , and C ∈ Rmc ×n . The product
then has the gebp shape:

C := A B + C

Furthermore, assume that

• The dimensions mc , kc are small enough so that A, a column from B, and a column from C together fit
in the cache.

• If A and the two columns are in the cache then gemv can be computed at the peak rate of the CPU.

• If A is in the cache it remains there until no longer needed.

Under these assumptions, the approach to implementing algorithm gebp in Figure 5.9 amortizes the cost
of moving data between the main memory and the cache as follows. The total cost of updating C is mc kc +
(2mc + kc )n memops for 2mc kc n flops. Now, let c = mc ≈ kc . Then, the ratio between computation and data
movement is
2c2 n flops 2c flops
≈ when c ¿ n.
c2 + 3cn memops 3 memops
If c ≈ n/100 then even if memops are 10 times slower than flops, the memops add only about 10% overhead to
the computation.
We note the similarity between algorithm gebp unb var2 in Figure 5.9 and the unblocked Variant 2 for
gemm in Figure 5.6 (right).

Remark 5.19 In the highest-performance implementations of gebp, both A and B are typically copied into
a contiguous buffer and/or transposed. For complete details on this observation, see [13].
5.4. High-Performance Implementation of gepp, gemp, and gepm 101

Algorithm: C := gebp unb var2(A, B, C)

¡ ¢ ¡ ¢
Partition B → BL BR , C → CL CR
where BL and CL have 0 columns

while n(BL ) < n(B) do

Repartition
¡ ¢ ¡ ¢ ¡ ¢ ¡ ¢
BL BR → B0 b1 B2 , CL CR → C0 c1 C2
where b1 and c1 are columns

Only 1st iteration: Load A into the cache (cost: mc kc memops)

Load b1 into the cache (cost: kc memops)
Load c1 into the cache (cost: mc memops)
c1 := Ab1 + c1 (cost: 2mc kc memops)
Store c1 into the memory (cost: mc memops)

Continue
¡ with¢ ¡ ¢ ¡ ¢ ¡ ¢
BL BR ← B0 b1 B2 , CL CR ← C0 c1 C2
endwhile

Figure 5.9: gebp implemented as a sequence of gemv, with indication of the memops and flops costs. Note
that a program typically has no explicit control over the loading of a cache. Instead it is in using the data that
the cache is loaded, and by carefully ordering the computation that the architecture is encouraged to keep data
in the cache.
102 5. High Performance Algorithms

Algorithm: C := gepp blk var3(A, B, C) Algorithm: C := gemm blk var3(A, B, C)

„ « „ « „ « „ «
AT CT AT CT
Partition A → ,C→ Partition A → ,C→
AB CB AB CB
where AT and CT have 0 rows where AT and CT have 0 rows
while m(AT ) < m(A) do while m(AT ) < m(A) do
Determine block size mb Determine block size mb
Repartition 0 1 0 1 Repartition 0 1 0 1
„ « A0 „ « C0 „ « A0 „ « C0
AT CT AT C T
→ @ A1 A , → @ C1 A → @ A1 A , → @ C1 A
AB CB AB CB
A2 C2 A2 C2
where A1 and C1 have b rows where A1 and C1 have mb rows

C1 := A1 B + C1 (gebp) C1 := A1 B + C1 (gepm)

Continue with
0 1 0 1 Continue with
0 1 0 1
„ « A0 „ « C0 „ « A0 „ « C0
AT CT AT CT
← @ A1 A , ← @ C1 A ← @ A1 A , ← @ C1 A
AB CB AB CB
A2 C2 A2 C2
endwhile endwhile

Figure 5.10: Left: gepp implemented as a sequence of gebp operations. Right: gemm implemented as a
sequence of gepm operations.

Exercise 5.20 Propose a similar scheme for the gepb operation, where A ∈ Rm×kc , B ∈ Rkc ×nc , and C ∈
Rm×nc . State your assumptions carefully. Analyze the ratio of flops to memops.

Exercise 5.21 Propose a similar scheme for the gepdot operation, where A ∈ Rmc ×k , B ∈ Rk×nc , and
C ∈ Rmc ×nc . State your assumptions carefully. Analyze the ratio of flops to memops.

5.4.2 Panel-panel multiply (gepp)

Consider the gepp operation C := AB + C, where A ∈ Rm×kb , B ∈ Rkb ×n , and C ∈ Rm×n . By partitioning
the matrices into two different directions, we will obtain two algorithms for this operation, based on gebp or
gepb. We will review here the first variant while the second one is proposed as an exercise.
Assume m is an exact multiple of mb and partition matrices A and C into blocks of mb rows so that the
product takes the form
5.4. High-Performance Implementation of gepp, gemp, and gepm 103

C0 A0 B C0

C1 := A1 + C1

.. .. ..
. . .

Then, each Ci can be computed as a gebp of the form Ci := Ai B + Ci . Since it was argued that gebp can
attain high performance, provided mb = mc and kb = kc , so can the gepp.

Remark 5.22 In the implementation of gepp based on gebp there is complete freedom to chose mb = mc .
Also, kb is usually set by the routine that invokes gepp (e.g., gemm blk var1), so that it can be chosen there
as kb = kc . An analogous situation will occur for the alternative implementation of gepp based on gepb, and
for all implementations of gemp and gepm.

The algorithm for this is given in Figure 5.10 (left). For comparison, we repeat Variant 3 for computing
gemm in the same figure (right). The two algorithms are identical except that the constraint on the row
dimension of A and column dimension of B changes the update from a gepm to a gebp operation.
Exercise 5.23 For the gepp operation assume n is an exact multiple of nb , and partition B and C into blocks
of nb columns so that

B0 B1 ···

C0 C1 ··· := A + C0 C1 ···

Propose an alternative high-performance algorithm for computing gepp based on gepb. Compare the resulting
algorithm to the three variants for computing gemm. Which variant does it match?

5.4.3 Matrix-panel multiply (gemp)

Consider again the gemp computation C := AB + C, where now A ∈ Rm×k , B ∈ Rk×nb , and C ∈ Rm×nb . The
two algorithms that are obtained in this case are based on gepb and gepdot.
Assume k is an exact multiple of kb , and partition A into blocks of kb columns and B into blocks of kb rows
so that
104 5. High Performance Algorithms

C := A0 A1 ··· B1 + C
..
.

Then, C can be computed as repeated updates of C with gepb operations, C := Ap Bp + C. The algorithm is
identical to gemm blk var1 in Figure 5.5 except that the update changes from a gepp to a gepb operation.
If gepb attains high performance, if nb = nc and kb = kc , so will this algorithm for computing gemp.
Exercise 5.24 For the gemp operation assume m is an exact multiple of mb , and partition A and C by blocks
of mb rows as

C0 A0 C0

C1 := A1 B + C1

.. .. ..
. . .

Propose an alternative high-performance algorithm based on gepdot.

5.4.4 Panel-matrix multiply (gepm)

Finally, consider once more the computation C := AB + C, except that now A ∈ Rmb ×k , B ∈ Rk×n , and
C ∈ Rmb ×n . Again two algorithms are obtained in this case, based on gepdot and gebp.
Assume that n is an exact multiple of nb , so that a partitioning of B and C into blocks of nb columns takes
the form

C0 C1 ··· A C0 C1 ···

:= B0 B1 ··· +

Then each block of C can be computed as Cj := ABj + C using the gepdot operation. The algorithm
is identical to gemm blk var2 in Figure 5.6 except that the update changes from a gemp to a gepdot
operation. If gepdot attains high performance, provided mb ≈ mc and nb = nc , so will this algorithm for
computing gepm.
5.5. Modularity and Performance via gemm: Implementing symm 105

Exercise 5.25 For the gepm operation assume k is an exact multiple of kb , and partition A and B by blocks
of kb rows and columns, respectively, so that

C A0 A1 ··· B0 C

:= B1 +
..
.

Propose an alternative high-performance algorithm based on gebp.

5.4.5 Putting it all together

Figure 5.11 summarizes the insights discussed in Sections 5.3 and 5.4 regarding the implementation of gemm.
A gemm can be implemented in terms of high-performance implementations of gepp, gemp, or gepm. Each of
these in turn can be implemented in terms of gebp, gepb, and/or gepdot.

Remark 5.26 If one variant of matrix-matrix multiplication is used at one level, that same variant does not
occur at the next level. There are theoretical reasons for this that go beyond the scope of this text. For details,
see [17].

5.5 Modularity and Performance via gemm: Implementing symm

A key concept introduced in the section is how to achieve high performance by casting computation in terms
of gemm.
We start by reviewing an important structural property of matrices: symmetry.
Definition 5.27 A square matrix is symmetric, denoted as A = AT , if its (i, j) element equals its (j, i) element.
In this section we consider the special case of the matrix-matrix product

C := AB + C,

where A ∈ Rm×m is symmetric and C, B ∈ Rm×n . (Thus, k = m.)

Remark 5.28 Unless otherwise stated, we assume hereafter that it is the lower part of the symmetric matrix
A, including the diagonal, that contains the relevant entries of the matrix. In our notation this is denoted as
SyLw(A). The algorithms that are derived will not make any reference to the contents of the strictly upper
triangular part (superdiagonals) of symmetric matrices.
106 5. High Performance Algorithms

gemm blk var3 gemm unb var2

(gebp mb , n, kb ) (gemv mb , kb )
gemm blk var1? +:= - +:=
(gepp m, n, kb )
µ
¡
+:= ¡ gemm blk var2 gemm unb var3
@ (gepb m, nb , kb ) (gevm nb , kb )
R
@
¤¤º +:= - +:=
¤
¤
¤
¤ gemm blk var1 gemm unb var3
¤ (gepb m, nb , kb ) (gevm nb , kb )
¤ -
¤ gemm blk var2 +:= +:=
(gemp m, nb , k)
¤ µ
¡
+:= ¤ - +:= ¡ gemm blk var3 gemm unb var1
C @ (gepdot mb , nb , k) (ger mb , nb )
C R
@
C +:= - +:=

C
C
C gemm blk var2 gemm unb var1
C (gepdot mb , nb , k) (ger mb , nb )
C
C gemm blk var3 +:= - +:=

CCW (gepm mb , n, k)
µ
¡
+:= ¡ gemm blk var1 gemm unb var2
@ (gebp mb , n, kb ) (gemv mb , kb )
R
@
+:= - +:=

Figure 5.11: Implementations of gemm. The legend on top of each figure indicates the algorithm that is invoked
in that case, and (between parenthesis) the shape and dimensions of the subproblems the case is decomposed
into. For instance, in the case marked with “?”, the product is performed via algorithm gemm blk var1,
which is then decomposed into matrix-matrix products of shape gepp and dimensions m, n, and kb .
5.5. Modularity and Performance via gemm: Implementing symm 107

When dealing with symmetric matrices, in general only the upper or lower part of the matrix is actually
stored. One option is to copy the stored part of A into both the upper and lower triangular part of a temporary
matrix and to the use gemm. This is undesirable if A is large, since it requires temporary space.
The precondition for the symmetric matrix-matrix product, symm, is given by
Ppre : (A ∈ Rm×m ) ∧ SyLw(A) ∧ (B, C ∈ Rm×n ),
while the postcondition is that
Ppost : C = AB + Ĉ.
We next formulate a partitioning and a collection of loop-invariants that potentially yield algorithms for
symm. Let us partition the symmetric matrix A into quadrants as
µ ¶
AT L AT BL
A→ .
ABL ABR

Then, from the postcondition, C = AB + Ĉ, a consistent partitioning of matrices B and C is given by
µ ¶ µ ¶
BT CT
B→ , C→ ,
BB CB
where “Pcons : n(AT L ) = m(BT ) ∧ m(AT L ) = m(CT )” holds. (A different possibility would be to also partition
B and C into quadrants, a case that is proposed as an exercise at the end of this section.) Very much as what
we do for triangular matrices, for symmetric matrices we also require the blocks in the diagonal to be square
(and therefore) symmetric. Thus, in the previous partitioning of A we want that
Pstruct : SyLw(AT L ) ∧ (m(AT L ) = n(AT L )) ∧ SyLw(ABR ) ∧ (m(ABR ) = n(ABR ))
holds. Indeed, because SyLw(A), it is sufficient to define
Pstruct : SyLw(AT L ).

Remark 5.29 When dealing with symmetric matrices, in order for the diagonal blocks that are exposed to
be themselves symmetric, we always partition this type of matrices into quadrants, with square blocks in the
diagonal.
The PME is given by Ã !
µ ¶ µ ¶µ ¶
CT AT L AT
BL BT ĈT
= + ,
CB ABL ABR BB ĈB
which is equivalent to Ã !
CT = AT L BT + AT
BL BB + ĈT
.
CB = ABL BT + ABR BB + ĈB
108 5. High Performance Algorithms

Recall that loop-invariants result by assuming that some computation is yet to be performed.
A systematic enumeration of subresults, each of which is a potential loop-invariants, is given in Table 5.12.
We are only interested in feasible loop-invariants:
Definition 5.30 A feasible loop-invariant is a loop-invariant that yields a correct algorithm when the derivation
methodology is applied. If a loop-invariant is not feasible, it is infeasible.
In the column marked by “Comment” reasons are given why a loop-invariant is not feasible.
Among the feasible loop-invariants in Figure 5.12, we now choose
Ãµ ¶ Ã !!
CT AT L BT + ĈT
= ∧ Pcons ∧ Pstruct ,
CB ĈB

for the remainder of this section. This invariant yields the blocked algorithm in Figure 5.13. As part of the
update, in this algorithm the symmetric matrix-matrix multiplication A11 B1 needs to be computed (being a
square block in the diagonal of A, A11 = AT 11 ). In order to do so, we can apply an unblocked version of the
algorithm which, of course, would not reference the strictly upper triangular part of A11 . The remaining two
updates require the computation of two gemms, AT 10 B1 and A10 B0 , and do not reference any block in the strictly
upper triangular part of A either.
Exercise 5.31 Show that the cost of the algorithm for symm in Figure 5.13 is 2m2 n flops.
Exercise 5.32 Derive a pair of blocked algorithms for computing C := AB + Ĉ, with SyLw(A), by partitioning
all three matrices into quadrants and choosing two feasible loop-invariants found for this case.
Exercise 5.33 Derive a blocked algorithm for computing C := BA + Ĉ, with SyLw(A), A ∈ Rm×m , C, B ∈
Rn×m , by partitioning both B and C in a single dimension and choosing a feasible loop-invariant found for this
case.

5.5.1 Performance
Consider now m to be an exact multiple of mb , m = µmb . The algorithm in Figure 5.13 requires µ iterations,
with 2m2b n flops being performed as a symmetric matrix multiplication (A11 B1 ) at each iteration, while the rest
of the computations is in terms of two gemms (AT 10 B1 and A10 B0 ). The amount of computations carried out as
symmetric matrix multiplications, 2m mb n flops, is only a minor part of the total cost of the algorithm, 2m2 n
flops (provided mb ¿ m). Thus, given an efficient implementation of gemm, high performance can be expected
from this algorithm.

5.6 Summary
The highlights of this Chapter are:
• A high level description of the architectural features of a computer that affect high-performance imple-
mentation of the matrix-matrix multiply.
5.6. Summary 109

Computed? µ ¶
CT
AT L BT ATBL BB ABL BT ABR BB Pinv : = Comment
CB
!
ĈT No loop-guard exists so
N N N N
ĈB that Pinv ∧ ¬G ⇒ Ppost
Ã !
AT L BT + ĈT
Y N N N Variant 1 (Fig. 5.13)
ĈB
!
AT
BL BB + ĈT
No loop-guard exists so
N Y N N
ĈB that Pinv ∧ ¬G ⇒ Ppost .
!
AT L BT + ATBL BB + ĈT
Y Y N N Variant 2
ĈB
!
ĈT No loop-guard exists so
N N Y N
ABL BT + ĈB that Pinv ∧ ¬G ⇒ Ppost .
! Leads to an alternative al-
AT L BT + ĈT
Y N Y N gorithm.
ABL BT + ĈB
! Variant 3
ATBL BB + ĈT
No loop-guard exists so
N Y Y N
ABL BT + ĈB that Pinv ∧ ¬G ⇒ Ppost .
!
AT L BT + ATBL BB + ĈT
Y Y Y N Variant 4
ABL BT + ĈB
!
ĈT
N N N Y Variant 5
ABR BB + ĈB
!
AT L BT + ĈT No simple initialization ex-
Y N N Y
ABR BB + ĈB
! ists to achieve this state.
ATBL BB + ĈT
N Y N Y Variant 6
ABR BB + ĈB
!
AT L BT + AT
BL BB + ĈT No simple initialization ex-
Y Y N Y
ABR BB + ĈB
! ists to achieve this state.
ĈT
N N Y Y Variant 7
ABL BT + ABR BB + ĈB
!
AT L BT + ĈT No simple initialization ex-
Y N Y Y
ABL BT + ABR BB + ĈB
! ists to achieve this state.
ATBL BB + ĈT
N Y Y Y Variant 8
ABL BT + ABR BB + ĈB
!
AT L BT + AT
BL BB
+ ĈT No simple initialization ex-
Y Y Y Y
ABL BT + ABR BB + ĈB ists to achieve this state.

Figure 5.12: Potential loop-invariants for C := AB + C, with SyLw(A), using the partitioning in Section 5.5.
Potential invariants are derived from the PME by systematically including/excluding (Y/N) a term.
110 5. High Performance Algorithms

Algorithm: C := symm blk var1(A, B, C)

µ ¶ µ ¶ µ ¶
AT L AT R BT CT
Partition A → ,B→ , and C →
ABL ABR BB CB
where AT L is 0 × 0, and BT , CT have 0 rows
while m(AT L ) < m(A) do
Determine block size mb
Repartition      
µ ¶ A00 A01 A02 µ ¶ B0 µ ¶ C0
AT L AT R B T C T
→  A10 A11 A12 , →  B1 , →  C1 
ABL ABR BB CB
A20 A21 A22 B2 C2
where A11 is mb × mb , and B1 , C1 have mb rows

C0 := AT
10 B1 + C0
C1 := A10 B0 + A11 B1 + C1

Continue with      
µ ¶ A00 A01 A02 µ ¶ B0 µ ¶ C0
AT L AT R  B T C T
← A10 A11 A12 , ←  B1 , ←  C1 
ABL ABR BB CB
A20 A21 A22 B2 C2
endwhile

Figure 5.13: Algorithm for computing C := AB + C, with SyLw(A) (blocked Variant 1).

• The hierarchical anatomy of the implementation of this operation that exploits the hierarchical organiza-
tion of multilevel memories of current architectures.
• The very high performance that is attained by this particular operation.
• How to cast algorithms for linear algebra operations in terms matrix-matrix multiply.
• The modular high-performance that results.
A recurrent theme of this and subsequent chapters will be that blocked algorithms for all major linear algebra
operations can be derived that cast most computations in terms of gepp, gemp, and gepm. The block size for
these is tied to the size of cache memory.

5.7 Other Matrix-Matrix Operations

A number of other commonly encountered matrix-matrix operations tabulated in Figure 5.14.
5.8. Further Exercises 111

Name Abbrev. Operation Cost Comment

(flops)
general gemm C := αAB + βC 2mnk C ∈ Rm×n , A ∈ Rm×k , B ∈ Rk×n
matrix- C := αAT B + βC 2mnk C ∈ Rm×n , A ∈ Rk×m , B ∈ Rk×n
matrix C := αAB T + βC 2mnk C ∈ Rm×n , A ∈ Rm×k , B ∈ Rn×k
multiplication C := αAT B T + βC 2mnk C ∈ Rm×n , A ∈ Rk×m , B ∈ Rn×k
symmetric symm C := αAB + βC 2mn2 C ∈ Rm×n , A ∈ Rm×m symmetric,
matrix stored in lower/upper triangular part
matrix C := αBA + βC 2mn2 C ∈ Rm×n , A ∈ Rn×n symmetric,
multiplication stored in lower/upper triangular part
triangular trmm B := LB, B := LT B m2 n B ∈ Rm×n , L ∈ Rm×m is lower trian-
matrix B := U B, B := U T B gular, U ∈ Rm×m is upper triangular
matrix B := BL, B := BLT mn2 B ∈ Rm×n , L ∈ Rn×n is lower trian-
multiplication B := BU, B := BU T gular, U ∈ Rn×n is upper triangular
symmetric syrk C := αAAT + βC n2 k A ∈ Rn×k , C ∈ Rn×n is symmetric,
rank-k stored in lower/upper triangular part
update C := αAT A + βC n2 k A ∈ Rk×n , C ∈ Rn×n symmetric,
stored in lower/upper triangular part
symmetric syr2k C := 2n2 k A ∈ Rn×k , C ∈ Rn×n symmetric,
rank-2k α(AB T + BAT ) + βC stored in lower/upper triangular part
update C := 2n2 k A ∈ Rk×n , C ∈ Rn×n symmetric,
α(AT B + B T A) + βC stored in lower/upper triangular part
triangular trsm B := L−1 B, B := L−T B m2 n B ∈ Rm×n , L ∈ Rm×m is lower trian-
solve with B := U −1 B, B := U −T B gular, U ∈ Rm×m is upper triangular
multiple B := BL, B := BLT mn2 B ∈ Rm×n , L ∈ Rn×n is lower trian-
right-hand sides B := BU, B := BU T gular, U ∈ Rn×n is upper triangular

Figure 5.14: Basic matrix-matrix operations. Cost is approximate.

5.8 Further Exercises

For additional exercises, visit $BASE/Chapter5/.
112 5. High Performance Algorithms
Chapter 6
The LU and Cholesky Factorizations

A commonly employed strategy for solving (dense) linear systems starts with the factorization of the coefficient
matrix of the system into the product of two triangular matrices, followed by the solves with the resulting
triangular systems. In this chapter we review two such factorizations, the LU and Cholesky factorizations.
The LU factorization (combined with pivoting) is the most commonly used method for solving general linear
systems. The Cholesky factorization plays an analogous role for systems with a symmetric positive definite
(SPD) coefficient matrix.
Throughout this chapter, and unless otherwise stated explicitly, we assume that the coefficient matrix (and
therefore the triangular matrices resulting from the factorization) to be nonsingular with n rows and columns.

6.1 Gaussian Elimination

Recall that Gaussian elimination is the process used to transform a set of linear equations into an upper
triangular system, which is then solved via back substitution.
Assume that we are given the linear system with n equations and n unknowns defined by1
    
α11 α12 . . . α1,n χ1 β1
 α21 α22 . . . α2,n   χ2   β2 
    
 .. .. .. ..   ..  =  ..  , or, Ax = b.
 . . . .   .   . 
αn,1 αn,2 ... αn,n χn βn
1 For the first time we use indices starting at 1. We do so to better illustrate the relation between Gaussian elimination and the

LU factorization in this section. In particular, the symbol α11 will denote the same element in the first step of both methods.

113
114 6. The LU and Cholesky Factorizations

Gaussian elimination starts by subtracting a multiple of the first row of the matrix from the second row so as
to annihilate element α21 . To do so, the multiplier λ21 = α21 /α11 is first computed, after which the first row
times the multiplier λ21 is subtracted from the second row of A, resulting in
 
α11 α12 ··· α1,n
 0 α22 − λ21 α12 · · · α2,n − λ21 α1,n 
 
 α31 α32 ··· α3,n 
 .
 .. .. . . .
. 
 . . . . 
αn,1 αn,2 ··· αn,n

Next, the multiplier to eliminate the element α31 is computed as λ31 = α31 /α11 , and the first row times λ31 is
subtracted from the third row of A to obtain
 
α11 α12 ··· α1,n
 0 α22 − λ21 α12 · · · α2,n − λ21 α1,n 
 
 0 α32 − λ31 α12 · · · α3,n − λ31 α1,n 
 .
 .. .. .. .. 
 . . . . 
αn,1 αn,2 ··· αn,n
Repeating this for the remaining rows yields
 
α11 α12 ··· α1,n
 0 α − λ21 α12 ··· α2,n − λ21 α1,n 
 22 
 0 α − λ31 α12 ··· α3,n − λ31 α1,n 
 32 . (6.1)
 .. .. .. .. 
 . . . . 
0 αn,2 − λn,1 α12 ··· αn,n − λn,1 α1,n
Typically, the multipliers λ21 , λ31 , . . . , λn,1 are stored over the zeroes that are introduced. After this, the process
continues with the bottom right n − 1 × n − 1 quadrant of the matrix until eventually A becomes an upper
triangular matrix.

6.2 The LU Factorization

Given a linear system Ax = b where the coefficient matrix has no special structure or properties (often referred
to as a general structure), computing the LU factorization of this matrix is usually the first step for towards
solving the system. For the system to have a unique solution it is a necessary and sufficient condition that A is
nonsingular.
Definition 6.1 Given a square matrix A with linearly independent columns is said to be nonsingular (or in-
vertible).
6.2. The LU Factorization 115

We remind the reader of the following theorem, found in any standard linear algebra text, gives a number
of equivalent characterizations of a nonsingular matrix:
Theorem 6.2 Given a sqaure matrix A, the following equivalent:

• A is nonsingular.

• Ax = 0 if and only if x = 0.

• Ax = b has a unique solution.

• There exists a an matrix, denoted as A−1 , which satisfies AA−1 = A−1 A = In ).

• The determinant of A is nonzero.

µ ¶
AT L AT R
Definition 6.3 Partition A as A → , where AT L ∈ Rk×k . Matrices AT L , 1 ≤ k ≤ n, are
ABL ABL
called the leading principle submatrices of A.
The LU factorization decomposes a matrix A into a unit lower triangular matrix L and an upper triangular
matrix U such that A = LU . The following theorem states conditions for the existence of such a factorization.
Theorem 6.4 Assume all leading principle submatrices of A are nonsingular. Then there exist a unit lower
triangular matrix L and a nonsingular upper triangular matrix U such that A = LU .

Proof: We delay the proof of this theorem until after algorithms for computing the LU factorization have been
given.

6.2.1 The LU factorization is Gaussian elimination

We show now that the LU factorization is just Gaussian elimination in disguise.
Partition µ ¶ µ ¶ µ ¶
α11 aT 12 1 0 µ11 uT
12
A→ , L→ , and U → ,
a21 A22 l21 L22 0 U22
where α11 and µ11 are both scalars. From A = LU we find that
µ ¶ µ ¶µ ¶ µ ¶
α11 aT 12 1 0 µ11 uT 12 µ11 uT
12
= = T .
a21 A22 l21 L22 0 U22 µ11 l21 l21 u12 + L22 U22

Equating corresponding submatrices on the left and the right of this equation yields the following insights:

• The first row of U is given by

µ11 := α11 and uT T
12 := a12 . (6.2)
116 6. The LU and Cholesky Factorizations

• The first column of L, below the diagonal, is computed by

l21 := a21 /µ11 = a21 /α11 . (6.3)

• Since A22 = l21 uT T

12 + L22 U22 , L22 and U22 can be computed from an LU factorization of A22 − l21 u12 ; that
is,
(A22 − l21 uT
12 ) = L22 U22 . (6.4)

Compare (6.2)–(6.4) with the steps in Gaussian elimination:

• According to (6.2), the first row of U equals the first row of A, just like the first row of A is left untouched
in Gaussian elimination.

• In (6.3) we can recognize the computation of the multipliers in Gaussian elimination:

     
α21 α21 /α11 λ21
 α31   α31 /α11   λ31 
     
l21 := a21 /α11 =  .  /α11 =  ..  =  ..  .
 ..   .   . 
αn,1 αn,1 /α11 λn,1

• The update A22 − l21 uT

12 in (6.4) corresponds to
    ¡ ¢
α22 α23 · · · α2,n λ21 α12 α13 ··· α1,n
 α32 α33 · · · α3,n   λ31 
   
 .. .. .. ..  −  .. 
 . . . .   . 
αn,2 αn,3 · · · αn,n λn,1 
α22 − λ21 α12 α23 − λ21 α13 ··· α2,n − λ21 α1,n
 α32 − λ31 α12 α33 − λ31 α13 ··· α3,n − λ31 α1,n 
 
= .. .. .. .. ,
 . . . . 
αn,2 − λn,1 α12 αn,3 − λn,1 α13 ··· αn,n − λn,1 α1,n

which are the same operations that are performed during Gaussian elimination on the n − 1 × n − 1 bottom
right submatrix of A.

• After these computations have been performed, both the LU factorization and Gaussian elimination
proceed (recursively) with the n − 1 × n − 1 bottom right quadrant of A.

We conclude that Gaussian elimination and and the described algorithm for computing LU factorization perform
exactly the same computations.
6.2. The LU Factorization 117

¡ ¢
Exercise 6.5 Gaussian elimination is usually applied to the augmented matrix A b so that, upon comple-
tion, A is overwritten by U , b is overwritten by an intermediate result y, and the solution of the linear system
Ax = b is obtained from U x = y. Use the system defined by
0 1 0 1
3 −1 2 7
A = @ −3 3 −1 A , b= @ 0 A,
6 0 4 18

to show that, given the LU factorization

0 10 1
1 3 −1 2
A = LU = @ −1 1 A@ 2 1 A,
2 1 1 −1

y satisfies Ly = b.
The previous exercise illustrates that in applying Gaussian elimination to the augmented system, both the
LU factorization of the matrix and the solution of the unit lower triangular system are performed simultaneously
(in the augmented matrix, b is overwritten with the solution of Ly = b). On the other hand, when solving a
linear system via the LU factorization, the matrix is first decomposed into the triangular matrices L and U ,
and then two linear systems are solved: y is computed from Ly = b, and then the solution x is obtained from
U x = y.

6.2.2 Variants
Let us examine next how to derive different variants for computing the LU factorization. The precondition2
and postcondition for this operation are given, respectively, by

Ppre : A = Â
Ppost : (A = {L\U }) ∧ (LU = Â),

where the notation A = {L\U } in the postcondition indicates that L overwrites the elements of A below the
diagonal while U overwrites those on and above the diagonal. (The unit elements on the diagonal entries of L
will not be stored since they are implicitly known.) The requirement that L and U overwrite specific parts of
A implicitly defines the dimensions and triangular structure of these factors.
In order to determine a collection of feasible loop-invariants, we start by choosing a partitioning of the
matrices involved in the factorization. The triangular form of L and U requires them to be partitioned into
quadrants with square diagonal blocks so that the off-diagonal block of zeroes can be cleanly identified., This
then requires A to be conformally partitioned into quadrants as well. Thus,
µ ¶ µ ¶ µ ¶
AT L AT R LT L 0 UT L UT R
A→ , L→ , U→ ,
ABL ABR LBL LBR 0 UBR
2 Strictly speaking, the precondition should also assert that A is square and has nonsingular leading principle submatrices (see

Theorem 6.4).
118 6. The LU and Cholesky Factorizations

where
Pcons : m(AT L ) = n(AT L ) = m(LT L ) = n(LT L ) = m(UT L ) = n(UT L )
holds. Substituting these into the postcondition yields
µ ¶ µ ¶ µ ¶µ ¶ Ã !
AT L AT R {L\U }T L UT R LT L 0 UT L UT R ÂT L ÂT R
= ∧ = ,
ABL ABR LBL {L\U }BR LBL LBR 0 UBR ÂBL ÂBR

from which, multiplying out the second expression, we obtain the PME for LU factorization:

LT L UT L = ÂT L LT L UT R = ÂT R
.
LBL UT L = ÂBL LBR UBR = ÂBR − LBL UT R

These equations exhibit data dependences which dictate an order for the computations: ÂT L must be factored
into LT L UT L before UT R := L−1 −1
T L ÂT R and LBL := ÂBL UT L can be computed, and these two triangular
systems need to be solved before the update ÂBR − LBL UT R can be carried out. By taking into account these
dependencies, the PME yields the five feasible loop-invariants for the LU factorization in Figure 6.1.
Exercise 6.6 Derive unblocked and blocked algorithms corresponding to each of the five loop-invariants in
Figure 6.1.
Note that the resulting algorithms are exactly those given in Figure 1.3.
The loop-invariants in Figure 1.3 yield all algorithms depicted on the cover, and discussed in, G.W. Stewart’s
book on matrix factorization [25]. All these algorithms perform the same computatons but in different order.

6.2.3 Cost analysis

Let Cluν (k) equal the number of flops that have been performed by Variant ν in Figure 1.3 when the algorithm
has proceeded to the point where, before the loop guard is evaluated, AT L is k × k. Concentrating on Variant
5, we can recursively define the cost as

Clu5 (0) = 0 flops,

¡ ¢
Clu5 (k + 1) = Clu5 (k) + (n − k − 1) + 2(n − k − 1)2 flops, 0 ≤ k < n,

where n = m(A) = n(A). The base case comes from the fact that for a 0 × 0 matrix no computation needs to
be performed. The recurrence results from the the cost of the updates in the loop-body: l21 := a21 /µ11 costs
n − k − 1 flops and A22 := A22 − a21 aT 2
12 costs 2(n − k − 1) flops. Thus, the total cost for Variant 5 is given by
3

n−1
X ¡ 2
¢ n−1
X¡ ¢
Clu5 (n) = (n − k − 1) + 2(n − k − 1) = k + 2k 2
k=0 k=0
3 When AT L equals all the matrix, the loop-guard is evaluated false and no update is performed so that Clu (n) = Clu (n − 1).
6.2. The LU Factorization 119

Variant 1
„ « !
AT L AT R {L\U }T L ÂT R
=
ABL ABR ÂBL ÂBR

Variant 2
„ « !
AT L AT R {L\U }T L UT R
=
ABL ABR ÂBL ÂBR

Variant 3
„ « !
AT L AT R {L\U }T L ÂT R
=
ABL ABR LBL ÂBR

Variant 4
„ « !
AT L AT R {L\U }T L UT R
=
ABL ABR LBL ÂBR

Variant 5
„ « !
AT L AT R {L\U }T L UT R
=
ABL ABR LBL ÂBR − LBL UT R

Figure 6.1: Five loop-invariants for the LU factorization.

n−1 n−1
Ã n
! n
X X X X
2 2 2
= 2 k + k=2 k −n + k−n
k=0 k=0 k=1 k=1
µ 3 ¶ µ ¶
n 3 n n2 n 2 3 5 2 n
= 2 − n2 + + − = n − n −
3 2 6 2 2 3 2 6
2 3
≈ n flops.
3

Note: The relation

n
X Z n
2 n3 1
k ≈ x2 dx = − ≈ n3 /3
1 3 3
k=1
120 6. The LU and Cholesky Factorizations

is a convenient way for approximating

n
X n3 n2 n
k2 = − + .
3 2 6
k=1

Exercise 6.7 Use mathematical induction to prove that Clu5 (n) = 32 n3 − 52 n2 − n6 .

Exercise 6.8 Show that the cost of the other four variants is identical to that of Variant 5.
Exercise 6.9 Show that the costs of the blocked algorithms for the LU factorization are identical to those of
the nonblocked algorithms. For simplicity, assume that n is an integer multiple of nb . What fraction of flops
are cast in terms of gemm?

6.2.4 Performance
The performance of the LU factorization was already discussed in Section 1.5.

6.2.5 Gauss transforms and the LU factorization

The LU factorization is often defined in terms of the application of a series of Gauss transformations. In
this section we characterize these transformation matrices and list some of their relevant properties. Gauss
transforms will reappear in the next sections, when pivoting is introduced in the LU factorization.
Definition 6.10 A Gauss transform is a matrix of the form
 
Ik 0 0
Lk =  0 1 0  , 0 ≤ k < n. (6.5)
0 l21 In−k−1

Exercise 6.11 Consider a set of Gauss transforms Lk , 0 ≤ k < n, defined as in (6.5). Show that

1. The inverse is given by  

Ik 0 0
L−1
k = 0 1 0 .
0 −l21 In−k−1

2. Gauss transformations can be easily accumulated as

   
L00 0 0 L00 0 0
 l10
T
1 0  Lk =  l10
T
1 0 ,
L20 0 In−k−1 L20 l21 In−k−1

where L00 ∈ Rk×k .

6.2. The LU Factorization 121

3. L0 L1 · · · Ln−1 ek = Lk ek , 0 ≤ k < n.
Hint: Use Result 2.
Definition 6.12 We will refer to µ ¶
LT L 0
Lac,k = , (6.6)
LBL In−k
where LT L ∈ Rk×k is unit lower triangular, as an accumulated Gauss transform.

Remark 6.13 In subsequent discussion, often we will not explicitly define the dimensions of Gauss transforms
and accumulated Gauss transforms, since they can be deduced from the dimension of the matrix or vector to
which the transformation is applied.
The name of this transform signifies that the product Lac,k B is equivalent to computing L0 L1 · · · Lk−1 B,
where the jth column of the Gauss transform Lj , 0 ≤ j < k, equals that of Lac,k .
Exercise 6.14 Show that the accumulated Gauss transform Lac,k = L0 L1 · · · Lk−1 , defined as in (6.6), satisfies
µ ¶
L−1 0
L−1 = TL and L−1 −1 −1 −1
ac,k A = Lk−1 · · · L1 L0 A.
ac,k −LBL L−1TL In−k

We are now ready to describe the LU factorization in terms of the application of Gauss transforms. Partition
A and the first Gauss transform L0 :
µ ¶ µ ¶
α11 aT 12 1 0
A→ , L0 → ,
a21 A22 l21 In−1

where α11 is a scalar. Next, observe the result of applying the inverse of L0 to A:
µ ¶µ ¶ µ ¶
1 0 α11 aT α11 aT
L−1 A = 12
= 12
.
0 −l21 In−1 a21 A22 a21 − α11 l21 A22 − l21 aT
12

By choosing l21 := a21 /α11 , we obtain a21 − α11 l21 = 0 and A is updated exactly as in the first iteration of the
unblocked algorithm for overwriting the matrix with its LU factorization via Variant 5.
Next, assume that after k steps A has been overwritten by L−1 −1 −1
k−1 · · · L1 L0 Â so that, by careful selection
of the Gauss transforms, µ ¶
UT L UT R
A := L−1
k−1 · · · L−1 −1
1 L0 Â = ,
0 ABR
where UT L ∈ Rk×k is upper triangular. Repartition
   
µ ¶ U00 u12 U02 Ik 0 0
UT L UT R
→ 0 α11 aT 12
 and Lk →  0 1 0 ,
0 ABR
0 a21 A22 0 l21 In−k−1
122 6. The LU and Cholesky Factorizations

and choose l21 := a21 /α11 . Then,

  
Ik 0 0 U00 u12 U02
L−1 A = L −1
(L−1
· · · L −1 −1
1 L 0 ) Â =  0 1 0   0 α11 aT 
k k k−1 12
0 −l21 In−k−1 0 a21 A22
 
U00 u12 U02
=  0 µ11 uT
12
.
T
0 0 A22 − l21 u12
An inductive argument can thus be used to confirm that reducing a matrix to upper triangular form by successive
application of inverses of carefully chosen Gauss transforms is equivalent to performing an LU factorization.
Indeed, if L−1 −1 −1
n−1 · · · L1 L0 A = U then A = LU , where L = L0 L1 · · · Ln−1 is trivially constructed from the
columns of the corresponding Gauss transforms.

6.3 The Basics of Partial Pivoting

The linear system Ax = b has a unique solution provided A is nonsingular. However, the LU factorization
described so far is only computable under the conditions stated inµTheorem ¶ 6.4, which are much more restrictive.
0 1
For example, the LU factorization will not complete when A = . In practice, even if these conditions
0 1
are satisfied, the use of finite precision arithmetic yields the computation of the LU factorization inadvisable,
as we argue in this section.
Let us review the update to matrix A in (6.1):
 
α11 α12 ··· α1,n
 0 α22 − λ21 α12 · · · α2,n − λ21 α1,n 
 
 0 α32 − λ31 α12 · · · α3,n − λ31 α1,n 
 .
 .. .. .. .. 
 . . . . 
0 αn,2 − λn,1 α12 · · · αn,n − λn,1 α1,n
where λi,1 = αi,1 /α11 , 2 ≤ i ≤ n. The algorithm clearly fails if α11 = 0, as corresponds to the 1 × 1 principal
submatrix of A being singular. Assume now that α11 6= 0 and denote the absolute value (magnitude) of a scalar
α by |α|. If |αi,1 | À |α11 |, then λi,1 will be large and it can happen that |αi,j − λi,1 αi,j | À |αi,j |, 2 ≤ j ≤ n;
that is, the update greatly increases the magnitude of αi,j . This is a phenomenon known as element growth
and, as the following example borrowed from [29] shows, in the presence of limited precision has a catastrophic
impact on the accuracy of the results.
Example 6.15 Consider the linear system Ax = b defined by
0 10 1 0 1
0.002 1.231 2.471 x0 3.704
@ 1.196 3.165 2.543 A @ x1 A = @ 6.904 A .
1.475 4.271 2.142 x2 7.888
6.3. The Basics of Partial Pivoting 123

T
A simple calculation shows that x = (1, 1, 1) is the exact solution of the system. (Ax − b = 0).
Now, assume we use a computer where all the operations are done in four-digit decimal floating-point arith-
metic. Computing the LU factorization of A in this machine then yields
0 1 0 1
1.000 0.002 1.231 2.471
L = @ 598.0 1.000 A and U =@ −732.9 −1475 A ,
737.5 1.233 1.000 −1820

which shows two large multipliers in L and the consequent element growth in U .
If we next employ these factors to solve for the system, applying forward substitution to Ly = b, we obtain
0 1 0 1
y0 3.704
y = @ y1 A = @ −2208 A ,
y2 −2000

and from that, applying backward substitution to U x = y,

0 1
4.000
x= @ −1.012 A ,
2.000

which is a completely erroneous solution!

Let us rearrange now the equations of the system (rows of A and b) in a different order
0 10 1 0 1
1.475 4.271 2.142 x0 7.888
Āx = b̄ ≡ @ 1.196 3.165 2.543 A @ x1 A = @ 6.904 A .
0.002 1.231 2.471 x2 3.704

Then, on the same machine, we obtain the triangular factors

0 1 0 1
1.000 1.475 4.271 2.142
L = @ 1.356 × 10−3 1.000 A and U =@ 1.225 2.468 A
0.8108 −0.2433 1.000 1.407

which present multipliers in L of smaller magnitude less than one, and no dramatic element growth in U .
Using these factors, from Ly = b̄ and U x = y, we obtain, respectively,
0 1 0 1
7.888 1.000
y= @ 3.693 A and x= @ 1.000 A .
1.407 1.000

The system is now solved for the exact x.

The second part of the previous example illustrated that the problem of element growth can be solved by
rearranging the rows of the matrix. Specifically, the first column of matrix A is searched for the largest element
in magnitude. The row that contains such element, the pivot row, is swapped with the first row, after which
the current step of the LU factorization proceeds. The net effect is that |λi,1 | ≤ 1 so that |αi,j − λi,1 α1,j | is of
the same magnitude as the largest of |αi,j | and |α1,j |, thus keeping element growth bounded. This is known as
the LU factorization with partial pivoting.
124 6. The LU and Cholesky Factorizations

6.3.1 Permutation matrices

To formally include row swapping in the LU factorization we introduce permutation matrices, which have the
effect of rearranging the elements of vectors and entire rows or columns of matrices.
Definition 6.16 A matrix P ∈ Rn×n is said to be a permutation matrix (or permutation) if, when applied to the
T
vector x = (χ0 , χ1 , . . . , χn−1 ) , it merely rearranges the order of the elements in that vector. Such a permutation
T
can be represented by the vector of integers p = (π0 , π1 , . . . , πn−1 ) , where {π0 , π1 , . . . , πn−1 } is a permutation of
¡ ¢T
{0, 1, . . . , n − 1}, and the scalars πi s indicate that the permuted vector is given by P x = χπ0 , χπ1 , . . . , χπn−1 .
A permutation matrix is equal to the identity matrix with permuted rows, as the next exercise states.
T
Exercise 6.17 Given p = (π0 , π1 , . . . , πn−1 ) , a permutation of {0, 1, . . . , n − 1}, show that
 
eT
π0
 eT 
 π1 
P = ..  ∈ Rn×n (6.7)
 . 
eT
πn−1

T ¡ ¢T
is the permutation matrix that, when applied to vector x = (χ0 , χ1 , . . . , χn−1 ) , yields P x = χπ0 , χπ1 , . . . , χπn−1 .
The following exercise recalls a few essential properties of permutation matrices.
Exercise 6.18 Consider A, x ∈ Rn , and let P ∈ Rn×n be a permutation matrix. Show that

1. The inverse is given by P −1 = P T . Hint: use (6.7).

2. P A rearranges the rows of A exactly in the same order as the elements of x are rearranged by P x. Hint:
Partition P as in (6.7) and recall that row π of A is given by eT
π A.

3. AP T rearranges the columns of A exactly in the same order as the elements of x are rearranged by P x.
T
Hint: Consider (P AT ) .

We will frequently employ permutation matrices that swap the first element of a vector with element π of
that vector:
Definition 6.19 The permutation that, when applied to a vector, swaps the first element with element π is
defined as


  In  if π = 0,


 0 0 1 0
P (π) =  0 Iπ−1 0 0 
   otherwise.

  1 0 0 0 


0 0 0 In−π−1
6.4. Partial Pivoting and High Performance 125

T
Definition 6.20 Given p = (π0 , π1 , . . . , πk−1 ) , a permutation of {0, 1, . . . , k − 1}, P (p) denotes the permuta-
tion µ ¶µ ¶ µ ¶
Ik−1 0 Ik−2 0 1 0
P (p) = ··· P (π0 ).
0 P (πk−1 ) 0 P (πk−2 ) 0 P (π1 )
Remark 6.21 In the previous definition, and from here on, we will typically not explicitly denote the di-
mension of a permutation matrix, since it can be deduced from the dimension of the matrix or the vector the
permutation is applied to.

6.3.2 An algorithm for LU factorization with partial pivoting

An algorithm that incorporates pivoting into Variant 5 (unblocked) of the LU factorization is given in Figure 6.2
(left). In that algorithm, the function PivIndex(x) returns the index of the element of largest magnitude in
vector x. The matrix P (π1 ) is never formed: the appropriate rows of the matrix to which P (π1 ) is applied are
merely swapped. The algorithm computes
L−1 −1 −1
n−1 Pn−1 · · · L1 P1 L0 P0 A = U, (6.8)
where µ ¶
Ik 0
Pk = , 0 ≤ k < n,
0 P (πk )
overwriting the upper triangular part of A with U , and the strictly lower triangular part of the kth column of
A with the multipliers in Lk , 0 ≤ k < n. The vector of pivot indices is stored in integer valued vector p.

6.4 Partial Pivoting and High Performance

The basic approach for partial pivoting described in Section 6.3 suffers from at least three shortcomings:
1. Solving the linear system once the factorization has been computed is inherently inefficient. The reason
for this is as follows: If Ax = b is the linear system to be solved, it is not hard to see that x can be
obtained from
U x = (L−1 −1 −1
n−1 Pn−1 · · · L1 P1 L0 P0 b) = y. (6.9)
The problem is that vector b is repeatedly updated by applying a Gauss transform to it, an operation
which is rich in axpy (vector-vector) operations.
2. This style of pivoting is difficult to incorporate in the other variants of the LU factorization.
3. Finally, it is hard to develop an equivalent blocked algorithm.
We will show that these drawbacks can be overcome by instead computing L, U , and a permutation p of
{0, 1, . . . , n − 1} to satisfy
P (p)A = LU. (6.10)
126 6. The LU and Cholesky Factorizations

Algorithm: [A, p] := LUP unb var5 B(A) Algorithm: [A, p] := LUP unb var5(A)
Partition
„ « „ «
Partition
„ « „ «
AT L AT R pT AT L AT R pT
A→ ,p→ A→ ,p→
ABL ABR pB ABL ABR pB
where AT L is 0 × 0 and pT has 0 elements where AT L is 0 × 0 and pT has 0 elements
while n(AT L ) < n(A) do while n(AT L ) < n(A) do

Repartition 0 1 Repartition 0 1
„ « A00 a01 A02 „ « A00 a01 A02
AT L AT R AT L AT R
→ @ aT
10 α11 aT
12
A, → @ aT
10 α11 aT
12
A,
ABL ABR ABL ABR
0 1 A20 a21 A22 0 1 A20 a21 A22
„ « p0 „ « p0
pT pT
→ @ π1 A → @ π1 A
pB pB
p2 p2
where α11 and π1 are scalars where α11 and π1 are scalars
„ « „ «
α11 α11
π1 := PivIndex π1 := PivIndex
„ « a21 „ T a21«
α11 aT 12 a10 α11 aT 12
a21 A22 „ « A20 a21 „A22 «
α11 aT
12 aT
10 α11 aT
12
:= P (π1 ) := P (π1 )
a21 A22 A20 a21 A22
a21 := a21 /α11 a21 := a21 /α11
A22 := A22 − a21 aT
12 A22 := A22 − a21 aT12

Continue with 0 1 Continue with 0 1

„ « A00 a01 A02 „ « A00 a01 A02
AT L AT R AT L AT R
← @ aT
10 α11 aT
12
A, ← @ aT
10 α11 aT
12
A,
ABL ABR ABL ABR
0 1 A20 a21 A22 0 1 A20 a21 A22
„ « p0 „ « p0
pT pT
← @ π1 A ← @ π1 A
pB pB
p2 p2
endwhile endwhile

Figure 6.2: Unblocked algorithms for the LU factorization with partial pivoting (Variant 5). Left: basic
algorithm. Right: High-performance algorithm.
6.4. Partial Pivoting and High Performance 127

6.4.1 Swapping the order of permutation matrices and Gauss transforms

We start by reviewing the theory that relates (6.10) to (6.9). Critical to justifying (6.10) is an understanding
of how the order of permutations and Gauss transforms can be swapped:
Lemma 6.22 Let    
Ik 0 0 Ik 0 0
P =  0 1 0  and Lk =  0 1 0 , (6.11)
0 0 Q 0 l21 In−k−1
where Q ∈ Rn−k−1×n−k−1 is a permutation matrix. Then,
  
Ik 0 0 Ik 0 0
P Lk =  0 1 0   0 1 0 
0 0 Q 0 l21 In−k−1
  
Ik 0 0 Ik 0 0
=  0 1 0  0 1 0  = L̄k P.
0 Ql21 In−k−1 0 0 Q

In words, there exists a Gauss transform L̄k , of the same dimension and structure as Lk , such that P Lk = L̄k P .
Exercise 6.23 Prove Lemma 6.22.
The above lemma supports the following observation: According to (6.8), the basic LU factorization with
partial pivoting yields
L−1 −1 −1
n−1 Pn−1 · · · L1 P1 L0 P0 A = U

or
A = P0 L0 P1 L1 · · · Pn−1 Ln−1 U.
(j)
From the lemma, there exist Gauss transforms Lk , 0 ≤ k ≤ j < n, such that

A = P0 L0 P1 L1 P2 L2 · · · Pn−1 Ln−1 U
| {z }
(1)
= P0 P1 L0 L1 P2 L2 · · · Pn−1 Ln−1 U
| {z }
(1) (1)
= P0 P1 L0 P2 L1 L2 · · · Pn−1 Ln−1 U
| {z }
(2) (2)
= P0 P1 P2 L0 L1 · · · Pn−1 Ln−1 U
= ···
(n−1) (n−1) (n−1)
= P0 P1 P2 · · · Pn−1 L0 L1 · · · Ln−1 U.

This culminates into the following result:

128 6. The LU and Cholesky Factorizations

Theorem 6.24 Let µ ¶

Ik 0
Pk =
0 P (πk )
and Lk , 0 ≤ k < n, be as computed by the basic LU factorization with partial pivoting so that

L−1 −1 −1
n−1 Pn−1 · · · L1 P1 L0 P0 A = U.

T
Then there exists a lower triangular matrix L such that P (p)A = LU with p = (π0 , π1 , . . . , πn−1 ) .
(n−1) (n−1) (n−1)
Proof: L is given by L = L0 L1 · · · Ln−1 .
If L, U , and p satisfy P (p)A = LU , then Ax = b can be solved by applying all permutations to b followed
by two (clean) triangular solves:

Ax = b =⇒ P A x = |{z}
|{z} Pb =⇒ LU x = b̄.
¯
LU b̄

6.4.2 Deriving multiple algorithms

The observations in Section 6.4.1 can be turned into the algorithm in Figure 6.2 (right). It differs from the
algorithm on the left only in that entire rows of A are pivoted, as corresponds to the permutations being
also applied to the part of L computed so far. This has the effect of reordering the permutations and Gauss
transforms so that the permutations are all shifted to the left.
In this section we show that instead that algorithm, and several other variants, can be systematically derived
from the postcondition
Ppost : (A = {L\U }) ∧ (LU = P (p)Â) ∧ (|L| ≤ 1).
Here |L| ≤ 1 indicates that all entries in L must be of magnitude less or equal than one.

Remark 6.25 It will become obvious that the family of algorithms for the LU factorization with pivoting
can be derived without the introduction of Gauss transforms or knowledge about how Gauss transforms and
permutation matrices can be reordered. They result from systematic application of the derivation techniques
to the operation that computes p, L, and U that satisfy the postcondition.
As usual, we start by deriving a PME for this operation: Partition A, L, and U into quadrants,
µ ¶ µ ¶ µ ¶
AT L AT R LT L 0 UT L UT R
A→ , L→ , U→ ,
ABL ABR LBL LBR 0 UBR

and partition p conformally as µ ¶

pT
p→
pB
6.4. Partial Pivoting and High Performance 129

where the predicate

Pcons : m(AT L ) = n(AT L ) = m(LT L ) = n(LT L ) = m(UT L ) = n(UT L ) = m(pT )

holds. Substitution into the postcondition then yields

µ ¶ µ ¶
AT L AT R {L\U }T L UT R
= (6.12)
ABL ABR LBL {L\U }BR
µ ¶µ ¶ µ ¶Ã !
LT L 0 UT L UT R pT ÂT L ÂT R
∧ =P (6.13)
LBL LBR 0 UBR pB ÂBL ÂBR
¯µ ¶¯
¯ LT L 0 ¯
∧ ¯¯ ¯ ≤ 1. (6.14)
LBL LBR ¯

Theorem 6.26 The expressions in (6.13) and (6.14) are equivalent to the simultaneous equations
µ ¶ Ã !
ĀT L ĀT R ÂT L ÂT R
= P (pT ) , (6.15)
ĀBL ĀBR ÂBL ÂBR
L̄BL = P (pB )T LBL , (6.16)
µ ¶ µ ¶ ¯µ ¶¯
LT L ĀT L ¯ LT L ¯
UT L = ∧¯ ¯ ¯ ≤ 1, (6.17)
L̄BL ĀBL L̄BL ¯
LT L UT R = ĀT R , (6.18)
LBR UBR = P (pB )(ĀBR − L̄BL UT R ) ∧ |LBR | ≤ 1, (6.19)

which together represent the PME for LU factorization with partial pivoting.
Exercise 6.27 Prove Theorem 6.26.
Equations (6.15)–(6.19) have the following interpretation:
• Equations (6.15) and (6.16) are included for notational convenience. Equation (6.16) states that L̄BL
equals the final LBL except that its rows have not yet been permuted according to future computation.
• Equation (6.17) denotes an LU factorization with partial pivoting of the submatrices to the left of the
thick line in (6.13).
• Equations (6.18) and (6.19) indicate that UT R and {L\U }BR result from permuting the submatrices to the
right of the thick line in (6.14), after which UT R is computed by solving the triangular system L−1
T L ĀT R ,
and {L\U }BR result from updating ĀBR and performing an LU factorization with partial pivoting of that
quadrant. Equation (6.16) resurfaces here, since the permutations pB must also be applied to L̄BL to
yield LBL .
130 6. The LU and Cholesky Factorizations

Variant 3a
„ « !
AT L AT R {L\U }T L ÂT R
=
ABL AT R L̄BL ÂBR
„ « !
ĀT L ÂT L
∧ = P (pT )
ĀBL ÂBL
„ « „ « ˛„ «˛
LT L ĀT L ˛ LT L ˛
∧ UT L = ∧ ˛˛ ˛≤1
L̄BL ĀBL L̄BL ˛
Variant 3b
„ « „ «
AT L AT R {L\U }T L ĀT R
=
ABL AT R L̄BL ĀBR
„ « !
ĀT L ĀT R ÂT L ÂT R
∧ = P (pT )
ĀBL ĀBR ÂBL ÂBR
„ « „ « ˛„ «˛
LT L ĀT L ˛ LT L ˛
∧ UT L = ∧ ˛˛ ˛≤1
L̄BL ĀBL L̄BL ˛
Variant 4
„ « „ «
AT L AT R {L\U }T L UT R
=
ABL AT R L̄BL ĀBR
„ « !
ĀT L ĀT R ÂT L ÂT R
∧ = P (pT )
ĀBL ĀBR ÂBL ÂBR
„ « „ « ˛„ «˛
LT L ĀT L ˛ LT L ˛
∧ UT L = ∧ LT L UT R = ĀT R ∧ ˛˛ ˛≤1
L̄BL ĀBL L̄BL ˛
Variant 5
„ « „ «
AT L AT R {L\U }T L UT R
=
ABL AT R L̄BL ĀBR − L̄BL UT R
„ « !
ĀT L ĀT R ÂT L ÂT R
∧ = P (pT )
ĀBL ĀBR ÂBL ÂBR
„ « „ « ˛„ «˛
LT L ĀT L ˛ LT L ˛
∧ UT L = ∧ LT L UT R = ĀT R ∧ ˛ ˛
L̄BL ĀBL ˛ L̄BL ˛ ≤ 1

Figure 6.3: Four loop-invariants for the LU factorization with partial pivoting.
6.4. Partial Pivoting and High Performance 131

Equations (6.15)–(6.19) dictate an inherent order in which the computations must proceed:

• If one of pT , LT L , UT L , or L̄BL has been computed, so have the others. In other words, any loop invariant
must include the computation of all four of these results.

• In the loop invariant, AT R = ĀT R implies ABR = ĀBR and viceversa.

• In the loop invariant, AT R = UT R only if ABR = ĀBR , since UT R requires ĀT R .

• In the loop invariant ABR = ĀBR only if AT R = UT R , since ĀBR requires UBR .

These constraints yield the four feasible loop invariants given in Figure 6.3. In that figure, the variant number
reflects the variant for the LU factorization without pivoting (Figure 6.1) that is most closely related. Variants 3a
and 3b only differ in that for Variant 3b the pivots computed so far have also been applied to the columns to
the right of the thick line µin (6.14).
¶ Variants 1 and 2 from Figure 6.1 have no correspondence here as pivoting
AT L
affects the entire rows of . In other words, in these two variants LT L and UT L have been computed
ABL
but L̄BL has not which can be argued is not possible for a feasible loop invariant.
Let us fully elaborate the case labeled as Variant 5. Partitioning to expose the next rows and columns of A,
L, U and the next element of p, as usual, and substituting into the loop invariant yields
   
A00 a01 A02 {L\U }00 u01 U02
 aT α11 aT = ¯lT ᾱ11 − ¯l10
T
u01 aT ¯T  (6.20)
10 12 10 12 − l10 U02
A20 a21 A22 L̄20 ā21 − L̄20 u01 Ā22 − L̄20 U02
       
Ā00 ā01 Ā02 Â00 â01 Â02 L00 Ā00
   ¯lT  U00 =  āT 
∧  āT 10 ᾱ11 āT12
 = P (p0 )  âT
10 α̂11 âT 12  ∧ 10 10 (6.21)
Ā20 ā21 Ā22 Â20 â21 Â22 L̄20 Ā20
¯ ¯
¯ ¯
¡ ¢ ¡ ¢ ¯ LT00 ¯
∧L00 u01 U02 = ā01 Ā02 ∧ ¯ ¯  ¯
l10  ¯ ≤ 1. (6.22)
¯
¯ L̄20 ¯

Similarly, after moving the thick lines substitution into the loop invariant yields
   
A00 a01 A02 {L\U }00 u01 U02
T
 aT 10 α11 aT
12
= l10 µ11 uT
12  (6.23)
A20 a21 A22 ¯
L̄ ¯
l ¯ − L̄
Ā ¯ U − ¯l uT
20 21 22 20 02 21 12
   
¯
Ā ā ¯
¯01 Ā02 µ ¶ Â00 â01 Â02
00
 ¯T  p0  
∧  ā10 ¯ 11 ā
ᾱ ¯T12  = P ( )  âT10 α̂11 âT
12  (6.24)
¯ ¯ π 1
Ā 20 ¯21 Ā
ā 22 Â20 â21 Â22
132 6. The LU and Cholesky Factorizations

   
L00 0 µ ¶ ¯
Ā ¯01
ā µ ¶µ ¶ µ ¯ ¶
00
 T  U00 u01  ¯T  L00 0 U02 Ā02
∧ l10 1 =  ā10 α¯11  ∧ T = (6.25)
¯ ¯l 0 µ11 ¯ l10 1 uT
12 ¯T
ā 12
L̄ 20 21 Ā 20 ¯21
ā
¯ ¯
¯ L00 0 ¯
¯ ¯
∧¯¯  l T
1  ¯ ≤ 1. (6.26)
10 ¯
¯ L̄ ¯ ¯l ¯
20 21

A careful manipulation of the conditions after repartitioning, in (6.20)–(6.22), and the conditions after moving
the thick line, in (6.23)–(6.26), shows that the current contents of A must be updated by the steps
µ ¶
α11
1. π1 := PivIndex .
a21
µ T ¶ µ T ¶
a10 α11 aT12 a10 α11 aT12
2. := P (π1 ) .
A20 a21 A22 A20 a21 A22

3. a21 := a21 /α11 .

4. A22 := A22 − a21 aT12 .

The algorithms corresponding to Variants 3a, 4, and 5 are given in Figure 6.4. There, trilu(Ai,i ) stands
for the unit lower triangular matrix stored in Ai,i , i = 0, 1.

6.5 The Cholesky Factorization

In this section we review a specialized factorization for a symmetric positive definite matrix. The symmetric
structure yields a reduction of the cost of the factorization algorithm while the combination of that property
and positive definiteness ensures the existence of the factorization and eliminates the need for pivoting.
Definition 6.28 A symmetric matrix A ∈ Rn×n is said to be symmetric positive definite (SPD) if and only if
xT Ax > 0 for all x ∈ Rn such that x 6= 0.

6.5.1 Variants for computing the Cholesky factorization

Given an SPD matrix A ∈ Rn×n , the Cholesky factorization computes a lower triangular matrix L ∈ Rn×n
such that A = LLT . (As an alternative we could compute an upper triangular matrix U ∈ Rn×n such that
A = U T U .) As usual we will assume that the lower triangular part of A contains the relevant entries of the
matrix, that is, SyLw(A). On completion, the triangular factor L will overwrite this part of the matrix while
the strictly upper triangular part of A will not be referenced or modified.
6.5. The Cholesky Factorization 133

Algorithm: [A, p] := LU piv unb(A) Algorithm: [A, p] := LU piv blk(A)

Partition
„ « „ «
Partition
„ « „ «
AT L AT R pT AT L AT R pT
A→ ,p→ A→ ,p→
ABL ABR pB ABL ABR pB
where AT L is 0 × 0 and pT has 0 elements where AT L is 0 × 0 has pT has 0 elements
while n(AT L ) < n(A) do while n(AT L ) < n(A) do
Determine block size nb
Repartition 0 1 0 1 Repartition 0 1 0 1
„ « A00 a01 A02 „ « p0 „ « A00 A01 A02 „ « p0
AT L AT R pT AT L AT R @ A pT @
→@ a10 α11 a12 A ,
T T →@ π1 A → A10 A11 A12 , → p1 A
ABL ABR pB ABL ABR pB
A20 a21 A22 p2 A20 A21 A22 p2
where α11 and π1 are scalars where A11 is nb ×nb and p1 has nb elements

Variant Variant 3a:

0 13a: 0 1 0 1 0 1
a01 a01 A01 A01
@ α11 A := P (p0 ) @ α11 A @ A11 A := P (p0 ) @ A11 A
a21 a21 A21 A21
a01 := trilu(A00 )−1 a01 A01 := trilu(A00 )A01
α11 := α11 − aT10 a01
A11 := A11 − A10 A01
a21 := a21 − A20 A21 := A21 − A20 A01
„a01 «
α11
π1 := PivIndex »„ « – „ «
„ « a„21 « A11 A11
α11 α11 , p1 := LUP unb
:= P (π1 ) A21 A21
a21 a21
a21 := a21 /α11
Variant 4: Variant 4:
α11 := α11 − aT10 a01 A11 := A11 − A10 A01
a21 := a21 − A20 a01 A21 := A21 − A20 A01
aT T T A
12 := a12 − a10 A12 := A
„ 02 « »„ «12 −–A10 A02 „ «
α11 A11 A11
π1 := PivIndex , p1 := LUP unb
a21
„ T « „ T « „ A21 « „ A21 «
a10 α11 aT 12 a10 α11 aT
12 A10 A12 A10 A12
:= P (π1 ) := P (p1 )
A20 a21 A22 A20 a21 A22 A20 A22 A20 A22
a21 := a21 /α11 A12 := trilu(A11 )−1 A12
Variant 5: „ « Variant 5:
»„ « – „ «
α11 A11 A11
π1 := PivIndex , p1 := LUP unb
„ T «a21 „ T « „ A21 « „ A21 «
a10 α11 aT 12 a10 α11 aT
12 A10 A12 A10 A12
:= P (π1 ) := P (p1 )
A20 a21 A22 A20 a21 A22 A20 A22 A20 A22
a21 := a21 /α11 A12 := trilu(A11 )−1 A12
A22 := A22 − a21 aT 12 A22 := A22 − A21 A12

Continue with 0 1 0 1 Continue with 0 1 0 1

„ « A00 a01 A02 „ « p0 „ « A00 A01 A02 „ « p0
AT L AT R @ T T A pT @ AT L AT R pT
← a10 α11 a12 , ← π 1 A @
← A10 A11 A12 A , @
← p1 A
ABL ABR pB ABL ABR pB
A20 a21 A22 p2 A20 A21 A22 p2
endwhile endwhile

Figure 6.4: Unblocked and blocked algorithms for the LU factorization with partial pivoting.
134 6. The LU and Cholesky Factorizations

Let us examine how to derive different variants for computing this factorization. The precondition4 and
postcondition of the operation are expressed, respectively, as

Ppre : A ∈ Rn×n ∧ SyLw(A) and

Ppost : (tril (A) = L) ∧ (LLT = Â),

where, as usual, tril (A) denotes the lower triangular part of A. The postcondition implicitly specifies the
dimensions and lower triangular structure of L.
The triangular structure of L requires it to be partitioned into quadrants with square diagonal blocks, and
this requires A to be conformally partitioned into quadrants as well:
µ ¶ µ ¶
AT L AT R LT L 0
A→ and L → ,
ABL ABR LBL LBR

where
Pcons : m(AT L ) = n(AT L ) = m(LT L ) = n(LT L ).
holds. Substituting these into the postcondition yields
µµ ¶¶ µ ¶ µ ¶
AT L ? tril (AT L ) 0 LT L 0
tril = =
ABL ABR ABL tril (A ) L LBR
¶ Ã !
BR BL
µ ¶µ T
LT L 0 LT L LT BL Â T L ?
∧ = .
LBL LBR 0 LTBR ÂBL ÂBR

The “?” symbol is used in this expression and from now on to indicate a part of a symmetric matrix that is not
referenced. The second part of the postcondition can then be rewritten as the PME

LT L LT
T L = ÂT L ?
,
LBL LTT L = ÂBL LBR LBR = ÂBR − LBL LT
BL

showing that ÂT L must be factored before LBL := ÂBL L−T T L can be solved, and LBL itself is needed in order
T
to compute the update ÂBR − LBL LBL . These dependences result in the three feasible loop-invariants for
Cholesky factorization in Figure 6.5. We present the unblocked and blocked algorithms that result from these
three invariants in Figure 6.6.
Exercise 6.29 Using the worksheet, show that the unblocked and blocked algorithms corresponding to the three
loop-invariants in Figure 6.5 are those given in Figure 6.6.
Exercise 6.30 Identify the type of operations that are performed in the blocked algorithms for the Cholesky
factorization in Figure 6.6 (right) as one of these types: trsm, gemm, chol, or syrk.
4A complete precondition would also assert that A is positive definite in order to guarantee existence of the factorization.
6.6. Summary 135

Variant 1
„ « !
AT L ? LT L ?
=
ABL ABR LBL ÂBR − LBL LT
BL

Variant 2 Variant 3
„ « ! „ « !
AT L ? LT L ? AT L ? LT L ?
= =
ABL ABR ÂBL ÂBR ABL ABR LBL ÂBR

Figure 6.5: Three loop-invariants for the Cholesky factorization.

Exercise 6.31 Prove that the cost of the Cholesky factorization is

n3
Cchol = flops.
3
Exercise 6.32 Show that the cost of the blocked algorithms for the Cholesky factorization is the same as that
of the nonblocked algorithms.
Considering that n is an exact multiple of nb with nb ¿ n, what is the amount of flops that are performed in
terms of gemm?

6.5.2 Performance
The performance of the Cholesky factorization is similar to that of the LU factorization, which was studied in
Section 1.5.

6.6 Summary
In this chapter it was demonstrated that

• The FLAME techniques for deriving algorithms extend to more complex linear algebra operations.

• Algorithms for factorization operations can be cast in terms of matrix-matrix multiply, and its special
cases, so that high performance can be attained.

• Complex operations, like the LU factorization with partial pivoting, fit the mold.

This chapter completes the discussion of the basic techniques that underlie the FLAME methodology.
136 6. The LU and Cholesky Factorizations

Algorithm: A := Chol unb(A) Algorithm: A := Chol blk(A)

„ « „ «
AT L AT R AT L AT R
Partition A → Partition A →
ABL ABR ABL ABR
where AT L is 0 × 0 where AT L is 0 × 0
while m(AT L ) < m(A) do while m(AT L ) < m(A) do
Determine block size nb
Repartition 0 1 Repartition 0 1
„ « A00 a01 A02 „ « A00 A01 A02
AT L AT R AT L AT R
→ @ aT10 α11 aT
12
A → @ A10 A11 A12 A
ABL ABR ABL ABR
A20 a21 A22 A20 A21 A22
where α11 is 1 × 1 where A11 is nb × nb

Variant 1: Variant 1:
√
α11 := α11 A11 := Chol unb(A11 )
a21 := a21 /α11 A21 := A21 tril (A11 )−T
A22 := A22 − a21 aT 21 A22 := A22 − A21 AT21
Variant 2: Variant 2:
aT T
10 := a10 tril (A)00
−T A10 := A10 tril (A)00 −T
α11 := α11 − aT a
10 10
A11 := A11 − A10 AT10
√
α11 := α11 A11 := Chol unb(A11 )
Variant 3: Variant 3:
α11 := α11 − aT 10 a10
A11 := A11 − A10 AT10
√
α11 := α11 A11 := Chol unb(A11 )
a21 := a21 − A20 a10 A21 := A21 − A20 AT10
a21 := a21 /α11 A21 := A21 tril (A)11 −T

Continue with 0 1 Continue with 0 1

„ « A00 a01 A02 „ « A00 A01 A02
AT L AT R AT L AT R
← @ aT
10 α11 a12 A
T
← @ A10 A11 A12 A
ABL ABR ABL ABR
A20 a21 A22 A20 A21 A22
endwhile endwhile

Figure 6.6: Algorithms for computing the Cholesky factorization.

6.7 Further Exercises

For additional exercises, visit $BASE/Chapter6/.
Appendix A
The Use of Letters

We attempt to be very consistent with our notation in this book as well as in FLAME related papers, the
FLAME website, and the linear algebra wiki.
As mentioned in Remark 3.1, Lowercase Greek letters and Roman letters will be used to denote scalars and
vectors, respectively. Uppercase Roman letters will be used for matrices. Exceptions to this rule are variables
that denote the (integer) dimensions of the vectors and matrices which are denoted by Roman lowercase letters
to follow the traditional convention.
The letters used for a matrix, vectors that appear as submatrices of that matrix (e.g., its columns), and
elements of that matrix are chosen in a consistent fashion Similarly, letters used for a vector and elements of
that vector are chosen to correspond. This consistent choice is indicated in Figure A.1. In that table we do not
claim that Greek letters used are the Greek letters that correspond to the inidicated Roman letters. We are
merely indicating what letters we chose.

137
138 A. The Use of Letters

Matrix Vector Scalar Note

Symbol LATEX Code
A a α \alpha alpha
B b β \beta beta
C c γ \gamma gamma
D d δ \delta delta
E e ² \epsilon epsilon ej = jth unit basis vector.
F f φ \phi phi
G g ξ \xi xi
H h η \eta eta
I i ι \iota iota I is used for the identity matrix.
K k κ \kappa kappa
L l λ \lambda lambda
M m µ \mu mu m(X) = row dimension of X.
N n ν \nu nu n(X) = column dimension of X. Shared with V .
P p π \pi pi
Q q θ \theta theta
R r ρ \rho rho
S s σ \sigma sigma
T t τ \tau tau
U u υ \upsilon upsilon
V v ν \nu nu Shared with N .
W w ω \omega omega
X x χ \chi chi
Y y ψ \psi psi
Z z ζ \zeta zeta

Figure A.1: Correspondence between letters used for matrices (uppercase Roman), vectors (lowercase Roman)
and the symbols used to denote their scalar entries (lowercase Greek letters).
Appendix B
Summary of FLAME/C Routines

In this appendix, we list a number of routines supported as part of the current implementation of the FLAME
library.

Additional Information
Information on the library, libFLAME, that uses the APIs and techniques discussed in this book, and the
functionality supported by the library, visit
http://www.cs.utexas.edu/users/flame/
A Quick Reference guide can be downloaded from
http://www.cs.utexas.edu/users/flame/Publications/

B.1 Parameters
A number of parameters can be passed in that indicate how FLAME objects are to be used. These are
summarized in Fig. B.1.

139
140 B. Summary of FLAME/C Routines

Parameter Datatype P ermitted values Meaning

datatype FLA Datatype FLA INT Elements of object are to be of indicated nu-
FLA FLOAT merical datatype.
FLA DOUBLE
FLA COMPLEX
FLA DOUBLE COMPLEX
side FLA Side FLA LEFT Indicates whether the matrix with special
FLA RIGHT structure appears on the left or on the right.
uplo FLA Uplo FLA LOWER TRIANGULAR Indicates whether lower or upper triangular
FLA UPPER TRIANGULAR part of array stores the matrix.
trans FLA Trans FLA NO TRANSPOSE Do not transpose: optrans (X) = X.
transA FLA TRANSPOSE Transpose: optrans (X) = X T .
transB FLA CONJ TRANSPOSE Conjugate transpose: optrans (X) = X H = X̄ T .
FLA CONJ NO TRANSPOSE Conjugate no transpose: optrans (X) = X̄.
diag FLA Diag FLA NONUNIT DIAG Use values stored on the diagonal.
FLA UNIT DIAG Compute as if diagonal elements equal one.
FLA UNIT DIAG Compute as if diagonal elements equal zero.
conj FLA Conj FLA NO CONJUGATE Indicates whether to conjugate.
FLA CONJUGATE
quadrant FLA Quadrant FLA TL Indicates from which quadrant the center sub-
FLA TR matrix is partitioned.
FLA BL
FLA BR

Figure B.1: Table of parameters and permitted values.

B.3. Manipulating Linear Algebra Objects 141

B.2 Initializing and Finalizing FLAME/C

FLA_Init( )
Initialize FLAME.
FLA_Finalize( )
Finalize FLAME.

B.3 Manipulating Linear Algebra Objects

Creators, destructors, and inquiry routines

FLA_Obj_create( FLA_Datatype datatype, int m, int n, FLA_Obj *matrix )

Create an object that describes an m × n matrix and create the associated storage array.
FLA_Obj_create_without_buffer( FLA_Datatype datatype, int m, int n, FLA_Obj *matrix )
Create an object that describes an m × n matrix without creating the associated storage array.
FLA_Obj_attach_buffer( void *buff, int ldim, FLA_Obj *matrix )
Attach a buffer that holds a matrix stored in column-major order with leading dimension ldim to the object matrix.
FLA_Obj_create_conf_to( FLA_Trans trans, FLA_Obj old, FLA_Obj *matrix )
Like FLA Obj create except that it creates an object with same datatype and dimensions as old, transposing if desired.
FLA_Obj_free( FLA_Obj *obj )
Free all space allocated to store data associated with obj.
FLA_Obj_free_without_buffer( FLA_Obj *obj )
Free the space allocated to for obj without freeing the buffer.
FLA_Datatype FLA_Obj_datatype( FLA_Obj matrix )
Extract datatype of matrix.
int FLA_Obj_length( FLA_Obj matrix )
Extract row dimension of matrix.
int FLA_Obj_width( FLA_Obj matrix )
Extract column dimension of matrix.
void *FLA_Obj_buffer( FLA_Obj matrix )
Extract the address where the matrix is stored.
int FLA_Obj_ldim( FLA_Obj matrix )
Extract the leading dimension for the array in which the matrix is stored.
142 B. Summary of FLAME/C Routines

Partitioning, etc.

FLA_Part_2x2( FLA_Obj A, FLA_Obj ATL, FLA_Obj ATR,

FLA_Obj *ABL, FLA_Obj *ABR, int mb, int nb, FLA_Quadrant quadrant )
Partition matrix A into four quadrants where the quadrant indicated by quadrant is mb × nb.
FLA_Merge_2x2( FLA_Obj ATL, FLA_Obj TR,
FLA_Obj ABL, FLA_Obj BR, FLA_Obj *A )
Merge a 2 × 2 partitioning of a matrix into a single view.
FLA_Repart_from_2x2_to_3x3 ( FLA_Obj ATL, FLA_Obj ATR, FLA_Obj *A00, FLA_Obj *A01, FLA_Obj *A02,
FLA_Obj *A10, FLA_Obj *A11, FLA_Obj *A12,
FLA_Obj ABL, FLA_Obj ABR, FLA_Obj *A20, FLA_Obj *A21, FLA_Obj *A22,
int mb, int nb, FLA_Quadrant quadrant )
Repartition a 2 × 2 partitioning of matrix A into a 3 × 3 partitioning where mb × nb submatrix A11 is split from the quadrant
indicated by quadrant.
FLA_Cont_with_3x3_to_2x2( FLA_Obj *ATL, FLA_Obj *ATR, FLA_Obj A00, FLA_Obj A01, FLA_Obj A02,
FLA_Obj A10, FLA_Obj A11, FLA_Obj A12,
FLA_Obj *ABL, FLA_Obj *ABR, FLA_Obj A20, FLA_Obj A21, FLA_Obj A22,
FLA_Quadrant quadrant )
Update the 2 × 2 partitioning of matrix A by moving the boundaries so that A11 is added to the quadrant indicated by
quadrant.
FLA_Part_2x1( FLA_Obj A, FLA_Obj *AT,
FLA_Obj *AB, int mb, FLA_Side side )
Partition matrix A into a top and bottom side where the side indicated by side has mb rows.
FLA_Merge_2x1( FLA_Obj AT,
FLA_Obj AB, FLA_Obj *A )
Merge a 2 × 1 partitioning of a matrix into a single view.
FLA_Repart_from_2x1_to_3x1( FLA_Obj AT, FLA_Obj *A0,
FLA_Obj *A1,
FLA_Obj AB, FLA_Obj *A2, int mb, FLA_Side side )
Repartition a 2 × 1 partitioning of matrix A into a 3 × 1 partitioning where submatrix A1 with mb rows is split from the side
indicated by side.
FLA_Cont_with_3x1_to_2x1( FLA_Obj *AT, FLA_Obj A0,
FLA_Obj A1,
FLA_Obj *AB, FLA_Obj A2, FLA_Side side )
Update the 2 × 1 partitioning of matrix A by moving the boundaries so that A1 is added to the side indicated by side.
FLA_Part_1x2( FLA_Obj A, FLA_Obj *AL, FLA_Obj *AR, int nb, FLA_Side side )
Partition matrix A into a left and right side where the side indicated by side has nb columns
FLA_Merge_1x2( FLA_Obj AL, FLA_Obj AR, FLA_Obj *A )
Merge a 1 × 2 partitioning of a matrix into a single view.
FLA_Repart_from_1x2_to_1x3( FLA_Obj AL, FLA_Obj AR,
FLA_Obj *A0, FLA_Obj *A1, FLA_Obj *A2, int nb, FLA_Side side )
Repartition a 1 × 2 partitioning of matrix A into a 1 × 3 partitioning where submatrix A1 with nb columns is split from the
side indicated by side.
FLA_Cont_with_1x3_to_1x2( FLA_Obj *AL, FLA_Obj *AR,
FLA_Obj A0, FLA_Obj A1, FLA_Obj A2, FLA_Side side )
Update the 1 × 2 partitioning of matrix A by moving the boundaries so that A1 is added to the side indicated by side.
B.5. A Subset of Supported Operations 143

B.4 Printing the Contents of an Object

FLA_Obj_show( char string1, FLA_Obj A, char format, char *string2 )

Print the contents of A.

B.5 A Subset of Supported Operations

For information on additional operations supported by the libFLAME library, such as Cholesky, LU, and QR
factorization and related solvers, visit http://www.cs.utexas.edu/users/flame/ . Most of the operations
are those supported by the Basic Linear Algebra Subprograms (BLAS) [21, 10, 9]. For this reason, we adopt a
naming convention that is very familiar to those who have used traditional BLAS routines.
144 B. Summary of FLAME/C Routines

General operations
Note: the name of the FLA Axpy routine comes from the BLAS routine axpy which stands for double precision
alpha times vector x plus vector y. We have generalized this routine to also work with matrices.

FLA_Axpy( FLA_Obj alpha, FLA_Obj A, FLA_Obj B )

B := αA + B.
FLA_Axpy_x( FLA_Trans trans, FLA_Obj alpha, FLA_Obj A, FLA_Obj B )
B := αoptrans (A) + B.
FLA_Copy( FLA_Obj A, FLA_Obj B )
B := A.
FLA_Copy_x( FLA_Trans trans, FLA_Obj A, FLA_Obj B )
B := optrans (A).
FLA_Inv_scal( FLA_Obj alpha, FLA_Obj A )
1
A := α A.
FLA_Negate( FLA_Obj A )
A := −A.
FLA_Obj_set_to_one( FLA_Obj A )
Set all elements of A to one.
FLA_Obj_set_to_zero( FLA_Obj A )
Set all elements of A to zero.
FLA_Random_matrix( FLA_Obj A )
Fill A with random values in the range (−1, 1).
FLA_Scal( FLA_Obj alpha, FLA_Obj A )
A := αA.
FLA_Set_diagonal( FLA_Obj sigma, FLA_Obj A )
Set the diagonal of A to σI. All other values in A are unaffected.
FLA_Shift_spectrum( FLA_Obj alpha, FLA_Obj sigma, FLA_Obj A )
A := A + ασI.
FLA_Swap( FLA_Obj A, FLA_Obj B )
A, B := B, A.
FLA_Swap_x( FLA_Trans trans, FLA_Obj A, FLA_Obj B )
A, B := optrans (A), optrans (B).
FLA_Symmetrize( FLA_Uplo uplo, FLA_Conj conj, FLA_Obj A )
A := symm(A) or A := herm(A), where uplo indicates whether A is originally stored only in the upper or lower triangular
part of A.
FLA_Triangularize( FLA_Uplo uplo, FLA_Diag diag, FLA_Obj A )
A := lower(A) or A := upper(A).
B.5. A Subset of Supported Operations 145

Scalar operations

FLA_Invert( FLA_Obj alpha )

α := 1/α.
FLA_Sqrt( FLA_Obj alpha )
√
α := α. Note: A must describe a scalar.

Vector-vector operations
Note: some of the below operations also appear above under “General operations”. Traditional users of the
BLAS would expect them to appear under the heading “Vector-vector operations,” which is why we repeat
them.

FLA_Axpy( FLA_Obj alpha, FLA_Obj x, FLA_Obj y )

y := αx + y. (alpha x plus y.)
FLA_Copy( FLA_Obj x, FLA_Obj y )
y := x.
FLA_Dot( FLA_Obj x, FLA_Obj y, FLA_Obj rho )
ρ := xT y.
FLA_Dot_x( FLA_Obj alpha, FLA_Obj x, FLA_Obj y, FLA_Obj beta, FLA_Obj rho )
ρ := αxT y + βρ.
FLA_Iamax( FLA_Obj x, FLA_Obj k )
Compute index k such that |χk | = kxk∞ .
FLA_Inv_scal( FLA_Obj alpha, FLA_Obj x )
1
x := α x.
FLA_Nrm1( FLA_Obj x, FLA_Obj alpha )
α := kxk1 .
FLA_Nrm2( FLA_Obj x, FLA_Obj alpha )
α := kxk2 .
FLA_Nrm_inf( FLA_Obj x, FLA_Obj alpha )
α := kxk∞ .
FLA_Scal( FLA_Obj alpha, FLA_Obj x )
x := αx.
FLA_Swap( FLA_Obj x, FLA_Obj y )
x, y := y, x.
146 B. Summary of FLAME/C Routines

Matrix-vector operations
As for the vector-vector operations, we adopt a naming convention that is very familiar to those who have used
traditional level-2 BLAS routines. The name FLA XXYY encodes the following information:
XX Meaning
Ge General rectangular matrix.
Tr One of the operands is a triangular matrix.
Sy One of the operands is a symmetric matrix.
YY Meaning
mv Matrix-vector multiplication.
sv Solution of a linear system.
r Rank-1 update.
r2 Rank-2 update.

FLA_Gemv( FLA_Trans trans, FLA_Obj alpha, FLA_Obj A, FLA_Obj x, FLA_Obj beta, FLA_Obj y )
y := αoptrans (A)x + βy.
FLA_Ger( FLA_Obj alpha, FLA_Obj x, FLA_Obj y, FLA_Obj A )
A := αxy T + A.
FLA_Symv( FLA_Uplo uplo, FLA_Obj alpha, FLA_Obj A, FLA_Obj x, FLA_Obj beta, FLA_Obj y )
y := αAx + βy, where A is symmetric and stored in the upper or lower triangular part of A, as indicated by uplo.
FLA_Syr( FLA_Uplo uplo, FLA_Obj alpha, FLA_Obj x, FLA_Obj A )
A := αxxT + A, where A is symmetric and stored in the upper or lower triangular part of A, as indicated by uplo.
FLA_Syr2( FLA_Uplo uplo, FLA_Obj alpha, FLA_Obj x, FLA_Obj y, FLA_Obj A )
A := αxy T + αyxT + A, where A is symmetric and stored in the upper or lower triangular part of A, as indicated by uplo.
FLA_Syr2k( FLA_Uplo uplo, FLA_Trans trans, FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := α(optrans (A)optrans (B)T + optrans (B)optrans (A)T + βC, where C is symmetric and stored in the upper or lower triangular
part of C, as indicated by uplo.
FLA_Trmv( FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag, FLA_Obj A, FLA_Obj x )
x := optrans (A)x, where A is upper or lower triangular, as indicated by uplo.
FLA_Trmv_x( FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag,
FLA_Obj alpha, FLA_Obj A, FLA_Obj x, FLA_Obj beta, FLA_Obj y )
Update y := αoptrans (A)x + βy, where A is upper or lower triangular, as indicated by uplo.
FLA_Trsv( FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag, FLA_Obj A, FLA_Obj x )
x := optrans (A)−1 x, where A is upper or lower triangular, as indicated by uplo.
FLA_Trsv_x( FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag,
FLA_Obj alpha, FLA_Obj A, FLA_Obj x, FLA_Obj beta, FLA_Obj y )
y := αoptrans (A)−1 x + βy, where A is upper or lower triangular, as indicated by uplo.
B.5. A Subset of Supported Operations 147

Matrix-matrix operations
As for the vector-vector and matrix-vector operations, we adopt a naming convention that is very familiar to
those who have used traditional level-3 BLAS routines. FLA XXYY in the name encodes
XX Meaning
Ge General rectangular matrix.
Tr One of the operands is a triangular matrix.
Sy One of the operands is a symmetric matrix.
YY Meaning
mm Matrix-matrix multiplication.
sm Solution of a linear system with multiple right-hand sides.
rk Rank-k update.
r2k Rank-2k update.

FLA_Gemm( FLA_Trans transA, FLA_Trans transB, FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := αoptransA (A)optransB (B) + βC.
FLA_Symm( FLA_Side side, FLA_Uplo uplo, FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := αAB + βC or C := αBA + βC, where A is symmetric, side indicates the side from which A multiplies B, uplo indicates
whether A is stored in the upper or lower triangular part of A.
FLA_Syr2k( FLA_Uplo uplo, FLA_Trans trans, FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := α(optrans (A)optrans (B)T + optrans (B)optrans (A)T + βC, where C is symmetric and stored in the upper or lower triangular
part of C, as indicated by uplo.
FLA_Syrk( FLA_Uplo uplo, FLA_Trans trans, FLA_Obj alpha, FLA_Obj A, FLA_Obj beta, FLA_Obj C )
C := αoptrans (A)optrans (A)T + βC, where C is symmetric and stored in the upper or lower triangular part of C, as indicated
by uplo.
FLA_Trmm( FLA_Side side, FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag, FLA_Obj alpha, FLA_Obj A, FLA_Obj B )
B := αoptrans (A)B (side == FLA LEFT) or B := αBoptrans (A) (side == FLA RIGHT). where A is upper or lower triangular, as
indicated by uplo.
FLA_Trmm_x( FLA_Side side, FLA_Uplo uplo, FLA_Trans transA, FLA_Trans transB, FLA_Diag diag,
FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := αoptransA (A)optransB (B) + βC (side == FLA LEFT) or C := αoptransB (B)optransA (A) + βC (side == FLA RIGHT) where A
is upper or lower triangular, as indicated by uplo.
FLA_Trsm( FLA_Side side, FLA_Uplo uplo, FLA_Trans trans, FLA_Diag diag, FLA_Obj alpha, FLA_Obj A, FLA_Obj B )
B := αoptrans (A)−1 B (SIDE == FLA LEFT) or B := αBoptrans (A)−1 (SIDE == FLA RIGHT) where A is upper or lower triangular,
as indicated by uplo.
FLA_Trsm_x( FLA_Side side, FLA_Uplo uplo, FLA_Trans transA, FLA_Trans transB, FLA_Diag diag,
FLA_Obj alpha, FLA_Obj A, FLA_Obj B, FLA_Obj beta, FLA_Obj C )
C := αoptransA (A)−1 optransB (B) + βC (SIDE == FLA LEFT) or C := αoptransB (B)optransA (A)−1 + βC (SIDE == FLA RIGHT)
where A is upper or lower triangular, as indicated by uplo.
148 B. Summary of FLAME/C Routines
Bibliography

[1] Satish Balay, William Gropp, Lois Curfman McInnes, and Barry Smith. PETSc 2.0 Users Manual. Technical
Report ANL-95/11, Argonne National Laboratory, Oct. 1996. 4.3
[2] Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms.
PhD thesis, 2006. 1.6
[3] Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Ortı́, and Robert A. van de
Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Soft., 31(1):1–26, 2005.
submitted. 1.2
[4] Paolo Bientinesi, Enrique S. Quintana-Ortı́, and Robert A. van de Geijn. Representing linear algebra
algorithms in code: the FLAME application program interfaces. ACM Trans. Math. Soft., 31(1):27–59,
2005. 1.2
[5] P. D. Crout. A short method for evaluating determinants and solving systmes of linear equations with real
or complex coefficients. AIEE Trans., 60:1235–1240, 1941. 1.3
[6] James W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.
[7] E. W. Dijkstra. A constructive approach to the problem of program correctness. BIT, 8:174–186, 1968. 1.3
[8] E. W. Dijkstra. A discipline of programming. Prentice-Hall, 1976. 1.3
[9] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 basic linear algebra
subprograms. ACM Trans. Math. Soft., 16(1):1–17, March 1990. 4.3, B.5
[10] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of
FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft., 14(1):1–17, March 1988. 4.3, B.5

149
150 Bibliography

[11] R. W. Floyd. Assigning meanings to programs. In J. T. Schwartz, editor, Symposium on Applied Mathe-
matics, volume 19, pages 19–32. American Mathematical Society, 1967. 1.3
[12] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press,
Baltimore, 3rd edition, 1996.
[13] Kazushige Goto and Robert A. van de Geijn. On reducing TLB misses in matrix multiplication. ACM
Trans. Math. Soft., 2006. To appear. 5.19
[14] David Gries. The Science of Programming. Springer-Verlag, 1981.
[15] David Gries and Fred B. Schneider. A Logical Approach to Discrete Math. Texts and Monographs in
Computer Science. Springer-Verlag, 1992. 2.3, 2.5
[16] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal linear
algebra methods environment. ACM Trans. Math. Soft., 27(4):422–455, December 2001. 1.2
[17] John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. A family of high-performance matrix
multiplication algorithms. In Vassil N. Alexandrov, Jack J. Dongarra, Benjoe A. Juliano, René S. Renner,
and C.J. Kenneth Tan, editors, Computational Science - ICCS 2001, Part I, Lecture Notes in Computer
Science 2073, pages 51–60. Springer-Verlag, 2001. 5.26
[18] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, second edition, 2002.
[19] C. A. R. Hoare. An axiomatic basis for computer programming. Communications of the ACM, pages
576–580, October 1969. 1.3
[20] Leslie Lamport. LATEX: A Document Preparation System. Addison-Wesley, Reading, MA, 2nd edition,
1994.
[21] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran
usage. ACM Trans. Math. Soft., 5(3):308–323, Sept. 1979. 4.3, B.5
[22] C. Moler, J. Little, and S. Bangert. Pro-Matlab, User’s Guide. The Mathworks, Inc., 1987. 4.2
[23] Marc Snir, Steve W. Otto, Steven Huss-Lederman, David W. Walker, and Jack Dongarra. MPI: The
Complete Reference. The MIT Press, 1996. 4.3
[24] G. W. Stewart. Introduction to Matrix Computations. Academic Press, Orlando, Florida, 1973.
[25] G. W. Stewart. Matrix Algorithms Volume 1: Basic Decompositions. SIAM, 1998. 6.2.2
[26] Gilbert Strang. Linear Algebra and its Application, Third Edition. Academic Press, 1988.
[27] Lloyd N. Trefethen and III David Bau. Numerical Linear Algebra. SIAM, 1997.
Bibliography 151

[28] Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 1997. 4.3
[29] David S. Watkins. Fundamentals of Matrix Computations. John Wiley & Sons, Inc., New York, 2nd edition,
2002. 6.3
152 Bibliography
Index

\, 117 correctness of, 13

:=, 9 derivation, 9
ˆ, 15, 29 eager, 49
=⇒ , 15 expressing, 9
¬, 15 gemv
?, 134 blocked Variant 1, 62
∧, 14 blocked Variant 3, 63
·, 10 goal-oriented derivation of, 16
:=, 10 high-performance, 6, 87
=, 10 lazy, 50
{ }, 13 loop-based, 5
$BASE/, ix recursive, 8
stability, 8
absolute value, 122 trsv
access blocked Variant 2, 56
by quadrants, 27 unblocked Variant 1, 50
column-wise, 27 unblocked Variant 2, 50
row-wise, 27 typesetting, 2
add, 25 algorithmic block size, 54, 66
cost, 88 algorithmic variant, 5
algorithm annotated algorithm, 15
annotated, 15, 16 apdot, 10, 11, 22, 25
blocked, 6, 7, 51 cost, 22, 88
column-oriented, 49 API, 5, 61
correct, 5 FLAME@lab, 62
correctness, 13 FLAME/C, 70

153
154 Index

Appendices, 136 computer architecture, 5

architecture, 5, 87 conformal, 33
cache-based, 7 Continue with ..., 12
arithmetic product, 10 correctness, 13
assertion, 13 proof, 15
assignment, 10 cost
axpy, 25, 144 add, 88
cost, 88 analysis, 20
apdot, 20, 88
back substitution, 113 axpy, 88
becomes, 10 gemm, 88
bidimensional partitioning gemv, 38, 41, 88
FLAME@lab, 68 invariant, 23
FLAME/C, 81 ger, 88
blank worksheet, 24 scal, 88
block size, 52 dot, 88
algorithmic, 54 symm, 108
Boolean expression, 13 trsv, 88
Bottom, 11 unblocked Variant 1, 50
unblocked Variant 2, 49
C, 5, 61 worksheet, 23
cache, 7 cost-invariant, 23
cache memory, 88 Cost-so-far, 23
casting in terms of CPU, 87
gemm, 87, 105 Crout, 5
gemv, 51 Csf , 23
Cholesky factorization, 113
algorithms, 136 datatype, 71
invariants, 135 derivation, 9
clock formal, 1
cycle, 7 goal-oriented, 9
rate, 7 systematic, 9
speed, 8 descriptor, 71
coefficient matrix, 44 diagonal, 2, 44
column, 28 Dijkstra, 5
column-oriented algorithm, 49 dot, 25
command, 13 download
sequence, 13 lab, 70
complex valued, 28 FLAMEC, 77
Index 155

ej , 30 download, 77
element, 28 FLAME/C, 139
element growth, 122 bidimensional partitioning, 81
equality, 10 creators, 141
equations, 43 destructors, 141
finalizing, 71, 141
factorization FLA Cont with 1x3 to 1x2, 81
Cholesky, 113 FLA Cont with 3x1 to 2x1, 79
LU, 113 FLA Cont with 3x3 to 2x2, 83
fetch, 88 FLA Finalize, 71
final result, 17 FLA Init, 71, 141
FLA Cont with 1x3 to 1x2 FLA Merge 1x2, 84
FLAME@lab, 68 FLA Merge 2x1, 84
FLAME/C, 81 FLA Merge 2x2, 84
FLA Cont with 3x1 to 2x1 FLA Obj attach buffer, 73
FLAME@lab, 65 FLA Obj buffer, 73
FLAME/C, 79 FLA Obj create, 71
FLA Cont with 3x3 to 2x2 FLA Obj create without buffer, 73
FLAME@lab, 69 FLA Obj datatype, 73
FLAME/C, 83 FLA Obj free, 72
FLA Finalize, 71 FLA Obj free without buffer, 75
FLA Init, 71, 141 FLA Obj ldim, 73
lab FLA Obj length, 73
download, 70 FLA Obj show, 75
FLAME project, 2 FLA Obj width, 73
FLAME@lab, 62 FLA Part 1x2, 79
bidimensional partitioning, 68 FLA Part 2x1, 77
FLA Cont with 1x3 to 1x2, 68 FLA Part 2x2, 81
FLA Cont with 3x1 to 2x1, 65 FLA Repart 1x2 to 1x3, 79
FLA Cont with 3x3 to 2x2, 69 FLA Repart 2x1 to 3x1, 78
FLA Part 1x2, 66 FLA Repart 2x2 to 3x3, 83
FLA Part 2x1, 63 horizontal partitioning, 77
FLA Part 2x2, 68 initializing, 71, 141
FLA Repart 1x2 to 1x3, 66 inquiry routines, 141
FLA Repart 2x1 to 3x1, 64 manipulating objects, 71, 141
FLA Repart 2x2 to 3x3, 68 object
horizontal partitioning, 62 print contents, 75, 143
vertical partitioning, 66 operations
FLAMEC Cholesky factorization, 143
156 Index

general, 144 FLA Repart 1x2 to 1x3

LU factorization, 143 FLAME@lab, 66
matrix-matrix, 147 FLAME/C, 79
matrix-vector, 146 FLA Repart 2x1 to 3x1
QR factorization, 143 FLAME@lab, 64
scalar, 145 FLAME/C, 78
solvers, 143 FLA Repart 2x2 to 3x3
vector-vector, 145 FLAME@lab, 68
parameters, 139 FLAME/C, 83
table, 140 FLATEX, 5, 20, 21
reference, 139 \FlaOneByThree..., 21
trsv \FlaOneByTwo, 21
blocked Variant 2, 57 \FlaThreeByOne..., 21
vertical partitioning, 79 \FlaThreeByThree..., 21
FLA Merge 1x2, 84 \FlaTwoByOne, 21
FLA Merge 2x1, 84 \FlaTwoByTwo, 21
FLA Merge 2x2, 84 FLA ZERO, 72
MINUS ONE, 72 floating-point arithmetic, 123
FLA Obj attach buffer, 73 flop, 7, 20, 88
FLA Obj buffer, 73 Floyd, 5
FLA Obj create, 71 formal derivation, 1, 5
FLA Obj create without buffer, 73 Formal Linear Algebra Methods Environment, 2
FLA Obj datatype, 73 Fortran, 5
FLA Obj free without buffer, 75 forward substitution, 49
FLA Obj free, 72 function, 29
FLA Obj ldim, 73 Fundamental Invariance Theorem, 14
FLA Obj length, 73
FLA Obj show, 75 Gauss, 5
FLA Obj width, 73 Gauss transform, 120
FLA ONE, 72 accumulated, 121
FLA Part 1x2 Gaussian elimination, 1, 115--117
FLAME@lab, 66 gebp, 100
FLAME/C, 79 algorithm, 101
FLA Part 2x1 shape, 93
FLAME@lab, 63 gemm, 7, 111
FLAME/C, 77 blocked Variant 1, 95
FLA Part 2x2 blocked Variant 2, 96
FLAME@lab, 68 blocked Variant 3, 98
FLAME/C, 81 cost, 88, 91
Index 157

definitions, 90 gepb, 100

naming, 93 shape, 93
partitioned gepdot, 100
product, 91 shape, 93
performance, 98 gepm
PME 1, 94 shape, 93
PME 2, 95 gepp
PME 3, 97 algorithm, 102
properties, 89 shape, 93
shape, 92, 93 ger
gebp, 93 shape, 93
gemm, 93 ger, 59
gemp, 93 GFLOPS, 7
gepb, 93 GHz, 8
gepdot, 93 gigaflops, 7
gepm, 93 goal-oriented derivation, 16
gepp, 93
naming, 93 handle, 71
unblocked Variant 1, 95 Haskell, 6
unblocked Variant 2, 96 high performance, 6
unblocked Variant 3, 98 high-performance
gemp gemm, 7
shape, 93 implementation, 100
gemv, 31, 39, 41, 59 Hoare, 5
algorithm triple, 13
blocked Variant 1, 62 horizontal partitioning
blocked Variant 3, 63 FLAME@lab, 62
casting in terms of, 51 FLAME/C, 77
cost, 88
derivation, 34 identity matrix, 31, 138
loop-invariants, 34 implies, 15
PMEs, 34 induction hypothesis, 23
shape, 93 initialization, 18
Variant 1 inner product, 9
algorithm, 39 invariant
cost, 41 cost, 23
worksheet, 36 invariants
Variant 3 trsv, 47
worksheet, 40 inverse, 115
158 Index

invertible, 114 potential, 108

invscal, 25 lower triangular matrix, 44
iteration, 13 unit, 1, 132
L\U , 117
LabView G, 6 LU factorization, 1, 113
LabView MathScript, 5, 62 algorithms
M-script, 5 LU blk (all), 4
Lac,k , 121 LU unb (all), 4
LATEX, 5, 20 LU unb var5, 2
lazy, 50 cost, 118, 120
leading dimension, 72 invariants, 119
leading principle submatrix, 115 loop-invariants, 118
letters performance, 7
lowercase Greek, 28, 29, 137 PME, 118, 129
lowercase Roman, 28, 29, 137 with partial pivoting, 123
uppercase Greek, 29, 137 algorithms, 133
linear algebra operations, 1 invariants, 130
linear combination, 30 LU blk, 4
linear independent, 31 LU unb, 4
linear system, 43 LU unb var5, 2, 6
matrix form, 44
triangular, 1, 44 m(·), 11, 33, 138
linear transformation, 29, 90 M-script, 5, 61, 62
composition, 90 LabView MathScript, 5, 62
lines Matlab, 5, 62
thick, 2 Octave, 5, 62
logical magnitude, 122
and, 14 map, 29
negation, 15 Mathematica, 6
loop, 10, 13, 14 mathematical induction
verifying, 13 principle of, 22
loop-based, 5 Matlab, 5, 62
loop-body, 13 M-script, 5
loop-guard, 13 matrix, 27--29
choosing, 17 augmented, 117
loop-invariant, 15, 16 diagonal, 2, 44
determining, 16 element, 28
feasible, 108 permutation, 124
infeasible, 108 rank, 31
Index 159

SPD, 132 length, 73

symmetric, 105 manipulating, 71, 141
unit lower triangular, 1, 115, 132 width, 73
upper triangular, 1 Octave, 5, 62
matrix-matrix operations, 110 M-script, 5
matrix-vector operation, 27 operation
matrix-vector operations, 27, 58 linear algebra, 1
matrix-vector product, 27, 33 matrix-matrix, 88
rank-1 update, 27, 40 matrix-vector, 27, 88
solution of triangular system, 27, 43 specification, 16
table of, 59, 111 vector-vector, 27, 88
triangular solve, 43 original contents, 15
matrix-vector product, 27, 29, 33
definition, 30 P , 13
distributive property, 31 Pafter , 19
gemv Pbefore , 19
PMEs, 32 Pcons , 34
via apdots, 31 Pinv , 14, 15
via axpys, 32 Ppost , 13, 15
memop, 88 Ppre , 13, 15
memory, 7, 87 Pstruct , 46
bandwidth, 7 P (·), 124, 125
bottleneck, 88 partial pivoting, 122, 123, 125
cache, 88 partial result, 17
hierarchy, 87 Partition ..., 11
model, 87, 89 partitioning
multiplier, 114 bidimensional
FLAME@lab, 68
n(·), 33, 138 FLAME/C, 81
nb , 52 conformal, 33
nonsingular, 114 horizontal
notation, 2 FLAME@lab, 62
numerical stability, 8 FLAME/C, 77
vertical
object, 71 FLAME@lab, 66
buffer, 73 FLAME/C, 79
datatype, 73 peak performance, 8
ldim, 73 performance, 6
leading dimension, 73 LU factorization, 7
160 Index

peak, 8 scalar, 28
permutation, 124 dot
PivIndex(·), 125 cost, 88
pivot row, 123 shape
PME, 17 gebp, 93
postcondition, 13, 15, 16 gemm, 93
precondition, 13, 15, 16 gemp, 93
predicate, 13 gemv, 93
Preface, v gepb, 93
Principle of Mathematical Induction, 22 gepdot, 93
processor gepm, 93
model, 87 gepp, 93
programming language ger, 93
C, 5, 61 side, 63
Fortran, 5 small, 92
Haskell, 6 SPD, 113
LabView G, 6 S, 13
M-script, 5, 61 stability, 8
Mathematica, 6 analysis, 8
proof of correctness, 15 state, 13
after moving the thick lines, 19
quadrant, 68 determining, 19
Quick Reference Guide, 139 after repartitioning, 18
RAM, 87 determining, 18
rank, 31 store, 88
rank-1 update, 27, 40 SyLw(·), 105
ger, 40 SyLw(·), 132
cost, 88 symm, 111
definition, 40 blocked Variant 1, 110
PMEs, 42 cost, 108
real number, 20, 28 loop-invariants, 109
recursive, 8, 55 performance, 108
reference, 77 PME 1, 107
registers, 88 symmetric positive definite, 113
Repartition ..., 12 symv, 59
right-hand side vector, 44 syr, 59
syr2, 59
scal, 25 syr2k, 111
cost, 88 syrk, 111
Index 161

T (transposition), 28 unit basis vector, 30, 91, 138

textual substitution, 19 unit lower triangular matrix, 1, 132
thick lines, 2, 11 unknowns, 43
Top, 11 update, 20
transform determining, 20
Gauss, 120 upper triangular
transposition, 28 strictly, 105
triangular linear system, 1, 27, 44 upper triangular matrix, 1, 44
solution, 27
triangular matrix variable
lower, 44 scalar, 28
unit lower, 115 state of, 15
upper, 44 vector, 28
triangular solve, 43 variant
tril (A), 134 algorithmic, 5
trilu(·), 132 blocked, 27
TrLw(·), 46 vector, 9, 28
trmm, 111 addition, 25
trmv, 59 element, 10, 28
trsv, 44, 59 inverse scaling, 25
algorithm length, 11
blocked Variant 2, 56 scaling, 25
unblocked Variant 1, 50 unit basis, 30
unblocked Variant 2, 50 vector of unknowns, 44
blocked, 51 vector-vector operation, 25, 27
cost, 88 apdot, 10, 22
unblocked Variant 1, 50 vector-vector operations, 24
unblocked Variant 2, 49 table of, 25
FLAME/C vertical partitioning
blocked Variant 2, 57 FLAME@lab, 66
loop-invariants, 47 FLAME/C, 79
unblocked, 43 view, 77
worksheet
blocked Variant 2, 53 webpages, viii
unblocked Variant 2, 45 while ..., 13
TrUp(·), 46 wiki, ix
trsm, 111 worksheet, 23, 24
TSoPMC, ix blank, 24
typesetting, 2 trsv
162 Index

blocked Variant 2, 53
unblocked Variant 2, 45
www.linearalgebrawiki.org, ix

Computer Science III
No ratings yet
Computer Science III
244 pages
GLPK Intro
No ratings yet
GLPK Intro
12 pages
Tractable Stochastic Analysis in High Dimensions Via Robust Optimization
100% (1)
Tractable Stochastic Analysis in High Dimensions Via Robust Optimization
48 pages
Recipes For State Space Models in R Paul Teetor
No ratings yet
Recipes For State Space Models in R Paul Teetor
27 pages
Functions and Packages
No ratings yet
Functions and Packages
7 pages
Baumol Benhabib 1989
No ratings yet
Baumol Benhabib 1989
30 pages
ComparisonofOpen SourceLinearProgrammingSolvers
100% (1)
ComparisonofOpen SourceLinearProgrammingSolvers
62 pages
Pro Python Best Practices Debugging Testing and Ma
No ratings yet
Pro Python Best Practices Debugging Testing and Ma
293 pages
Lecture On Volatility
No ratings yet
Lecture On Volatility
37 pages
Juno-G Ug PDF
No ratings yet
Juno-G Ug PDF
28 pages
Solution 2
0% (1)
Solution 2
4 pages
Stochastic Dynamic Programming Guide
No ratings yet
Stochastic Dynamic Programming Guide
18 pages
Zivot+Yollin R Forecasting
No ratings yet
Zivot+Yollin R Forecasting
90 pages
Topology of Musical Data PDF
100% (1)
Topology of Musical Data PDF
34 pages
Linear and Nonlinear Programming
No ratings yet
Linear and Nonlinear Programming
7 pages
Maximum Entropy Distribution of Stock Price Fluctuations
No ratings yet
Maximum Entropy Distribution of Stock Price Fluctuations
29 pages
Athanasopoulos and Hyndman (2008)
No ratings yet
Athanasopoulos and Hyndman (2008)
13 pages
George B Dantzig PDF
100% (1)
George B Dantzig PDF
19 pages
Improved Selection Sort Algorithm
100% (1)
Improved Selection Sort Algorithm
6 pages
Training Manual For Data Analytics Using R
No ratings yet
Training Manual For Data Analytics Using R
47 pages
Coordinate Descent and Golden Selection Search
No ratings yet
Coordinate Descent and Golden Selection Search
2 pages
Octave Programming and Linear Algebra
No ratings yet
Octave Programming and Linear Algebra
17 pages
Dynamic Programming Basics
No ratings yet
Dynamic Programming Basics
6 pages
Architecture Patterns With Python 1st Edition Harry Percival PDF Download
100% (1)
Architecture Patterns With Python 1st Edition Harry Percival PDF Download
52 pages
Top Python Cheat Sheets for Beginners
100% (1)
Top Python Cheat Sheets for Beginners
13 pages
Combined
No ratings yet
Combined
70 pages
XG Boost PDF
100% (1)
XG Boost PDF
3 pages
Python Library Functions
No ratings yet
Python Library Functions
12 pages
Build Your Own C Interpreter
No ratings yet
Build Your Own C Interpreter
18 pages
Mathematical Economics Guide
No ratings yet
Mathematical Economics Guide
394 pages
ANOVA3
No ratings yet
ANOVA3
194 pages
MM Assignment Stochastic Programming and Applications2023
No ratings yet
MM Assignment Stochastic Programming and Applications2023
22 pages
Alpha Beta Pruning
No ratings yet
Alpha Beta Pruning
6 pages
Statistics and Numerical Methods - Rekha
No ratings yet
Statistics and Numerical Methods - Rekha
13 pages
Dijkstra's Algorithm Explained
No ratings yet
Dijkstra's Algorithm Explained
13 pages
Visualization 2 Data Representation
100% (1)
Visualization 2 Data Representation
56 pages
Intro to Linear Regression
100% (1)
Intro to Linear Regression
47 pages
Introduction To Matlab
No ratings yet
Introduction To Matlab
45 pages
A Matlab Primer: by Jo Ao Lopes, Vitor Lopes
No ratings yet
A Matlab Primer: by Jo Ao Lopes, Vitor Lopes
45 pages
Algorithm Design & Analysis Syllabus
100% (1)
Algorithm Design & Analysis Syllabus
126 pages
A Guide To Compose PDF
No ratings yet
A Guide To Compose PDF
213 pages
Solutions Manual Introduction To Mathematical Statistics and Its Applications 5th Edition by Larsen & Marx PDF
No ratings yet
Solutions Manual Introduction To Mathematical Statistics and Its Applications 5th Edition by Larsen & Marx PDF
8 pages
SAS IML User Guide PDF
No ratings yet
SAS IML User Guide PDF
1,108 pages
DAA - Backtracking Branch and Bound
No ratings yet
DAA - Backtracking Branch and Bound
39 pages
Policy Gradient Methods For Reinforcement Learning PDF
No ratings yet
Policy Gradient Methods For Reinforcement Learning PDF
5 pages
Lasso Regularization for Statisticians
No ratings yet
Lasso Regularization for Statisticians
14 pages
Shivendra Frontpage
No ratings yet
Shivendra Frontpage
10 pages
An Example of Dantzig-Wolfe Decomposition
No ratings yet
An Example of Dantzig-Wolfe Decomposition
7 pages
Lecture # 4: Theory of Automata by Dr. MM Alam
No ratings yet
Lecture # 4: Theory of Automata by Dr. MM Alam
15 pages
N-Queen Problem for Enthusiasts
No ratings yet
N-Queen Problem for Enthusiasts
12 pages
Python 3 Sequence Containers Guide
No ratings yet
Python 3 Sequence Containers Guide
1 page
MATLAB Basics for CS 1371 Students
No ratings yet
MATLAB Basics for CS 1371 Students
85 pages
Data 1010
No ratings yet
Data 1010
130 pages
Computer Science Three
No ratings yet
Computer Science Three
244 pages
Ble 90
No ratings yet
Ble 90
268 pages
Ics 2013
No ratings yet
Ics 2013
883 pages
Laff Nla
No ratings yet
Laff Nla
420 pages
Laff Nla
No ratings yet
Laff Nla
470 pages
Flint-2 5
No ratings yet
Flint-2 5
671 pages
NumCSE Lecture Document
No ratings yet
NumCSE Lecture Document
886 pages
Notes On Implementing Attention - Eli Bendersky
No ratings yet
Notes On Implementing Attention - Eli Bendersky
12 pages
LLM Compact Guide
No ratings yet
LLM Compact Guide
9 pages
LLM Attention
No ratings yet
LLM Attention
13 pages
Linear Regression Gradient Descent Vs Analytical Solution
No ratings yet
Linear Regression Gradient Descent Vs Analytical Solution
5 pages
Counting Sort: Visualization & Analysis
No ratings yet
Counting Sort: Visualization & Analysis
8 pages
Full Quadrant Approximations For Arctangent Tips and Tricks
No ratings yet
Full Quadrant Approximations For Arctangent Tips and Tricks
6 pages
Kinematics
No ratings yet
Kinematics
3 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
39 pages
Electrical & Systems Engineering Courses
No ratings yet
Electrical & Systems Engineering Courses
20 pages
Type BX Ring Gasket Groove
No ratings yet
Type BX Ring Gasket Groove
1 page
UNIT - 2 Half Notes
No ratings yet
UNIT - 2 Half Notes
35 pages
A Diana Algoritma
No ratings yet
A Diana Algoritma
2 pages
SAT Math - Advanced Math FULL - (Easy Level) - Results
No ratings yet
SAT Math - Advanced Math FULL - (Easy Level) - Results
140 pages
CH - 12 LINEAR PROGRAMMING
No ratings yet
CH - 12 LINEAR PROGRAMMING
27 pages
WPE 11th
No ratings yet
WPE 11th
14 pages
Principles of Nano Optics 2nd Edition Lukas Novotny PDF Download
No ratings yet
Principles of Nano Optics 2nd Edition Lukas Novotny PDF Download
69 pages
Statics & Resolving Forces (Multiple Choice) QP
No ratings yet
Statics & Resolving Forces (Multiple Choice) QP
16 pages
Replicate Vix in Spreadsheet
0% (1)
Replicate Vix in Spreadsheet
12 pages
Gauss Seidel Power Flow Solution
No ratings yet
Gauss Seidel Power Flow Solution
4 pages
Discrete Structure PDF
No ratings yet
Discrete Structure PDF
189 pages
AI-QB-VDP - Unit 1 New
No ratings yet
AI-QB-VDP - Unit 1 New
13 pages
Kuchar - Aperture Coupled Micro Strip Patch Antenna Array - 1996
No ratings yet
Kuchar - Aperture Coupled Micro Strip Patch Antenna Array - 1996
91 pages
Errata: Understanding Computational Bayesian Statistics
No ratings yet
Errata: Understanding Computational Bayesian Statistics
7 pages
A Study of Demand-Controlled Ventilation
No ratings yet
A Study of Demand-Controlled Ventilation
8 pages
Exp-2 Simple Reflex Agent
No ratings yet
Exp-2 Simple Reflex Agent
7 pages
CASE STUDY ANALYSIS FORM - Ivey v3
No ratings yet
CASE STUDY ANALYSIS FORM - Ivey v3
5 pages
Modelling and Optimization of The Velocity Profiles at The Draft Tube Inlet of A Francis Turbine Within An Operating Range
No ratings yet
Modelling and Optimization of The Velocity Profiles at The Draft Tube Inlet of A Francis Turbine Within An Operating Range
17 pages
Bac E 2021 Math Matiques 044608
No ratings yet
Bac E 2021 Math Matiques 044608
2 pages
Heer and Maussner PP 28-41
No ratings yet
Heer and Maussner PP 28-41
14 pages
Lazy Code Motion
No ratings yet
Lazy Code Motion
13 pages
Newtons Laws of Motion
No ratings yet
Newtons Laws of Motion
50 pages
Module 1 Quiz
No ratings yet
Module 1 Quiz
2 pages
Background: 1.1. DNA - Deoxyribonucleic Acid
No ratings yet
Background: 1.1. DNA - Deoxyribonucleic Acid
19 pages
Chapter 3 - Linked List
No ratings yet
Chapter 3 - Linked List
22 pages
Quiz 2 SRB
No ratings yet
Quiz 2 SRB
5 pages
Determination of Coefficient of Linear Expansion of A Metal Rod
50% (2)
Determination of Coefficient of Linear Expansion of A Metal Rod
5 pages