0% found this document useful (0 votes)
11 views

Lecture1 Slides

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture1 Slides

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Low Rank Approximation

Lecture 1
Daniel Kressner
Chair for Numerical Algorithms and HPC
Institute of Mathematics, EPFL
[email protected]

1
Organizational aspects

I Lectures: Tuesday 8-10, MA A110. First: September 25, Last:


December 18.
I Exercises: Tuesday 8-10, MA A110. First: September 25, Last:
December 18.
I Exam: Miniproject + oral exam.
I Webpage: https://anchp.epfl.ch/lowrank.
I [email protected], [email protected]

2
From http://www.niemanlab.org
... his [Aleksandr
Kogan’s] message
went on to confirm
that his approach
was indeed similar to
SVD or other matrix
factorization meth-
ods, like in the Netflix
Prize competition, and
the Kosinki-Stillwell-
Graepel Facebook
model. Dimensionality
reduction of Facebook
data was the core of
his model.

3
Rank and basic properties
For field F , let A ∈ F m×n . Then

rank(A) := dim(range(A)).

For simplicity, F = R throughout the lecture and often m ≥ n.


Lemma
Let A ∈ Rm×n . Then
1. rank(AT ) = rank(A);
2. rank(PAQ) = rank(A) for invertible matrices P ∈ Rm×m ,
Q ∈ Rn×n ;
3. rank(AB) ≤ min{rank(A), rank(B)} for any matrix B ∈ Rn×p .
 
A11 A12
4. rank = rank(A11 ) + rank(A22 ) for A11 ∈ Rm1 ×n1 ,
0 A22
A12 ∈ Rm1 ×n2 , A22 ∈ Rm2 ×n2 .
Proof: See Linear Algebra 1 / Exercises.

4
Rank and matrix factorizations
Let B = {b1 , . . . , br } ⊂ Rm with r = rank(A) be basis
 of range(A).
Then each of the columns of A = a1 , a2 , . . . , an can be expressed
as linear combination of B:
 
ci1
  . 
ai = b1 ci1 + b2 ci2 + · · · + br cir = b1 , . . . , br  ..  ,
cir

for some coefficients cij ∈ R with i = 1, . . . , n, j = 1, . . . , r .


Stacking these relations column by column
 
c11 · · · cn1
a1 , . . . , an = b1 , . . . , br  ... .. 
   
. 
c1r ··· cnr

5
Rank and matrix factorizations
Lemma. A matrix A ∈ Rm×n of rank r admits a factorization of the
form
A = BC T , B ∈ Rm×r , C ∈ Rn×r .

We say that A has low rank if rank(A)  m, n.


Illustration of low-rank factorization:

A BC T
#entries mn mr + nr
I Generically (and in most applications), A has full rank, that is,
rank(A) = min{m, n}.
I Aim instead at approximating A by a low-rank matrix.
6
Questions addressed in lecture series

What? Theoretical foundations of low-rank approximation.


When? A priori and a posteriori estimates for low-rank
approximation. Situations that allow for low-rank
approximation techniques.
Why? Applications in engineering, scientific computing, data
analysis, ... where low-rank approximation plays a
central role.
How? State-of-the-art algorithms for performing and working
with low-rank approximations.

Will cover both, matrices and tensors.

7
Literature for Lecture 1

Golub/Van Loan’2013 Golub, Gene H.; Van Loan, Charles F. Matrix


computations. Fourth edition. Johns Hopkins University
Press, Baltimore, MD, 2013.
Horn/Johnson’2013 Horn, Roger A.; Johnson, Charles R. Matrix
analysis. Second edition. Cambridge University Press,
2013.
+ References on slides.

8
1. Fundamental tools
I SVD
I Relation to eigenvalues
I Norms
I Best low-rank approximation

9
The singular value decomposition
Theorem (SVD). Let A ∈ Rm×n with m ≥ n. Then there are
orthogonal matrices U ∈ Rm×m and V ∈ Rn×n such that
 
σ1
 .. 
A = UΣV T , with Σ = 
 .  ∈ Rm×n

 σn 
0

and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
I σ1 , . . . , σn are called singular values
I u1 , . . . , un are called left singular vectors
I v1 , . . . , vn are called right singular vectors
I Avi = σi ui , AT ui = σi vi for i = 1, . . . , n.
I Singular values are always uniquely defined by A.
I Singular values are never unique. If σ1 > σ2 > · · · σn > 0 then
unique up to ui ← ±ui , vi ← ±vi .

10
SVD: Sketch of proof
Induction over n. n = 1 trivial.
For general n, let v1 solve max{kAv k2 : kv k2 = 1} =: kAk2 . Set
σ1 := kAk2 and u1 := Av1 /σ1 .1 By definition,

Av1 = σ1 u1 .

∈ Rm×m and

After completion to orthogonal matrices U 1 = u1 , U ⊥
V1 = v1 , V⊥ ∈ Rn×n :


u1T Av1 u1T AV⊥ wT


   
σ1
U1T AV1 = T T = ,
U⊥ Av1 U⊥ AV⊥ 0 A1

with w := V⊥T AT u1 and A1 = U⊥


T
AV⊥ . k · k2 invariant under orthogonal
transformations
σ1 w T
  q
T
σ1 = kAk2 = kU1 AV1 k2 = ≥ σ12 + kwk22 .
0 A1 2

Hence, w = 0. Proof completed by applying induction to A1 .


1 If σ1 = 0, choose arbitrary u1 .
11
Very basic properties of the SVD

I r = rank(A) is number of nonzero singular values of A.


I kernel(A) = span{vr +1 , . . . , vn }
I range(A) = span{u1 , . . . , ur }

12
SVD: Computation (for small dense matrices)
Computation of SVD proceeds in two steps:
1. Reduction to bidiagonal form: By applying n Householder
reflectors from left and n − 1 Householder reflectors from right,
compute orthogonal matrices U1 , V1 such that
 
 
B @@
U1T AV1 = B = 1 =  @@ @ ,

0
0
that is, B1 ∈ Rn×n is an upper bidiagonal matrix.
2. Reduction to diagonal form: Use Divide&Conquer to compute
orthogonal matrices U2 , V2 such that Σ = U2T B1 V2 is diagonal.
Set U = U1 U2 and V = V1 V2 .
Step 1 is usually the most expensive. Remarks on Step 1:
I If m is significantly larger than n, say, m ≥ 3n/2, first computing
QR decomposition of A reduces cost.
I Most modern implementations reduce A successively via banded
form to bidiagonal form.2
2 Bischof, C. H.; Lang, B.; Sun, X. A framework for symmetric band reduction. ACM

Trans. Math. Software 26 (2000), no. 4, 581–601.


13
SVD: Computation (for small dense matrices)
In most applications, vectors un+1 , . . . , um are not of interest. By
omitting these vectors one obtains the following variant of the SVD.
Theorem (Economy size SVD). Let A ∈ Rm×n with m ≥ n. Then
there is a matrix U ∈ Rm×n with orthonormal columns and an
orthonormal matrix V ∈ Rn×n such that
 
σ1
A = UΣV T , with Σ = 
 .. ∈R
 n×n
.
σn

and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
Computed by M ATLAB’s [U,S,V] = svd(A,’econ’).
Complexity:
memory operations
singular values only O(mn) O(mn2 )
economy size SVD O(mn) O(mn2 )
(full) SVD O(m2 + mn) O(m2 n + mn2 )

14
SVD: Computation (for small dense matrices)
Beware of roundoff error when interpreting singular value plots.
Exmaple: semilogy(svd(hilb(100)))

10 0

10 -10

10 -20
0 20 40 60 80 100

I Kink is caused by roundoff error and does not reflect true


behavior of singular values.
I Exact singular values are known to decay exponentially.3
I Sometimes more accuracy possible.4 .
3 Beckermann, B. The condition number of real Vandermonde, Krylov and positive

definite Hankel matrices. Numer. Math. 85 (2000), no. 4, 553–577.


4 Drmač, Z.; Veselić, K. New fast and accurate Jacobi SVD algorithm. I. SIAM J.

Matrix Anal. Appl. 29 (2007), no. 4, 1322–1342


15
Singular/eigenvalue relations: symmetric matrices

Symmetric A = AT ∈ Rn×n admits spectral decomposition

A = U diag(λ1 , λ2 , . . . , λn )U T

with orthogonal matrix U.


After reordering may assume |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |. Spectral
decomposition can be turned into SVD A = UΣV T by defining

Σ = diag(|λ1 |, . . . , |λn |), V = U diag(sign(λ1 ), . . . , sign(λn )).

Remark: This extends to the more general case of normal matrices


(e.g., orthogonal or symmetric) via complex spectral or real Schur
decompositions.

16
Singular/eigenvalue relations: general matrices
Consider SVD A = UΣV T of A ∈ Rm×n with m ≥ n. We then have:
1. Spectral decomposition of Gramian
AT A = V ΣT ΣV T = V diag(σ12 , . . . , σn2 )V T
AT A has eigenvalues σ12 , . . . , σn2 ,
right singular vectors of A are eigenvectors of AT A.
2. Spectral decomposition of Gramian
AAT = UΣΣT U T = U diag(σ12 , . . . , σn2 , 0, . . . , 0)U T
AAT has eigenvalues σ12 , . . . , σn2 and, additionally, m − n zero
eigenvalues,
first n left singular vectors A are eigenvectors of AAT .
3. Decomposition of Golub-Kahan matrix
     T
0 A U 0 0 Σ U 0
A= T =
A 0 0 V ΣT 0 0 V

eigenvalues of A are ±σj , j = 1, . . . , n, and zero (m − n times).

17
Norms: Spectral and Frobenius norm

Given SVD A = UΣV T , one defines:


I Spectral norm: kAk2 = σ1 .
q
I Frobenius norm: kAkF = σ12 + · · · + σn2 .
Basic properties:
I kAk2 = max{kAv k2 : kv k2 = 1} (see proof of SVD).
I k · k2 and k · kF are both (submultiplicative) matrix norms.
I k · k2 and k · kF are both unitarily invariant, that is

kQAZ k2 = kAk2 , kQAZ kF = kAkF

for any orthogonal matrices Q, Z .



I kAk2 ≤ kAkF ≤ kAk2 / r
I kABkF ≤ min{kAk2 kBkF , kAkF kBk2 }

18
Euclidean geometry on matrices
Let B ∈ Rn×n have eigenvalues λ1 , . . . , λn ∈ C. Then

trace(B) := b11 + · · · + bnn = λ1 + · · · + λn .

In turn, X
kAk2F = trace AT A = trace AAT = aij2 .
i,j

Two simple consequences:


I k · kF is the norm induced by the matrix inner product

hA, Bi := trace(AB T ), A, B ∈ Rm×n .


 
I Partition A = a1 , a2 , . . . , an and define vectorization
 
a1
vec(A) =  ...  ∈ Rmn .
 

an

Then hA, Bi = hvec(A), vec(B)i and kAkF = k vec(A)k2 .


19
Von Neumann’s trace inequality
Theorem
For m ≥ n, let A, B ∈ Rm×n have singular values σ1 (A) ≥ · · · ≥ σn (A)
and σ1 (B) ≥ · · · ≥ σn (B), respectively. Then

|hA, Bi| ≤ σ1 (A)σ1 (B) + · · · + σn (A)σn (B).

Proof of Von Neumann’s trace inequality in lecture notes.5


Consequence:

kA − Bk2F = hA − B, A − Bi = kAk2F − 2hA, Bi + kBk2F


n
X
≥ kAk2F − 2 σi (A)σi (B) + kBk2F
i=1
n
X
= (σi (A) − σi (B))2 .
i=1

5 Thisproof follows [Grigorieff, R. D. Note on von Neumann’s trace inequality. Math.


Nachr. 151 (1991), 327–328]. For Mirsky’s ingenious proof based on doubly stochastic
matrices; see Theorem 8.7.6 in [Horn/Johnson’2013].
20
Schatten norms
There are other unitarily invariant matrix norms.6
Let s(A) = (σ1 , . . . , σn ). The p-Schatten norm defined by

kAk(p) := ks(A)kp

is a matrix norm for any 1 ≤ p ≤ ∞.


p = ∞: spectral norm, p = 2: Frobenius norm, p = 1: nuclear norm.
Definition
The dual of a matrix norm k · k on Rm×n is defined by

kAkD = max{hA, Bi : kBk = 1}.

Lemma
Let p, q ∈ [1, ∞] such that p−1 + q −1 = 1. Then

kAkD
(p) = kAk(q) .

6 Complete characterization via symm gauge functions in [Horn/Johnson’2013].


21
Best low-rank approximation

Consider k < n and let


   
Uk := u1 · · · uk , Σk := diag(σ1 , . . . , σk ), Vk := u1 ··· uk .

Then
Tk (A) := Uk Σk VkT
has rank at most k . For any unitarily invariant norm k · k:

kTk (A) − Ak = diag(0, . . . , 0, σk +1 , . . . , σn )

In particular, for spectral norm and the Frobenius norm:


q
kA − Tk (A)k2 = σk +1 , kA − Tk (A)kF = σk2+1 + · · · + σn2 .

Nearly equal if and only if singular values decay sufficiently quickly.

22
Best low-rank approximation
Theorem (Schmidt-Mirsky). Let A ∈ Rm×n . Then

kA − Tk (A)k = min kA − Bk : B ∈ Rm×n has rank at most k




holds for any unitarily invariant norm k · k.

Proof7 for k · kF : Follows directly from consequence of Von


Neumann’s trace inequality.
Proof for k · k2 : For any B ∈ Rm×n of rank ≤ k , kernel(B) has
dimension ≥ n − k . Hence, ∃w ∈ kernel(B) ∩ range(Vk +1 ) with
kwk2 = 1. Then

kA − Bk22 ≥ k(A − B)wk22 = kAwk22 = kAVk +1 VkT+1 wk22


= kUk +1 Σk +1 VkT+1 wk22
r +1
X r +1
X
= σj |vjT w|2 ≥ σk +1 |vjT w|2 = σk +1 .
j=1 j=1

7 See Section 7.4.9 in [Horn/Johnson’2013] for the general case.


23
Best low-rank approximation
Uniqueness:
I If σk > σk +1 best rank-k approximation with respect to Frobenius
norm is unique.
I If σk = σk +1 best rank-k approximation never unique. For
example I3 has several best rank-two approximations:
     
1 0 0 1 0 0 0 0 0
0 1 0 , 0 0 0 , 0 1 0 .
0 0 0 0 0 1 0 0 1

I With respect to spectral norm best rank-k approximation only


unique if σk +1 = 0. For example, diag(2, 1, ) with 0 <  < 1 has
infinitely many best rank-two approximations:
     
2 0 0 2 − /2 0 0 2 − /3 0 0
0 1 0 ,  0 1 − /2 0 ,  0 1 − /3 0 , . . . .
0 0 0 0 0 0 0 0 1

24
Approximating the range of a matrix
Aim at finding a matrix Q ∈ Rm×k with orthonormal columns such that

range(Q) ≈ range(A).

I − QQ T is orthogonal projector onto range(Q)⊥


Aim at minimizing

k(I − QQ T )Ak = kA − QQ T Ak

for unitarily invariant norm k · k. Because rank(QQ T A) ≤ k ,

kA − QQ T Ak ≥ kA − Tk (A)k.

Setting Q = Uk one obtains

Uk UkT A = Uk UkT UΣV T = Uk Σk VkT = Tk (A).

Q = Uk is optimal.

25
Approximating the range of a matrix

Variation:
max{kQ T AkF : Q T Q = Ik }.
Equivalent to
max{|hAAT , QQ T i| : Q T Q = Ik }.
By Von Neumann’s trace inequality and equivalence between
eigenvectors of AAT and left singular vectors of A, optimal Q given by
Uk .

26

You might also like