0% found this document useful (0 votes)
15 views

Lecture1 Slides

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lecture1 Slides

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Low Rank Approximation

Lecture 1
Daniel Kressner
Chair for Numerical Algorithms and HPC
Institute of Mathematics, EPFL
[email protected]

1
Organizational aspects

I Lectures: Tuesday 8-10, MA A110. First: September 25, Last:


December 18.
I Exercises: Tuesday 8-10, MA A110. First: September 25, Last:
December 18.
I Exam: Miniproject + oral exam.
I Webpage: https://anchp.epfl.ch/lowrank.
I [email protected], [email protected]

2
From http://www.niemanlab.org
... his [Aleksandr
Kogan’s] message
went on to confirm
that his approach
was indeed similar to
SVD or other matrix
factorization meth-
ods, like in the Netflix
Prize competition, and
the Kosinki-Stillwell-
Graepel Facebook
model. Dimensionality
reduction of Facebook
data was the core of
his model.

3
Rank and basic properties
For field F , let A ∈ F m×n . Then

rank(A) := dim(range(A)).

For simplicity, F = R throughout the lecture and often m ≥ n.


Lemma
Let A ∈ Rm×n . Then
1. rank(AT ) = rank(A);
2. rank(PAQ) = rank(A) for invertible matrices P ∈ Rm×m ,
Q ∈ Rn×n ;
3. rank(AB) ≤ min{rank(A), rank(B)} for any matrix B ∈ Rn×p .
 
A11 A12
4. rank = rank(A11 ) + rank(A22 ) for A11 ∈ Rm1 ×n1 ,
0 A22
A12 ∈ Rm1 ×n2 , A22 ∈ Rm2 ×n2 .
Proof: See Linear Algebra 1 / Exercises.

4
Rank and matrix factorizations
Let B = {b1 , . . . , br } ⊂ Rm with r = rank(A) be basis
 of range(A).
Then each of the columns of A = a1 , a2 , . . . , an can be expressed
as linear combination of B:
 
ci1
  . 
ai = b1 ci1 + b2 ci2 + · · · + br cir = b1 , . . . , br  ..  ,
cir

for some coefficients cij ∈ R with i = 1, . . . , n, j = 1, . . . , r .


Stacking these relations column by column
 
c11 · · · cn1
a1 , . . . , an = b1 , . . . , br  ... .. 
   
. 
c1r ··· cnr

5
Rank and matrix factorizations
Lemma. A matrix A ∈ Rm×n of rank r admits a factorization of the
form
A = BC T , B ∈ Rm×r , C ∈ Rn×r .

We say that A has low rank if rank(A)  m, n.


Illustration of low-rank factorization:

A BC T
#entries mn mr + nr
I Generically (and in most applications), A has full rank, that is,
rank(A) = min{m, n}.
I Aim instead at approximating A by a low-rank matrix.
6
Questions addressed in lecture series

What? Theoretical foundations of low-rank approximation.


When? A priori and a posteriori estimates for low-rank
approximation. Situations that allow for low-rank
approximation techniques.
Why? Applications in engineering, scientific computing, data
analysis, ... where low-rank approximation plays a
central role.
How? State-of-the-art algorithms for performing and working
with low-rank approximations.

Will cover both, matrices and tensors.

7
Literature for Lecture 1

Golub/Van Loan’2013 Golub, Gene H.; Van Loan, Charles F. Matrix


computations. Fourth edition. Johns Hopkins University
Press, Baltimore, MD, 2013.
Horn/Johnson’2013 Horn, Roger A.; Johnson, Charles R. Matrix
analysis. Second edition. Cambridge University Press,
2013.
+ References on slides.

8
1. Fundamental tools
I SVD
I Relation to eigenvalues
I Norms
I Best low-rank approximation

9
The singular value decomposition
Theorem (SVD). Let A ∈ Rm×n with m ≥ n. Then there are
orthogonal matrices U ∈ Rm×m and V ∈ Rn×n such that
 
σ1
 .. 
A = UΣV T , with Σ = 
 .  ∈ Rm×n

 σn 
0

and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
I σ1 , . . . , σn are called singular values
I u1 , . . . , un are called left singular vectors
I v1 , . . . , vn are called right singular vectors
I Avi = σi ui , AT ui = σi vi for i = 1, . . . , n.
I Singular values are always uniquely defined by A.
I Singular values are never unique. If σ1 > σ2 > · · · σn > 0 then
unique up to ui ← ±ui , vi ← ±vi .

10
SVD: Sketch of proof
Induction over n. n = 1 trivial.
For general n, let v1 solve max{kAv k2 : kv k2 = 1} =: kAk2 . Set
σ1 := kAk2 and u1 := Av1 /σ1 .1 By definition,

Av1 = σ1 u1 .

∈ Rm×m and

After completion to orthogonal matrices U 1 = u1 , U ⊥
V1 = v1 , V⊥ ∈ Rn×n :


u1T Av1 u1T AV⊥ wT


   
σ1
U1T AV1 = T T = ,
U⊥ Av1 U⊥ AV⊥ 0 A1

with w := V⊥T AT u1 and A1 = U⊥


T
AV⊥ . k · k2 invariant under orthogonal
transformations
σ1 w T
  q
T
σ1 = kAk2 = kU1 AV1 k2 = ≥ σ12 + kwk22 .
0 A1 2

Hence, w = 0. Proof completed by applying induction to A1 .


1 If σ1 = 0, choose arbitrary u1 .
11
Very basic properties of the SVD

I r = rank(A) is number of nonzero singular values of A.


I kernel(A) = span{vr +1 , . . . , vn }
I range(A) = span{u1 , . . . , ur }

12
SVD: Computation (for small dense matrices)
Computation of SVD proceeds in two steps:
1. Reduction to bidiagonal form: By applying n Householder
reflectors from left and n − 1 Householder reflectors from right,
compute orthogonal matrices U1 , V1 such that
 
 
B @@
U1T AV1 = B = 1 =  @@ @ ,

0
0
that is, B1 ∈ Rn×n is an upper bidiagonal matrix.
2. Reduction to diagonal form: Use Divide&Conquer to compute
orthogonal matrices U2 , V2 such that Σ = U2T B1 V2 is diagonal.
Set U = U1 U2 and V = V1 V2 .
Step 1 is usually the most expensive. Remarks on Step 1:
I If m is significantly larger than n, say, m ≥ 3n/2, first computing
QR decomposition of A reduces cost.
I Most modern implementations reduce A successively via banded
form to bidiagonal form.2
2 Bischof, C. H.; Lang, B.; Sun, X. A framework for symmetric band reduction. ACM

Trans. Math. Software 26 (2000), no. 4, 581–601.


13
SVD: Computation (for small dense matrices)
In most applications, vectors un+1 , . . . , um are not of interest. By
omitting these vectors one obtains the following variant of the SVD.
Theorem (Economy size SVD). Let A ∈ Rm×n with m ≥ n. Then
there is a matrix U ∈ Rm×n with orthonormal columns and an
orthonormal matrix V ∈ Rn×n such that
 
σ1
A = UΣV T , with Σ = 
 .. ∈R
 n×n
.
σn

and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
Computed by M ATLAB’s [U,S,V] = svd(A,’econ’).
Complexity:
memory operations
singular values only O(mn) O(mn2 )
economy size SVD O(mn) O(mn2 )
(full) SVD O(m2 + mn) O(m2 n + mn2 )

14
SVD: Computation (for small dense matrices)
Beware of roundoff error when interpreting singular value plots.
Exmaple: semilogy(svd(hilb(100)))

10 0

10 -10

10 -20
0 20 40 60 80 100

I Kink is caused by roundoff error and does not reflect true


behavior of singular values.
I Exact singular values are known to decay exponentially.3
I Sometimes more accuracy possible.4 .
3 Beckermann, B. The condition number of real Vandermonde, Krylov and positive

definite Hankel matrices. Numer. Math. 85 (2000), no. 4, 553–577.


4 Drmač, Z.; Veselić, K. New fast and accurate Jacobi SVD algorithm. I. SIAM J.

Matrix Anal. Appl. 29 (2007), no. 4, 1322–1342


15
Singular/eigenvalue relations: symmetric matrices

Symmetric A = AT ∈ Rn×n admits spectral decomposition

A = U diag(λ1 , λ2 , . . . , λn )U T

with orthogonal matrix U.


After reordering may assume |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |. Spectral
decomposition can be turned into SVD A = UΣV T by defining

Σ = diag(|λ1 |, . . . , |λn |), V = U diag(sign(λ1 ), . . . , sign(λn )).

Remark: This extends to the more general case of normal matrices


(e.g., orthogonal or symmetric) via complex spectral or real Schur
decompositions.

16
Singular/eigenvalue relations: general matrices
Consider SVD A = UΣV T of A ∈ Rm×n with m ≥ n. We then have:
1. Spectral decomposition of Gramian
AT A = V ΣT ΣV T = V diag(σ12 , . . . , σn2 )V T
AT A has eigenvalues σ12 , . . . , σn2 ,
right singular vectors of A are eigenvectors of AT A.
2. Spectral decomposition of Gramian
AAT = UΣΣT U T = U diag(σ12 , . . . , σn2 , 0, . . . , 0)U T
AAT has eigenvalues σ12 , . . . , σn2 and, additionally, m − n zero
eigenvalues,
first n left singular vectors A are eigenvectors of AAT .
3. Decomposition of Golub-Kahan matrix
     T
0 A U 0 0 Σ U 0
A= T =
A 0 0 V ΣT 0 0 V

eigenvalues of A are ±σj , j = 1, . . . , n, and zero (m − n times).

17
Norms: Spectral and Frobenius norm

Given SVD A = UΣV T , one defines:


I Spectral norm: kAk2 = σ1 .
q
I Frobenius norm: kAkF = σ12 + · · · + σn2 .
Basic properties:
I kAk2 = max{kAv k2 : kv k2 = 1} (see proof of SVD).
I k · k2 and k · kF are both (submultiplicative) matrix norms.
I k · k2 and k · kF are both unitarily invariant, that is

kQAZ k2 = kAk2 , kQAZ kF = kAkF

for any orthogonal matrices Q, Z .



I kAk2 ≤ kAkF ≤ kAk2 / r
I kABkF ≤ min{kAk2 kBkF , kAkF kBk2 }

18
Euclidean geometry on matrices
Let B ∈ Rn×n have eigenvalues λ1 , . . . , λn ∈ C. Then

trace(B) := b11 + · · · + bnn = λ1 + · · · + λn .

In turn, X
kAk2F = trace AT A = trace AAT = aij2 .
i,j

Two simple consequences:


I k · kF is the norm induced by the matrix inner product

hA, Bi := trace(AB T ), A, B ∈ Rm×n .


 
I Partition A = a1 , a2 , . . . , an and define vectorization
 
a1
vec(A) =  ...  ∈ Rmn .
 

an

Then hA, Bi = hvec(A), vec(B)i and kAkF = k vec(A)k2 .


19
Von Neumann’s trace inequality
Theorem
For m ≥ n, let A, B ∈ Rm×n have singular values σ1 (A) ≥ · · · ≥ σn (A)
and σ1 (B) ≥ · · · ≥ σn (B), respectively. Then

|hA, Bi| ≤ σ1 (A)σ1 (B) + · · · + σn (A)σn (B).

Proof of Von Neumann’s trace inequality in lecture notes.5


Consequence:

kA − Bk2F = hA − B, A − Bi = kAk2F − 2hA, Bi + kBk2F


n
X
≥ kAk2F − 2 σi (A)σi (B) + kBk2F
i=1
n
X
= (σi (A) − σi (B))2 .
i=1

5 Thisproof follows [Grigorieff, R. D. Note on von Neumann’s trace inequality. Math.


Nachr. 151 (1991), 327–328]. For Mirsky’s ingenious proof based on doubly stochastic
matrices; see Theorem 8.7.6 in [Horn/Johnson’2013].
20
Schatten norms
There are other unitarily invariant matrix norms.6
Let s(A) = (σ1 , . . . , σn ). The p-Schatten norm defined by

kAk(p) := ks(A)kp

is a matrix norm for any 1 ≤ p ≤ ∞.


p = ∞: spectral norm, p = 2: Frobenius norm, p = 1: nuclear norm.
Definition
The dual of a matrix norm k · k on Rm×n is defined by

kAkD = max{hA, Bi : kBk = 1}.

Lemma
Let p, q ∈ [1, ∞] such that p−1 + q −1 = 1. Then

kAkD
(p) = kAk(q) .

6 Complete characterization via symm gauge functions in [Horn/Johnson’2013].


21
Best low-rank approximation

Consider k < n and let


   
Uk := u1 · · · uk , Σk := diag(σ1 , . . . , σk ), Vk := u1 ··· uk .

Then
Tk (A) := Uk Σk VkT
has rank at most k . For any unitarily invariant norm k · k:

kTk (A) − Ak = diag(0, . . . , 0, σk +1 , . . . , σn )

In particular, for spectral norm and the Frobenius norm:


q
kA − Tk (A)k2 = σk +1 , kA − Tk (A)kF = σk2+1 + · · · + σn2 .

Nearly equal if and only if singular values decay sufficiently quickly.

22
Best low-rank approximation
Theorem (Schmidt-Mirsky). Let A ∈ Rm×n . Then

kA − Tk (A)k = min kA − Bk : B ∈ Rm×n has rank at most k




holds for any unitarily invariant norm k · k.

Proof7 for k · kF : Follows directly from consequence of Von


Neumann’s trace inequality.
Proof for k · k2 : For any B ∈ Rm×n of rank ≤ k , kernel(B) has
dimension ≥ n − k . Hence, ∃w ∈ kernel(B) ∩ range(Vk +1 ) with
kwk2 = 1. Then

kA − Bk22 ≥ k(A − B)wk22 = kAwk22 = kAVk +1 VkT+1 wk22


= kUk +1 Σk +1 VkT+1 wk22
r +1
X r +1
X
= σj |vjT w|2 ≥ σk +1 |vjT w|2 = σk +1 .
j=1 j=1

7 See Section 7.4.9 in [Horn/Johnson’2013] for the general case.


23
Best low-rank approximation
Uniqueness:
I If σk > σk +1 best rank-k approximation with respect to Frobenius
norm is unique.
I If σk = σk +1 best rank-k approximation never unique. For
example I3 has several best rank-two approximations:
     
1 0 0 1 0 0 0 0 0
0 1 0 , 0 0 0 , 0 1 0 .
0 0 0 0 0 1 0 0 1

I With respect to spectral norm best rank-k approximation only


unique if σk +1 = 0. For example, diag(2, 1, ) with 0 <  < 1 has
infinitely many best rank-two approximations:
     
2 0 0 2 − /2 0 0 2 − /3 0 0
0 1 0 ,  0 1 − /2 0 ,  0 1 − /3 0 , . . . .
0 0 0 0 0 0 0 0 1

24
Approximating the range of a matrix
Aim at finding a matrix Q ∈ Rm×k with orthonormal columns such that

range(Q) ≈ range(A).

I − QQ T is orthogonal projector onto range(Q)⊥


Aim at minimizing

k(I − QQ T )Ak = kA − QQ T Ak

for unitarily invariant norm k · k. Because rank(QQ T A) ≤ k ,

kA − QQ T Ak ≥ kA − Tk (A)k.

Setting Q = Uk one obtains

Uk UkT A = Uk UkT UΣV T = Uk Σk VkT = Tk (A).

Q = Uk is optimal.

25
Approximating the range of a matrix

Variation:
max{kQ T AkF : Q T Q = Ik }.
Equivalent to
max{|hAAT , QQ T i| : Q T Q = Ik }.
By Von Neumann’s trace inequality and equivalence between
eigenvectors of AAT and left singular vectors of A, optimal Q given by
Uk .

26

You might also like