0% found this document useful (0 votes)

172 views

CS 532 Lecture Notes

This document contains lecture notes on matrix methods in machine learning. It begins with an overview of elements of machine learning including collecting data, preprocessing, feature extraction, generating training samples, loss functions, learning models, and characterizing generalization error. It then reviews key linear algebra concepts such as inner products, matrix-vector and matrix-matrix multiplication, and linear independence. It goes on to cover topics such as least squares, regularization, the singular value decomposition, and convexity.

Uploaded by

BarbaBoki

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

172 views

CS 532 Lecture Notes

Uploaded by

BarbaBoki

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Matrix Methods in Machine Learning

Lecture Notes
Rebekah Dix

November 11, 2018

Contents
1 Elements of Machine Learning 3

2 Linear Algebra Review 3

2.1 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Linear Systems and Vector Norms 5

3.1 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Least Squares 6
4.1 Geometric Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Vector Calculus Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.1 Review of Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.2 Application to Least Squares . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.5 Least Squares with Orthonormal Basis for Subspace . . . . . . . . . . . . . . 10
4.5.1 Orthogonal Matrices and Orthonormal Basis . . . . . . . . . . . . . . 10
4.5.2 Back to LS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5.3 Gram-Schmidt Orthogonalization Algorithm . . . . . . . . . . . . . . 11

5 Least Squares Classification 13

6 Tikhonov Regularization/Ridge Regression 14

6.1 Tikhonov Regularization Derivation . . . . . . . . . . . . . . . . . . . . . . . 15
6.1.1 Derivation with Vector Calculus . . . . . . . . . . . . . . . . . . . . . 15
6.1.2 Alternative Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1
7 Singular Value Decomposition 16
7.1 Interpretation of SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.2 Low-Rank Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

8 Power Iteration and Page Rank 18

8.1 SVD: Connection to Eigenvalues/vectors . . . . . . . . . . . . . . . . . . . . 18

9 Matrix Completion 18
9.1 Iterative Singular Value Thresholding . . . . . . . . . . . . . . . . . . . . . . 18

10 Iterative Solvers 18
10.1 Gradient Descent/Landweber Iteration . . . . . . . . . . . . . . . . . . . . . 19

11 Regularized Regression 21
11.1 Proximal Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
11.2 LASSO (Least absolute selection and shrinkage operator) . . . . . . . . . . . 21

12 Convexity and Support Vector Machines 23

12.1 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
12.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2
1 Elements of Machine Learning
1. Collect data

2. Preprocessing: changing data to simplify subsequent operations without losing rel-

evant information.

3. Feature extraction: reduce raw data by extracting features or properties relevant to

the model.

4. Generate training samples: a large collection of examples we can use to learn the
model.

5. Loss function: To learn the model, we choose a loss function (i.e. a measure of how
well a model fits the data)

6. Learn the model: Search over a collection of candidate models or model parameters
to find one that minimizes the loss on training data.

7. Characterize generalization error (the error of our predictions on new data that was
not used for training).

2 Linear Algebra Review

2.1 Products
Inner products:
p
h x, wi = ∑ w j x j = x T w = wT x (1)
j =1

Thus this inner product is a weighted sum of the elements of x.

Matrix-vector multiplication:

x1T
   T 
x1 w
 xT   x T w
2  2 
Xw =   w =  ..  (2)
 
..
 .   . 
xnT xnT w

Matrix-matrix multiplication:

Example 1. Let X ∈ Rn× p , n movies, p people. T ∈ Rn×r , and W ∈ Rr× p . We can

think of T as the taste profiles of r representative customers and W as the weights on each
representative profile (there will be one set of weights for each customer). Suppose we
have two representative taste profiles (i.e. an action lover and a romance lover). Then w
will be a 2-vector containing the weights of on the two representative taste profiles. Then

3
Tw is the expected preferences of a customer who weights the representative taste profiles
of T with the weights given in w.

Now we can think about the full matrix product X = TW

X = TW =⇒ Xij = hith row of T, jth column of Wi (3)

• The jth column of X is a weighted sum of the columns of T, where the jth column
of W tells us the weights.
x j = Tw j (4)
That is, the tastes (preferences) of the jth customer.

• The ith row of X is xiT = tiT W where tiT is the ith row of T. This gives us how much
each customer likes movie i.

Inner product representation:

 
t Tw t Tw . . . t Tw
t1T
 
  1 1 1 2 1 p
t2T  t w1 . . .
T .. 
. 
 
TW = 

 w1 w2

. . . wp =  2 (5)
..  .. .. .. 

 .   . . . 
tnT T
t n w1 T
tn w p

Outer Product Representation:

w1T
 
 
 w2T  r
TW = T1 T2 . . . Tr   = ∑ Tk wkT (6)
 
  ..
 .  k =1
wrT

(the sum of rank 1 matrices. TW has rank r if and only if the columns of T are rows
of W are linearly independent). In this representation, we can think about Tk as the kth
representative taste profile and wkT as the kth row of W, or the affinity of each customer
with the kth representative profile.

2.2 Linear Independence

Definition 1. (Linear Independence) Vectors v1 , v2 , . . . , vn ∈ R p are linearly independent vec-
tors if and only if
n
∑ αj vj = 0 ⇐⇒ α j = 0, j = 1, . . . , n (7)
j =1

Definition 2. (Matrix rank) The rank of a matrix is the maximum number of linearly independent
columns. The rank of a matrix is less than the smallest dimension of the matrix.

4
3 Linear Systems and Vector Norms
Example 2. (Condition on rank( A) for existence of exact solution)
Consider the linear system of equations Ax = b. This means that b is a weighted
sum
of the columns
of A. Suppose A is full rank. Now consider the matrix A b . If the
rank of A b were greater than the rank of A (since the number of columns of the matrix
increased by 1 and A is assumed full rank, this would imply the rank is rank( A) + 1),
this would mean that b could not be written as a linear combination of the columns of
A, and thatthe system would not have an exact solution. Therefore, we must have that
rank( A b ) = rank( A) in order for the system Ax = b to have an exact solution.
To see how the definition of linear independence applies here, observe that Ax =
b =⇒ Ax − b = 0. Therefore
x
A b =0 (8)
−1

Thus, if Ax = b has an exact solution, then A b does not have linearly independent
columns.

Example 3. (Condition on rank( A) for more than one exact solution)

If the system of linear equations Ax = b has more than one exact solution, then there
is at least one non zero vector w for which x + w is also a solution. That is, A( x + w) = b.
If x is an exact solution, then Ax = b. This implies Aw = 0. Therefore, the columns of
A are linearly dependent. Thus, if rank( A) < dim( x ), then there will be more than one
exact solution.

Example 4. (Apply the above conditions) Let

   
1 −2 2
A =  −1 2  , b =  −2 (9)
−2 4 −4

We want to solve Ax = b.

• This system has an exact solution, since rank( A) = rank( A b ). This follows since
the columns of A are linearly dependent,
so it has rank 1, and b is a multiple of the
columns of A, so the rank of A b is also 1.

• Note that 1 = rank( A) < dim( x ) = 2. Therefore this system does not have a unique
solution.

3.1 Vector Norms

Definition 3. (Vector Norm) A vector norm is a function k·k mapping from Rn → R with the
following properties.

1. k x k ≥ 0 for all x ∈ Rn .

5
2. k x k = 0 if and only if x = 0.

3. kαx k = |α| k x k for all α ∈ R, x ∈ Rn .

4. k x + yk ≤ k x k + kyk for all x, y ∈ Rn .

Helpful fact: k x kq0 ≤ k x kq if 1 ≤ q ≤ q0 ≤ ∞.

Figure 1: The l p norm in R2

4 Least Squares
We are given:
1. Vector of labels y ∈ Rn

2. Matrix of features X ∈ Rn× p

We want to find:
1. Vector of weights w ∈ R p
Assumptions:
1. n ≥ p, and rank( X ) = p.
If y = Xw, then we have a system of n linear equations, where the ith equation is
p
yi = w1 xi1 + w2 xi2 + · · · + w p xip = ∑ w j xij = hw, x·i i (10)
j =1

where x·i is the ith row of X.

In general, y 6= Xw for any w. We define a residual ri = yi − hw, x·i i. Our goal is then
to find w ∑in=1 |ri |2 (the sum of square residuals/errors).
Why should we minimize the sum of square errors?
1. Magnifies the effect of large errors

6
2. Allows us to compute derivatives

3. Simple geometric interpretation

4. Coincides with modeling y = Xw + e, where e is Gaussian noise

4.1 Geometric Approach

We know r̂ = y − X ŵ is orthogonal to the span of the columns of X. Thus xiT r̂ = 0,
or X T r̂ = 0. This implies X T (y − X ŵ) = 0. Thus ŵ is a solution to the linear system of
equations
X T X ŵ = X T y (11)

Figure 2: Geometry of LS in R2

Observations:

• The question we’re trying to answer: What is the point in col ( X ) that has the shortest
distance to y? In R2 , what are the weights β 1 and β 2 such that β 1 x1 + β 2 x2 has the
shortest distance to y?

• colX is the space of all vectors that can be written as αx1 + βx2 for some α, β ∈ R,
that is the span of the columns of X. y may not lie in this space.

• The residual vector will form a right angle with colX, because any other angle would
correspond to a longer distance.

7
4.2 Vector Calculus Approach
4.2.1 Review of Vector Calculus
Let w be a p-vector and let f be a function of w that maps R p to R. Then the gradient
of f with respect to w is
 ∂ f (w) 
 ∂w. 1 
∇w f (w) =  .  (12)
 . 
∂ f (w)
∂w p

Example 5. (Gradient of an Inner Product) Let f (w) = h a, wi = w T a = ∑in=1 wi ai . Then

 
a1
 a2 
∇w w T a =  ..  = a (13)
 
.
ap

Example 6. (Gradient of an Inner Product, Squared) Let f (w) = kwk2 = w T w = w12 +

· · · + w2p . Then
 
2w1
 2w2 
T
∇w w w =  ..  = 2w (14)
 
 . 
2w p
(This is a special case of the Quadratic Form discussed below, where w T Qw, and Q = I)

Example 7. (Gradient of a Quadratic Form) Let x ∈ Rn and f ( x ) = x T Qx, where Q is

symmetric (if Q isn’t symmetric we could replace Q with 21 ( Q + Q T )). Then

f ( x ) = x T Qx
n n
= ∑ ∑ xi Qij x j
i =1 j =1

Therefore 
2Qkk xk
 i=j=k
df
[∇ x f ]k = = Qkj x j i = k, i 6= j (15)
dxk 
Qik xi j = k, j 6= i


Therefore
∇x f = (Q + QT )x (16)
If Q is symmetric, then this equals 2Qx.

8
4.2.2 Application to Least Squares
Let f (w) = ky − Xwk22 . Then the least squares problem is

ŵ = arg min f (w) (17)

We can expand f (w) as

f (w) = (y − Xw) T (y − Xw)

= y T y − y T Xw − w T X T y + w T X T Xw
= y T y − 2w T X T y + w T X T Xw

Then

∇w f (w) = −2X T y + 2X T Xw

At an optimum we have that ŵ solves X T y = X T Xw. Then if ( X T X )−1 exists, we have

that
ŵ = ( X T X )−1 X T y (18)

Theorem 1. (Sufficient Condition for Existence/Uniqueness of LS Solution) If the columns of

X are linearly independent, then X T X is non-singular, and there exists a unique least squares
solution ŵ = ( X T X )−1 X T y.

Proof.

4.3 Positive Definite Matrices

Definition 4 (Positive Definite, pd). A matrix Q (n × n) is positive definite (written Q 0) if
x T Qx > 0 for all x ∈ Rn , x 6= 0.

Definition 5 (Positive Semi-Definite, psd). A matrix Q (n × n) is positive semi-definite (writ-

ten Q 0) if x T Qx ≥ 0 for all x ∈ Rn , x 6= 0.

Properties of Positive Definite matrices:

1. If P 0 and Q 0, then P + Q 0.

2. If P 0 and α > 0, then αP 0.

3. For any matrix A, A T A 0 and AA T 0. Further, if the columns of A are linearly

independent, then A T A 0.

4. If A 0, then A−1 exists.

5. Notation: A B means A − B 0.

9
Example 8. Let 

1 1
X = 1 1 (19)
1 1
Then
T 3 3
X X= (20)
3 3

1
Consider the vector a = . Then a T X T Xa = 0. Therefore X T X is not positive defi-
−1
nite.

4.4 Subspaces
Definition 6. (Subspace) A set of points S ⊆ Rn is a subspace if

1. 0 ∈ S (S contains the origin)

2. If x, y ∈ S, then x + y ∈ S

3. If x ∈ S, α ∈ R, then αx ∈ S.

4.5 Least Squares with Orthonormal Basis for Subspace

Suppose are given a training sample { xi , yi }in=1 , xi ∈ R p and y ∈ R. If the columns of X
(the data matrix) are linearly dependent, then X T X is not invertible. It is then impossible
to tell which features
aresignificant predictors of y.
Given X = x1 . . . x p , the following are options to represent the corresponding sub-
space spanned by the columns of X:

1. Use X, but then LS can be hard to interpret.

2. Use an orthonormal basis for the subspace.

4.5.1 Orthogonal Matrices and Orthonormal Basis

Definition 7. (Orthonormal basis for X) An orthonormal basis for the columns of X is a collection
of vectors {u1 , . . . , ur } such that the span of the columns of X equals the span of {u1 , . . . , ur }.
That is, span({ x1 , . . . , x p }) = span({u1 , . . . , ur }). Furthermore,
(
0 i 6= j
uiT u j = (21)
1 i=j

That is, the u vectors are orthogonal and have norm 1.

Observations:

10
• The rank r of the subspace must satisfy r ≤ min (n, p). r is the number of linearly
independent columns of X.

• We can place the basis vectors into a basis matrix U ∈ Rn×r .

Claim 1. (Properties of orthogonal (basis) matrices) Let U ∈ Rn×r be an orthogonal (basis)
matrix.
1. U T U = I

2. If U and V are both orthogonal, then UV is also orthogonal.

3. U is length preserving: kUvk2 = kvk2 for v ∈ Rn .

Proof. We prove each item as follows:
1. We can easily see this from the inner product interpretation of matrix multiplication.

2. (UV ) T UV = V T U T UV = V T V = I.

3. kUvk22 = (Uv) T Uv = v T U T Uv = v T v = kvk22 .

4.5.2 Back to LS
Suppose U is an orthonormal basis matrix for our data matrix X. Then, the least-
squares problem is
v̂ = arg min ky − Uvk22 (22)
v

We need v̂ to satisfy U T U v̂ = U T y. Thus, v̂ = U T y.

4.5.3 Gram-Schmidt Orthogonalization Algorithm

How can we take X and get an orthonormal basis U?

1. Input X = x1 . . . x p ∈ Rn× p

Output: U = u1 . . . ur ∈ Rn×r

where r = rank( X ) ≤ min(n, p)

x1
2. Initialize u1 = k x1 k2

3. For j = 2, 3, . . . , p
x 0j = all the components of x j not represented by u1 , . . . , u j−1 .

j −1
x 0j = x j − ∑ (uiT x j )ui (23)
i =1

11
here (uiT x j ) is the least squares weight for ui .
 0
x
 j
 x 0j 6= 0
uj = x 0j (24)
2
0 x 0j =0


Next, by construction, each column of U, ui , is in span({ x1 , . . . , x p }). Therefore we can

write
ui = αi1 x1 + αi2 x2 + · · · + αip x p (25)
where the αij ∈ R. We can write this in matrix form as

U = XA (26)

where X is n × p and A is p × r, and the ith column of A is

 
αi1
α 
 i2 
ai =  ..  (27)
 . 
αip

Thus, ui = Xai .
Now, suppose w ∈ R p is the vector of weights we found using LS, and as above, v is
our vector of weights founding using LS with an orthonormal basis matrix. We have two
equations for the predicated label ŷ

ŷ = w1 x2 + w2 x2 + · · · + w p x p
= v1 u1 + v2 u2 + · · · + vr ur
= v1 Xa1 + v2 Xa2 + · · · + vr Xar
= v1 (α11 x1 + α12 x2 + · · · + α1p x p )
..
.
+ vr (αr1 x1 + αr2 x2 + · · · + αrp x p )
= x1 (v1 α11 + · · · + vr αr1 )
..
.
+ x p (v1 α1p + · · · + vr αrp )

Notice that

w1 = v1 α11 + · · · + vr αr1
..
.
w p = v1 α1p + · · · + vr αrp

12
Therefore
ŷ = XAv = Xw (28)
so that Av = w.
In sum, given a new sample xnew ∈ R p , we have two ways to predict label ynew :
1. ŷnew = h xnew , wi
T
2. Using an orthonormal basis U, we know that U = XA. Therefore, unew T A.
= xnew
Equivalently, unew = Axnew . Then ynew = hunew , vi.
If the columns of X are linearly independent (r = p), we can calculate using LS (recall-
ing ui = Xai )
a i = ( X T X ) −1 X T u i (29)
Theorem 2. Let X ∈ Rn× p , n ≥ p, be full rank (the p columns of X are linearly indepen-
dent) and y ∈ Rn . Let u1 , . . . , u p be orthonormal basis vectors such that span({ x1 , . . . , x p }) =
span({u1 , . . . , u p }). Then ŷ = X ŵ where ŵ = arg minw ky − Xwk22 is given by ŷ = UU T y,
where U = u1 u2 . . . u p .
Proof.
ŷ = X ŵ = X ( X T X )−1 X T y (30)
where Px = X ( X T X )−1 X T is a projection matrix. Since span({ x1 , . . . , x p }) = span({u1 , . . . , u p }),
we must have that
Px y = Pu y (31)
which implies Px = Pu . Thus

Px = Pu = U (U T U )−1 U T = UU T (32)

Finally
ŷ = Px y = Pu y = UU T y (33)

5 Least Squares Classification

We are given a training sample { xi , yi }in=1 , xi ∈ R p and y ∈ R (or y ∈ {+1, −1}).
Definition 8. (Linear Predictor) We have a linear predictor if each label is a linear combination of
p
the features i.e. we can find weights {wi }i=1 such that

yi = w1 xi1 + w2 xi2 + . . . w p xip (34)

In words, this says the label for observation i is a linear combination of the features for example i.
The steps to complete least squares classification in this environment are as follows:

13
1. Build a data matrix or feature matrix and label vector

x1T
   T   
x1 1 y1
 x T   x T 1  y2 
2   2
X=  =  ..  ∈ Rn × p , y=. (35)
   
..
 .   .   .. 
xnT xnT 1 yn

The linear model is then ŷ = Xw.

2. Solve a least squares optimization problem

n
ŵ = arg min ky − Xwk22 = arg min ∑ (yi − xiT w)2 (36)
w w i =1

(this last equality makes it clear that we are minimizing the sum of squared resid-
uals). If the columns of X are linearly independent, then X T X is positive definite.
Therefore X T X is invertible. In sum, if X T X is positive definite, then there exists a
unique LS solution
ŵ = ( X T X )−1 X T y (37)
The predicted labels are

ŷ = Xw
= X ( X T X ) −1 X T y

3. Validate with test/hold out data

6 Tikhonov Regularization/Ridge Regression

We are given X ∈ Rn× p (n training samples, p features) and y ∈ Rn (n labels). Our
model is y ≈ Xw, which means yi ≈ xiT w for some w ∈ R p .
The LS problem is
n
ŵ LS = arg min ky − Xwk22 = arg min ∑ (yi − xiT w)2 (38)
w w i =1

There are two cases

1. If X is full rank (i.e. the columns of X are linearly independent), then ŵ LS is unique
and
ŵ LS = ( X T X )−1 X T y (39)

2. If X is not full rank, then X T X is not invertible. ŵ LS is not unique; there are infinitely
many solutions.

14
6.1 Tikhonov Regularization Derivation
In this second case (and it can also be useful in the first), we can define a new objective

ŵ = arg min ky − Xwk22 + λ kwk22 (40)

where ky − Xwk22 measures the fit to the data, λ > 0 is a regularization parameter or
tuning parameter, and kwk22 is a regularizer. kwk22 measures the energy in w.
Observations about this problem:

1. ŵ is unique even when no unique least square solution exists

2. Even when X is full rank, X T X can be badly behaved, and regularization adjusts for
this.

6.1.1 Derivation with Vector Calculus

Let f (w) = ky − Xwk22 + λ kwk22 . Then

f (w) = y T y − 2w T X T y + w T X T Xw + λw T w
= y T y − 2w T X T y + w T ( X T X + λI )w

Then
∇w f (w) = −2X T y + 2( X T X + λI ) (41)
If ( X T X + λI ) is invertible, then ŵ = ( X T X + λI )−1 X T y. BUT, ( X T X + λI ) is always
invertible. Recall that if a matrix is positive definite, then it is invertible. We can show that
( X T X + λI ) is indeed positive definite and hence invertible. To see this, fix 0 6= a ∈ Rn ,
then

a T ( X T X + λI ) a = a T X T Xa + λa T a
= k Xak22 + λ k ak22

Now note that k Xak22 ≥ 0 (it could be 0 if X is not full rank and a is in the null space of X
– this is what causes troubles with LS) but λ k ak22 > 0. Therefore,( X T X + λI ) is positive
definite.

6.1.2 Alternative Derivation

Note that for vectors a, b,
2
a
k ak22 + kbk22 =
b (42)
2

15
Therefore,

f (w) = ky − Xwk22 + λ kwk22

√ 2
2
= ky − Xwk2 + λw

2
y − Xw 2

= √
λw 2
2
y
√Xw
= 0 −
λw 2
2
y X
− √

= w
0 λI
2
2
= ỹ − X̃w 2

We can solve this problem with LS, so that

ŵ = ( X̃ T X̃ )−1 X̃ ỹ (43)

where
X̃ T X̃ = X T X + λI (44)
and
X̃ ỹ = X T y (45)
Thus this is equivalent to the derivation above.

7 Singular Value Decomposition

Theorem 3. Every matrix X ∈ Rn× p can be factorized as

X = UΣV T (46)

where

1. U ∈ Rn×n is orthogonal (U T U = UU T = I) are the left singular vectors of X

2. Σ ∈ Rn× p is diagonal and contains the singular values of X

3. V ∈ R p× p is orthogonal (V T V = VV T = I) and contains the right singular vectors of X

16
Figure 3: SVD

7.1 Interpretation of SVD

1. U is an orthonormal basis for the columns of X

2. ΣV T are the basis coefficients

Example 9. (Netflix) Let X ∈ Rn× p be a matrix (full rank) where the columns are taste
profiles of customers and the rows are single movie ratings across customers.

1. The ith column of U is a basis vector in Rn and is the ith representative customer
taste profile (i.e. vector of normalized movie ratings).

2. The jth column of V T (the jth row of V) is the relative importance of each represen-
tative taste profile to predicting customer j’s preferences.

3. The ith row of V T (the ith column of V) is the vector of users’ affinities to the ith
representative profile.

7.2 Low-Rank Approximation

Theorem 4. (Subspace Approximation) If X ∈ R p×n has rank r > k, then

min k X − Z k2F (47)

Z : rank( Z )=k

is given by Z = Xk = Uk Σk VkT and

r
k X − Xk k2F = ∑ σi2 (48)
i = k +1

17
8 Power Iteration and Page Rank
8.1 SVD: Connection to Eigenvalues/vectors
Suppose X = UΣV T ∈ R p×n . Then

A := X T X
= VΣU T UΣV T
= VΣ2 V T
= VΛV T

Thus V is a matrix of the eigenvectors of A and Λ = Σ2 contains its eigenvalues.

Power iteration gives a method to find the 1st right singular vector.

9 Matrix Completion
9.1 Iterative Singular Value Thresholding

Algorithm 1 Iterative Singular Value Thresholding

Require: X̂ = zeros(n, p)
Require: X̂Ω = XΩ . fill in obs. entries
Require: Threshold or r
for k ← 0, 1, . . . do
[U, S, V ] ← svd( X̂ )
Ŝ ← S ≥ threshold . if threshold, keep sing vals ≥
Ŝ ← S(1 : r, 1 : r ) . if rank, keep r sing vals
X̂ ← U ŜV T
X̂Ω ← XΩ . fill in obs. entries
If converged: X̂ − X̂old < e, stop.
end for

10 Iterative Solvers
Let τ > 0 be step size.

18
Algorithm 2 Landweber Iteration
Require: w (0)
for k ← 0, 1, . . . do
w (k+1) ← w(k) − τX T ( Xw(k) − y)
end for

10.1 Gradient Descent/Landweber Iteration

This algorithm takes a step in the direction of negative gradient of each iteration of the
objective function f (w) = k Xw − yk22 . Notice that

f (w) = k Xw − yk22
∇w f (w) = ∇w ( Xw − y)T ( Xw − y)
= ∇w w T X T Xy − 2w T X T y + y T y
= 2X T Xw − 2X T y
= 2X T ( Xw − y)

Thus the new iterate equals the old iterate plus a step in direction of negative gradient.

Claim 2. (Convergence of Landweber Iteration)

Proof. We want to show that

2 2
Xw(k+1) − y ≤ Xw(k) − y (49)

2 2

Recall that the iteration is given by w(k+1) = w(k) − τX T ( Xw(k) − y). Then
2 2
( k +1) (k) T (k)
Xw − y = X (w − τX ( Xw − y)) − y

2 2
2
= Xw(k) − y − τXX T ( Xw(k) − y)

2
2 2
= Xw − y + τ XX ( Xw − y) − 2τ ( Xw(k) − y)T XX T ( Xw(k) − y)
(k) 2 T (k)

2 2

Now observe that

2 2
2
XX T ( Xw(k) − y) ≤ k X kop X T ( Xw(k) − y) (50)

2 2

19
Therefore
2 2 2 2
( k +1) (k) 2 T (k) T (k)
Xw − y ≤ Xw − y + τ τ k X kop X ( Xw − y) − 2 X ( Xw − y)

2 2 2 2
2 2
= Xw(k) − y + τ X T ( Xw(k) − y) (τ k X k2op − 2)

2 2

2 2
Thus, if(τ k X k2op
− 2) < 0, then Xw ( k + 1 ) ( k )
− y ≤ Xw − y . Therefore, for conver-

2 2
gence we require that
2
0<τ< (51)
k X k2op
in order to ensure convergence. Under this condition, then

w ( k ) → ( X T X ) −1 X T y (52)

The above proof made use of the following claim.

Claim 3. (Bound on 2-norm of matrix-vector product) Let X be matrix and w a vector (con-
formable). Then
k Xwk2 ≤ k X kop kwk2 (53)

Proof. Recall: The 2-norm (here, operator norm) of a matrix is it’s largest singular value.

k X kop = k X k2 = σmax (54)

Suppose X = UΣV T . Then

k Xwk2 = UΣV T w

2
T
= ΣV w (U orthonormal, preserves norms)

2
!1
2 2

= ∑ σi (V w)iT
i
!1
2

≤ σmax ∑(V T w)2i

i

= σmax V T w (definition of norm)

2
= σmax kwk2 (V, V T orthonormal, preserves norms)

20
11 Regularized Regression
11.1 Proximal Gradient Algorithm

Algorithm 3 Proximal Gradient Algorithm: arg minw ky − Xwk22 + λr (w)

Require: Initial w (0)
for k ← 0, 1, . . . do
z (k) ← w (k) − τX T ( Xw (k) − y ) . grad descent step
2
w (k+1) ← arg minw z k − w 2 + λτr (w )

. regularization step
end for

11.2 LASSO (Least absolute selection and shrinkage operator)

LASSO solve the following problem

ŵ L = arg min kwk1 subject to ky − Xwk < e (55)

which is equivalent to
ŵ L = arg min ky − Xwk22 + λ kwk1 (56)
w
In the figure below, the rhombuses and circles show the locus of points for which the
weight vector has a particular norm (in the L1 and L2 norms respectively). More precisely,
they are {w : kwk1 = τ1 } and {w : kwk2 = τ2 }. The red line is {w : y = Xw}.

Figure 4: Weight vector with Lasso vs Ridge Regression

21
Example 10. (r (w) = kwk1 = ∑i |wi |)

ŵ = arg min kz − wk22 + λτ kwk1

w

= arg min ∑ (zi − wi )2 + λτ |wi | (separable)
w i

Thus we can solve a problem for each wi .

ŵi = arg min(zi − wi )2 + λτ |wi | (57)

where λ, τ > 0.

1. Case 1: zi > 0. Then ŵi ≥ 0. Thus

ŵi = arg min(zi − wi )2 + λτwi

wi
d
(obj) = −2(zi − wi ) + λτ
dwi
λτ
wi = z i −
2
Thus

(a) If zi > λτ
2 , then ŵi = zi − λτ
2 .
λτ
(b) If zi < 2 , then ŵi = 0.

In sum
λτ
ŵi = zi − (58)
2 +

2. Case 2: zi < 0. Then ŵi ≤ 0. Similarly,

λτ
wi = z i + (59)
2
Thus

(a) If zi < − λτ λτ
2 , then ŵi = zi 2 .
(b) If zi > − λτ
2 , then ŵi = 0.

In sum
λτ
ŵi = − |zi | − (60)
2 +

22
We can combine these two cases to get that

λτ
ŵi = − |zi | − sign(zi ) (61)
2 +

we we call the SoftThreshold(zi , λτ

2 ). The figure below shows how ŵi depends on zi .

Figure 5: Soft Thresholding

12 Convexity and Support Vector Machines

12.1 Convexity
Example 11. (Convex function lies above tangent lines) Suppose l (w) is a convex function.
Then informally, l (w) ≥ tangent and w. More formally,

l (u) ≥ l (w) + (u − w) T ∇l (w) (62)

23
Figure 6: A convex function lies entirely above tangent lines

Definition 9. (Subgradient) A subgradient v of a convex function l is as a vector satisfying

l (u) ≥ l (w) + (u − w) T v (63)

Figure 7: Subgradients

12.2 Support Vector Machines

When we use the classification rule ŷ = sign(w T x ), our goal is thus to choose a w such
that ŷ = sign(w T x ) as often as possible. However, when we use least squares we do not
actually minimize the number of mistakes. Minimizing the number of mistakes can be

24
written as minimizing the following sum of indicator variables:
n
∑ Iyi 6=sign(wT xi ) (64)
i =1

LS actually minimizes
n
∑ ( y i − w T x i )2 (65)
i =1
We want to choose a convex function that mimics the ideal loss. We will use hinge
loss, which is defined by
n
l (w) = ∑ (1 − yi xiT w)+ (66)
i =1
where (
a a>0
( a)+ = (67)
0 otherwise
Definition 10. (Support Vector Machine) If we minimize
n
∑ (1 − yi xiT w)+ + λ kwk22 (68)
i =1

this is called a support vector machine.

In the figure below, the black line is the ideal loss function (not convex). The green line
is the squared loss function (LS), the blue line is hinge loss, and the red line is log loss.

Figure 8: Loss Functions

Solution Manual For Discrete Time Signal Processing 3 E 3rd Edition Alan V Oppenheim Ronald W Schafer
0% (1)
Solution Manual For Discrete Time Signal Processing 3 E 3rd Edition Alan V Oppenheim Ronald W Schafer
4 pages
Selected Linear Algebra for Machine Learning
No ratings yet
Selected Linear Algebra for Machine Learning
30 pages
Cs421 Cheat Sheet
No ratings yet
Cs421 Cheat Sheet
2 pages
Lecture-04__Least Squares and Geometry
No ratings yet
Lecture-04__Least Squares and Geometry
35 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
2 pages
Chapter1_Numerical Analysis II 2023-2024
No ratings yet
Chapter1_Numerical Analysis II 2023-2024
30 pages
MA1522 Note Binder
No ratings yet
MA1522 Note Binder
43 pages
Linear Algebra Cheat-Sheet: Laurent Lessard
100% (1)
Linear Algebra Cheat-Sheet: Laurent Lessard
13 pages
Math Data
No ratings yet
Math Data
117 pages
Lecture 2: Background: - Linear Algebra
No ratings yet
Lecture 2: Background: - Linear Algebra
36 pages
Math 6610 - Analysis of Numerical Methods I
No ratings yet
Math 6610 - Analysis of Numerical Methods I
103 pages
Cis515 13 sl1 A
No ratings yet
Cis515 13 sl1 A
68 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
72073931-8e00-4107-bdde-c19d4ec282cb
No ratings yet
72073931-8e00-4107-bdde-c19d4ec282cb
5 pages
La PDF
No ratings yet
La PDF
208 pages
斯坦福大学机器学习数学基础 9-16
No ratings yet
斯坦福大学机器学习数学基础 9-16
8 pages
Linear Algebra Through Matrices
100% (2)
Linear Algebra Through Matrices
245 pages
Linear Algebra
100% (1)
Linear Algebra
245 pages
Iterative Linear
No ratings yet
Iterative Linear
10 pages
Cis515 15 sl1 A
No ratings yet
Cis515 15 sl1 A
68 pages
La PDF
No ratings yet
La PDF
257 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Shifting Method
No ratings yet
Shifting Method
9 pages
Properties of The Singular Value Decomposition: Preliminary Definitions
No ratings yet
Properties of The Singular Value Decomposition: Preliminary Definitions
24 pages
SciCom LecNotes
No ratings yet
SciCom LecNotes
28 pages
Mathematical Treatise On Linear Algebra
No ratings yet
Mathematical Treatise On Linear Algebra
7 pages
1 Solving Systems of Linear Equations: Gaussian Elimination: Lecture 9: October 26, 2021
No ratings yet
1 Solving Systems of Linear Equations: Gaussian Elimination: Lecture 9: October 26, 2021
8 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
Report System Identification and Modelling
No ratings yet
Report System Identification and Modelling
34 pages
Numerical Analisis 2015
No ratings yet
Numerical Analisis 2015
357 pages
133A Textbook
No ratings yet
133A Textbook
348 pages
ML Module 2,3,4
No ratings yet
ML Module 2,3,4
13 pages
Linear_Algebra_LectureNote
No ratings yet
Linear_Algebra_LectureNote
288 pages
L02 Notes
No ratings yet
L02 Notes
6 pages
Lec 3 Printed
No ratings yet
Lec 3 Printed
136 pages
Linear Algebra and Optimization For Machine Learning
No ratings yet
Linear Algebra and Optimization For Machine Learning
18 pages
Mma PDF
No ratings yet
Mma PDF
377 pages
Practical Linear Algebra
100% (1)
Practical Linear Algebra
253 pages
(VMLS) Julia Language Companion PDF
No ratings yet
(VMLS) Julia Language Companion PDF
178 pages
Lecture 1 - Linear Algebra
No ratings yet
Lecture 1 - Linear Algebra
24 pages
MachineLearningPatternRecognition_18_finalversion
No ratings yet
MachineLearningPatternRecognition_18_finalversion
265 pages
MLF Notes - Rishab Dec 24
No ratings yet
MLF Notes - Rishab Dec 24
6 pages
Math133 Midterm Guide
No ratings yet
Math133 Midterm Guide
12 pages
Direct Methods
No ratings yet
Direct Methods
79 pages
11 Linearmodels 3
No ratings yet
11 Linearmodels 3
13 pages
Linear Algebra and Applications: Benjamin Recht
No ratings yet
Linear Algebra and Applications: Benjamin Recht
42 pages
Nonlinear Optimization (18799 B, PP) : Ist-Cmu PHD Course, Spring 2011
No ratings yet
Nonlinear Optimization (18799 B, PP) : Ist-Cmu PHD Course, Spring 2011
11 pages
Fundamentals of Numerical Linear Algebra
No ratings yet
Fundamentals of Numerical Linear Algebra
265 pages
Linear Algebra For Data Science 9811276226 9789811276224 - Compress
100% (2)
Linear Algebra For Data Science 9811276226 9789811276224 - Compress
257 pages
Draft: Lecture Notes On Discrete Mathematics
No ratings yet
Draft: Lecture Notes On Discrete Mathematics
195 pages
Cheat Sheet (Regular Font) PDF
No ratings yet
Cheat Sheet (Regular Font) PDF
4 pages
Midterm Review
No ratings yet
Midterm Review
10 pages
Chapter 0 - Miscellaneous Preliminaries: EE 520: Topics - Compressed Sensing Linear Algebra Review
No ratings yet
Chapter 0 - Miscellaneous Preliminaries: EE 520: Topics - Compressed Sensing Linear Algebra Review
18 pages
Foundations Lecture 1 Printout
No ratings yet
Foundations Lecture 1 Printout
17 pages
Ee127-Fa2018-Mt1-El Ghaoui-Soln
No ratings yet
Ee127-Fa2018-Mt1-El Ghaoui-Soln
15 pages
Definition of Terms
No ratings yet
Definition of Terms
3 pages
Linear Algebra Cheat Sheet
100% (1)
Linear Algebra Cheat Sheet
3 pages
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Fundamentals of Matrices: Scalar, Vectors and Matrices
No ratings yet
Fundamentals of Matrices: Scalar, Vectors and Matrices
15 pages
Exercise
No ratings yet
Exercise
3 pages
Matrix Algebra_ MTH-174_Beyond Syllabus-2
No ratings yet
Matrix Algebra_ MTH-174_Beyond Syllabus-2
6 pages
Mobile Information Systems - 2016 - Ali - On The Eigenvalue Based Detection For Multiantenna Cognitive Radio System
No ratings yet
Mobile Information Systems - 2016 - Ali - On The Eigenvalue Based Detection For Multiantenna Cognitive Radio System
8 pages
mth643 Final Term Lect8to15 (Yasir)
No ratings yet
mth643 Final Term Lect8to15 (Yasir)
4 pages
Matlab Z Transform
No ratings yet
Matlab Z Transform
3 pages
Graph Theoretic Methods in Multiagent Networks Mehran Mesbahi - Read the ebook online or download it to own the complete version
100% (2)
Graph Theoretic Methods in Multiagent Networks Mehran Mesbahi - Read the ebook online or download it to own the complete version
47 pages
MATHS GN11 U8 & U 9. (A) (Aut) ...
No ratings yet
MATHS GN11 U8 & U 9. (A) (Aut) ...
45 pages
GEE 311 - Module 1.0 - Matrices - by Engr. Douglas - 2022 - 2023
No ratings yet
GEE 311 - Module 1.0 - Matrices - by Engr. Douglas - 2022 - 2023
32 pages
Fundamental Matrix
No ratings yet
Fundamental Matrix
30 pages
Lecture# 8 (3-D Force System and Projection)
No ratings yet
Lecture# 8 (3-D Force System and Projection)
27 pages
Applications of Tensor Analysis in Continuum Mechanics 1st Edition Victor A Eremeyev Michael J Cloud And Leonid P Lebedev - Get the ebook in PDF format for a complete experience
100% (3)
Applications of Tensor Analysis in Continuum Mechanics 1st Edition Victor A Eremeyev Michael J Cloud And Leonid P Lebedev - Get the ebook in PDF format for a complete experience
65 pages
Continuum Mechanics
No ratings yet
Continuum Mechanics
410 pages
120 Minutes
No ratings yet
120 Minutes
11 pages
Lecture Notes in Quantum Mechanics: June 2017
No ratings yet
Lecture Notes in Quantum Mechanics: June 2017
99 pages
Mat306 Ass2
No ratings yet
Mat306 Ass2
11 pages
WWW Programiz Com
No ratings yet
WWW Programiz Com
3 pages
NQM 11 Specialist Maths 00 Preliminaries
No ratings yet
NQM 11 Specialist Maths 00 Preliminaries
15 pages
Exercise Sheet 1: Quantum Information - Summer Semester 2020
No ratings yet
Exercise Sheet 1: Quantum Information - Summer Semester 2020
2 pages
Vector in Military Usage
No ratings yet
Vector in Military Usage
5 pages
Rotation About Arbitrary Line PDF
No ratings yet
Rotation About Arbitrary Line PDF
16 pages
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) June 20, 2020
No ratings yet
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) June 20, 2020
29 pages
Machine Learning: Dr. Shahid Mahmood Awan
No ratings yet
Machine Learning: Dr. Shahid Mahmood Awan
164 pages
Mat5011 - Digital Assignment 2
No ratings yet
Mat5011 - Digital Assignment 2
2 pages
Full Download Calculus and Linear Algebra in Recipes Terms Phrases and Numerous Examples in Short Learning Units 1st Edition Christian Karpfinger PDF
No ratings yet
Full Download Calculus and Linear Algebra in Recipes Terms Phrases and Numerous Examples in Short Learning Units 1st Edition Christian Karpfinger PDF
45 pages
Short Notes For Analytical Geometry
No ratings yet
Short Notes For Analytical Geometry
8 pages
Matrix Completion
No ratings yet
Matrix Completion
43 pages
MAT 203 - Lecture 4 Slides
No ratings yet
MAT 203 - Lecture 4 Slides
22 pages
Further Vectors Worksheet
No ratings yet
Further Vectors Worksheet
13 pages
Field Guide To Polarization
No ratings yet
Field Guide To Polarization
6 pages

CS 532 Lecture Notes

Uploaded by

CS 532 Lecture Notes

Uploaded by

Matrix Methods in Machine Learning

November 11, 2018

2 Linear Algebra Review 3

3 Linear Systems and Vector Norms 5

5 Least Squares Classification 13

6 Tikhonov Regularization/Ridge Regression 14

8 Power Iteration and Page Rank 18

12 Convexity and Support Vector Machines 23

2. Preprocessing: changing data to simplify subsequent operations without losing rel-

3. Feature extraction: reduce raw data by extracting features or properties relevant to

2 Linear Algebra Review

Thus this inner product is a weighted sum of the elements of x.

Example 1. Let X ∈ Rn× p , n movies, p people. T ∈ Rn×r , and W ∈ Rr× p . We can

Now we can think about the full matrix product X = TW

X = TW =⇒ Xij = hith row of T, jth column of Wi (3)

Inner product representation:

Outer Product Representation:

2.2 Linear Independence

Example 3. (Condition on rank( A) for more than one exact solution)

Example 4. (Apply the above conditions) Let

3.1 Vector Norms

3. kαx k = |α| k x k for all α ∈ R, x ∈ Rn .

4. k x + yk ≤ k x k + kyk for all x, y ∈ Rn .

Figure 1: The l p norm in R2

2. Matrix of features X ∈ Rn× p

where x·i is the ith row of X.

3. Simple geometric interpretation

4. Coincides with modeling y = Xw + e, where e is Gaussian noise

4.1 Geometric Approach

Example 5. (Gradient of an Inner Product) Let f (w) = h a, wi = w T a = ∑in=1 wi ai . Then

Example 6. (Gradient of an Inner Product, Squared) Let f (w) = kwk2 = w T w = w12 +

Example 7. (Gradient of a Quadratic Form) Let x ∈ Rn and f ( x ) = x T Qx, where Q is

ŵ = arg min f (w) (17)

We can expand f (w) as

f (w) = (y − Xw) T (y − Xw)

At an optimum we have that ŵ solves X T y = X T Xw. Then if ( X T X )−1 exists, we have

Theorem 1. (Sufficient Condition for Existence/Uniqueness of LS Solution) If the columns of

4.3 Positive Definite Matrices

Definition 5 (Positive Semi-Definite, psd). A matrix Q (n × n) is positive semi-definite (writ-

Properties of Positive Definite matrices:

2. If P  0 and α > 0, then αP  0.

3. For any matrix A, A T A  0 and AA T  0. Further, if the columns of A are linearly

4. If A  0, then A−1 exists.

1. 0 ∈ S (S contains the origin)

4.5 Least Squares with Orthonormal Basis for Subspace

1. Use X, but then LS can be hard to interpret.

2. Use an orthonormal basis for the subspace.

4.5.1 Orthogonal Matrices and Orthonormal Basis

That is, the u vectors are orthogonal and have norm 1.

• We can place the basis vectors into a basis matrix U ∈ Rn×r .

2. If U and V are both orthogonal, then UV is also orthogonal.

3. U is length preserving: kUvk2 = kvk2 for v ∈ Rn .

3. kUvk22 = (Uv) T Uv = v T U T Uv = v T v = kvk22 .

We need v̂ to satisfy U T U v̂ = U T y. Thus, v̂ = U T y.

4.5.3 Gram-Schmidt Orthogonalization Algorithm

where r = rank( X ) ≤ min(n, p)

Next, by construction, each column of U, ui , is in span({ x1 , . . . , x p }). Therefore we can

where X is n × p and A is p × r, and the ith column of A is

5 Least Squares Classification

yi = w1 xi1 + w2 xi2 + . . . w p xip (34)

The linear model is then ŷ = Xw.

2. Solve a least squares optimization problem

3. Validate with test/hold out data

6 Tikhonov Regularization/Ridge Regression

There are two cases

ŵ = arg min ky − Xwk22 + λ kwk22 (40)

1. ŵ is unique even when no unique least square solution exists

6.1.1 Derivation with Vector Calculus

6.1.2 Alternative Derivation

f (w) = ky − Xwk22 + λ kwk22

We can solve this problem with LS, so that

7 Singular Value Decomposition

1. U ∈ Rn×n is orthogonal (U T U = UU T = I) are the left singular vectors of X

2. Σ ∈ Rn× p is diagonal and contains the singular values of X

3. V ∈ R p× p is orthogonal (V T V = VV T = I) and contains the right singular vectors of X

7.1 Interpretation of SVD

2. If P 0 and α > 0, then αP 0.

3. For any matrix A, A T A 0 and AA T 0. Further, if the columns of A are linearly

4. If A 0, then A−1 exists.