CS 532 Lecture Notes
CS 532 Lecture Notes
Lecture Notes
Rebekah Dix
Contents
1 Elements of Machine Learning 3
4 Least Squares 6
4.1 Geometric Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Vector Calculus Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.1 Review of Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.2 Application to Least Squares . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.5 Least Squares with Orthonormal Basis for Subspace . . . . . . . . . . . . . . 10
4.5.1 Orthogonal Matrices and Orthonormal Basis . . . . . . . . . . . . . . 10
4.5.2 Back to LS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5.3 Gram-Schmidt Orthogonalization Algorithm . . . . . . . . . . . . . . 11
1
7 Singular Value Decomposition 16
7.1 Interpretation of SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.2 Low-Rank Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
9 Matrix Completion 18
9.1 Iterative Singular Value Thresholding . . . . . . . . . . . . . . . . . . . . . . 18
10 Iterative Solvers 18
10.1 Gradient Descent/Landweber Iteration . . . . . . . . . . . . . . . . . . . . . 19
11 Regularized Regression 21
11.1 Proximal Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
11.2 LASSO (Least absolute selection and shrinkage operator) . . . . . . . . . . . 21
2
1 Elements of Machine Learning
1. Collect data
4. Generate training samples: a large collection of examples we can use to learn the
model.
5. Loss function: To learn the model, we choose a loss function (i.e. a measure of how
well a model fits the data)
6. Learn the model: Search over a collection of candidate models or model parameters
to find one that minimizes the loss on training data.
7. Characterize generalization error (the error of our predictions on new data that was
not used for training).
x1T
T
x1 w
xT x T w
2 2
Xw = w = .. (2)
..
. .
xnT xnT w
Matrix-matrix multiplication:
3
Tw is the expected preferences of a customer who weights the representative taste profiles
of T with the weights given in w.
• The jth column of X is a weighted sum of the columns of T, where the jth column
of W tells us the weights.
x j = Tw j (4)
That is, the tastes (preferences) of the jth customer.
• The ith row of X is xiT = tiT W where tiT is the ith row of T. This gives us how much
each customer likes movie i.
w1T
w2T r
TW = T1 T2 . . . Tr = ∑ Tk wkT (6)
..
. k =1
wrT
(the sum of rank 1 matrices. TW has rank r if and only if the columns of T are rows
of W are linearly independent). In this representation, we can think about Tk as the kth
representative taste profile and wkT as the kth row of W, or the affinity of each customer
with the kth representative profile.
Definition 2. (Matrix rank) The rank of a matrix is the maximum number of linearly independent
columns. The rank of a matrix is less than the smallest dimension of the matrix.
4
3 Linear Systems and Vector Norms
Example 2. (Condition on rank( A) for existence of exact solution)
Consider the linear system of equations Ax = b. This means that b is a weighted
sum
of the columns
of A. Suppose A is full rank. Now consider the matrix A b . If the
rank of A b were greater than the rank of A (since the number of columns of the matrix
increased by 1 and A is assumed full rank, this would imply the rank is rank( A) + 1),
this would mean that b could not be written as a linear combination of the columns of
A, and thatthe system would not have an exact solution. Therefore, we must have that
rank( A b ) = rank( A) in order for the system Ax = b to have an exact solution.
To see how the definition of linear independence applies here, observe that Ax =
b =⇒ Ax − b = 0. Therefore
x
A b =0 (8)
−1
Thus, if Ax = b has an exact solution, then A b does not have linearly independent
columns.
We want to solve Ax = b.
• This system has an exact solution, since rank( A) = rank( A b ). This follows since
the columns of A are linearly dependent,
so it has rank 1, and b is a multiple of the
columns of A, so the rank of A b is also 1.
• Note that 1 = rank( A) < dim( x ) = 2. Therefore this system does not have a unique
solution.
1. k x k ≥ 0 for all x ∈ Rn .
5
2. k x k = 0 if and only if x = 0.
4 Least Squares
We are given:
1. Vector of labels y ∈ Rn
6
2. Allows us to compute derivatives
Figure 2: Geometry of LS in R2
Observations:
• The question we’re trying to answer: What is the point in col ( X ) that has the shortest
distance to y? In R2 , what are the weights β 1 and β 2 such that β 1 x1 + β 2 x2 has the
shortest distance to y?
• colX is the space of all vectors that can be written as αx1 + βx2 for some α, β ∈ R,
that is the span of the columns of X. y may not lie in this space.
• The residual vector will form a right angle with colX, because any other angle would
correspond to a longer distance.
7
4.2 Vector Calculus Approach
4.2.1 Review of Vector Calculus
Let w be a p-vector and let f be a function of w that maps R p to R. Then the gradient
of f with respect to w is
∂ f (w)
∂w. 1
∇w f (w) = . (12)
.
∂ f (w)
∂w p
f ( x ) = x T Qx
n n
= ∑ ∑ xi Qij x j
i =1 j =1
Therefore
2Qkk xk
i=j=k
df
[∇ x f ]k = = Qkj x j i = k, i 6= j (15)
dxk
Qik xi j = k, j 6= i
Therefore
∇x f = (Q + QT )x (16)
If Q is symmetric, then this equals 2Qx.
8
4.2.2 Application to Least Squares
Let f (w) = ky − Xwk22 . Then the least squares problem is
Then
∇w f (w) = −2X T y + 2X T Xw
Proof.
1. If P 0 and Q 0, then P + Q 0.
5. Notation: A B means A − B 0.
9
Example 8. Let
1 1
X = 1 1 (19)
1 1
Then
T 3 3
X X= (20)
3 3
1
Consider the vector a = . Then a T X T Xa = 0. Therefore X T X is not positive defi-
−1
nite.
4.4 Subspaces
Definition 6. (Subspace) A set of points S ⊆ Rn is a subspace if
2. If x, y ∈ S, then x + y ∈ S
3. If x ∈ S, α ∈ R, then αx ∈ S.
Observations:
10
• The rank r of the subspace must satisfy r ≤ min (n, p). r is the number of linearly
independent columns of X.
2. (UV ) T UV = V T U T UV = V T V = I.
4.5.2 Back to LS
Suppose U is an orthonormal basis matrix for our data matrix X. Then, the least-
squares problem is
v̂ = arg min ky − Uvk22 (22)
v
1. Input X = x1 . . . x p ∈ Rn× p
Output: U = u1 . . . ur ∈ Rn×r
3. For j = 2, 3, . . . , p
x 0j = all the components of x j not represented by u1 , . . . , u j−1 .
j −1
x 0j = x j − ∑ (uiT x j )ui (23)
i =1
11
here (uiT x j ) is the least squares weight for ui .
0
x
j
x 0j 6= 0
uj =
x 0j
(24)
2
0 x 0j =0
U = XA (26)
Thus, ui = Xai .
Now, suppose w ∈ R p is the vector of weights we found using LS, and as above, v is
our vector of weights founding using LS with an orthonormal basis matrix. We have two
equations for the predicated label ŷ
ŷ = w1 x2 + w2 x2 + · · · + w p x p
= v1 u1 + v2 u2 + · · · + vr ur
= v1 Xa1 + v2 Xa2 + · · · + vr Xar
= v1 (α11 x1 + α12 x2 + · · · + α1p x p )
..
.
+ vr (αr1 x1 + αr2 x2 + · · · + αrp x p )
= x1 (v1 α11 + · · · + vr αr1 )
..
.
+ x p (v1 α1p + · · · + vr αrp )
Notice that
w1 = v1 α11 + · · · + vr αr1
..
.
w p = v1 α1p + · · · + vr αrp
12
Therefore
ŷ = XAv = Xw (28)
so that Av = w.
In sum, given a new sample xnew ∈ R p , we have two ways to predict label ynew :
1. ŷnew = h xnew , wi
T
2. Using an orthonormal basis U, we know that U = XA. Therefore, unew T A.
= xnew
Equivalently, unew = Axnew . Then ynew = hunew , vi.
If the columns of X are linearly independent (r = p), we can calculate using LS (recall-
ing ui = Xai )
a i = ( X T X ) −1 X T u i (29)
Theorem 2. Let X ∈ Rn× p , n ≥ p, be full rank (the p columns of X are linearly indepen-
dent) and y ∈ Rn . Let u1 , . . . , u p be orthonormal basis vectors such that span({ x1 , . . . , x p }) =
span({u1 , . . . , u p }). Then ŷ = X ŵ where ŵ = arg minw ky − Xwk22 is given by ŷ = UU T y,
where U = u1 u2 . . . u p .
Proof.
ŷ = X ŵ = X ( X T X )−1 X T y (30)
where Px = X ( X T X )−1 X T is a projection matrix. Since span({ x1 , . . . , x p }) = span({u1 , . . . , u p }),
we must have that
Px y = Pu y (31)
which implies Px = Pu . Thus
Px = Pu = U (U T U )−1 U T = UU T (32)
Finally
ŷ = Px y = Pu y = UU T y (33)
In words, this says the label for observation i is a linear combination of the features for example i.
The steps to complete least squares classification in this environment are as follows:
13
1. Build a data matrix or feature matrix and label vector
x1T
T
x1 1 y1
x T x T 1 y2
2 2
X= = .. ∈ Rn × p , y=. (35)
..
. . ..
xnT xnT 1 yn
(this last equality makes it clear that we are minimizing the sum of squared resid-
uals). If the columns of X are linearly independent, then X T X is positive definite.
Therefore X T X is invertible. In sum, if X T X is positive definite, then there exists a
unique LS solution
ŵ = ( X T X )−1 X T y (37)
The predicted labels are
ŷ = Xw
= X ( X T X ) −1 X T y
1. If X is full rank (i.e. the columns of X are linearly independent), then ŵ LS is unique
and
ŵ LS = ( X T X )−1 X T y (39)
2. If X is not full rank, then X T X is not invertible. ŵ LS is not unique; there are infinitely
many solutions.
14
6.1 Tikhonov Regularization Derivation
In this second case (and it can also be useful in the first), we can define a new objective
where ky − Xwk22 measures the fit to the data, λ > 0 is a regularization parameter or
tuning parameter, and kwk22 is a regularizer. kwk22 measures the energy in w.
Observations about this problem:
2. Even when X is full rank, X T X can be badly behaved, and regularization adjusts for
this.
f (w) = y T y − 2w T X T y + w T X T Xw + λw T w
= y T y − 2w T X T y + w T ( X T X + λI )w
Then
∇w f (w) = −2X T y + 2( X T X + λI ) (41)
If ( X T X + λI ) is invertible, then ŵ = ( X T X + λI )−1 X T y. BUT, ( X T X + λI ) is always
invertible. Recall that if a matrix is positive definite, then it is invertible. We can show that
( X T X + λI ) is indeed positive definite and hence invertible. To see this, fix 0 6= a ∈ Rn ,
then
a T ( X T X + λI ) a = a T X T Xa + λa T a
= k Xak22 + λ k ak22
Now note that k Xak22 ≥ 0 (it could be 0 if X is not full rank and a is in the null space of X
– this is what causes troubles with LS) but λ k ak22 > 0. Therefore,( X T X + λI ) is positive
definite.
15
Therefore,
ŵ = ( X̃ T X̃ )−1 X̃ ỹ (43)
where
X̃ T X̃ = X T X + λI (44)
and
X̃ ỹ = X T y (45)
Thus this is equivalent to the derivation above.
X = UΣV T (46)
where
16
Figure 3: SVD
Example 9. (Netflix) Let X ∈ Rn× p be a matrix (full rank) where the columns are taste
profiles of customers and the rows are single movie ratings across customers.
1. The ith column of U is a basis vector in Rn and is the ith representative customer
taste profile (i.e. vector of normalized movie ratings).
2. The jth column of V T (the jth row of V) is the relative importance of each represen-
tative taste profile to predicting customer j’s preferences.
3. The ith row of V T (the ith column of V) is the vector of users’ affinities to the ith
representative profile.
17
8 Power Iteration and Page Rank
8.1 SVD: Connection to Eigenvalues/vectors
Suppose X = UΣV T ∈ R p×n . Then
A := X T X
= VΣU T UΣV T
= VΣ2 V T
= VΛV T
9 Matrix Completion
9.1 Iterative Singular Value Thresholding
10 Iterative Solvers
Let τ > 0 be step size.
18
Algorithm 2 Landweber Iteration
Require: w (0)
for k ← 0, 1, . . . do
w (k+1) ← w(k) − τX T ( Xw(k) − y)
end for
f (w) = k Xw − yk22
∇w f (w) = ∇w ( Xw − y)T ( Xw − y)
= ∇w w T X T Xy − 2w T X T y + y T y
= 2X T Xw − 2X T y
= 2X T ( Xw − y)
Thus the new iterate equals the old iterate plus a step in direction of negative gradient.
Recall that the iteration is given by w(k+1) = w(k) − τX T ( Xw(k) − y). Then
2
2
( k +1) (k) T (k)
Xw − y
=
X (w − τX ( Xw − y)) − y
2 2
2
=
Xw(k) − y − τXX T ( Xw(k) − y)
2
2
2
=
Xw − y
+ τ
XX ( Xw − y)
− 2τ ( Xw(k) − y)T XX T ( Xw(k) − y)
(k) 2 T (k)
2 2
19
Therefore
2
2
2
2
( k +1) (k) 2
T (k)
T (k)
Xw − y
≤
Xw − y
+ τ τ k X kop
X ( Xw − y)
− 2
X ( Xw − y)
2 2 2 2
2
2
=
Xw(k) − y
+ τ
X T ( Xw(k) − y)
(τ k X k2op − 2)
2 2
2
2
Thus, if(τ k X k2op
− 2) < 0, then
Xw ( k + 1 ) ( k )
− y
≤
Xw − y
. Therefore, for conver-
2 2
gence we require that
2
0<τ< (51)
k X k2op
in order to ensure convergence. Under this condition, then
w ( k ) → ( X T X ) −1 X T y (52)
Claim 3. (Bound on 2-norm of matrix-vector product) Let X be matrix and w a vector (con-
formable). Then
k Xwk2 ≤ k X kop kwk2 (53)
Proof. Recall: The 2-norm (here, operator norm) of a matrix is it’s largest singular value.
= ∑ σi (V w)iT
i
!1
2
20
11 Regularized Regression
11.1 Proximal Gradient Algorithm
which is equivalent to
ŵ L = arg min ky − Xwk22 + λ kwk1 (56)
w
In the figure below, the rhombuses and circles show the locus of points for which the
weight vector has a particular norm (in the L1 and L2 norms respectively). More precisely,
they are {w : kwk1 = τ1 } and {w : kwk2 = τ2 }. The red line is {w : y = Xw}.
21
Example 10. (r (w) = kwk1 = ∑i |wi |)
where λ, τ > 0.
(a) If zi > λτ
2 , then ŵi = zi − λτ
2 .
λτ
(b) If zi < 2 , then ŵi = 0.
In sum
λτ
ŵi = zi − (58)
2 +
λτ
wi = z i + (59)
2
Thus
(a) If zi < − λτ λτ
2 , then ŵi = zi 2 .
(b) If zi > − λτ
2 , then ŵi = 0.
In sum
λτ
ŵi = − |zi | − (60)
2 +
22
We can combine these two cases to get that
λτ
ŵi = − |zi | − sign(zi ) (61)
2 +
23
Figure 6: A convex function lies entirely above tangent lines
Figure 7: Subgradients
24
written as minimizing the following sum of indicator variables:
n
∑ Iyi 6=sign(wT xi ) (64)
i =1
LS actually minimizes
n
∑ ( y i − w T x i )2 (65)
i =1
We want to choose a convex function that mimics the ideal loss. We will use hinge
loss, which is defined by
n
l (w) = ∑ (1 − yi xiT w)+ (66)
i =1
where (
a a>0
( a)+ = (67)
0 otherwise
Definition 10. (Support Vector Machine) If we minimize
n
∑ (1 − yi xiT w)+ + λ kwk22 (68)
i =1
25