0% found this document useful (0 votes)

275 views62 pages

Linear Algebra For Machine Learning: Sargur N. Srihari Srihari@cedar - Buffalo.edu

This document discusses the importance and applications of linear algebra concepts in machine learning. It introduces key linear algebra concepts such as scalars, vectors, matrices, and tensors that are fundamental to machine learning algorithms. Many machine learning algorithms involve linear transformations of input data through matrix and tensor operations such as matrix multiplication and addition. Understanding these linear algebra concepts is essential for working with machine learning models.

Uploaded by

Host Van Braund

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

275 views62 pages

Linear Algebra For Machine Learning: Sargur N. Srihari Srihari@cedar - Buffalo.edu

Uploaded by

Host Van Braund

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Machine Learning Srihari

Linear Algebra for Machine

Learning
Sargur N. Srihari
[email protected]

1
Machine Learning Srihari

What is linear algebra?

•  Linear algebra is the branch of mathematics
concerning linear equations such as
a1x1+…..+anxn=b
–  In vector notation we say aTx=b
–  Called a linear transformation of x
•  Linear algebra is fundamental to geometry, for
defining objects such as lines, planes, rotations
Linear equation a1x1+…..+anxn=b
defines a plane in (x1,..,xn) space
Straight lines define common solutions
to equations

2
Machine Learning Srihari

Why do we need to know it?

•  Linear Algebra is used throughout engineering
–  Because it is based on continuous math rather than
discrete math
•  Computer scientists have little experience with it
•  Essential for understanding ML algorithms
–  E.g., We convert input vectors (x1,..,xn) into outputs
by a series of linear transformations
•  Here we discuss:
–  Concepts of linear algebra needed for ML
–  Omit other aspects of linear algebra
3
Linear Algebra Topics
Machine Learning Srihari

–  Scalars, Vectors, Matrices and Tensors

–  Multiplying Matrices and Vectors
–  Identity and Inverse Matrices
–  Linear Dependence and Span
–  Norms
–  Special kinds of matrices and vectors
–  Eigendecomposition
–  Singular value decomposition
–  The Moore Penrose pseudoinverse
–  The trace operator
–  The determinant
–  Ex: principal components analysis 4
Machine Learning Srihari

Scalar
•  Single number
–  In contrast to other objects in linear algebra,
which are usually arrays of numbers
•  Represented in lower-case italic x
–  They can be real-valued or be integers
•  E.g., let x ∈! be the slope of the line
–  Defining a real-valued scalar
•  E.g., let n ∈! be the number of units
–  Defining a natural number scalar

5
Machine Learning Srihari

Vector
•  An array of numbers arranged in order
•  Each no. identified by an index
•  Written in lower-case bold such as x
–  its elements are in italics lower case, subscripted
⎡ x ⎤
⎢ 1 ⎥
⎢ x2 ⎥
⎢ ⎥
x=⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ xn ⎥
⎢⎣ ⎥⎦
•  If each element is in R then x is in Rn
•  We can think of vectors as points in space
–  Each element gives coordinate along an axis
6
Machine Learning Srihari

Matrices
•  2-D array of numbers
–  So each element identified by two indices
•  Denoted by bold typeface A
–  Elements indicated by name in italic but not bold
•  A1,1 is the top left entry and Am,nis the bottom right entry
•  We can identify nos in vertical column j by writing : for the
horizontal coordinate
•  E.g., ⎡ A A ⎤
A= ⎢ ⎥
1,1 1,2

⎢ A2,1 A2,2 ⎥
⎣ ⎦

•  Ai: is ith row of A, A:j is jth column of A

•  If A has shape of height m and width n with
real-values then A ∈! m×n
7
Machine Learning Srihari

Tensor
•  Sometimes need an array with more than two
axes
–  E.g., an RGB color image has three axes
•  A tensor is an array of numbers arranged on a
regular grid with variable number of axes
–  See figure next
•  Denote a tensor with this bold typeface: A
•  Element (i,j,k) of tensor denoted by Ai,j,k
8
Machine Learning Srihari

Shapes of Tensors

9
Machine Learning Srihari

Transpose of a Matrix
•  An important operation on matrices
•  The transpose of a matrix A is denoted as AT
•  Defined as
(AT)i,j=Aj,i
–  The mirror image across a diagonal line
•  Called the main diagonal , running down to the right
starting from upper left corner

⎡ A A1,2 A1,3 ⎤ ⎡ A A2,1 A3,1 ⎤ ⎡ A A1,2 ⎤ ⎡ A A2,1 A3,1 ⎤

⎢ 1,1 ⎥ ⎢ 1,1 ⎥ ⎢ 1,1 ⎥ ⎢ 1,1 ⎥
A = ⎢ A2,1 A2,2 A2,3 ⎥ ⇒ A = ⎢ A1,2
T
A2,2 A3,2 ⎥ A = ⎢ A2,1 A2,2 ⎥⇒ A = ⎢ A
T
A2,2 A3,2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 1,2 ⎥
⎢ A3,1 A3,2 A3,3 ⎥ ⎢ A1,3 A2,3 A3,3 ⎥ ⎢ A3,1 A3,2 ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
10
Machine Learning Srihari

Vectors as special case of matrix

•  Vectors are matrices with a single column
•  Often written in-line using transpose
x = [x1,..,xn]T
⎡ x ⎤
⎢ 1 ⎥
⎢ x2 ⎥
⎢ ⎥
⎥ ⇒ x = ⎡⎣x 1 ,x 2 ,..x n ⎤⎦
T
x=⎢
⎢ ⎥
⎢ ⎥
⎢ xn ⎥
⎢⎣ ⎥⎦

•  A scalar is a matrix with one element

a=aT
11
Machine Learning Srihari

Matrix Addition
•  We can add matrices to each other if they have
the same shape, by adding corresponding
elements
–  If A and B have same shape (height m, width n)
C = A+B ⇒C i,j = Ai,j +Bi,j

•  A scalar can be added to a matrix or multiplied

by a scalar D = aB+c ⇒ D = aB +c i,j i,j

•  Less conventional notation used in ML:

–  Vector added to matrix C = A+b ⇒C i,j = Ai,j +bj

•  Called broadcasting since vector b added to each row of A
12
Machine Learning Srihari

Multiplying Matrices
•  For product C=AB to be defined, A has to have
the same no. of columns as the no. of rows of B
–  If A is of shape mxn and B is of shape nxp then
matrix product C is of shape mxp

C = AB ⇒C i,j = ∑ Ai,k Bk,j

–  Note that the standard product of two matrices is

not just the product of two individual elements
•  Such a product does exist and is called the element-wise
product or the Hadamard product A¤B
13
Machine Learning Srihari

Multiplying Vectors
•  Dot product between two vectors x and y of
same dimensionality is the matrix product xTy
•  We can think of matrix product C=AB as
computing Cij the dot product of row i of A and
column j of B

14
Machine Learning Srihari

Matrix Product Properties

•  Distributivity over addition: A(B+C)=AB+AC
•  Associativity: A(BC)=(AB)C
•  Not commutative: AB=BA is not always true
•  Dot product between vectors is commutative:
xTy=yTx
•  Transpose of a matrix product has a simple
form: (AB)T=BTAT

15
Machine Learning Srihari

Example flow of tensors in ML

A linear classifier y= WxT+b

Vector x is converted
into vector y by
multiplying x by a matrix W

A linear classifier with bias eliminated y= WxT

Machine Learning Srihari

Linear Transformation
•  Ax=b
–  where A ∈! n×n and b ∈! n
–  More explicitly A x + A x +....+ A x = b 11 1 12 2 1n n 1
n equations in
A21 x1 + A22 x2 +....+ A2n xn = b2
n unknowns
An1 x1 + Am2 x2 +....+ An,n xn = bn

⎡ A ! A1,n ⎤ ⎡ x ⎤ ⎡ b ⎤
⎢ 1,1 ⎥ ⎢ 1 ⎥ ⎢ 1 ⎥ Can view A as a linear transformation
A= ⎢ " " " ⎥ x = ⎢ " ⎥ b= ⎢ " ⎥
⎢ A ⎥ ⎢ x ⎥ ⎢ b ⎥ of vector x to vector b
⎢ n,1 ! Ann ⎥ ⎢⎣ n ⎥⎦ ⎢⎣ n ⎥⎦
⎣ ⎦

nxn nx1 n x1

•  Sometimes we wish to solve for the unknowns

x ={x1,..,xn} when A and b provide constraints
17
Machine Learning Srihari

Identity and Inverse Matrices

•  Matrix inversion is a powerful tool to analytically
solve Ax=b
•  Needs concept of Identity matrix
•  Identity matrix does not change value of vector
when we multiply the vector by identity matrix
–  Denote identity matrix that preserves n-dimensional
vectors as In
–  Formally In ∈! n×n and ∀x ∈! n ,Inx = x
⎡ 1 0 0 ⎤

–  Example of I3 ⎢
0 1 0
⎥
⎢ ⎥
⎢ 0 0 1 ⎥ 18
⎣ ⎦
Machine Learning Srihari

Matrix Inverse
•  Inverse of square matrix A defined as
A−1 A = In
•  We can now solve Ax=b as follows:
Ax = b
A−1 Ax = A−1b
In x = A−1b
−1
x = A b

•  This depends on being able to find A-1

•  If A-1 exists there are several methods for
finding it
19
Machine Learning Srihari

Solving Simultaneous equations

•  Ax = b
where A is (M+1) x (M+1)
x is (M+1) x 1: set of weights to be determined
b is N x 1

20
Example: System of Linear
Machine Learning Srihari

Equations in Linear Regression

•  Instead of Ax=b
•  We have Φw = t
–  where Φ is m x n design matrix of m features for n
samples xj, j=1,..n
–  w is weight vector of m values
–  t is target values of sample, t=[t1,..tn]
–  We need weight w to be used with m features to
determine output

m
y(x,w)=∑wi x i
i=1 21
Machine Learning Srihari

Closed-form solutions
•  Two closed-form solutions
1. Matrix inversion x=A-1b
2. Gaussian elimination

22
Machine Learning Srihari

Linear Equations: Closed-Form Solutions

1. Matrix Formulation: Ax=b

Solution: x=A-1b

2. Gaussian Elimination
followed by back-substitution

L2-3L1àL2 L3-2L1àL3 -L2/4àL2

Machine Learning Srihari

Disadvantage of closed-form solutions

•  If A-1 exists, the same A-1 can be used for any
given b
–  But A-1 cannot be represented with sufficient
precision
–  It is not used in practice
•  Gaussian elimination also has disadvantages
–  numerical instability (division by small no.)
–  O(n3) for n x n matrix
•  Software solutions use value of b in finding x
–  E.g., difference (derivative) between b and output is
used iteratively 24
Machine Learning Srihari

How many solutions for Ax=b exist?

•  System of equations with A11 x1 + A12 x2 +....+ A1n xn = b1
A21 x1 + A22 x2 +....+ A2n xn = b2
–  n variables and m equations is:
Am1 x1 + Am2 x2 +....+ Amn xn = bm

•  Solution is x=A-1b
•  In order for A-1 to exist Ax=b must have
exactly one solution for every value of b
–  It is also possible for the system of equations to
have no solutions or an infinite no. of solutions for
some values of b
•  It is not possible to have more than one but fewer than
infinitely many solutions
–  If x and y are solutions then z=α x + (1-α) y is a
solution for any real α 25
Machine Learning Srihari

Span of a set of vectors

•  Span of a set of vectors: set of points obtained
by a linear combination of those vectors
–  A linear combination of vectors {v(1),.., v(n)} with
coefficients ci is ∑ c v i
(i)

–  System of equations is Ax=b

•  A column of A, i.e., A:i specifies travel in direction i
•  How much we need to travel is given by xi
•  This is a linear combination of vectors Ax = ∑ x i A:, i
i

–  Thus determining whether Ax=b has a solution is

equivalent to determining whether b is in the span of
columns of A
•  This span is referred to as column space or range of A
Machine Learning Srihari

Conditions for a solution to Ax=b

•  Matrix must be square, i.e., m=n and all
columns must be linearly independent
–  Necessary condition is n ≥ m
•  For a solution to exist when A ∈!
m×n
we require the
column space be all of ! m
–  Sufficient Condition
•  If columns are linear combinations of other columns,
m
column space is less than !
–  Columns are linearly dependent or matrix is singular
•  For column space to encompass ! m at least one set
of m linearly independent columns
•  For non-square and singular matrices
–  Methods other than matrix inversion are used
Machine Learning Srihari

Use of a Vector in Regression

•  A design matrix
–  N samples, D features

•  Feature vector has three dimensions

•  This is a regression problem
28
Machine Learning Srihari

Norms
•  Used for measuring the size of a vector
•  Norms map vectors to non-negative values
•  Norm of vector x = [x1,..,xn]T is distance from
origin to x
–  It is any function f that satisfies:
( )
f x = 0 ⇒x = 0
( ) ( ) Triangle Inequality
f( x +y) ≤ f x + f y
∀α ∈R f (αx ) = α f (x )

29
LP Norm
Machine Learning Srihari

•  Definition: 1
⎛ p⎞ p
x = ⎜ ∑ xi ⎟
p ⎝ i ⎠
–  L2 Norm
•  Called Euclidean norm
–  Simply the Euclidean distance
between the origin and the point x
–  written simply as ||x||
–  Squared Euclidean norm is same as xTx 22 + 22 = 8 = 2 2

–  L1 Norm
•  Useful when 0 and non-zero have to be distinguished
–  Note that L2 increases slowly near origin, e.g., 0.12=0.01)
–  L∞ Norm x = max x i
∞ i

•  Called max norm 30

Machine Learning Srihari

Use of norm in Regression

•  Linear Regression
x: a vector, w: weight vector
y(x,w) = w0+w1x1+..+wd xd = wTx

With nonlinear basis functions ϕj

M −1
y(x,w) = w0 + ∑ w φ (x)
j j
j =1

•  Loss Function
1 N
λ
E(w) = ∑ {y(x n ,w) − tn } + || w 2 ||
! 2

2 n =1 2

Second term is a weighted norm

called a regularizer (to prevent overfitting) 31
Machine Learning Srihari

LP Norm and Distance

•  Norm is the length of a vector

•  We can use it to draw a unit circle from origin

–  Different P values yield different shapes
•  Euclidean norm yields a circle

•  Distance between two vectors (v,w)

–  dist(v,w)=||v-w||
= (v − w )
1 1
2
+ .. + (vn − wn )2
Distance to origin would just be sqrt of sum of squares 32
Machine Learning Srihari

Size of a Matrix: Frobenius Norm

•  Similar to L2 norm ⎛ ⎞
A = ⎜ ∑ Ai,j2 ⎟
1
2

F
⎝ i,j ⎠

⎡ 2 −1 5 ⎤
⎢ ⎥
A=⎢ 0 2 1 ⎥ A = 4 + 1 + 25 + .. + 1 = 46
⎢ 3 1 1 ⎥ V matrix W matrix
⎣ ⎦

I J K
•  Frobenius in ML I1×(I+1) × V(I+1)×J=netJ
–  Layers of neural network hj=f(netj) f(x)=1/(1+e-x)
involve matrix multiplication
–  Regularization:
•  minimize Frobenius of weight
matrices ||W(i)|| over L layers

33
Machine Learning Srihari

Angle between Vectors

•  Dot product of two vectors can be written in
terms of their L2 norms and angle θ between
them T
x y ⇒||x ||2||y ||2 cos θ

•  Cosine between two vectors is a measure of

their similarity

34
Machine Learning Srihari

Special kind of Matrix: Diagonal

•  Diagonal Matrix has mostly zeros, with non-
zero entries only in diagonal
–  E.g., identity matrix, where all diagonal entries are 1

–  E.g., covariance matrix with independent features

⎧⎪ 1 ⎫
N(x | µ,Σ) =
1 1
exp ⎪⎨− (x − µ)T Σ−1(x − µ)⎪⎪⎬
(2π)D/2 | Σ |1/2 ⎪⎪⎩ 2 ⎪⎪⎭
If Cov(X,Y)=0 then E(XY)=E(X)E(Y)
Machine Learning Srihari

Efficiency of Diagonal Matrix

•  diag (v) denotes a square diagonal matrix with
diagonal elements given by entries of vector v
•  Multiplying vector x by a diagonal matrix is
efficient
–  To compute diag(v)x we only need to scale each xi
by vi
diag( v)x = v ⊙x
•  Inverting a square diagonal matrix is efficient
–  Inverse exists iff every diagonal entry is nonzero, in
which case diag (v)-1=diag ([1/v1,..,1/vn]T)
Machine Learning Srihari

Special kind of Matrix: Symmetric

•  A symmetric matrix equals its transpose: A=AT
–  E.g., a distance matrix is symmetric with Aij=Aji

–  E.g., covariance matrices are symmetric

Machine Learning Srihari

Special Kinds of Vectors

•  Unit Vector
–  A vector with unit norm x =1
2

•  Orthogonal Vectors
–  A vector x and a vector y are
orthogonal to each other if xTy=0
•  If vectors have nonzero norm, vectors at
90 degrees to each other
–  Orthonormal Vectors
•  Vectors are orthogonal & have unit norm
•  Orthogonal Matrix
–  A square matrix whose rows are mutually
orthonormal: ATA=AAT=I Orthogonal matrices are of
–  A-1=AT interest because their inverse is
very cheap to compute
Machine Learning Srihari

Matrix decomposition
•  Matrices can be decomposed into factors to
learn universal properties, just like integers:
–  Properties not discernible from their representation
1. Decomposition of integer into prime factors
•  From 12=2×2×3 we can discern that
–  12 is not divisible by 5 or
–  any multiple of 12 is divisible by 3
–  But representations of 12 in binary or decimal are different
2. Decomposition of Matrix A as A=Vdiag(λ)V-1
•  where V is formed of eigenvectors and λ are eigenvalues,
e.g,
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
A= ⎢ 2 1 ⎥ has eigenvalues λ=1 and λ=3 and eigenvectors V: vλ=1 = ⎢ 1 ⎥ ,vλ=3 = ⎢ 1 ⎥
⎢ −1 ⎥ ⎢ 1 ⎥
⎢ 1 2 ⎥ ⎣ ⎦ ⎣ ⎦
⎣ ⎦
Machine Learning Srihari

Eigenvector
•  An eigenvector of a square matrix
A is a non-zero vector v such that
multiplication by A only changes
the scale of v
Av=λv
–  The scalar λ is known as eigenvalue
•  If v is an eigenvector of A, so is
any rescaled vector sv. Moreover Wikipedia

sv still has the same eigen value.

Thus look for a unit eigenvector
40
Machine Learning Srihari

Eigenvalue and Characteristic Polynomial

⎡ A L A1,n ⎤ ⎡ v ⎤ ⎡ w ⎤
⎢ 1,1 ⎥ ⎢ 1 ⎥ ⎢ 1 ⎥

•  Consider Av=w A= ⎢ M M
⎢
⎢ An,1 L
⎣
M ⎥
Ann ⎥
⎥
⎦
v= ⎢ M
⎢ v
⎢⎣ n
⎥
⎥
⎥⎦
w= ⎢ M ⎥
⎢ w ⎥
⎢⎣ n ⎥⎦

•  If v and w are scalar multiples, i.e., if Av=λv

•  then v is an eigenvector of the linear transformation A
and the scale factor λ is the eigenvalue corresponding
to the eigen vector
•  This is the eigenvalue equation of matrix A
–  Stated equivalently as (A-λI)v=0
–  This has a non-zero solution if |A-λI|=0 as
•  The polynomial of degree n can be factored as
|A-λI| = (λ1-λ)(λ2-λ)…(λn-λ)
•  The λ1, λ2…λn are roots of the polynomial and are
eigenvalues of A
Machine Learning Srihari

Example of Eigenvalue/Eigenvector
•  Consider the matrix ⎡ 2 1 ⎤
A= ⎢ ⎥
⎢ 1 2 ⎥
⎣ ⎦
•  Taking determinant of (A-λI), the char poly is
⎡ 2 −λ 1 ⎤
| A −λI |= ⎢⎢ ⎥ = 3 − 4λ + λ 2
⎥
⎣ 1 2 −λ ⎦

•  It has roots λ=1 and λ=3 which are the two

eigenvalues of A
•  The eigenvectors are found by solving for v in
Av=λv, which are ⎡ 1 ⎤ ⎡ 1 ⎤
vλ=1 = ⎢ ⎥ ,v = ⎢ ⎥
⎢ −1 ⎥ λ=3 ⎢ 1 ⎥
⎣ ⎦ ⎣ ⎦ 42
Machine Learning Srihari

Eigendecomposition
•  Suppose that matrix A has n linearly
independent eigenvectors {v(1),..,v(n)} with
eigenvalues {λ1,..,λn}
•  Concatenate eigenvectors to form matrix V
•  Concatenate eigenvalues to form vector
λ=[λ1,..,λn]
•  Eigendecomposition of A is given by
A=Vdiag(λ)V-1
43
Machine Learning Srihari

Decomposition of Symmetric Matrix

•  Every real symmetric matrix A can be
decomposed into real-valued eigenvectors and
eigenvalues
A=QΛQT
where Q is an orthogonal matrix composed of
eigenvectors of A: {v(1),..,v(n)}
orthogonal matrix: components are orthogonal or v(i)Tv(j)=0
Λ is a diagonal matrix of eigenvalues {λ1,..,λn}
•  We can think of A as scaling space by λi in
direction v(i)
–  See figure on next slide 44
Machine Learning Srihari

Effect of Eigenvectors and Eigenvalues

•  Example of 2×2 matrix
•  Matrix A with two orthonormal eigenvectors
–  v(1) with eigenvalue λ1, v(2) with eigenvalue λ2
Plot of unit vectors u ∈!
2
Plot of vectors Au
(circle) (ellipse)

with two variables x1 and x2

45
Machine Learning Srihari

Eigendecomposition is not unique

•  Eigendecomposition is A=QΛQT
–  where Q is an orthogonal matrix composed of
eigenvectors of A
•  Decomposition is not unique when two
eigenvalues are the same
•  By convention order entries of Λ in descending
order:
–  Under this convention, eigendecomposition is
unique if all eigenvalues are unique
46
Machine Learning Srihari

What does eigendecomposition tell us?

•  Tells us useful facts about the matrix:
1.  Matrix is singular if & only if any eigenvalue is zero
2.  Useful to optimize quadratic expressions of form
f(x)=xTAx subject to ||x||2=1
Whenever x is equal to an eigenvector, f is equal to the
corresponding eigenvalue
Maximum value of f is max eigen value, minimum value is
min eigen value
Example of such a quadratic form appears in multivariate
Gaussian ⎧
⎪ ⎫
⎪
1 1 ⎪ 1
N(x | µ,Σ) = exp ⎨− (x − µ)T Σ−1(x − µ)⎪
⎬
D/2
(2π) | Σ | 1/2 ⎪
⎪
⎩ 2 ⎪
⎪
⎭
47
Machine Learning Srihari

Positive Definite Matrix

•  A matrix whose eigenvalues are all positive is
called positive definite
–  Positive or zero is called positive semidefinite
•  If eigen values are all negative it is negative
definite
–  Positive definite matrices guarantee that xTAx ≥ 0

48
Machine Learning Srihari

Singular Value Decomposition (SVD)

•  Eigendecomposition has form: A=Vdiag(λ)V-1
–  If A is not square, eigendecomposition is undefined
•  SVD is a decomposition of the form A=UDVT
•  SVD is more general than eigendecomposition
–  Used with any matrix rather than symmetric ones
–  Every real matrix has a SVD
•  Same is not true of eigen decomposition
Machine Learning Srihari

SVD Definition
•  Write A as a product of 3 matrices: A=UDVT
–  If A is m×n, then U is m×m, D is m×n, V is n×n
•  Each of these matrices have a special structure
•  U and V are orthogonal matrices
•  D is a diagonal matrix not necessarily square
–  Elements of Diagonal of D are called singular values of A
–  Columns of U are called left singular vectors
–  Columns of V are called right singular vectors

•  SVD interpreted in terms of eigendecomposition

•  Left singular vectors of A are eigenvectors of AAT
•  Right singular vectors of A are eigenvectors of ATA
•  Nonzero singular values of A are square roots of eigen
values of ATA. Same is true of AAT
Machine Learning Srihari

Use of SVD in ML
1.  SVD is used in generalizing matrix inversion
–  Moore-Penrose inverse (discussed next)
2.  Used in Recommendation systems
–  Collaborative filtering (CF)
•  Method to predict a rating for a user-item pair based on the
history of ratings given by the user and given to the item
•  Most CF algorithms are based on user-item rating matrix
where each row represents a user, each column an item
–  Entries of this matrix are ratings given by users to items
•  SVD reduces no.of features of a data set by reducing space
dimensions from N to K where K < N
51
Machine Learning Srihari

SVD in Collaborative Filtering

•  X is the utility matrix

–  Xij denotes how user i likes item j
–  CF fills blank (cell) in utility matrix that has no entry
•  Scalability and sparsity is handled using SVD
–  SVD decreases dimension of utility matrix by
extracting its latent factors
•  Map each user and item into latent space of dimension r
52
Machine Learning Srihari

Moore-Penrose Pseudoinverse
•  Most useful feature of SVD is that it can be
used to generalize matrix inversion to non-
square matrices
•  Practical algorithms for computing the
pseudoinverse of A are based on SVD
A+=VD+UT
–  where U,D,V are the SVD of A
•  Pseudoinverse D+ of D is obtained by taking the
reciprocal of its nonzero elements when taking transpose
of resulting matrix
53
Machine Learning Srihari

Trace of a Matrix
•  Trace operator gives the sum of the elements
along the diagonal
Tr(A )= ∑ Ai ,i
i ,i

•  Frobenius norm of a matrix can be represented

as
1

F
(
A = Tr(A)) 2

54
Machine Learning Srihari

Determinant of a Matrix
•  Determinant of a square matrix det(A) is a
mapping to a scalar
•  It is equal to the product of all eigenvalues of
the matrix
•  Measures how much multiplication by the
matrix expands or contracts space

55
Machine Learning Srihari

Example: PCA
•  A simple ML algorithm is Principal Components
Analysis
•  It can be derived using only knowledge of basic
linear algebra

56
Machine Learning Srihari

PCA Problem Statement

•  Given a collection of m points {x(1),..,x(m)} in
Rn represent them in a lower dimension.
–  For each point x(i) find a code vector c(i) in Rl
–  If l is smaller than n it will take less memory to
store the points
–  This is lossy compression
–  Find encoding function f (x) = c and a decoding
function x ≈ g ( f (x) )

57
Machine Learning Srihari

PCA using Matrix multiplication

•  One choice of decoding function is to use
matrix multiplication: g(c) =Dc where D ∈! n×l

–  D is a matrix with l columns

•  To keep encoding easy, we require columns of
D to be orthogonal to each other
–  To constrain solutions we require columns of D to
have unit norm
•  We need to find optimal code c* given D
•  Then we need optimal D
58
Machine Learning Srihari

Finding optimal code given D

•  To generate optimal code point c* given input
x, minimize the distance between input point x
and its reconstruction g(c*)
c* = argmin x − g(c)
2
c

–  Using squared L2 instead of L2, function being

minimized is equivalent to
(x
− g(c))T
(x − g(c))
•  Using g(c)=Dc optimal code can be shown to
be equivalent to c* = argmin − 2x T Dc+cT c
c 59
Machine Learning Srihari

Optimal Encoding for PCA

∇c (−2x T Dc+c T c) = 0
•  Using vector calculus
−2DT x+2c = 0
c = D x
T

•  Thus we can encode x using a matrix-vector

operation
–  To encode we use f(x)=DTx
–  For PCA reconstruction, since g(c)=Dc we use
r(x)=g(f(x))=DDTx
–  Next we need to choose the encoding matrix D

60
Machine Learning Srihari

Method for finding optimal D

•  Revisit idea of minimizing L2 distance between
inputs and reconstructions
–  But cannot consider points in isolation
–  So minimize error over all points: Frobenius norm
1
⎛
( ( ))
⎞
2 2
D* = argmin ⎜ ∑ x (ij ) − r x (i ) ⎟
D ⎝ i,j j ⎠

•  subject to DTD=Il
•  Use design matrix X, X ∈!
m×n

–  Given by stacking all vectors describing the points

•  To derive algorithm for finding D* start by
considering the case l =1
61
–  In this case D is just a single vector d
Machine Learning Srihari

Final Solution to PCA

•  For l =1, the optimization problem is solved
using eigendecomposition
–  Specifically the optimal d is given by the
eigenvector of XTX corresponding to the largest
eigenvalue
•  More generally, matrix D is given by the l
eigenvectors of X corresponding to the largest
eigenvalues (Proof by induction)

MS0105 2122S2 Exam
0% (1)
MS0105 2122S2 Exam
8 pages
Comprehensive Algebra Vol 2 Vinay Kumar
100% (7)
Comprehensive Algebra Vol 2 Vinay Kumar
438 pages
Linear Algebra
No ratings yet
Linear Algebra
50 pages
Tungban Machine Learning Math Course
No ratings yet
Tungban Machine Learning Math Course
124 pages
Maths For AI
No ratings yet
Maths For AI
176 pages
Introduction To Game Theory and Its Application in Electric Power Markets Singh
No ratings yet
Introduction To Game Theory and Its Application in Electric Power Markets Singh
5 pages
Start Here With Machine Learning
No ratings yet
Start Here With Machine Learning
25 pages
Lecture 1 Mechanism Design 2
No ratings yet
Lecture 1 Mechanism Design 2
11 pages
Software Development Risk Management: by Karl Gallagher
No ratings yet
Software Development Risk Management: by Karl Gallagher
19 pages
Investment in Renewable Energy Considering Game Theory and Wind-Hydro Diversification
No ratings yet
Investment in Renewable Energy Considering Game Theory and Wind-Hydro Diversification
12 pages
XAI Final
No ratings yet
XAI Final
18 pages
Linear Algebra in Details
No ratings yet
Linear Algebra in Details
88 pages
TigerGraph Buyers Guide Part 1
No ratings yet
TigerGraph Buyers Guide Part 1
8 pages
Game Theory Report
No ratings yet
Game Theory Report
49 pages
Deployments and Deployment Attempts: Futuregrid, and The More Modern Intergrid and Intragrid
No ratings yet
Deployments and Deployment Attempts: Futuregrid, and The More Modern Intergrid and Intragrid
22 pages
An Introduction To Social Network Analysis
100% (8)
An Introduction To Social Network Analysis
38 pages
CIM For Generation
No ratings yet
CIM For Generation
20 pages
Data Ethics Framework 2
No ratings yet
Data Ethics Framework 2
23 pages
MCA-25 Discrete Mathematics
No ratings yet
MCA-25 Discrete Mathematics
250 pages
Survey On Anonymization Techniques in Big Data and Privacy Models
No ratings yet
Survey On Anonymization Techniques in Big Data and Privacy Models
20 pages
Applying Artificial Neural Neteorks To Investemnt Analysis
No ratings yet
Applying Artificial Neural Neteorks To Investemnt Analysis
4 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
15 pages
E Procurement EXAMPLES
No ratings yet
E Procurement EXAMPLES
33 pages
Cs221 Linear Algebra
No ratings yet
Cs221 Linear Algebra
156 pages
Entropy: Explainable AI: A Review of Machine Learning Interpretability Methods
No ratings yet
Entropy: Explainable AI: A Review of Machine Learning Interpretability Methods
45 pages
CEC Value of DA: Distribution Automation Detailed Scenarios: Xanthus
No ratings yet
CEC Value of DA: Distribution Automation Detailed Scenarios: Xanthus
19 pages
Why+learn+Generative+AI
No ratings yet
Why+learn+Generative+AI
10 pages
Segmentation
100% (1)
Segmentation
51 pages
Lecture 4-Agent-Based Model
No ratings yet
Lecture 4-Agent-Based Model
15 pages
Data Mining Concepts and Techniques by Jiawei Han
No ratings yet
Data Mining Concepts and Techniques by Jiawei Han
4 pages
Gib Above Duct
No ratings yet
Gib Above Duct
52 pages
Mathematics For Machine Learning Naveed R Butt Set 18 1736130249
No ratings yet
Mathematics For Machine Learning Naveed R Butt Set 18 1736130249
245 pages
AI 5th Sem 2024 Organizer
No ratings yet
AI 5th Sem 2024 Organizer
132 pages
Neural Network: Prepared By: Nikita Garg M.Tech (CS)
No ratings yet
Neural Network: Prepared By: Nikita Garg M.Tech (CS)
29 pages
Game Theory (2) - Mechanism Design With Transfers
No ratings yet
Game Theory (2) - Mechanism Design With Transfers
60 pages
2019-A Bi-Objective Hyper-Heuristic Support Vector Machines For Big Data Cyber - Security
No ratings yet
2019-A Bi-Objective Hyper-Heuristic Support Vector Machines For Big Data Cyber - Security
11 pages
Chap 9 Representation and Description
No ratings yet
Chap 9 Representation and Description
75 pages
Used Telecom Equipment: Adjacency Matrix of A Graph
No ratings yet
Used Telecom Equipment: Adjacency Matrix of A Graph
4 pages
500 - Projects of ML and DL
No ratings yet
500 - Projects of ML and DL
9 pages
Marketing in The Moment PDF
No ratings yet
Marketing in The Moment PDF
14 pages
Game Theory and Competitive Strategy
No ratings yet
Game Theory and Competitive Strategy
95 pages
Game Theory Sample Y Narahari
0% (1)
Game Theory Sample Y Narahari
9 pages
Equivalence Class Testing
No ratings yet
Equivalence Class Testing
31 pages
Complexity Theory
No ratings yet
Complexity Theory
19 pages
Energy Labelling and Standards Programs Throughout The World - Julio 2004 PDF
No ratings yet
Energy Labelling and Standards Programs Throughout The World - Julio 2004 PDF
56 pages
Assignment On Operations Research by Rahul Gupta
No ratings yet
Assignment On Operations Research by Rahul Gupta
21 pages
Design Patterns For Enterprise Application
No ratings yet
Design Patterns For Enterprise Application
27 pages
Snow Park Schneestern
No ratings yet
Snow Park Schneestern
45 pages
Vulnerability Assesment in Power Systems 225-1258-1-PB
No ratings yet
Vulnerability Assesment in Power Systems 225-1258-1-PB
9 pages
Software Architecture and Design
No ratings yet
Software Architecture and Design
95 pages
Real Time Object Detection Using Deep Learning Andmachine Learning Project
No ratings yet
Real Time Object Detection Using Deep Learning Andmachine Learning Project
56 pages
State Machine Design Inc
No ratings yet
State Machine Design Inc
12 pages
BCBS 239
100% (1)
BCBS 239
16 pages
Corinex Hybrid Fiber-BPL Solution
No ratings yet
Corinex Hybrid Fiber-BPL Solution
21 pages
The Foundations of Artificial Intelligence
No ratings yet
The Foundations of Artificial Intelligence
31 pages
Ch-4 Ethics in Data Science PPT Vasu Sharma 9-A
No ratings yet
Ch-4 Ethics in Data Science PPT Vasu Sharma 9-A
18 pages
Back Propagation
No ratings yet
Back Propagation
20 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
Game Theory and Climate Change
No ratings yet
Game Theory and Climate Change
6 pages
LinearAlgebra Lect2 Karan
No ratings yet
LinearAlgebra Lect2 Karan
62 pages
Module 1 Lecture 3 - Linear Algibra
No ratings yet
Module 1 Lecture 3 - Linear Algibra
34 pages
Shaik MubarakMlGZ
No ratings yet
Shaik MubarakMlGZ
15 pages
Hoffman 1953
No ratings yet
Hoffman 1953
3 pages
Drdo - Summer Internship-2024 Final
No ratings yet
Drdo - Summer Internship-2024 Final
112 pages
Linus Pauling Goudsmit, Samuel, The Structure of Line Spectra, (1930)
No ratings yet
Linus Pauling Goudsmit, Samuel, The Structure of Line Spectra, (1930)
288 pages
Ee-262 Programming With C Language: Group Assignment
No ratings yet
Ee-262 Programming With C Language: Group Assignment
5 pages
AM 5030 - Linear Dynamical Systems: Assignment-1
No ratings yet
AM 5030 - Linear Dynamical Systems: Assignment-1
1 page
212ufrep in GATE-2020 Updated
No ratings yet
212ufrep in GATE-2020 Updated
43 pages
Cenumes313 p1m Exam Set Blue v1!10!16-23
No ratings yet
Cenumes313 p1m Exam Set Blue v1!10!16-23
3 pages
Python Intro ChE Comp
No ratings yet
Python Intro ChE Comp
55 pages
Final Internship Report
No ratings yet
Final Internship Report
61 pages
Linear Algebra Summary
No ratings yet
Linear Algebra Summary
80 pages
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
No ratings yet
SpV8: Pursuing Optimal Vectorization and Regular Computation Pattern in SPMV
21 pages
An Efficient Brute-Force Solution To The Network Reconfiguration Problem
No ratings yet
An Efficient Brute-Force Solution To The Network Reconfiguration Problem
5 pages
Department of Mathematics MAL 110 (Mathematics I) Tutorial Sheet No. 6 Linear Algebra and Matrix
No ratings yet
Department of Mathematics MAL 110 (Mathematics I) Tutorial Sheet No. 6 Linear Algebra and Matrix
2 pages
Advanced, Composite, Matrix, Nbox, Search, Simple, Transposed, and Trending
No ratings yet
Advanced, Composite, Matrix, Nbox, Search, Simple, Transposed, and Trending
2 pages
Determining Vision-Reality Gap
No ratings yet
Determining Vision-Reality Gap
11 pages
Pre-Calculus-Honors Curriculum Map
No ratings yet
Pre-Calculus-Honors Curriculum Map
7 pages
Chapter 3 Determinants
No ratings yet
Chapter 3 Determinants
20 pages
Matrices and Linear Transformations
No ratings yet
Matrices and Linear Transformations
10 pages
English For MATH Students
No ratings yet
English For MATH Students
66 pages
Algebra Elements - FINAL
No ratings yet
Algebra Elements - FINAL
19 pages
fx9750g ch07
No ratings yet
fx9750g ch07
9 pages
Porat A Gentle Introduction To Tensors 2014 PDF
No ratings yet
Porat A Gentle Introduction To Tensors 2014 PDF
87 pages
Class-Xii Holiday Homework (2024-25) Science
No ratings yet
Class-Xii Holiday Homework (2024-25) Science
7 pages
Mathocrat Linear Algebra Pyqs 1992 To 2022
No ratings yet
Mathocrat Linear Algebra Pyqs 1992 To 2022
29 pages
Generalized State Space Averaging
No ratings yet
Generalized State Space Averaging
12 pages
C Exercises
No ratings yet
C Exercises
9 pages
Generating Good Pseudo-Random Numbers: Computational Statistics & Data Analysis December 2006
No ratings yet
Generating Good Pseudo-Random Numbers: Computational Statistics & Data Analysis December 2006
10 pages
Inverse Iteration Method For Finding Eigenvectors
No ratings yet
Inverse Iteration Method For Finding Eigenvectors
4 pages

Linear Algebra For Machine Learning: Sargur N. Srihari Srihari@cedar - Buffalo.edu

Uploaded by

Linear Algebra For Machine Learning: Sargur N. Srihari Srihari@cedar - Buffalo.edu

Uploaded by

Machine Learning Srihari

Linear Algebra for Machine

What is linear algebra?

Why do we need to know it?

– Scalars, Vectors, Matrices and Tensors

• Ai: is ith row of A, A:j is jth column of A

⎡ A A1,2 A1,3 ⎤ ⎡ A A2,1 A3,1 ⎤ ⎡ A A1,2 ⎤ ⎡ A A2,1 A3,1 ⎤

Vectors as special case of matrix

• A scalar is a matrix with one element

• A scalar can be added to a matrix or multiplied

• Less conventional notation used in ML:

C = AB ⇒C i,j = ∑ Ai,k Bk,j

– Note that the standard product of two matrices is

Matrix Product Properties

Example flow of tensors in ML

A linear classifier with bias eliminated y= WxT

• Sometimes we wish to solve for the unknowns

Identity and Inverse Matrices

• This depends on being able to find A-1

Solving Simultaneous equations

Equations in Linear Regression

Linear Equations: Closed-Form Solutions

1. Matrix Formulation: Ax=b

L2-3L1àL2 L3-2L1àL3 -L2/4àL2

Disadvantage of closed-form solutions

How many solutions for Ax=b exist?

Span of a set of vectors

– System of equations is Ax=b

– Thus determining whether Ax=b has a solution is

Conditions for a solution to Ax=b

Use of a Vector in Regression

• Feature vector has three dimensions

• Called max norm 30

Use of norm in Regression

With nonlinear basis functions ϕj

Second term is a weighted norm

LP Norm and Distance

• We can use it to draw a unit circle from origin

• Distance between two vectors (v,w)

Size of a Matrix: Frobenius Norm

Angle between Vectors

• Cosine between two vectors is a measure of

Special kind of Matrix: Diagonal

– E.g., covariance matrix with independent features

Efficiency of Diagonal Matrix

Special kind of Matrix: Symmetric

– E.g., covariance matrices are symmetric

Special Kinds of Vectors

sv still has the same eigen value.

Eigenvalue and Characteristic Polynomial

• If v and w are scalar multiples, i.e., if Av=λv

• It has roots λ=1 and λ=3 which are the two

Decomposition of Symmetric Matrix

Effect of Eigenvectors and Eigenvalues

with two variables x1 and x2

Eigendecomposition is not unique

What does eigendecomposition tell us?

Positive Definite Matrix

Singular Value Decomposition (SVD)

• SVD interpreted in terms of eigendecomposition

SVD in Collaborative Filtering

• X is the utility matrix

• Frobenius norm of a matrix can be represented

PCA Problem Statement

PCA using Matrix multiplication

– D is a matrix with l columns

Finding optimal code given D

– Using squared L2 instead of L2, function being

Optimal Encoding for PCA

• Thus we can encode x using a matrix-vector

Method for finding optimal D

– Given by stacking all vectors describing the points

Final Solution to PCA

You might also like

–  Scalars, Vectors, Matrices and Tensors

•  Ai: is ith row of A, A:j is jth column of A

•  A scalar is a matrix with one element

•  A scalar can be added to a matrix or multiplied

•  Less conventional notation used in ML:

–  Note that the standard product of two matrices is

•  Sometimes we wish to solve for the unknowns

•  This depends on being able to find A-1

–  System of equations is Ax=b

–  Thus determining whether Ax=b has a solution is

•  Feature vector has three dimensions

•  Called max norm 30

•  We can use it to draw a unit circle from origin

•  Distance between two vectors (v,w)

•  Cosine between two vectors is a measure of

–  E.g., covariance matrix with independent features

–  E.g., covariance matrices are symmetric

•  If v and w are scalar multiples, i.e., if Av=λv

•  It has roots λ=1 and λ=3 which are the two

•  SVD interpreted in terms of eigendecomposition

•  X is the utility matrix

•  Frobenius norm of a matrix can be represented

–  D is a matrix with l columns

–  Using squared L2 instead of L2, function being

•  Thus we can encode x using a matrix-vector

–  Given by stacking all vectors describing the points