0% found this document useful (0 votes)
72 views49 pages

Matrix and Tensor Factorization For Machine Learning: IFT 6760A

This document provides an overview of an advanced graduate course on matrix and tensor factorization for machine learning. The course will begin with a linear algebra refresher focusing on matrix decomposition techniques. It will then cover applications of linear algebra in machine learning before introducing tensors and tensor decomposition. Finally, it will discuss tensors in machine learning and include a seminar portion where students present their research projects. The goal is for students to learn how to use tensor methods in machine learning applications and gain experience with formal proofs and research papers.

Uploaded by

Abhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views49 pages

Matrix and Tensor Factorization For Machine Learning: IFT 6760A

This document provides an overview of an advanced graduate course on matrix and tensor factorization for machine learning. The course will begin with a linear algebra refresher focusing on matrix decomposition techniques. It will then cover applications of linear algebra in machine learning before introducing tensors and tensor decomposition. Finally, it will discuss tensors in machine learning and include a seminar portion where students present their research projects. The goal is for students to learn how to use tensor methods in machine learning applications and gain experience with formal proofs and research papers.

Uploaded by

Abhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

IFT 6760A

Matrix and Tensor Factorization for


Machine Learning

Winter 2019

Instructor: Guillaume Rabusseau


email: [email protected]
Today

I What is this class about? Linear and multilinear algebra for ML...
I Class goals: refresh linear algebra, discover and have fun with
tensors and do research!
I Class logistics.
I Content overview.
I In-class quiz (not graded): for me to assess your background and
adjust material accordingly.
I Questions.
Material and discussion

I All material will be posted on my website


I We will use a google discussion group for announcements and
discussions, don’t forget to sign-up (link on the course webpage).
I We will mainly use Studium for you to upload project proposals /
reports...
Class high-level overview

I Objectives
I End goal: for you to know what tensors and tensor decomposition
techniques are and how they can be used in ML.
I Along the way: learn useful theoretical tools from linear algebra and
matrix decomposition techniques.
I We will start back from linear algebra and build up from there:
1. Linear algebra refresher with focus on matrix decomposition
2. Applications of Linear algebra in ML
3. Introduction to tensors and tensor decomposition techniques
4. Tensors in ML
5. Seminar part of the class → your turn! (more on that later)
(rough timeline: (1,2) in Jan., (3,4) in Feb, (5) in Mar.)
I I am very open to feedbacks and suggestions, and if there is a topic
you would like to see covered in class come see me / send a mail!
About this class

I Not a math course but still a lot of maths


I I will assume a reasonable background in
I probability / statistics
I machine learning (methodology, bias-variance, regularization, etc.)
I basic theoretical CS (big O notation, what is a graph, etc.)
I linear algebra (even though we will review main results)

I We will do a lot of proofs...


,→ getting comfortable with writing / reading formal proofs can be
seen as a side-objective of the class.
Who is this class for?

I Advanced grad students


I If you’re a first year MSc student the class may be tough or not
given your background: come discuss with me or send me a mail
with your background and list of relevant courses you followed.
I You should have followed at least one ML course (e.g.
IFT3395/6390, COMP-551, COMP-652).
Class goals

I Get a good grasp on linear algebra and matrix decomposition


techniques in the context of ML
I Learn theoretical tools potentially relevant to many aspects of ML
I Learn about tensors and how they can be used in ML

I Read research papers, get familiar with the literature


I Engage in research, connect the tools we’ll learn with your own
research
I Work on your presentation and writing skills
I Work on a semester research project
Class logistics
I Language: everything will be English (many foreign students,
papers are in English...), come see me or send me a mail if this is a
concern.
I Grading:
I participation (10%): questions/comments during class, project
updates, feedback, etc.
I scribing (10%): most lectures will be on the board, each lecture a
group of students is responsible for taking notes and write them (in
latex) for the following week.
Each student has to scribe at least once (likely twice) during the
semester.
I paper presentation (30%)...
I research project (50%)...
,→ no midterm but a project proposal due for middle of semester.
,→ no assignments per se but readings and some proofs let as
exercises.
Paper presentation

I A few classes (likely starting late Feb./early March) will be devoted


to paper presentation by students.
I Goal is for you to work on your presentation skills and potentially
start elaborating your project.
I read a research paper from the literature (I will post references but
you can come up with your own suggestion, to be validated).
I present the work in class either with slides or on the blackboard in a
talk.
I graded on the quality of the presentation.
I Specifics to be set up later (size of groups, length of presentations,
etc.).
Research project

I Groups of 2-3 students


I Proposal due middle of semester
I One lecture will be devoted to projects presentation / progress
report
I Project final presentation (either talks or poster session, TBD) at
the end of the semester (date TBD)
I Project final written report due end of semester (date TBD)
Research project

I Topic chosen based on


I your own research interests (aligned with the class)
I a bibliography I will make available as we progress in the class.
I Each project should be grounded on at least one research paper.
I Only requirement: the project should be related to the class
content.
Research project

I Types of projects:
I Literature review: choose a topic/problem and present the existing
approaches to handle it, comparing them and analyzing their
drawbacks / advantages / limits, perform some experimental
comparison.
I Application: apply ideas/algorithms seen in class to a new problem
you find interesting or related to your research.
I Theory: focus on theoretical analysis of an algorithm or a problem
(e.g. robustness analysis, complexity analysis), propose a new
method to tackle a problem (existing or a new problem), ideally still
perform experiments (e.g. on synthetic data).
I Best case scenario: project ⇒ research paper
Times and dates

I Class on Tuesdays 12:30-14:15 and Thursdays 11:30-13:15 here at


the Mila auditorium.
I Schedule conflicts (COMP-551/767)?
I Office hours: Tuesdays after class in my office (D15 @ 6666) or
one of the meeting rooms at Mila
I Reading week: March 3-8
I If I understood correctly:
I Deadline to switch/drop the course without fees: January 22nd
I Deadline to drop the course with fees: March 15th
Auditing the class

I Everyone’s welcome (as long as there are enough room, but this
should be ok!)
I Participating to research projects may be doable for auditing
students as well, come see me.
I Sign-up sheet
Questions?
Class high-level overview

I Overall objective: for you to know what tensors and tensor


decomposition techniques are and how they can be used in ML.
I We will start back from linear algebra and build up from there:
1. Linear algebra refresher with focus on matrix decomposition
2. Linear algebra and ML / linear algebra and discrete maths
3. Introduction to tensors and tensor decomposition techniques
4. Tensors in ML
5. Seminar part of the class → your turn! (more on that later)
(rough timeline: (1,2) in Jan., (3,4) in Feb, (5) in Mar.)
I I am very open to feedbacks and suggestions, if there is a topic you
would like to see covered in class come see me or send a mail!
Linear Algebra road map
I Start by briefly recalling the basics: span, linear independence,
dimension, rank of a matrix, rank nullity theorem, orthogonality,
subspaces, projection, etc.
I Give some special care to eigenvalues/eigenvectors,
diagonalizability, etc. and present common factorization (low rank
factorization, QR, eigendecomposition, SVD), a bit of optimization
(mainly the power method).
I Present a few fundamental theorems and tools (and prove some of
them):
I Spectral theorem
I Jordan canonical form
I Eckart-Young-Mirsky theorem
I Min-max theorem (Courant-Fischer-Weyl theorem)
I Generalized eigenvalue problems
I Perturbation bounds on eigen-vectors/values
I Illustrate with ML applications.
Linear Algebra and ML: Teasers

Let’s look at a few examples of connections between algebra and


ML / CS...
Linear Algebra and ML: Linear Regression

I We want to find f : Rd → R linear (i.e. f : x 7→ w> x) minimizing


P
the squared error loss on the training data L = N 2
i=1 (f (xi ) − yi ) .
Linear Algebra and ML: Linear Regression

I We want to find f : Rd → R linear (i.e. f : x 7→ w> x) minimizing


P
the squared error loss on the training data L = N 2
i=1 (f (xi ) − yi ) .
I Solution: w∗ = (X> X)−1 X> y where X ∈ RN×d and y ∈ RN .
I The vector of prediction is given by ŷ = Xw∗ = X(X> X)−1 X> y
| {z }
orthogonal projection
Linear Algebra and ML: Linear Regression

I We want to find f : Rd → R linear (i.e. f : x 7→ w> x) minimizing


P
the squared error loss on the training data L = N 2
i=1 (f (xi ) − yi ) .
I Solution: w∗ = (X> X)−1 X> y where X ∈ RN×d and y ∈ RN .
I The vector of prediction is given by ŷ = Xw∗ = X(X> X)−1 X> y
| {z }
orthogonal projection

ŷ is the orthogonal projection of y onto the subspace of RN


spanned by the columns of X.
Linear Algebra and ML: Principal Component Analysis

I Given a set of points x1 , · · · , xN ∈ Rd (assume centered) we want


to find the k-dimensional subspace of Rd such that he projections
of the points onto this subspace
I have maximal variance
I stay as close as possible to the original points (in `2 distance).
Linear Algebra and ML: Principal Component Analysis

I Given a set of points x1 , · · · , xN ∈ Rd (assume centered) we want


to find the k-dimensional subspace of Rd such that he projections
of the points onto this subspace
I have maximal variance
I stay as close as possible to the original points (in `2 distance).

The solution is given by the subspace spanned by the top k


eigenvectors of the covariance matrix X> X ∈ Rd×d .
Linear Algebra and ML: Spectral Graph Clustering

I The Laplacian of a graph is the difference between its degree


matrix and its adjacency matrix: L = D − A

   
v1 v3 v4 2 0 0 0 0 1 1 0
0 2 0 0 1 0 1 0
L=
0
− 
0 3 0 1 1 0 1
v2
0 0 0 1 0 0 1 0
Linear Algebra and ML: Spectral Graph Clustering

I The Laplacian of a graph is the difference between its degree


matrix and its adjacency matrix: L = D − A

   
v1 v3 v4 2 0 0 0 0 1 1 0
0 2 0 0 1 0 1 0
L=
0
− 
0 3 0 1 1 0 1
v2
0 0 0 1 0 0 1 0

Zero is an eigenvalue of the Laplacian, and its multiplicity is


equal to the number of connected components of G .
Linear Algebra and ML: Spectral Learning of HMMs/WFAs

I Let Σ be a finite alphabet (e.g. Σ = {a, b}).


I Let Σ∗ be the set of all finite sequences of symbols in Σ (e.g.
a, b, ab, aab, bbba, . . . ).
I Given a real-valued function f : Σ∗ → R, its Hankel matrix
∗ ∗
H ∈ RΣ ×Σ is a bi-infinite matrix whose entries are given by
Hu,v = f (uv ).
Linear Algebra and ML: Spectral Learning of HMMs/WFAs

I Let Σ be a finite alphabet (e.g. Σ = {a, b}).


I Let Σ∗ be the set of all finite sequences of symbols in Σ (e.g.
a, b, ab, aab, bbba, . . . ).
I Given a real-valued function f : Σ∗ → R, its Hankel matrix
∗ ∗
H ∈ RΣ ×Σ is a bi-infinite matrix whose entries are given by
Hu,v = f (uv ).

The rank of H is finite if and only if H can be computed by


a weighted automaton.

(if you don’t know what a weighted automaton is, think some kind
of RNN with linear activation functions)
Linear Algebra and ML: Method of Moments

I Consider a Gaussian mixture model with k components, where the


ith Gaussian has mean µi ∈ Rd and all Gaussians have the same
diagonal covariance σ 2 I, i.e. the pdf of x is
k
X
f (x) = pi N (µi , σ 2 I)
i=1
Linear Algebra and ML: Method of Moments

I Consider a Gaussian mixture model with k components, where the


ith Gaussian has mean µi ∈ Rd and all Gaussians have the same
diagonal covariance σ 2 I, i.e. the pdf of x is
k
X
f (x) = pi N (µi , σ 2 I)
i=1

The rank of the (modified) second-order moment

M = E[xx> ] − σ 2 I

is at most k.

P
(we actually have M = i pi µi µ>
i )
Spectral Methods (high-level view)

I Spectral methods usually achieve learning by extracting structure


from observable quantities through eigen-decompositions/tensor
decompositions.
I Spectral methods often constitute an alternative to EM to learn
latent variable models (e.g. HMMs, single-topic/Gaussian mixtures
models).
I Advantages of spectral methods:
I computationally efficient,
I consistent,
I no local optima.
Tensors

What about tensors?


Tensors vs Matrices

M ∈ Rd1 ×d2
T ∈ Rd1 ×d2 ×d3
Mij ∈ R for
(T ijk ) ∈ R for i ∈ [d1 ], j ∈ [d2 ], k ∈ [d3 ]
i ∈ [d1 ], j ∈ [d2 ]
Tensors and Machine Learning

(i) Data has a tensor structure: color image, video, multivariate time
series...

(ii) Tensor as parameters of a model: neural networks, polynomial


regression, weighted tree automata...
X
f (x) = T ijk xi xj xk
i,j,k

(iii) Tensors as tools: tensor method of moments, system of polynomial


equations, layer compression in neural networks...
Tensors

M ∈ Rd1 ×d2 T ∈ Rd1 ×d2 ×d3

Mij ∈ R for i ∈ [d1 ], j ∈ [d2 ] (T ijk ) ∈ R for i ∈ [d1 ], j ∈ [d2 ], k ∈ [d3 ]


Tensors

M ∈ Rd1 ×d2 T ∈ Rd1 ×d2 ×d3

Mij ∈ R for i ∈ [d1 ], j ∈ [d2 ] (T ijk ) ∈ R for i ∈ [d1 ], j ∈ [d2 ], k ∈ [d3 ]


I Outer product. If u ∈ Rd1 , v ∈ Rd2 , w ∈ Rd3 :

u ⊗ v = uv> ∈ Rd1 ×d2 (u ⊗ v)i,j = ui vj

u ⊗ v ⊗ w ∈ Rd1 ×d2 ×d3 (u ⊗ v ⊗ w)i,j,k = ui vj wk


Tensors: mode-n fibers

I Matrices have rows and columns, tensors have fibers1 :

(a) Mode-1 (column) fibers: x:jk (b) Mode-2 (row) fibers: xi:k (c) Mode-3 (tube) fibers: xij:

Fig. 2.1 Fibers of a 3rd-order tensor.

1
fig. from [Kolda and Bader, Tensor decompositions and applications, 2009].
Tensors: Matricizations
I T ∈ Rd1 ×d2 ×d3 can be reshaped into a matrix as

T(1) ∈ Rd1 ×d2 d3


T(2) ∈ Rd2 ×d1 d3
T(3) ∈ Rd3 ×d1 d2

T T(1)
Tensors: Multiplication with Matrices

AMB> ∈ Rm1 ×m2 T ×1 A ×2 B ×3 C ∈ Rm1 ×m2 ×m3


Tensors: Multiplication with Matrices

AMB> ∈ Rm1 ×m2 T ×1 A ×2 B ×3 C ∈ Rm1 ×m2 ×m3

ex: If T ∈ Rd1 ×d2 ×d3 and B ∈ Rm2 ×d2 , then T ×2 B ∈ Rd1 ×m2 ×d3 and
d2
X
(T ×2 B)i1 ,i2 ,i3 = T i1 ,k,i3 Bi2 ,k for all i1 ∈ [d1 ], i2 ∈ [m2 ], i3 ∈ [d3 ].
k=1
Tensors are not easy...

MOST TENSOR PROBLEMS ARE NP HARD

CHRISTOPHER J. HILLAR AND LEK-HENG LIM

Abstract. The idea that one might extend numerical linear algebra, the collection of matrix com-
putational methods that form the workhorse of scientific and engineering computing, to numeri-
cal multilinear algebra, an analogous collection of tools involving hypermatrices/tensors, appears
very promising and has attracted a lot of attention recently. We examine here the computational
tractability of some core problems in numerical multilinear algebra. We show that tensor analogues
of several standard problems that are readily computable in the matrix (i.e. 2-tensor) case are NP
hard. Our list here includes: determining the feasibility of a system of bilinear equations, determin-
ing an eigenvalue, a singular value, or the spectral norm of a 3-tensor, determining a best rank-1
approximation to a 3-tensor, determining the rank of a 3-tensor over R or C. Hence making tensor
computations feasible is likely to be a challenge.

[Hillar and Lim, Most tensor problems are NP-hard, Journal of the ACM, 2013.]
Tensors are not easy...

MOST TENSOR PROBLEMS ARE NP HARD

CHRISTOPHER J. HILLAR AND LEK-HENG LIM

Abstract. The idea that one might extend numerical linear algebra, the collection of matrix com-
putational methods that form the workhorse of scientific and engineering computing, to numeri-
cal multilinear algebra, an analogous collection of tools involving hypermatrices/tensors, appears
very promising and has attracted a lot of attention recently. We examine here the computational
tractability of some core problems in numerical multilinear algebra. We show that tensor analogues
of several standard problems that are readily computable in the matrix (i.e. 2-tensor) case are NP
hard. Our list here includes: determining the feasibility of a system of bilinear equations, determin-
ing an eigenvalue, a singular value, or the spectral norm of a 3-tensor, determining a best rank-1
approximation to a 3-tensor, determining the rank of a 3-tensor over R or C. Hence making tensor
computations feasible is likely to be a challenge.

[Hillar and Lim, Most tensor problems are NP-hard, Journal of the ACM, 2013.]

... but training a neural network with 3 nodes is also NP hard


[Blum and Rivest, NIPS ’89]
Tensors vs. Matrices: Rank
I The rank of a matrix M is:
I the number of linearly independent columns of M
I the number of linearly independent rows of M
I the smallest integer R such that M can be written as a sum of R
rank-one matrix:
XR
M= ui vi> .
i=1
Tensors vs. Matrices: Rank
I The rank of a matrix M is:
I the number of linearly independent columns of M
I the number of linearly independent rows of M
I the smallest integer R such that M can be written as a sum of R
rank-one matrix:
XR
M= ui vi> .
i=1
I The multilinear rank of a tensor T is a tuple of integers
(R1 , R2 , R3 ) where Rn is the number of linearly independent
mode-n fibers of T :
Rn = rank(T(n) )
Tensors vs. Matrices: Rank
I The rank of a matrix M is:
I the number of linearly independent columns of M
I the number of linearly independent rows of M
I the smallest integer R such that M can be written as a sum of R
rank-one matrix:
XR
M= ui vi> .
i=1
I The multilinear rank of a tensor T is a tuple of integers
(R1 , R2 , R3 ) where Rn is the number of linearly independent
mode-n fibers of T :
Rn = rank(T(n) )

I The CP rank of T is the smallest integer R such that T can be


written as a sum of R rank-one tensors:
R
X
T = u i ⊗ v i ⊗ wi .
i=1
CP and Tucker decomposition
I CP decomposition2 :

I Tucker decomposition:

2
fig. from [Kolda and Bader, Tensor decompositions and applications, 2009].
Hardness results

I Those are all NP-hard for tensor of order ≥ 3 in general:


I Compute the CP rank of a given tensor
I Find the best approximation with CP rank R of a given tensor
I Find the best approximation with multilinear rank (R1 , · · · , Rp ) of a
given tensor (*)
I ...
I On the positive side:
I Computing the multilinear rank is easy and efficient algorithms exist
for (*).
I Under mild conditions, the CP decomposition is unique (modulo
scaling and permutations).
⇒ Very relevant for model identifiability...
Back to the Method of Moments
I Consider a Gaussian mixture model with k components, where the
ith Gaussian has mean µi ∈ Rd and all Gaussians have the same
diagonal covariance σ 2 I.

The (modified) second-order moment M = E[xx> ] − σ 2 I is


such that
X k
M= pi µi µ>
i
i=1

I Can we recover the mixing weights pi and centers µi from M?


Back to the Method of Moments
I Consider a Gaussian mixture model with k components, where the
ith Gaussian has mean µi ∈ Rd and all Gaussians have the same
diagonal covariance σ 2 I.

The (modified) second-order moment M = E[xx> ] − σ 2 I is


such that
X k
M= pi µi µ>
i
i=1

I Can we recover the mixing weights pi and centers µi from M?


I No, except if the µi are orthonormal, in which case they are the
eigenvectors of M and the pi are the corresponding eigenvalues.
I But we will see that if we know both the matrix M and the 3rd
P
order tensor T = ki=1 pi µi ⊗ µi ⊗ µi , then we can recover the
weights and centers if the µi are linearly independent.
Quiz

Quiz Time

You might also like