0% found this document useful (0 votes)
14 views112 pages

03 Matrix

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views112 pages

03 Matrix

Uploaded by

chunfeng277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

CS 514 Advanced Topics in Network Science

Lecture 3. Matrix and Tensor


Hanghang Tong, Computer Science, Univ. Illinois at Urbana -Champaign, 2024
Network Science: An Overview
We are here
network
(e.g., Patterns, laws, connectivity, etc.)

(e.g., clusters, communities,


dense subgraphs, etc.)
subgraph

(e.g., ranking, link prediction, embedding, etc.)

node/link

• Level 1: diameter, connectivity, graph-level classification, graph-level embedding, graph kernel, graph structure learning, graph generator,…
• Level 2: frequent subgraphs, clustering, community detection, motif, teams, dense subgraphs, subgraph matching, NetFair, …
• Level 3: node proximity, node classification, link prediction, anomaly detection, node embedding, network alignment, NetFair,
• Beyond:, network of X, ….

2
Matrix & Tensor Tools
• Matrix Tools
– Proximity (covered in Lecture 2)
– Low-rank approximation
– Co-clustering
• Tensor Tools

3
Motivation
• Q: How to find patterns?
– e.g., communities, anomalies, etc.
• A (Common Approach): Low-Rank
Approximation (LRA) for Adjacency Matrix.
X M X R

A ~ L

4
Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos: Colibri: fast
mining of large static and dynamic graphs. KDD 2008: 686-694
LRA for Graph Mining
Conference

John
ICDM
1 1 0 0
Tom 1 1 0 0
KDD

Author
Bob
1 1 0 0
Carl
ISMB 0 1 1 1
Van
RECOMB
0 0 1 1
Roy
0 0 1 1
Author Conference Adjacency matrix: A
5
LRA for Graph Mining: Communities
R: Conf. Group Matrix
John Adj. matrix: A
ICDM
Tom X X
KDD
Bob

Carl
ISMB
~ M: Group-Group
Van
Interaction Matrix
RECOMB
Roy

L: author group matrix


Author Conf.

6
LRA for Graph Mining: Anomalies
John Adj. matrix: A L M R
ICDM
Tom X X
KDD
Bob

Carl
ISMB
~
Van
RECOMB
Roy

Author Conf.
Recon. error is high
→ ‘Carl’ is abnormal
7
Challenges – Problem 1
• Prob.1: Given a static graph A,
+ (C1) How to get (L, M, R) efficiently?
- Both time and space
+ (C2) What is the interpretation of
(L, M, R)?

8
Challenges – Problem 2
• Prob. 2: Given a dynamic graph
At(t=1,2,…),
+ (C3) How to get (Lt, Mt, Rt) incrementally?
- Track patterns over time

9
Roadmap - LRA
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
10
Overview

X M X R

A L L
Find L Project A

Projection of A

Same for
different methods
11
Matrix & Vector
3 1 Phlip Yu Philip 1
ICML
• Matrix B= 1 1 William Cohen William
1
3
1 SIGMOD
0 0 John Smith John

SIGMOD ICML

John Smith
William Cohen

ICML = [1, 1, 0]’


SIGMOD = [3, 1, 0]’
Philip Yu

12
Column Space
3 1 Phlip Yu Philip 1
ICML
• Matrix B= 1 1 William Cohen William
1
3
1 SIGMOD
0 0 John Smith John

SIGMOD ICML

• Column Space of a Matrix

ICML SIGMOD

VLDB = SIGMOD – ICML = [2 0 0]’

13
Projection & Projection Matrix
KDD
v

ICML
v~ KDD ~
SIGMOD

+
X BTB X BT X

v~ = B v
Core Matrix

Projection of v Projection matrix of B An arbitrary vector 14


Projection of a Matrix

L M R

+
X BTB X BT X

~ = B A
A
Core Matrix

Projection of A Projection matrix of B

15
Roadmap
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
16
Singular-Value-Decomposition (SVD)
1 … v1

x x


… … …. ….
… k vk

 V:
a1 a2 ….
a3 …a m
~ u1 …
…. uk
right singular vectors

… …

A: n x m U: left singular vectors 17


SVD: definitions
• #1: Find the left matrix U,
– where A  viT a1  vi ,1 + a2  vi ,2 + ... + am  vi ,m
ui = =
i i
• #2: Project A into the column space of U
+
A = U (U U ) U A = ... = U V
T T

18
SVD: advantages
• Optimal Low-Rank Approximation
–In both L2 and LF

–For any rank-k matrix Ak


|| A – ||2, F <= || A – Ak ||2,F

19
SVD: drawbacks
• (C1) Efficiency A U  V
2 2
– Time O (min( n m, nm ))
[footnote: or O( E • Iter ) ] =
– Space (U, V) are dense

• (C2) Interpretation

20
SVD: drawbacks
• (C3) Dynamic: not easy
At Ut t Vt At+1 Ut+1 t+1 Vt+1

21
Roadmap
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
22
CUR (CX) decomposition
[Drineas+ 2005]

+
… … … x (C T….
C) x C AT
….

U R
… ……. … ~ ….
•Sample Columns from A
•Project A
… … … Left matrix: C
Middle matrix: (C T C ) +
Right matrix : C T A

A: n x m C 23
CUR (CX): advantages
• (C0) Quality: Near-Optimal
• (C1) Efficiency (better than SVD)
– Time O ( c 2
n ) or O ( c 3
+ cm)
• (c is # of sampled col.s)
– Space (C, R) are sparse

• (C2) Interpretation

24
CUR (CX): drawbacks
• (C1) Redundancy in C

• 3 copies of green,
• 2 copies of red,
• 2 copies of purple
• purple=0.5*green + red…

25
Redundant Col.
KDD
Does Not Help
ICML ~
KDD
SIGMOD

Observations: VLDB
~
#1: Does not help KDD
KDD
#2: Wastes time & space
ICML ~
KDD
SIGMOD
VLDB

26
CUR (CX): drawbacks
• (C3) Dynamic: not easy

~~
~

?
t t+1

C ~
C

27
Roadmap
• Motivation
• Survey: Existing Methods
– SVD
– CUR/CX
– CMD
• Proposed Methods: Colibri
• Experimental Results
• Conclusion
28
CMD [Sun+ 2007]
CUR (CX) CMD
Original Matrix

~~ ~

Left matrix: C
Middle matrix: (C T C ) +
• 3 copies of green, Right matrix : C T A
• 2 copies of red,
• 2 copies of purple
• purple=0.5*green + red C

Duplicate: deleted in CMD!


29
Challenges
• Can we do even better than CMD
• by removing the other types of redundancy?
• Can we efficiently track LRA
• for time-evolving graphs?

30
Roadmap
• Motivation
• Survey: Existing Methods
• Proposed Methods: Colibri
– Colibri-S for static graphs (Problem 1)
– Colibri-D for dynamic graphs (Problem 2)
• Experimental Results
• Conclusion

31
Colibri-S: Basic Idea
CUR (CX) Colibri-S
Original Matrix

x. x ….

M R


. Left matrix: L
Middle matrix: ( LT L) −1

• 3 copies of green, Right matrix : LT A


• 2 copies of red,
• 2 copies of purple
• purple=0.5*green + red L
We want the Col.s in L to be linearly independent!
32
Q: How to find L & M from C efficiently?

33
A: Find L & M incrementally!
Initially Sampled
….
Matrix C

Current For each col. v in C


Redundant
discard v
L&M Project it on L ? Yes

No

Expand L & M
34
Step 1: How to test if KDD is redundant ?

KDD
SIGMOD

~
_ X

ICML
KDD
KDD
=

ICML
~ SIGMOD

KDD
KDD = Mold X ICML X
SIGMOD

35
Step 2: How to update core matrix ?

-1

SIGMOD
SIGMOD

ICML
KDD Mold = ICML
X

ICML
~
KDD
?
SIGMOD
-1
SIGMOD

SIGMOD
ICML
X

KDD
Mnew = ICML
KDD

36
Q: How to update core matrix?
A: Incrementally.
Theorem 1
1 ~
KDDX −1
[Tong et al KDD 2008]
+ X 

KDD
Mold 

KDD
~

~
 2

Mnew = −1
 ~
KDD
1
 

~
We only need to know KDD and !
37
Colibri-S vs. CUR(CMD)
Example:
• (C0) Quality: -If c = 200, c = 1000
• Colibri-S = CUR(CMD) - Colibri-S: 125x faster !

• (C1) Time: O(c 3 + cm) vs. O(c3 + cm ), where c  c, m  m


• Colibri-S better or equal CUR(CMD)
• (C1) Space
• Colibri-S better or equal CUR(CMD)
• (C2) Interpretations
• Colibri-S = CUR(CMD)
38
A Pictorial Comparison
1 ICML Y: William Cohen
Philip
1 3
William 1 SIGMOD
……

X: Philip Yu

Each dot is a conference


39
A Pictorial Comparison: SVD
Y: William Cohen

2nd singular vector

1st singular vector

X: Philip Yu

Each dot is a conference


40
A Pictorial Comparison: CUR
[Drineas+ 2005]

Y: William Cohen

2x

3x 2x

4x
2x
1x X: Philip Yu

Each dot is a conference


41
A Pictorial Comparison: CMD
[Sun+ 2007]

Y: William Cohen

X: Philip Yu

Each dot is a conference


42
A Pictorial Comparison: Colibri-S
[Tong+ 2008]

Y: William Cohen

X: Philip Yu

Each dot is a conference


43
Roadmap
• Motivation
• Survey: Existing Methods
• Proposed Methods: Colibri
– Colibri-S for static graphs (Problem 1)
– Colibri-D for dynamic graphs (Problem 2)
• Experimental Results optional

• Conclusion

44
Problem Definition
• Given (e.g., Author-Conference Graphs)

A1 A2 A3 …

• Find Incrementally

M1 R1 M2 R2 M3 R3
L1 L2 L3 …
45
Colibri-D for dynamic graphs

Mt Rt

t Lt

Initially sampled matrix

Mt+1 Rt+1
?
t+1 Lt+1

Q: How to update L and M efficiently? 46


Colibri-D: How-To
Selected Redundant

Mt Rt

t Lt

Initially sampled matrix


Selected Redundant
t+1 t+1
M R
?
t+1 Lt+1

47
Changed from t
Colibri-D: How-To Mt

Selected Redundant Lt

t
~
M
Unchanged Cols!

~ Subspace by
L
Initially sampled matrix blue cols
Selected Redundant at t+1

t+1
Mt+1

Lt+1
48
How to Get Core Matrix
for Un-changed Col.s ?
Lt
-1

X
M t
= [(Lt )’ x Lt ]-1 =
t

?
-1
t+1 ~ ~t ~ t -1
Mt = [(L )’ xL ] = X

~
Lt v
49
How to Get Core Matrix
for Un-changed Col.s ?
Let
s: # of changed columns in Lt

Theorem 2 [Tong et al KDD 2008] -1


X X
~t _ t
M 2,2
t
M 2,1
M = t
M 1,2

We only need an s x s matrix inverse !


50
How to Get Core Matrix
for Un-changed Col.s
Let t: # of un-changed columns in Lt

s: # of changed columns in Lt

We only need a matrix inverse of size


- s x s, instead of t x t
- if s<< t (a.k.a, “smooth”), we are faster
- example:
+ if s=10 and t=100, we are 1000x faster!

51
Comparison SVD, CUR/CMD vs. Colibri
s

Wish List SVD CUR/CMD Colibri


[Golub+ 1989] [Drineas+ 2005, [Tong+ 2008]
Sun+ 2007]
(C0)
Quality
(C1)
Efficiency
(C2)
Interpretation
(C3)
Dynamics
(?)

52
Roadmap
• Motivation
• Survey: Existing Methods
• Proposed Methods: Colibri
• Experimental Results
• Conclusion

53
Experimental Setup
• Data set
• Network traffic
• 21,837 sources/destinations
• 1,222 consecutive hours (~ 2 months)
• 22,800 edges per hour
• Accuracy:
Accuracy =
• Space Cost:

54
Performance
SVD SVD
of Colibri-S
• Accuracy CUR CUR
• Same 91%+
• Time
• 12x of CMD
• 28x of CUR
• Space
• ~1/3 of CMD CMD
• ~10% of CUR CMD Ours
Ours
Time Space 55
Performance
Time
CMD of Colibri-D
(Prior Best Method)

Network traffic
- 21,837 nodes
Colibri-S - 1,220 hours
- 22,800 edge/hr
Colibri-D Accuracy
- Same 93%+
# of changed cols
Colibri-D achieves up to 112x speedups 56
Conclusion: Colibri
• Colibri-S (for static graphs)
– Idea: remove redundancy
– Up to 52x speedup; 2/3 space saving
– No quality loss (w.r.t., CUR/CMD)
• Colibri-D (for dynamic graphs)
– Idea: leverage “smoothness”
– Up to 112x than CMD

57
optional

• More on Matrix Low Rank Approximations

58
Graph Mining by Low-Rank Approximation

Q: How to get the low-rank matrix approximations?


59
optional

More on LRA
• Q0: SVD + example-based LRA
• Q1: Nonnegative Matrix Factorization
• Q2: Non-negative Residual Matrix Factorization
• Q3: Nuclear norm related technologies

60
Low Rank Approximation
• Nonnegative Matrix Factorization (NMF)

DanielMay
D. Lee 1st-4th,
and H. Sebastian
2013Seung. Learning
SDM the 2013,
parts of objects
Austin, by Texas
non-negative matrix factorization. Nature 401,61788-
791 (21 October 1999)
Nonnegative Matrix Factorization (NMF)

• Factorizing a nonnegative matrix to the


product of two low-rank matrices
(entire F matrix)

(1 row in G)

r r
62
NMF Solutions: Multiplicative Updates
• Multiplicative update method

Daniel D. Lee and H. Sebastian Seung (2001). Algorithms for Non-negative Matrix Factorization. NIPS 2001.
H Zhou, K Lange,
May and M Suchard.
1st-4th, 2013 (2010) Graphical
SDM processing units andTexas
2013, Austin, high-dimensional optimization, Statistical Science,
63
25:311-324
NMF Solutions: Alternating Nonnegative
Least Squares
• Initialize F and G with nonnegative values
• Iterate the following procedure:
– Fixing , Solve
– Fixing , Solve

(1) Projected Gradient: http://www.csie.ntu.edu.tw/~cjlin/nmf/


(2) Newtown Type of Method:
http://www.cs.utexas.edu/users/dmkim/Source/software/nnma/index.html
(3) Block Principal Pivoting: https://sites.google.com/site/jingukim/nmf_bpas.zip?attredirects=0

P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error
estimates of data values. Environmetrics, 5(1):111–126, 1994
C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation,19(2007), 2756-2779.
D. Kim, S. Sra, I. S. Dhillon, Fast Newton-type Methods for the Least Squares Nonnegative Matrix Approximation Problem.
SDM 2007.
May 1st-4th, 2013 SDM 2013, Austin, Texas 64
J. Kim and H. Park. Toward Faster Nonnegative Matrix Factorization: A New Algorithm and Comparisons. ICDM 2008.
Application of NMF: Privacy-Aware On-line User
Role Tracking [AAAI11]
• Problem Definitions
– Given: the user-activity log that changes over time
– Monitor: (1) the user role/cluster; and (2) the role/cluster description.
• Design Objective
– (1) Privacy-aware; and (2) Efficiency (in both time and space).

65
Key Ideas
• Minimize the upper bound of the original/exact objective function
min ||X + ΔX – F G - ΔF GT – F ΔGT - ΔF ΔGT ||F

||X – F GT ||F Dependent on X,


but fixed
≤ +
||ΔX – ΔF GT – F ΔGT - ΔF ΔGT ||F Independent on X
subject to: ΔF+F ≥ 0, ΔG+G ≥ 0

min ||ΔX – ΔF GT – F ΔGT - ΔF ΔGT ||F Independent on X


subject to: ΔF+F ≥ 0, ΔG+G ≥ 0

Can be solved by the projected gradient descent method


Fei Wang, Hanghang Tong, Ching-Yung Lin: Towards Evolutionary Nonnegative Matrix
66
Factorization. AAAI 2011
Experimental Results

Time
Time Stamp Data Sets

Red: Our method; Blue: Off-line method


Fei Wang, Hanghang Tong, Ching-Yung Lin: Towards Evolutionary Nonnegative Matrix
67
Factorization. AAAI 2011
NMF: Extensions
• General loss
– Bregman Divergence
• Different constraints
– Semi-NMF, Convex NMF, Symmetric NMF
• Incorporating supervisions
– Pairwise constraints, label
• Multiple factorized matrices
– Tri-factorization
I. S. Dhillon and S. Sra. Generalized Nonnegative Matrix Approximations with Bregman Divergences. NIPS 2005.
Chris H. Q. Ding, Tao Li, Michael I. Jordan: Convex and Semi-Nonnegative Matrix Factorizations. IEEE Trans. Pattern Anal.
Mach. Intell. 32(1): 45-55 (2010)
Chris H. Q. Ding, Tao Li, Wei Peng, Haesun Park: Orthogonal nonnegative matrix t-factorizations for clustering. KDD 2006.
Fei Wang, Tao Li, Changshui Zhang: Semi-Supervised Clustering via Matrix Factorization. SDM 2008: 1-12
Yuheng Hu, Fei Wang, Subbarao Kambhampati. Listen to the Crowd: Automated Analysis of Live Events via Aggregated
TwitterMay 1st-4th, 2013 SDM 2013, Austin, Texas 68
Sentiment. IJCAI 2013.
Graph Mining by Low-Rank Approximation

Q: How to get the low-rank matrix approximations?


69
A2: Non-negative Residual MF
• Observations: anomalies → actual activities
• Examples: popularity contest, port scanner, etc
• NrMF formulation

Weighted Frobenius Form

Common in Any MF Weight

Unique in NrMF Non-negative residual

H. Tong, C. Lin: Non-Negative Residual Matrix Factorization with Application to Graph Anomaly
May 1st-4th,
Detection. SDM 20112013 SDM 2013, Austin, Texas 70
Visual Comparisons
Original NrMF SVD Original NrMF SVD

71
Low Rank Approximation
• Nonnegative Matrix Factorization
• Non-negative Residual Matrix Factorization
• Nuclear norm related technologies

SDM 2013, Austin, Texas 72


Rank Minimization and Nuclear Norm
• Matrix completion with rank minimization
NP hard

• Convex relaxation

M. Fazel, H. Hindi, S. Boyd. A Rank Minimization Heuristic with Application to Minimum Order
May
System 1st-4th, 2013Proceedings
Approximation. SDM 2013, Control
American Austin, Conference,
Texas 6:4734-4739, June 2001.
73
Nuclear Norm Minimization
• Singular Value Thresholding
– http://svt.stanford.edu/
• Accelerated gradient
– http://www.public.asu.edu/~jye02/Software/SLEP
/index.htm
• Interior point methods
– http://abel.ee.ucla.edu/cvxopt/applications/nucnr
m/
J-F. Cai, E.J. Candès and Z. Shen. A Singular Value Thresholding Algorithm for Matrix Completion. SIAM Journal on
Optimization. Volume 20 Issue 4, January 2010 Pages 1956-1982.
Shuiwang Ji and Jieping Ye. An Accelerated Gradient Method for Trace Norm Minimization. The Twenty-Sixth International
Conference on Machine Learning (ICML 2009)
Z. Liu, May
Lieven 1st-4th,
Vandenberghe.
2013 Interior-point method for nuclear
SDM 2013, norm approximation
Austin, Texas with application to system 74
identification. SIAM Journal on Matrix Analysis and Applications (2009)
◼ From LRA to Co-clustering
Co-clustering
• Let X and Y be discrete random variables
– X and Y take values in {1, 2, …, m} and {1, 2, …, n}
– p(X, Y) denotes the joint probability distribution—if
not known, it is often estimated based on co-occurrence
data
– Application areas: text mining, market-basket analysis,
analysis of browsing behavior, etc.
• Key Obstacles in Clustering Contingency Tables
– High Dimensionality, Sparsity, Noise
– Need for robust and scalable algorithms

Reference:
1. Dhillon et al. Information-Theoretic Co-clustering, KDD’03
76
n
𝑃(𝑋, 𝑌) .05 .05 .05 0 0 0  eg, terms x documents
.05 .05 .05 0 0 0

m 0 0 0 .05 .05 .05

.04 
0 0 0 .05 .05 .05
.04 0 .04 .04 .04
.04 .04 .04 0 .04 .04 
k
 =
l n
.5 0 0  .3 0  l .36 .36 .28 0 0 0 .054 .054 .042 0 0 0 
.5 0 0  k 0 .3
.2 .2 0 0 0 .28 .36 .36
.054 .054 .042 0 0 0

m 0   00 
0 .5 0 0 0 .042 .054 .054

0 .5 0
 ෠
𝑃(𝑌|𝑌)
.036 0 0 .042 .054 .054

0  𝑃(𝑋,෠ 𝑌) .036 
0 .5 .036 028 .028 .036 .036
 0 .5  ෠ .036 .028 .028 .036 .036 


𝑃(𝑋, 𝑌)

𝑃(𝑋|𝑋)

77
med. doc
cs doc

.05 .05 .05 0 0 0  med. terms


.05 .05 .05 0 0 0

 00 0 0 .05 .05 .05
 cs terms
.04 
0 0 .05 .05 .05
term group x
.04 0 .04 .04 .04
doc. group .04 .04 .04 0 .04 .04 
common terms

.5
.5
0
0
0
0
 .03 .03
.2 .2

.36 .36 .28
0 0 0
0 0
.28 .36 .36
0
= .054
.054
.054
.054
.042
.042
0
0
0
0
0
0


0 .5 0
  00 0 0 .042 .054 .054

 00 .5 0
 doc x .036 0 0 .042 .054 .054

0  .036 
0 .5 .036 028 .028 .036 .036
 0 .5  doc group .036 .028 .028 .036 .036 

term x
term-group
78
Co-clustering
Observations
• uses KL divergence, instead of L2 or LF
• the middle matrix is not diagonal
– we’ll see that again in the Tucker tensor
decomposition

79
Matrix & Tensor Tools
• Matrix Tools
• Tensor Tools
– Tensor Basics
– Tucker
• Tucker 1
• Tucker 2
• Tucker 3
– PARAFAC

80
Tensor Basics
Reminder: SVD

n n



VT
m A m

U
– Best rank-k approximation in L2 or LF

82
Reminder: SVD

n
1u1v1 2u2v2

m A  +

– Best rank-k approximation in L2

83 See also PARAFAC


Goal: extension to >=3 modes

IxJxK
IxR JxR


¼
A
B = +…+

RxRxR

84
Main points:
• 2 major types of tensor decompositions:
PARAFAC and Tucker
• both can be solved with ``alternating least
squares’’ (ALS)
• Details follow – we start with terminology:

85
[T. Kolda,’07]
A tensor is a multidimensional array
An I x J x K tensor Column (Mode-1) Row (Mode-2) Tube (Mode-3)
Fibers Fibers Fibers
X1,1,1

xijk
I

J
Horizontal Slices Lateral Slices Frontal Slices
3rd
order tensor
mode 1 has dimension I
mode 2 has dimension J
mode 3 has dimension K
Note: focus is on 3rd
order, but everything
can be extended to
higher orders.

86
details [T. Kolda,’07]
Matricization: Converting a Tensor to
a Matrix
X(n): The mode-n fibers are
Matricize
(i′,j′) rearranged to be the columns
(unfolding) (i,j,k)
of a matrix

Reverse
(i′,j′)
Matricize (i,j,k)

5 7
1 3
6 8
2 4

87
details

Tensor Mode-n Multiplication

• Tensor Times Matrix • Tensor Times Vector

Compute the dot


Multiply each
product of a and
row (mode-2)
fiber by B each column
(mode-1) fiber

[T. Kolda,’07]
88
details

Mode-n product Example


• Tensor times a matrix

Time

Location

Clusters
Location

Time

Clusters
Time

[T. Kolda,’07]
89
details

Mode-n product Example


• Tensor times a vector

Location

Time
Location

Time

Time

[T. Kolda,’07]
90
details
Outer, Kronecker, &
Khatri-Rao Products
3-Way Outer Product Review: Matrix Kronecker Product

MxN PxQ

MP x NQ

=
Matrix Khatri-Rao Product
Rank-1 Tensor

MxR NxR MN x R

91 [T. Kolda,’07]
Specially Structured Tensors
Specially Structured Tensors
• Tucker Tensor • Kruskal Tensor

Our
Notation
Our
Notation

“core”

IxJxK IxR JxS IxJxK wI1 x R wR


JxR
= V = = +…+ V
v1 vR
U U
RxSxT u1
RxRxR uR

[T. Kolda,’07]
93
details

Specially Structured Tensors


• Tucker Tensor • Kruskal Tensor

In matrix form: In matrix form:

[T. Kolda,’07]
94
Outline: Part 2
• Matrix Tools
• Tensor Tools
– Tensor Basics
– Tucker
• Tucker 1
• Tucker 2
• Tucker 3
– PARAFAC

95
Tensor Decompositions
Tucker Decomposition - intuition

IxJxK IxR JxS


~ B
A
RxSxT

• author x keyword x conference


• A: author x author-group
• B: keyword x keyword-group
• C: conf. x conf-group
• G: how groups relate to each other
97
Reminder
.05 .05 .05 0 0 0  med. terms
.05 .05 .05 0 0 0

 00 0 0 .05 .05 .05
 cs terms
.04 
0 0 .05 .05 .05
term group x
.04 0 .04 .04 .04
doc. group .04 .04 .04 0 .04 .04 
common terms

.5
.5
0
0
0
0
 .03 .03
.2 .2

.36 .36 .28
0 0 0
0 0
.28 .36 .36
0
= .054
.054
.054
.054
.042
.042
0
0
0
0
0
0


0 .5 0
  00 0 0 .042 .054 .054

 00 .5 0
 doc x .036 0 0 .042 .054 .054

0  .036 
0 .5 .036 028 .028 .036 .036
 0 .5  doc group .036 .028 .028 .036 .036 

term x
term-group

98
Tucker Decomposition

IxJxK IxR JxS


~ B
Given A, B, C, the optimal core is:
A
RxSxT

• Proposed by Tucker (1966) Recall the equations for


• AKA: Three-mode factor analysis, three-mode converting a tensor to a matrix
PCA, orthogonal array decomposition
• A, B, and C generally assumed to be
orthonormal (generally assume they have full
column rank)
• is not diagonal
• Not unique

99
details

Tucker Variations
See Kroonenberg & De Leeuw, Psychometrika,1980 for discussion.
• Tucker2 Identity Matrix
IxJxK IxR JxS
~ B
A
RxSxK

• Tucker1
IxJxK IxR
~
A Finding principal components in only mode 1
RxJxK
can be solved via rank-R matrix SVD

100
details
Solving for Tucker
Given A, B, C orthonormal, the optimal core is: IxJxK IxR JxS
~~ B
Tensor norm is the square A
root of the sum of all the RxSxT
elements squared Eliminate the core to get:

Minimize
s.t. A,B,C orthonormal fixed maximize this
If B & C are fixed, then we can solve for A as follows:

Optimal A is R left leading singular vectors for


101
details

Higher Order SVD (HO-SVD)


Not optimal, but
IxJxK often used to
IxR JxS initialize Tucker-
~ B ALS algorithm.
A
RxSxT

(Observe connection to Tucker1)

De Lathauwer, De Moor, & Vandewalle, SIMAX, 1980


102
Tucker-Alternating Least Squares (ALS)
Successively solve for each component (A,B,C).

• Initialize
– Choose R, S, T
IxJxK – Calculate A, B, C via HO-SVD
IxR JxS
• Until converged do…
= B
– A = R leading left singular
A vectors of X(1)(CB)
RxSxT
– B = S leading left singular
vectors of X(2)(CA)
– C = T leading left singular
vectors of X(3)(BA)
• Solve for core:

Kroonenberg & De Leeuw, Psychometrika, 1980


103
details
Tucker in Not Unique

IxJxK IxR JxS


~ B
A
RxSxT

Tucker decomposition is not unique. Let Y be


an RxR orthonormal matrix. Then…

[T. Kolda,’07]
104
Outline: Part 2
• Matrix Tools
• Tensor Tools
– Tensor Basics
– Tucker
• Tucker 1
• Tucker 2
• Tucker 3
– PARAFAC

105
CANDECOMP/PARAFAC
Decomposition

IxJxK
IxR JxR
~ B = +…+
A
RxRxR

• CANDECOMP = Canonical Decomposition (Carroll & Chang, 1970)


• PARAFAC = Parallel Factors (Harshman, 1970)
• Core is diagonal (specified by the vector )
• Columns of A, B, and C are not orthonormal
• If R is minimal, then R is called the rank of the tensor (Kruskal 1977)
• Can have rank ( ) > min{I,J,K}
106
details

PARAFAC-Alternating Least Squares (ALS)


Successively solve for each component (A,B,C).

= +…+

IxJxK
Khatri-Rao Product
(column-wise Kronecker product) Find all the vectors in
one mode at a time

Hadamard Product

If C, B, and  are fixed, the optimal A is given by:

Repeat for B,C, etc.


[T. Kolda,’07]
107
details

PARAFAC is often unique


IxJxK c1 cR
Assume
+…+ PARAFAC
= b1 bR decomposition
a1 aR is exact.

Sufficient condition for uniqueness (Kruskal, 1977):

kA = k-rank of A = max number k such that every set


of k columns of A is linearly independent
108
Tucker vs. PARAFAC Decompositions
• Tucker • PARAFAC
– Variable transformation in – Sum of rank-1 components
each mode – No core, i.e., superdiagonal
– Core G may be dense core
– A, B, C generally – A, B, C may have linearly
orthonormal dependent columns
– Not unique – Generally unique

IxJxK IxR JxS IxJxK c1 cR


B ¼ +…+
~ b1 bR
A
RxSxT a1 aR

109
Tensor tools - summary
• Two main tools
– PARAFAC
– Tucker
• Both find row-, column-, tube-groups
– but in PARAFAC the three groups are identical
• To solve: Alternating Least Squares

110
Tensor tools - resources
• Toolbox: from Tamara Kolda:
csmr.ca.sandia.gov/~tgkolda/TensorToolbox/
• T. G. Kolda and B. W. Bader. Tensor
Decompositions and Applications. SIAM
Review 2008
• csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfil
es/TensorReview-preprint.pdf

111
Key Papers
Core Papers
• Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos: Colibri: fast mining
of large static and dynamic graphs. KDD 2008: 686-694
• Dhillon et al. Information-Theoretic Co-clustering, KDD’03
• T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review 2008

Further Reading
• Chih-Jen Lin: Projected Gradient Methods for Non-negative Matrix Factorization.
https://www.csie.ntu.edu.tw/~cjlin/papers/pgradnmf.pdf
• Candès, Emmanuel J., and Benjamin Recht. "Exact matrix completion via convex optimization."
Foundations of Computational mathematics 9, no. 6 (2009): 717.
• Rendle, S. (2010, December). Factorization machines. In 2010 IEEE International Conference on Data
Mining (pp. 995-1000). IEEE.
• Tamara G. Kolda, Brett W. Bader, Joseph P. Kenny: Higher-Order Web Link Analysis Using Multilinear
Algebra. ICDM 2005: 242-249
• U Kang, Evangelos E. Papalexakis, Abhay Harpale, Christos Faloutsos: GigaTensor: scaling tensor analysis
up by 100 times - algorithms and discoveries. KDD 2012: 316-324
• Deepayan Chakrabarti, Spiros Papadimitriou, Dharmendra S. Modha, Christos Faloutsos: Fully automatic
cross-associations. KDD 2004: 79-88
• Trigeorgis, G., Bousmalis, K., Zafeiriou, S., & Schuller, B. (2014, January). A deep semi-nmf model for
learning hidden representations. In International Conference on Machine Learning (pp. 1692-1700).
• Risi Kondor, Nedelina Teneva, and Vikas Garg. 2014. Multiresolution matrix factorization. In International
Conference on Machine Learning. 1620–1628

112

You might also like