Part 1 Low-Rank Tensor Decompositions
Part 1 Low-Rank Tensor Decompositions
1561/2200000059
Andrzej Cichocki
Namgil Lee
Ivan Oseledets
Anh-Huy Phan
Qibin Zhao
Danilo P. Mandic
Boston — Delft
Full text available at: http://dx.doi.org/10.1561/2200000059
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, mechanical, photocopying, recording
or otherwise, without prior written permission of the publishers.
Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen-
ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for
internal or personal use, or the internal or personal use of specific clients, is granted by
now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The
‘services’ for users can be found on the internet at: www.copyright.com
For those organizations that have been granted a photocopy license, a separate system
of payment has been arranged. Authorization does not extend to other kinds of copy-
ing, such as that for general distribution, for advertising or promotional purposes, for
creating new collective works, or for resale. In the rest of the world: Permission to pho-
tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc.,
PO Box 1024, Hanover, MA 02339, USA; Tel. +1 781 871 0245; www.nowpublishers.com;
[email protected]
now Publishers Inc. has an exclusive license to publish this material worldwide. Permission
to use this content must be obtained from the copyright license holder. Please apply to
now Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com;
e-mail: [email protected]
Full text available at: http://dx.doi.org/10.1561/2200000059
Editor-in-Chief
Michael Jordan
University of California, Berkeley
United States
Editors
Editorial Scope
Topics
• Nonparametric methods
• Bayesian learning
• Online learning
• Classification and prediction
• Optimization
• Clustering
• Reinforcement learning
• Data mining
• Relational learning
• Dimensionality reduction
• Robustness
• Evaluation
• Spectral methods
• Game theoretic learning
• Statistical learning theory
• Graphical models
• Variational inference
• Independent component
analysis • Visualization
Andrzej Cichocki
RIKEN Brain Science Institute (BSI), Japan and
Skolkovo Institute of Science and Technology (SKOLTECH)
[email protected]
Namgil Lee
RIKEN BSI, [email protected]
Ivan Oseledets
Skolkovo Institute of Science and Technology (SKOLTECH) and
Institute of Numerical Mathematics of Russian Academy of Sciences
[email protected]
Anh-Huy Phan
RIKEN BSI, [email protected]
Qibin Zhao
RIKEN BSI, [email protected]
Danilo P. Mandic
Department of Electrical and Electronic Engineering
Imperial College London
[email protected]
Full text available at: http://dx.doi.org/10.1561/2200000059
Contents
ii
Full text available at: http://dx.doi.org/10.1561/2200000059
iii
Acknowledgements 161
References 162
Full text available at: http://dx.doi.org/10.1561/2200000059
Abstract
1
Introduction and Motivation
2
Full text available at: http://dx.doi.org/10.1561/2200000059
(a)
VOLUME
Petabytes
Terabytes
ic
ilist
An nsist ata
data
GB
aly ncy
d
bab
iew
e
Inc sing
MB
Mu e s
3D data
Pro
age es
g
ltiv
s
ima
No liers
Im seri
Mi
om
o
ary
Bin s
ise
VERACITY
VARI ETY
e
Ou
Tim
Batch
Micro-batch
Near real-time
Streams
VELOCITY
(b)
Tucker,NTD
Hierarchical Tensor Train,
Tucker MPS/MPO
VOLUME
PARAFAC Tensor PEPS,
CPD,NTF Models MERA Feature
VERACITY Storage Extraction,
Management, Classification,
Scale Clustering
Robustness to
Noise, Outliers,
Missing Values Anomaly
Signal Detection
Processing
Challenges and Machine Applications,
Tasks
Learning for
High Speed Big Data
Distributed, Matrix/Tensor
Parallel
Computing Completion,
Inpainting,
Correlation, Imputation
VELOCITY Integration
Regression,
of Variety of
Data Optimization Statistical Prediction,
Sparseness Criteria, Independence, Forecasting
Constraints Correlation
VARIETY
Smoothness
Non-negativity
Figure 1.1: A framework for extremely large-scale data analysis. (a) The 4V
challenges for big data. (b) A unified framework for the 4V challenges and the
potential applications based on tensor decomposition approaches.
Full text available at: http://dx.doi.org/10.1561/2200000059
2, 3
,K
1, de-
...
k= Mo
i=1,2,...,I
Mode-1
x6,5,1
j =1,2,..., J
Mode-2
X(:,:,k)
X(:,3,1)
Figure 1.2: A 3rd-order tensor X P RI J K , with entries xi,j,k Xpi, j, kq, and
its subtensors: slices (middle) and fibers (bottom). All fibers are treated as column
vectors.
.
..
G21 G22 . . . G2K
.
..
...
...
...
. ...
GM1 GM 2 . . . GM K
..
Figure 1.3: A block matrix and its representation as a 4th-order tensor, created
by reshaping (or a projection) of blocks in the rows into lateral slices of 3rd-order
tensors.
...
A sample set
...
...
...
...
...
...
...
...
...
...
...
(a)
Scalar Vector Matrix
J
a a I I A J
I I A
(b)
A x b = Ax
I J
= I
A B C = AB
I J K
= I K
I A B P C P
I
K M = M
J L J L
K
mode. For example, the tensor X P RI1 I2 IN is of order N and size
In in all modes-n pn 1, 2, . . . , N q. Lower-case letters e.g, i, j are used
for the subscripts in running indices and capital letters I, J denote the
upper bound of an index, i.e., i 1, 2, . . . , I and j 1, 2, . . . , J. For
a positive integer n, the shorthand notation n ¡ denotes the set of
indices t1, 2, . . . , nu.
Full text available at: http://dx.doi.org/10.1561/2200000059
Xpnq P RI I I
n 1 n 1 In 1 IN
mode-n matricization of X P RI I
1 N
p
X :, i2 , i3 , . . . , iN q P RI 1 mode-1 fiber of a tensor X obtained by fixing all
indices but one (a vector)
p
X :, :, i3 , . . . , iN q P RI I 1 2 slice (matrix) of a tensor X obtained by fixing
all indices but two
p
X :, :, :, i4 , . . . , iN q subtensor of X, obtained by fixing several in-
dices
p
R, R1 , . . . , RN q tensor rank R and multilinear rank
Table 1.2: Terminology used for tensor networks across the machine learn-
ing/scientific computing and quantum physics/chemistry communities.
4th-order tensor
... =
5th-order tensors
...
=
...
...
=
...
...
6th-order tensor
Figure 1.6: Graphical representations and symbols for higher-order block tensors.
Each block represents either a 3rd-order tensor or a 2nd-order tensor. The outer
circle indicates a global structure of the block tensor (e.g. a vector, a matrix, a
3rd-order block tensor), while the inner circle reflects the structure of each element
within the block tensor. For example, in the top diagram a vector of 3rd order
tensors is represented by an outer circle with one edge (a vector) which surrounds
an inner circle with three edges (a 3rd order tensor), so that the whole structure
designates a 4th-order tensor.
(a)
X X
R 2 I2
R2
R2
I1 I2 I1 I2
R1I1 Û =
R1
R1
( I1 ´ I 2 )
Matrix
Block matrix
(c)
Matrix
Û =
Tensor Toolbox (Bader and Kolda, 2015) and in the sparse grid ap-
proach (Garcke et al., 2001; Bungartz and Griebel, 2004; Hackbusch,
2012).
As already mentioned, the problem of huge dimensionality can be
alleviated through various distributed and compressed tensor network
formats, achieved by low-rank tensor network approximations. The un-
derpinning idea is that by employing tensor networks formats, both
computational costs and storage requirements may be dramatically re-
duced through distributed storage and computing resources. It is im-
portant to note that, except for very special data structures, a tensor
cannot be compressed without incurring some compression error, since
a low-rank tensor representation is only an approximation of the orig-
inal tensor.
The concept of compression of multidimensional large-scale data
by tensor network decompositions can be intuitively explained as fol-
lows. Consider the approximation of an N -variate function f pxq
f px1 , x2 , . . . , xN q by a finite sum of products of individual functions,
each depending on only one or a very few variables (Bebendorf, 2011;
Dolgov, 2014; Cho et al., 2016; Trefethen, 2017). In the simplest sce-
nario, the function f pxq can be (approximately) represented in the
following separable form
f px1 , x2 , . . . , xN q f p1q px1 qf p2q px2 q f pN q pxN q. (1.1)
In practice, when an N -variate function f pxq is discretized into an N th-
order array, or a tensor, the approximation in (1.1) then corresponds to
the representation by rank-1 tensors, also called elementary tensors (see
Section 2). Observe that with In , n 1, 2, . . . , N denoting the size of
each mode and I maxn tIn u, the memory requirement to store such a
±
full tensor is N n1 In ¤ I , which grows exponentially with N . On the
N
pnq
where f r P RIn denotes a discretized version of the univariate
pnq
function fr pxn q, symbol denotes the outer product, and R is
the tensor rank.
f px1 , . . . , xN q
r1 1
rN 1
(1.4)
and its distributed tensor network variants (see Section 3.3),
• The Tensor Train (TT) format (see Section 4.1), in the form
¸
R1 ¸
R2 N 1
R¸
f px1 , x2 , . . . , xN q frp11q px1 q frp12qr2 px2 q
r1 1 r2 1
rN 1 1
f N 2q
p
rN 2 rN 1 pxN 1q frpNq pxN q,
N 1
(1.5)
¸
Ru ¸
Rv
frpttq pxt q grptuq,rv ,rt frpuuq pxu q frpvvq pxv q,
ru 1 rv 1
where xt txi : i P tu. See Section 2.2.1 for more detail.
Example. In a particular case for N =4, the HT format can be
expressed by
¸
R ¸
R
grp1234 q f p12q px1 , x2 q f p34q px3 , x4 q,
12 34
1
Although similar approaches have been known in quantum physics for a long
time, their rigorous mathematical analysis is still a work in progress (see (Oseledets,
2011; Orús, 2014) and references therein).
Full text available at: http://dx.doi.org/10.1561/2200000059
Review and tutorial papers (Kolda and Bader, 2009; Lu et al., 2011;
Grasedyck et al., 2013; Cichocki et al., 2015b; de Almeida et al., 2015;
Sidiropoulos et al., 2016; Papalexakis et al., 2016; Bachmayr et al.,
2016) and books (Smilde et al., 2004; Kroonenberg, 2008; Cichocki
et al., 2009; Hackbusch, 2012) dealing with TDs and TNs already exist,
however, they typically focus on standard models, with no explicit links
to very large-scale data processing topics or connections to a wide class
of optimization problems. The aim of this monograph is therefore to
Full text available at: http://dx.doi.org/10.1561/2200000059
2
Usually, we assume that huge-scale problems operate on at least 107 parameters.
Full text available at: http://dx.doi.org/10.1561/2200000059
References
162
Full text available at: http://dx.doi.org/10.1561/2200000059
References 163
164 References
References 165
166 References
References 167
168 References
References 169
170 References
References 171
172 References
References 173
174 References
References 175
176 References
References 177
178 References
References 179
180 References
References 181