Tensorsom: March 2016
Tensorsom: March 2016
net/publication/298646719
TensorSOM
CITATIONS READS
0 65
2 authors, including:
Tetsuo Furukawa
Kyushu Institute of Technology
84 PUBLICATIONS 461 CITATIONS
SEE PROFILE
All content following this page was uploaded by Tetsuo Furukawa on 17 March 2016.
Abstract
In this paper, we propose nonlinear tensor analysis methods: the tensor self-
organizing map (TSOM) and the tensor generative topographic mapping (TGTM).
TSOM is a straightforward extension of the self-organizing map from high-dimensional
data to tensorial data, and TGTM is an extension of the generative topographic
map, which provides a theoretical background for TSOM using a probabilistic
generative model. These methods are useful tools for analyzing and visualizing
tensorial data, especially multimodal relational data. For given n-mode relational
data, TSOM and TGTM can simultaneously organize a set of n-topographic maps.
Furthermore, they can be used to explore the tensorial data space by interactively
visualizing the relationships between modes. We present the TSOM algorithm and
a theoretical description from the viewpoint of TGTM. Various TSOM variations
and visualization techniques are also described, along with some applications to
real relational datasets. Additionally, we attempt to build a comprehensive de-
scription of the TSOM family by adapting various data structures.
Keywords:
self-organizing map, generative topographic map, tensor decomposition, relational data
∗ This is the accepted manuscript version of the journal article in Neural Networks 77 (2016) 107–125,
which is made available for scholarly purposes only, in accordance with the journal’s author permissions.
The final publication is available at http://www.sciencedirect.com/science/article/pii/S0893608016000149.
DOI: http://dx.doi.org/10.1016/j.neunet.2016.01.013
† Corresponding author. E-mail: [email protected]
1
1 Introduction
Topographic mappings are a class of dimension reduction methods that project high-
dimensional data into a lower-dimensional space with preserving the topological struc-
ture. Representatives of this class include the self-organizing map (SOM) [1, 2] and
the generative topographic mapping (GTM) [3, 4]. The aim of this work is to extend
these methods for tensorial data, namely, to the tensor SOM (TSOM) and the tensor
GTM (TGTM), which can be used to visualize multimodal relational data.
A typical application of TSOM/TGTM is the interpretation of user–item rating
data, which contain product ratings for an online shop as evaluated by the users [5,
6, 7]. If the dataset consists of D-dimensional rating scores for J items evaluated by
I users, then the data can be represented as an I × J × D-dimensional tensor. In such
a case, we are interested in analyzing the customers, the items, and the relationships
between them. These kinds of tensorial datasets are common in many fields. An-
other typical example is SNS or e-mail message analysis, which can be modeled by
a tensor of (user)×(keyword)×(time) [8, 9, 10, 11]. When recording a multi-channel
electroencephalogram, the power of wavelet filters can be represented by a tensor of
(channel)×(frequency)×(time) [12, 13]. Another possible application is image recog-
nition for faces, postures, or gaits, because these images can be decomposed into
(person)×(posture)×(emotional expression) [14, 15, 16, 17].
Representative methods for tensorial data analysis constitute a group of algorithms
called tensor factorizations or tensor decompositions [18, 19, 20]. Tensor decompo-
sitions such as Tucker and PARAFAC involve generalizations of matrix decomposi-
tions, which decompose the given tensor into a product of lower-order matrices or ten-
sors. Singular value decomposition (SVD) [21], principal component analysis (PCA)
[22, 23], independent component analysis (ICA) [24, 25], and nonnegative matrix fac-
torization (NMF) [26, 27] have all been extended to cope with tensorial data. The aim
of this work is to add nonlinear tensor analysis tools to this group by extending the
SOM and the GTM.
To analyze a multimodal relational dataset, we need to make two different types
of analyses, that is, intra-mode and inter-mode analyses. For the intra-mode analysis,
a low-dimensional representation is required for every mode. In the case of user–
item ratings, we need two topographic maps, one for users and one for items. The
user map indicates how much each user has similar or different preferences to other
users, whereas the item map visualizes how much each item is preferred similarly
or differently by users compared with other items. To make the inter-mode analysis,
we need to visualize the relations between different modes. At times, for example,
we need to analyze user preferences by focusing on an item group of interest, and at
other times need to analyze preferred items by focusing on a user group of interest.
For these purposes, we assume that the inherent property of each user and each item
is individually represented by a low-dimensional latent variable, and that the rating
score is determined by a nonlinear function of the latent variables. This latent variable
model meets the framework of the GTM, a Bayesian extension of the SOM. This is the
motivation for developing tensorial extensions of topographic mappings.
Another motivation of this work is to develop a simple, fast and flexible analysis
tool for tensorial data. We intend for people to be able to use it easily in practical
2
tasks, and for this reason we chose the SOM. The SOM is a popular neural network
architecture with a simple algorithm. By a straightforward extension, the advantages
of SOMs are inherited by TSOMs. In fact, the TSOM algorithm can be programmed
with matrix multiplications only, and does not require either inverses or eigenvalues
to be found. This has the advantage not only of simplicity, but also of computational
cost, which is usually expensive in practical tensor data analysis. In addition, tensor
data sometimes has a complex structure, and our TSOM can easily be adapted to such
cases by systematically combining SOMs like building-blocks (Secs. 6 and 7). This is
another advantage of using the SOM.
The remainder of this paper is organized as follows. Sec. 2 presents the theoretical
background of this work and introduces some related research. Sec. 3 describes the ba-
sic TSOM and TGTM algorithms. Sec. 4 presents simulation results using artificial and
real datasets, and Sec. 5 introduces the visualization techniques. Sec. 6 describes some
extensions of the TSOM. In Sec. 7, we discuss the TSOM family and the relationship
between the original SOM and TSOM. Sec. 8 presents our conclusions.
2 Theoretical Preparations
2.1 Notation of vectors, matrices and tensors
An M-dimensional array of scalars is referred as a tensor of order M, each dimension
of which is referred as the mode [18]. In the case of user–item rating data, the 1st mode
corresponds to the user, and 2nd corresponds to the item.
Scalars are all in R and are indicated by italics, e.g., x, y. The exceptions are
d, i, j, k, l, m, n and their upper cases, which are all N. A lower case font is used for
indices of vector or tensor components, and capital letters are used to represent their
upper bound. Thus, i ∈ {1, . . . , I}. Vectors are denoted in bold, lower case font (e.g.,
x, y), and are column vectors unless otherwise specified. Matrices are denoted in bold,
upper case font (e.g., X, Y), and higher-order tensors are denoted by an underscore
(e.g., X, Y). The (i, j) component of a matrix A is denoted by Ai j . Similarly, Ai jk is
the (i, j, k) component of tensor A of order 3. For a tensor of order M, Ak1 k2 ...kM is also
denoted by Ak , where k = (k1 , . . . , k M ).
A subarray is denoted by a colon expression. xi: and x: j are the vectors consisting
of the ith row and jth column components of matrix X, respectively. Similarly, xi j: is a
vector cut out of tensor X along the 3rd mode. For a tensor of order M, subtensors of
order (M − 1) are called slices. For example, A::i:: is the ith slice of mode 3. To describe
general cases, a slice of A is also denoted by Ai(m) , which means the ith slice of mode
m. Thus, A::i:: ≡ Ai(3) .
The m-mode tensor–matrix product is denoted by ×m . For example, the product of
X ∈ RK1 ×···×KM and A ∈ R J×Km becomes Y = X ×m A, each component of which is given
by
∑
Km
Yk1 ...km−1 j km+1 ...kM = Xk1 ...km ...kM A jkm . (1)
km =1
3
Note that Y is a tensor of order M, the size of which is K1 ×· · ·×Km−1 ×J×Km+1 ×. . . K M .
When M matrices {A} = {A(1) , . . . , A(M) } are given, the M-multiple product of X and
{A} is
Y = X ×1 A(1) ×2 . . . × M A(M) = X × {A}. (2)
X ×−m {A} denotes the (M − 1)-multiple product of X and {A(1) , . . . , A(M) } except A(m) .
Similarly, the multiple product of X and a subset of {A} is denoted by X ×M′ {A}, where
M′ is a set of the indices of the modes. For example, X ×2 A(2) ×4 A(4) ≡ X ×{2,4} {A}1 .
The outer product of X and Y is denoted by X ◦ Y. For X ∈ RK1 ×···×KM and Y ∈
R J1 ×···×JN , Z = X ◦ Y becomes Z ∈ RK1 ×···×KM ×J1 ×···×JN , where Zkj = Xk Yj . The inner
product is denoted by ⟨X, Y⟩, defined by the product sum of all components. √∑ Thus
∑
⟨X, Y⟩ , k Xk Yk .
X
is the Euclidean norm of X, defined by
X
, k Xk =
2
√
⟨X, X⟩.
4
2.3 Self-organizing map: SOM
We first briefly summarize SOM as an introduction to TSOM. Suppose that Ω is the
universe of the objects of interest, and a data vector x(ω) ∈ X is observed from a
sample point ω ∈ Ω. Here, X is the observation space, which is assumed to be X = RD
in this paper. Let ΩS = {ω1 , . . . , ωN } be a sample set from the universe Ω, and xn:
be the data vector observed from ωn . Thus, X = (xn: ) becomes a N × D data matrix.
From the viewpoint of a generative model, xn: is assumed to be generated from the
latent variable zn: ∈ Z and a nonlinear smooth map f (z), so that xn: = f (zn: ) + ε.
Here, ε is isotropic Gaussian noise. The task of the SOM is to estimate the latent
variables (zn: ) and the nonlinear map f (z), so that they represent the observed data by
xn: ≃ f (zn: ). As the result, we obtain a topographic map of the objects in the latent
space Z. Usually Z is assumed to be a closed area in a low dimensional space RL ,
typically the 2-dimensional square space. In this paper, the prior of z is the uniform
distribution in Z = [−1, 1]L ⊂ RL (L = 1 or 2), but it is easily generalized to other
cases.
To represent the nonlinear map f (z), we discretize the latent space Z into K regular
nodes. Let ζ k: be the positional vector of the kth node in Z, and yk: , f (ζ k: ). Then
the matrix Y = (yk: ) represents the entire map. Because yk: indicates the corresponding
point in X for the kth node in Z, it is often referred to as the ‘reference vector’.
The SOM algorithm is an expectation-maximization (EM) algorithm in a broad
sense, in which we iterate over the expectation (E) and maximization (M) steps while
reducing the neighborhood area until the map converges [35, 36]. Considering the
extension to tensor cases, let us summarize the SOM algorithm using matrix notation.
In the E step, the maximum a posteriori node is the winning node for each data
vector. That is,
kn∗ = arg min ∥xn: − yk: ∥2 , (6)
k
Bkn = δ(k, kn∗ ), (7)
where B = (Bkn ) ∈ R is the winning matrix, and δ is Kronecker’s delta. Let
K×N
5
where G ∈ RK×K is the diagonal matrix, which represents the sum of the responsibility
of each node, and R̃ ∈ RK×N is the responsibility normalized by G.
In the M step, the map Y is updated so that the expectation of the square error is
minimized using
1 ∑
N
yk: = Rkn xn: . (12)
Gk n=1
1 ∑∑
N K
F=− Hk kn∗ ∥xn: − yk: ∥2 (15)
2N n=1 k=1
[35, 36, 37, 38, 39]. We can easily derive the M step (13) using this objective function,
and slightly modify the E step (6) to
∑
K
kn∗ = arg min Hk k′ ∥xn: − yk: ∥2 . (16)
k′ k=1
The original SOM algorithm (6) can be regarded as an approximation of (16), because
they are almost equal when the neighborhood is sufficiently small. Note that F is a
function of σ, which gradually changes throughout the learning process.
6
( )
so that it minimizes the expectation error. Here, Φ , φ j (ζ k: ) ∈ RK×J and ΦG# is the
generalized inverse of Φ with the weight G. Thus, ΦG# , (ΦT GΦ)−1 ΦT G. Defining
the J × N matrix Q̃ by
(18) becomes
W = Q̃X. (20)
The difference between (13) and (20) is that Y and R̃ are replaced by W and Q̃.
To update W using (18), ΦG# should be calculated at every iteration. Because Gk is
the sum of the responsibilities, Gk is more or less equal to NPk , where Pk is the prior
of the kth node. Therefore, ΦG# can be approximated by Φ#P , which we can calculate
in advance. In the case of the uniform prior, Φ#P becomes the ordinary Moore-Penrose
( )−1
inverse ΦT Φ ΦT .
Another advantage of using basis functions is that gradient methods can be used for
the E step. In the case of the gradient descent method, the latent variable zn: = (znl )T ∈
RL is updated by
⟨ ⟩
znl := znl − η WT φ(zn: ) − xn , WT φ′l (zn: ) , (21)
Note that a single step of the gradient descent method is enough for every E step,
because the convergence speed is limited by the neighborhood size scheduling. Thus
there is no need to use other faster methods for this purpose.
Unlike SOM, GTM does not have a neighborhood relationship in the E step. Instead,
GTM introduces other assumptions in the M step, to obtain a smooth continuous map.
Thus, the original GTM uses Gaussian radial basis functions [3], and another type of
7
GTM represents the map using a Gaussian stochastic process [4, 40]. In the former
case, the M step is
( )−1
λ
W = Φ GΦ + I
T
ΦT RX. (24)
β
Here, Φ is the matrix of the radial bases and the term λβ I represents the regularization.
( )−1
Letting Q̃ , ΦT GΦ + λβ I ΦT R, we can represent the M step by W = Q̃X, as in
(20). In the latter case, the M step is
( )−1
λ
Y = G + H−1 RX, (25)
β
where the neighborhood matrix H is used as the correlation matrix for the Gaussian
process. Note that both algorithms are equivalent if the radial bases satisfy ΦΦT = H.
3 Basic TSOM
3.1 Definition of the task
To clarify the task of the TSOM, let us revisit the example of the user–item rating. In
this case, the objects of interest are the set of users Ω(user) , and the set of items for
sale Ω(item) . Because each rating score x ∈ RD is determined by a pair of instances
(ω(user) , ω(item) ), we can call this type of data 2-mode relational data. The data tensor
X is of order 3, and its modes correspond to the users, items, and rating components.
Suppose that the last mode is not the target of the tensor analysis, similar to Tucker2.
(If necessary, the data can be regarded as 3-mode relational data, which can be also
dealt with by a TSOM.)
The aim of the TSOM is to visualize the relationships between objects within each
mode (i.e., between users and between items), and simultaneously visualize the rela-
tionships between two modes (i.e., between users and items). In this paper, the TSOM
for 2-mode relational data is abbreviated to TSOM–R2 .
This situation can be{ generalized as} follows. Suppose { (m) that the }set of object uni-
(m) (m)
verses to be analyzed is Ω , . . . , Ω
(1) (M)
, and ΩS = ω1 , . . . , ωNm is the sample set
of( Ω(m) , consisting ) of N m instances. We observe the D-dimensional data xn1 ...nM : =
(1) (M) ∏M
x ωn1 , . . . , ωnM for all members of the direct product set ΩS = m=1 ΩS(m) . Then, we
obtain M-mode relational data X, which is a tensor of order (M + 1).
Given such tensorial data, TSOM–R M must organize M topographic maps (each of
which visualizes the relationships of objects within each mode) and visualize the rela-
tionships between modes. TSOM–R M assumes that there are M latent spaces {Z(1) , . . . , Z(M) },
and that the observed data can be represented by a nonlinear map f of M latent vari-
ables. That is,
( )
xn1 ...nM : ≃ f z(1) (M)
n1 : , . . . , zn M : . (26)
8
Z(1) X
Z(2)
Z(1)
X
Z(2)
Figure 1: (a) The architecture of TSOM–R2 . The map from the latent spaces to the
observation space is represented by the set of reference vectors (yk1 k2 : ), each of which
corresponds to a pair of nodes (ζ (1) (2)
k1 , ζ k2 ) in the latent spaces. The reference vectors
represent a product manifold in the observation space. (Note that the product manifold
is 4-dimensional in this case, and is depicted as a 3-dimensional nonlinear cube in
this figure). (b) Estimates of the instance manifold {U(m) (m)
n } for each instance ωn . The
winner is determined according to the best matching slice manifold of the instance
manifold.
Thus, the actual TSOM task is to simultaneously estimate the nonlinear map and the
latent variables.
For ordinary SOM, the dimensions of the observation space (D) and of the latent
space (L) should satisfy D ≥ L, so that the topology is preserved. Note that this
restriction is not necessary for TSOM. In fact, we can even consider {a scalar
} case{ (i.e.,
}
(m)
D = 1). This is because we expect topological preservation between Xn(m) and zn: ,
and the dimension of Xn(m) is usually much greater than L.
9
of TSOM–R2 is to model the 2-mode relational data using a nonlinear map. That is,
f : Z(1) × Z(2) −→ Y ⊂ X = RD
(z(1) , z(2) ) 7−→ y(z(1) , z(2) ). (27)
To represent f , TSOM–R2 has two latent spaces, Z(1) and Z(2) , each of which is dis-
cretized to K1 and K2 nodes, respectively. Letting ζ (m) km be the coordinate of the km th
node of the mth latent space, we assign a reference vector to every pair of nodes (k1 , k2 ),
i.e., yk1 k2 : = f (ζ (1) (2)
k1 , ζ k2 ). Consequently, the entire map is represented by the tensor
Y = (Yk1 k2 d ), as shown in Fig. 1 (a).
In the TSOM–R2 algorithm, slice manifolds and instance manifolds are essential
for determining the winning nodes. The slice manifolds are the submanifolds of Y,
which are represented by(the slices) of Y. For example, the k1 th slice manifold of mode
1 is defined by Y(1) k1 = f ζ k1 , Z
(1) (2)
, which is represented by Yk1 :: . The slice manifold
characterizes the data distribution for a specified latent variable. In contrast, the in-
stance manifold characterizes the data distribution when an instance is specified. Now,
suppose that an instance ω(1) (1) ) and the data distribution {xn1 1(1)
n1 (is specified, , . . . , xn1 N2 } is
approximated by xn1 n2 ≃ f (1) z(2) ω . Then, the instance manifold Un1 is given by
( (2) (1) ) n2 : n1
Un1 = f
(1) (1)
Z ωn1 . Suppose that this manifold is represented by a matrix U(1) n1 :: ,
( (2) (1) )
defined by u (1)
n1 k2 := f (1)
ζ ω . Thus, the instance manifold U is the n th slice
k2 n1
(1)
n1 :: 1
of U(1) . The instance manifolds of mode 2 are also defined in the same way. In the
TSOM algorithm, the instance manifolds are estimated for all instances, and they are
regarded as the feature vectors for determining the winners. This situation is shown in
Fig. 1 (b).
Like the ordinary SOM, the TSOM algorithm consists of an E step and an M step.
For unfamiliar readers, we denote the algorithm in two ways, with and without the
tensor convention.
E step
In the E step, the winner of each mode is determined by
2
kn∗(1)
1
= arg min
Yk1 :: − U(1)
n1 ::
(28)
k1
∑
K2 ∑
D ( )2
= arg min Yk1 k2 d − Un(1)
1 k2 d
(29)
k1 k2 =1 d=1
2
kn∗(2)
2
= arg min
Y:k2 : − U(2)
:n2 :
(30)
k2
∑
K1 ∑
D ( )2
= arg min Yk1 k2 d − Uk(2)
1 n2 d
(31)
k2 k1 =1 d=1
Thus, the winner of ω(1)n1 is determined as the best matching slice manifold for the
instance manifold Un1 :: (Fig. 1 (b)). (28) also means that U(1)
n1 :: and Yk1 :: behave as if
they are the data and the reference vectors with respect to mode 1.
10
After determining all the winners, the winning matrices B(1) and B(2) , and the re-
sponsibility matrices R(1) and R(2) are obtained using
∗(1)
B(1)
k1 n1 = δ(k1 , kn1 ) (32)
B(2)
k2 n2 = δ(k2 , kn∗(2)
2
) (33)
(1)
R =H B (1) (1)
(34)
R(2)
=H B .(2) (2)
(35)
Here, H(m) is the neighborhood matrix of mode m. Elementwise, these equations can
be represented by
[ ]
(1) 1 2 ∗(1)
Rk1 n1 = exp − d (k1 , kn1 ) (36)
σ(1) 2
[ ]
1 2 ∗(2)
R(2)
k2 n2 = exp − (2) 2 d (k2 , kn2 ) , (37)
σ
where d(k, k′ ) is the distance between two nodes in the latent space. Similar to (10)
and (11), the sum of the responsibility G(m) and normalized responsibility R̃(m) are also
calculated for each mode using
N
∑ 1
G = diag
(1)
Rk1 n1
(1)
(38)
n1 =1
N
∑ 2
G = diag
(2)
Rk2 n2
(2)
(39)
n2 =1
(1) −1
R̃ (1)
=G R(1) (40)
(2) −1
R̃(2) = G R(2) . (41)
M step
11
If the components of the tensors are denoted explicitly, the above equations can be
rewritten as
1 ∑
N2
Un(1)
1 k2 d
= R(2)
k2 n2 Xn1 n2 d (45)
G(2)
k2 n2 =1
1 ∑
N1
Uk(2)
1 n2 d
= R(1)
k1 n1 Xn1 n2 d (46)
G(1)
k1 n1 =1
1 ∑
N1 ∑
N2
Yk 1 k 2 d = R(1) (2)
k1 n1 Rk2 n2 Xn1 n2 d . (47)
G(1) (2)
k1 G k2 n1 =1 n2 =1
E step
As in TSOM–R2 , the winners are determined by the distance between the instance
manifolds U(m)
n(m) and the slice manifolds Yk(m) . That is,
2
(m)
kn∗(m) = arg min
Yk(m) − Un(m)
(50)
k
( )
∗(m)
B(m)
kn = δ k, kn . (51)
Then, the matrices R(m) , G(m) , and R̃(m) are calculated for each mode using
12
M step
Note that P(M′ + {m}) = P(M′ ) ×m R̃(m) , if m < M′ . Considering Y = P(M) and
U(m) = P(M \ {m}), (57)–(61) can be calculated using
P({1, 2}) = X ×1 R̃(1) ×2 R̃(2) (63)
P({3, 4}) = X ×3 R̃ (3)
×4 R̃ (4)
(64)
U (1)
= P({3, 4}) ×2 R̃ (2)
(65)
U (2)
= P({3, 4}) ×1 R̃ (1)
(66)
U (3)
= P({1, 2}) ×4 R̃ (4)
(67)
U (4)
= P({1, 2}) ×3 R̃ (3)
(68)
Y=U (1)
×1 R̃ . (1)
(69)
Thus, these tensors can be calculated in parallel like a binary-tree algorithm. In this
case, the number of tensor–matrix multiplications is reduced from 16 to 9 (= 4 log2 4 +
1).
13
Suppose that X and Y are tensors of order (M + 1), and {R} = {R(1) , . . . , R(M) } is a set
of responsibility matrices that give the weights. The weighted error between X and Y
is defined as
M
( ) ∑ ∑ ∏
2
E X, Y; {R} ,
Rkm nm
xn1 ...nM : − yk1 ...kM :
.
(m)
(70)
n1 ,...,n M k1 ,...,k M m=1
∑
If each R(m) is normalized so that k R(m)
kn = 1, (70) gives the expectation error. Let the
square errors of two tensors except mode m be defined as
M
∑ ∑ ∏
2
(m′ )
(−m)
Ekm nm (X, Y; {R}) ,
Rkm′ nm′
xn1 ...nM : − yk1 ...kM :
. (71)
n ,...,n
1
′
M k1 ,...,k M m =1
except nm except km except m
because the second term of (73) is independent of kn∗(m) . Similar to the approximations
from (16) to (6), (74) and (75) become
kn∗(m) = arg min Ekn
(−m)
(X, Y; {R}) (76)
k
2
= arg min
Yk(m) − U(m)
n(m)
. (77)
k
Thus, we obtain (50). Both (76) and (77) yield the same result, but (77) can be calcu-
lated quicker than (76).
In the M step, (72) is maximized with respect to Y. It is easy to show that the
solution is given by (55).
14
3.5 TSOM with basis functions
If the order of the tensor increases, the computational costs drastically increase. The
size of the task is roughly proportional to K M . To reduce the costs, we need to reduce
the task size as well as algorithmic improvement. The easiest way is to reduce K, i.e.,
the node number. However, this approach sacrifices the spatial resolution.
Another approach is to use a set of basis functions. Because we assume that the
map is smooth and continuous, we should be able to sufficiently represent it using a
relatively small number of bases. If J basis functions for each mode are enough to
represent the map, then the map can be represented by the tensor W ∈ R J ×D , in-
M
provides further advantages by using an orthonormal basis set. Because distances are
preserved by the orthonormal transformation, the distance between two maps is equal
to the distance between two corresponding coefficient vectors. Applying this to (50),
the distance between K M−1 × D dimensional vectors can be evaluated by the distance
between J M−1 × D dimensional coefficient vectors. Accordingly, this reduces the calcu-
lation times in the competitive learning process. In this work, we used the normalized
Legendre polynomials as a typical orthonormal system in the square latent space.
Another advantage of using a continuous basis set is that it provides the derivative
of the objective function with respect to the latent variable z. Thus we can find the
best matching point using more efficient algorithms such as the gradient method. This
means that we can specify the winners more precisely while using less computational
time than in the ordinary all-play-all competition. { }
Suppose that the orthonormal system for mode m is φ1(m) (z), . . . , φ(m) (z) , and let
( ) J
the matrix representation be Φ(m) = φ j (ζ k: ) ∈ RK×J . Then Y is obtained by decom-
pressing W by
In this paper, W is called the core tensor of Y. To determine the winners, we also need
the core tensors of the instance manifolds. Letting V(m) be the core tensor of U(m) , the
instance manifolds are obtained by
The core tensors of the slice manifolds T(m) are easily calculated by only decompress-
ing the W with respect to mode m. Thus,
Because T(m) and V(m) are compressed representations for all modes except m, they are
(J/K) M−1 times smaller than the original Yk(m) and U(m) . Nevertheless, the distance
between the two maps is preserved by the orthonormal system. Thus,
2
2
Y (m)
=
T(m) − V(m)
.
k(m) − Un(m) k(m) n(m) (81)
15
To obtain W and V, we need to calculate the generalized inverse ΦG# for every
iteration and mode. This computational cost could outweigh the benefits of using the
basis functions. However, as described in Sec. 2, the inverse can be approximated by
Φ#P , which is just the transpose ΦT in an orthonormal system. Thus, we do not need to
calculate the inverse matrix at all.
Based on the above discussions, the TSOM algorithm with orthonormal bases is
described below.
E step
Applying (81), the winner is determined without decompressing the core tensors. That
is,
2
(m)
kn∗(m) = arg min
T(m)
k(m) − V
n(m)
. (82)
k
We can replace the above all-play-all competition with a gradient method if preferred.
In the case of the gradient descent method, it becomes
⟨ ⟩
′ (m)
z(m) (m) (m)
nl := znl − η W ×m φ(zn ) − Vn(m) , W ×m φl (zn ) .
(m)
(83)
After determining the winners, we calculate R(m) , G(m) , and R̃(m) using (52)–(54).
Then, Q̃(m) (the normalized responsibility of the bases) is calculated using
T
Q̃(m) = Φ(m) R̃(m) . (84)
T
Here Φ(m) is used as the generalized inverse.
M step
In the M step, the core tensors W and {V(m) } are updated according to
{ }
W = X × Q̃ (85)
{ }
V(m) = X ×−m Q̃ . (86)
The only difference between this and the discrete version in (55) (56) is that R̃(m) is
replaced by Q̃(m) . The binary tree calculation is also effective. The simulation results
(Table 1, 2) show that the basis function is effective in reducing the calculation time.
16
Z(1) X
x=y+ε
Z(2)
(a)
W L
(b)
Figure 2: (a) The generative model of the relational data dealt with by TGTM–R2 . Note
that the two 2-dimensional latent spaces are mapped onto the 4-dimensional manifold
in the observation space, which is depicted as a 3-dimensional nonlinear cube. (b) The
graphical model of the generative model.
In real applications, initialization using PCA is also recommended. For relational data,
PCA should be used for each mode, by vectorizing ( the
) slices of the data tensor. For
example, a PCA for mode m regards x̂(m) n = vec Xn(m) as the nth data vector of mode
m. When the PCA finishes, the initial winning nodes are determined using a principal
axes transformation.
The time scheduling of the neighborhood size is implemented in a similar way. For
example, σ(t) = max [σ0 (1 − t/τ), σ∞ ] or σ(t) = (σ0 − σ∞ ) e−t/τ + σ∞ , where t is the
calculation time, σ0 and σ∞ are the initial and last neighborhood sizes, and τ is the
time constant.
17
3.7 TGTM
By applying the probabilistic generative model, it is possible to derive the algorithm
using a fully Bayesian approach, namely, TGTM. In this paper, we have abbreviated
TGTM for M-mode relational data to TGTM–R M .
We first define the generative model for the relational data dealt with by TGTM–
R M . Suppose that we have M latent spaces {Z(m) } (m = 1, . . . , M), each of which is dis-
cretized to Km nodes with the positional vectors {ζ k(m) }. The priors of the latent variables
( )
are given by Prob z(m) = ζ (m)
k = p(m) (m)
k , which we have assumed to be pk = 1/Km . To
obtain the training dataset, the latent variables are generated Nm times, independently
for each mode. The observations are made for all combinations of the latent variables
using
( )
xn1 ...nM : = y z(1) (M)
n1 : , . . . , zn M : + εn1 ...n M : , (87)
where y is the nonlinear map from the direct product space Z(1) × · · · × Z(M) to the
) X = R∏, and ε ∈ R is the observation noise generated by p(ε) =
D D
observation
( space
−1
N ε 0, β ID . Thus, m Nm data are observed, and are represented by a tensor
( )
X = Xn1 ...nM d .
We assume that the map is represented by a set of radial basis functions, such that
Y = W × {Φ} , (88)
where Φ(m) ∈ RKm ×Jm denotes the radial bases of mode m, and W is the core tensor.
The prior of W is
( ) ( )
p W = N W O,−1 L (89)
L = (λ1 IK1 ) ◦ · · · ◦ (λ M IKM ). (90)
Thus,
( each component ) of W obeys the independent Gaussian prior p(W j1 ... jM d ) =
N W j1 ... jM d 0, λ−1 . The generative model for M = 2 is shown in Fig. 2 (a) and
(b).
The task of TGTM is to estimate the core tensor W, the latent variables {z(m) nm : }, and
the parameter β from the observed relational data X. The TGTM algorithm can be de-
rived by applying the standard EM algorithm. Here, Y and β are estimated as the MAP
solutions, and the latent variables {z(m)
nm : } are estimated as their posteriors. (Note that we
can also easily estimate the posteriors of Y and β by applying the variational Bayesian
method [40].) Applying the variational approximation to this generative model, the
objective function is
β ( ) ∑∑ ( )
F(R, W, β) = − E X, Y(W); {R} − Rkn ln Rkn + ln P W , (91)
2 k n
where E(X, Y; R) is the expectation error defined by (70), and R(m) is the responsibility
matrix of {z(m)
nm : }. In the E step, F is maximized with respect to R
(m)
using
β (−m) ( )
ln R(m)
kn = − 2 E kn X, Y(W); {R} + const. (92)
18
If M = 3, (92) becomes
β ∑ ∑ (2) (3)
2
ln R(1)
k1 n1 = − Rk2 n2 RK3 n3
yk1 k2 k3 : − xn1 n2 n3 :
+ const. (93)
2 k ,k n ,n
2 3 2 3
This is the TGTM–R M algorithm. It is also possible to use the Gaussian process instead
of the RBF bases. If the RBF bases are replaced by Nadaraya-Watson smoother, and the
latent variables are estimated by the maximum likelihood, then this algorithm becomes
the TSOM.
The TGTM algorithm is a Bayesian algorithm, which is its advantage compared
with TSOM. However, the computational cost of TGTM is much greater than TSOM
(Table 1, 2). Thus, it is hard to apply TGTM to practical tasks without the development
of a more efficient algorithm. In this paper, the significance of TGTM is to bridge the
gap between neural network-based algorithms and Bayesian algorithms, rather than
to provide a practically useful algorithm. With such a solid theoretical background,
we consider TSOM to be a widely applicable method that can be used with ease and
confidence.
4 Simulation Results
4.1 Artificial datasets
We used some artificial datasets to examine the performances of the proposed algo-
rithms. For the sake of visualization, we synthesized 2-mode relational datasets with 1
dimensional latent spaces using the generative model. The protocol for generating data
points is as follows. First, we generated Nm latent variables {z(m) (m)
1 , . . . , zN1 } using the
uniform distribution in [−1, +1]L for m = {1, 2}. Then, the observation data xn1 n2 were
generated by
( )
xn1 n2 = f z(1)
n1 , zn2 + ε,
(2)
(98)
19
Table 1: Parameters used in the simulation of the artificial dataset, and the calculation
results. 2 2 2 2
Basic TSOM–R Legendre TSOM–R TGTM–R with TGTM–R with
(20 nodes/mode) with 4 bases/mode 20 bases/mode 4 bases/mode
Number of nodes (K1 , K2 ) 20 × 20 — 20 × 20 20 × 20
Number of bases (J1 , J2 ) (20 × 20) 4×4 20 × 20 4×4
Neighborhood radius (σ0 ) 2.0 2.0 2.0 2.0
(σ∞ ) 0.1 0.1 0.8 0.8
Neighborhood time constant (τ) 50
Number of instances (N1 , N2 ) 100 × 100
Noise amplitude (σnoise ) 0.1
RMSE (Average of 20 trials) 0.0775±0.0037 0.0867±0.0046 0.0693±0.0027 0.0777±0.0019
Calculation time for 100 loops (sec)* 0.45593±0.00083 0.09796±0.00027 15.3294±0.0057 10.9452±0.0038
* Intel Core i7-2600K (3.40GHz), Visual C++ (single thread)
1 1
0.5 0.5
0 y3 0 x3
-0.5 -0.5
-1 -1
1 1
y1 y2 0.5 x2 0.5
-1 0 x1 0
-0.5 -0.5 -1 -0.5 -0.5
0 0.5 -1 0
1 0.5 -1
1
1 1
0.5 0.5
0 y3 0 y3
-0.5 -0.5
-1 -1
1 1
y1 y2 0.5 y1 y2 0.5
-1 0 -1 0
-0.5 -0.5 -0.5 0 -0.5
0 0.5 -1 0.5 -1
1 1
20
1 1
0.5 0.5
0 y3 0 y
-0.5 -0.5
-1 -1
1 1
y1 y2 0.5 z1 z2 0.5
-1 0 -1 0
-0.5 0 -0.5 -0.5 0 -0.5
0.5 -1 0.5 -1
1 1
(a) (b)
Figure 4: Results for the other artificial dataset organized by basic TSOM–R2 . Data
points are indicated by dots. (a) 3-dimensional dataset and the organized map using
TSOM–R2 . The axes represent the three components of the observation space. (b)
1-dimensional dataset and the organized map. Unlike other figures, the two horizontal
axes represent the latent variables, and the observation space is only represented by the
vertical axis.
( )
for N1 ×N2 combinations of z(1) n1 and zn2 . Here, ε is Gaussian noise p(ε) = N ε 0, σnoise I .
(2) 2
In the simulations, we used nonlinear maps ranging in [−1, +1], and set the noise am-
plitude to σnoise = 0.1.
We examined four variations of TSOM and TGTM: (1) basic TSOM–R2 with 20
discrete nodes per mode, (2) TSOM–R2 with 4 Legendre bases per mode (Legendre
TSOM–R2 ), (3) TGTM–R2 with 20 radial basis function (RBF) bases per mode, and
(4) TGTM–R2 with 4 RBF bases per mode. The parameters used in the simulation are
shown in Table 1. TSOMs were initialized by randomly assigning the winners, and
TGTM was initialized by randomly constructing the map. Empirically, GTM is more
likely to fall into local minima than SOM, especially when it is initialized randomly. In
this work, the radius of the RBF was gradually reduced to avoid local minima, in the
same manner as the neighborhood size of the SOMs. With this modification, TGTM
did not fall into local minima in any of the simulations. In the basic TSOM–R2 , the
latent variables are estimated by the all-play-all competition, whereas they are updated
by the gradient descent method in the Legendre TSOM–R2 .
We iterated over the E and M steps, and reduced the neighborhood size exponen-
tially with the time constant τ = 50. In this experiment, the calculations were moni-
tored until t = 600, but the map converged after a smaller number of iterations. Typi-
cally, at most one hundred iterations are enough for random initialization. The iteration
cycles can be further reduced by using PCA initialization.
In the first artificial dataset, the data points were generated using
z1 cos π/4 − z2 sin π/4
f (z1 , z2 ) = z1 sin π/4 + z2 cos π/4 . (99)
z21 − z22
The results are shown in Fig. 3 and Table 1. All the algorithms succeeded in capturing
the data distribution, as expected. We repeated the simulations 20 times for each algo-
rithm, randomly changing the initial state and the noise. All the results were consistent
and stable. For this dataset, the RMSEs were more or less equal for all algorithms.
Among the examined algorithms, the fastest was Legendre TSOM–R2 . It was ap-
21
Table 3: Summary of the sushi and beverage datasets.
Sushi dataset Beverage dataset
Number of modes 2 (user, item) 3 (user, item, context)
Number of instances 5, 000 × 10 604 × 14 × 11
Data type scalar (integer) scalar (integer)
Data scale ranking from 1 to 10 grades from 1 to 5
proximately 5 times faster than basic TSOM–R2 . In contrast, TGTM–R2 was more
than 100 times slower than the Legendre TSOM–R2 . The speed differences increased
when the tasks became larger (Table 2). Therefore, Legendre TSOM–R2 is the best
solution for a large-scale relational data analysis.
Fig. 4 (a) shows the results using another artificial dataset, generated by
cos(0.5πz1 + πz2 )
f (z1 , z2 ) = sin(0.5πz2 + πz2 ) . (100)
z2
Ordinary SOM cannot generally learn this ‘Swiss-roll’ type of dataset. Nevertheless,
TSOM succeeded. This is because the relational data contain more information about
the latent variables than the ordinary dataset. This issue is discussed in Sec. 7.
TSOM can estimate the map even if the observed data are scalar. Fig. 4 (b) is an
example of a scalar dataset. Here, the scalar data were generated by
( )
3π
f (z1 , z2 ) = sin (z1 + z2 ) . (101)
4
In this case, the two latent spaces degenerate into the one dimensional observation
space. Thus, different latent variable pairs can often generate the same output. Never-
theless, TSOM estimated both the nonlinear map and the latent variables, as shown in
Fig. 4 (b).
2 http://www.kamishima.net/sushi/
3 http://www.brain.kyutech.ac.jp/ furukawa/beverage-e/
22
B C
A D
Figure 5: Maps for the sushi dataset: (a) the map of users (respondents) and (b) the
map of items (sushi toppings), colored by the marginal U-matrix, where red regions
represent the cluster borders and A–E are the conditioning points for Fig. 6; (c) the map
of items (sushi toppings) colored by the marginal component plane. The red/blue colors
indicate the higher/lower average scores. Fatty tuna is the most preferred topping, and
cucumber roll is the least favorite.
A B C D E
Figure 6: The user maps for the sushi dataset, colored by the conditional component
plane. The conditioning points are labeled A–E in Fig. 5 (b). The red/blue regions in A
correspond to the respondents who like/dislike cucumber rolls, while those in E show
users who prefer the shrimp topping. When the conditioning point continuously moves
from A to E in the sushi map, the user map changes color gradually from A to E.
23
(a) (b) (c)
Figure 7: Maps for the beverage dataset, colored by the marginal U-matrix: (a) users
(respondents); (b) items (beverages); and (c) contexts (situations).
(a) (b)
Figure 8: Maps of the beverage dataset colored by the conditional component plane.
Red/blue regions represent the higher/lower scores. The markup balloons are the corre-
spondence overlay. Note that the scores are marginalized with respect to the user mode.
(a) The conditional component map of beverages colored for ‘exercise time’ (‘+’ in the
context map). The beverages in the red region (mineral water and isotonic drink) are
preferred during exercise. The overlaid label ‘exercise time’ is also located between
these beverages. (b) The conditional component map of contexts colored for ‘isotonic
drink’ (‘+ in the beverage map). The map shows that isotonic drink is preferred in the
situations indicated by red (exercise time, outdoor work time, and outdoor playtime).
The overlaid label ‘isotonic drink’ is located near these situations.
24
5 Visualization and Exploration in Map Spaces
In this section, we introduce some visualization techniques for TSOM. We only de-
scribe the methods for TSOM, but they are also all available for TGTM.
where d is the component of interest. Fig. 5 (c) shows the marginal component plane
of the sushi map, which visualizes how much each topping is preferred by the respon-
dents.
Similarly, for the marginal U-matrix, each node is colored using the average dis-
tance between two neighboring units, defined as
2
Yk(m) − Yk′ (m)
,
2 1
Dm (k, k′ ) = (103)
K−m
25
where k and k′ are the indices of two neighboring nodes. Fig. 5 (a) (b) and Fig. 7 are
examples using the marginal U-matrix.
4 http://www.brain.kyutech.ac.jp/ furukawa/tsom-e/
26
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
1 1
0.5 0.5
-1 0 -1 0
-0.5 -0.5 -0.5 0 -0.5
0 0.5 -1 0.5 -1
1 1
60% (100 x 100) 90% (1000 x 1000)
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
1 1
0.5 0.5
-1 0 -1 0
-0.5 0 -0.5 -0.5 -0.5
0.5 -1 0 0.5 -1
1 1
80% (100 x 100) 98% (1000 x 1000)
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
1 1
0.5 0.5
-1 0 -1 0
-0.5 0 -0.5 -0.5 -0.5
0.5 -1 0 0.5 -1
1 1
90% (100 x 100) 99% (1000 x 1000)
(a)
0.11
100 x 100
1000 x 1000
90%
0.1
RMSE
0.09 99%
80%
98%
60%
96%
0.08
92%
20% 0% 0%
90%
80% 60% 20%
0.07
1,000 10,000 100,000 1e+006
Given data number
(b)
Figure 9: The results of the missing data estimation: (a) estimated manifolds under
different missing ratios; and (b) RMSE of the missing data estimates. The RMSEs
were evaluated for 2 different original data sizes, 100×100 and 1, 000×1, 000. Missing
ratios are also indicated.
27
Pulp Fiction
The Terminator
Aliens
Fargo
The Godfather
Independence Day
Star Trek
Jurassic Park
Men In Black
(a) (b)
Figure 10: The results for the MovieLens dataset: (a) user map (dominant classes of
generations and genders are indicated by the correspondence overlay, 20s M and 20s F
respectively represent men and women in their 20’s); and (b) movie map (movie genres
are indicated by the correspondence overlay, as well as some popular movie titles).
6 TSOM Variations
In this section, we introduce several variations of the TSOM, and example applications.
We do not present the TGTM variations, but they are also possible.
1 ∑
N1 ∑
N2
Yk1 k2 d = R(1) (2)
k1 n1 Rk2 n2 Γn1 n2 d Xn1 n2 d . (104)
G k1 k2 d n1 =1 n2 =1
∑ ∑
Here, Gk1 k2 d = nN11=1 nN22=1 R(1) (2)
k1 n1 Rk2 n2 Γn1 n2 d , and Γn1 n2 d indicates whether Xn1 n2 d is miss-
ing. Thus, Γn1 n2 d = 0 when Xn1 n2 d is missing, otherwise Γn1 n2 d = 1. The instance
manifolds are also calculated in a similar way.
In the E step, we can determine winners in two ways. One way is to use (77), as in
the ordinary case. This method results in a quick winning decision. When the missing
ratio increases, there may only be a few data points for each instance. In such cases,
it is better to directly evaluate (76), because the distance between the instance and the
slice manifolds may be affected by missing data. After the training process, the missing
values are estimated using X̃n1 n2 d = Ykn∗(1) kn∗(2) d .
1 2
We investigated this algorithm using the same artificial dataset as in Fig. 3. We
randomly removed part of the dataset, and then passed it to the TSOM. (At least 2
28
Sending to Sending to
Sending to Sending to
Figure 11: The results for the Enron e-mail dataset. The center map is colored by the
marginal U-matrix, where the cluster borders are indicated by red regions. The num-
bers represent user IDs. The surrounding small maps are the conditional component
planes, representing the sending/receiving e-mail traffic from/by the conditioned user.
data points were left for each instance.) In this simulation, we estimated the winning
positions using (76). We repeated this protocol, changing the various missing ratios.
The results are shown in Fig. 9. The TSOM captured the manifold shape, even when
the missing ratio was higher than 90% (Fig. 9 (a)). The RMSE result also shows that
TSOM was robust to missing data (Fig. 9 (b)).
We also applied this algorithm to the MovieLens dataset5 , which contains user–item
rating data for movie titles. The dataset used here is a subset called MovieLens 100k,
which contains 100,000 rating scores on a 5-point scale. The scores were evaluated
by 943 people for 1,682 movie titles, and approximately 94% of the data are missing.
The dataset also contains the information for the respondent attributes and the movie
title genres, but they were not used in this experiment. To evaluate the missing data
estimation, we used the training and test dataset provided by MLcomp6 . The root
mean square error (RMSE) was 0.9533 ± 0.0008. This result is more or less equal to
other methods within top 10 in MLcomp.
Here, we introduced a simple algorithm for estimating missing data. By applying
the EM algorithm, this algorithm may be integrated into the TGTM, so that the missing
values can be estimated at the same time as the nonlinear map. Such a probabilistic
approach is an important issue for further investigation.
5 http://grouplens.org/datasets/movielens/
6 http://www.mlcomp.org/
29
6.2 Using side information
In some situations, some additional data called side information are also provided. For
example, the MovieLens dataset contains the side information on the age, gender, and
occupation as user attributes. In such cases, it is possible to consider side information
with the relational data.
Suppose that xs.i.(m)
n is the side information of mode m. Then, the winner of mode
m is determined using both the instance manifold and the side information. Let ys.i.(m)
k
be the kth reference vector of the side information of mode m, and let Ds.i.(m) be the
dimension of xs.i.(m)
n . Then, the winner is determined by
{
2 }
1
2
(m)
α(m)
s.i.(m) s.i.(m)
∗(m)
kn = arg min
Y − Un(m)
+ s.i.(m)
yk − xn
. (105)
k K−m D k(m) D
Here, α(m) is a weight parameter that determines how much the side information affects
the winning decision. A typical value for α(m) is 1, so the relational data and the side
information equally affect the winning decision. In the M step, ys.i.(m)
k is also updated
according to
Thus, the TSOM works as an ordinary SOM with respect to the side information.
We applied this algorithm to the MovieLens dataset. The side information for the
respondents (age, gender, and occupation) and the movie titles (genres such as drama,
comedy, SF, etc.) were considered in the winning decision for α(m) = 1. The obtained
maps are shown in Fig. 10. The user map showed distinct cluster borders, which coin-
cided with users’ generations and genders. Some clusters of genres were also present
in the movie map.
30
should be determined with respect to both modes. Thus, the winner of each member is
defined by
(
2
)
kn∗ = arg min
Yk:: − U(1) (2)
2
n::
+
Y:k: − U:n:
, (107)
k
7 https://www.cs.cmu.edu/˜./enron/
31
β
W L
Figure 12: Graphical representation of the generative model for SOM2 (TSOM–H2 ).
and ‘children’ in SOM2 , whereas there is no such hierarchy in TSOM. The difference
originates in the type of data structures used by these algorithms. Thus, these two al-
gorithms share the same goal (estimating the nonlinear map from the multiple latent
spaces to the observation space), but use different start points (the data structure of the
given dataset). In this sense, it is better to call SOM2 a ‘tensor SOM for hierarchical
data’ (TSOM-H), and call the proposed algorithm a ‘tensor SOM for relational data’
(TSOM-R). Clarifying the similarities and differences between the two algorithms is
important when comprehensively constructing the family of TSOMs. Hereafter, SOM2
is referred as TSOM-H and is regarded as a member of the TSOM family.
32
2
Data structure 1 2 1 2
diagram 1
1 1 1
0.5 0.5 0.5
0 0 0
Dataset -0.5
-1
-0.5
-1
-0.5
-1
1 1 1
0.5 0.5 0.5
-1 0 -1 0 -1 0
-0.5 -0.5 -0.5 -0.5 -0.5 -0.5
0 -1 0 -1 0 -1
0.5 0.5 0.5
1 1 1
1 1
0.5 0.5
Instance 0 0
-0.5 -0.5
-1 -1
manifolds 1 1
0.5 0.5
-1 0 0
-0.5 -1
-0.5 -0.5 -0.5
0 -1 0 -1
0.5 0.5
1 1
1 1 1
0.5 0.5 0.5
0 0 0
Organized map -0.5 -0.5 -0.5
-1 -1 -1
1 1 1
0.5 0.5 0.5
-1 0 0 0
-0.5 -1 -1
-0.5 -0.5 -0.5 -0.5 -0.5
0 -1 0 -1 0
0.5 0.5 0.5 -1
1 1 1
2 2
(a) TSOM-R (b) TSOM-H (c) TSOM-C 2 (SOM)
Figure 13: Comparison of three possible data structures and their corresponding algo-
rithms: (a) relational data with TSOM–R2 ; (b) hierarchical data with TSOM–H2 ; and
(c) coupled data with TSOM–C2 (which becomes ordinary SOM). The 1st row contains
the data structure diagrams, and the 2nd shows examples of the sampled dataset. The
relational data in (a) are aligned in two directions, while the hierarchical data in (b) are
aligned in one direction. No data alignment was applied to the coupled data case in (c).
The 3rd row contains the instance manifolds calculated by TSOM–R2 and TSOM–H2 .
No instance manifold is estimated in TSOM–C2 . The 4th row contains the estimated
nonlinear maps.
Considering the above discussion, the SOM2 algorithm can be rewritten as TSOM–
2
H , a variation of TSOM. Although the algorithm described below looks quite different
from the original, they are equivalent. In the E step, the winner is determined according
to
2
kn∗(1) = arg min
y ∗(1) − xn |n
(108)
1 |n2 k1 kn : 1 2
2
k1
2
kn∗(2)
2
= arg min
Y:k2 : − U(2)
:n2 :
. (109)
k2
(108) and (109) mean that each winner of the parent mode is determined by the dis-
tance between U(1) (2)
:n2 : (the instance manifold) and Y:k2 : (the slice manifolds), whereas
the winner of the child mode is determined by the best matching unit within the best
matching slice manifold of its parent.
33
3
3 2 3 2 2
3
1 2 3 1 2 1 1 1
(a) TSOM-R3 (b) TSOM-HR2 (c) TSOM-R2H (d) TSOM-H2R (e) TSOM-H3
3
1 2 3 1 2 1 2
Figure 15: Three examples of data structures with two or more observation spaces.
Note that there are more possible data structures. The data structure in (c) corresponds
to relational data with side information.
n2 = H Bn2
R(1) (1) (1)
(110)
R (2)
=H B .
(2) (2)
(111)
U(2)
:n2 : = X:n2 : ×1 R̃n2
(1)
(112)
Y=U (2)
×2 R̃ . (2)
(113)
34
7.3 Data structure diagram
To illustrate the data structures, let us introduce the diagram in Fig. 13 (top row). In this
diagram, each sample set is represented by a circle node and its mode number. When
two or more nodes are integrated into a set of sample combinations, they are packaged
to an oval node. The nodes with no links mean that the samples of the nodes were
selected independently to other nodes. The symbol ‘×’ represents the direct product
of independent nodes, which produces all possible combinations of two sample sets.
The black thin arrow denotes the hierarchical sampling of the connected nodes, and
the nodes connected by double lines are coupled in the data sampling. Thus, a pair
of instances is always sampled together. The thick white arrow represents the data
observation, which is made for every member of the root node.
The top row of Fig. 13 (a) contains a diagram of a 2-mode relational dataset, dealt
with by TSOM–R2 . In this )} data vectors {xn1 n2 } are observed for all members
{( case,(2)the
of the sample set ΩS = ω(1) n1 , ω n2 .
Fig. 13 (b) depicts the 2-mode hierarchical dataset case, dealt with by TSOM–
H2 . In this)}case, the data vectors {xn1 |n2 } are observed for all members of ΩS =
{(
ω(1)
n1 |n2
, ω(2)
n2 . This situation appears in transfer or multisystem learning tasks, where
the parent mode corresponds to the system parameter, and the child mode represents
the state variable under the given parameter [46, 47].
In the last situation depicted in Fig. 13 (c), two nodes are
( (1) ) tightly coupled in the data
(2)
observation. In this case, a pair of instances ωn , ωn is sampled simultaneously,
and then a data vector xn is observed from the instance pair. Here, we refer to the
TSOM for this type dataset as TSOM–C2 (tensor SOM for coupled data). Interestingly,
the TSOM–C2 algorithm is almost the same as the conventional SOM, because we
do not choose the winning position but take the ordinary best matching point in the
entire nonlinear map. This situation corresponds to nonlinear ICA, and several previous
studies have shown that SOM can be applied to this task [48, 49]. It would be worth
investigating TSOM–C from the viewpoint of nonlinear ICA. This is an important issue
that we will consider in the future.
The difficulties of these tasks are also different for the three cases. In case (a),
TSOM–R2 must estimate (N1 + N2 )L latent variables from N1 N2 D observed data,
whereas TSOM–H2 in case (b) must estimate (N1 N2 + N2 )L unknowns from N1 N2 D
known values. Thus, TSOM–H2 task is more difficult than TSOM–R2 . The most dif-
ficult is case (c), in which TSOM–C2 must estimate 2NL latent variables from ND
observed data. This is why TSOM–R2 captured the data distribution better than ordi-
nary SOM (i.e., TSOM–C2 ).
35
(c) TSOM–R2 H, (d) TSOM–H2 R, and (e) TSOM–H3 (SOM3 ). All these algorithms
can be derived by combining TSOM–R and TSOM–H. For example, the TSOM–HR2
algorithm (Fig. 14 (b)) can be described as follows.
E step
2
kn∗(1)
1 |n3
= arg min
Yk1 :kn∗(3) : − U(1)
n1 ::|n3
(114)
k1 3
2
kn∗(2) = arg min
Y ∗(3) − U(2)
(115)
2 |n3 :k2 kn : :n2 :|n3
3
k2
2
kn∗(3) = arg min
Y − U(3)
(116)
3 ::k3 : ::n3 :
k3
M step
U(1)
n1 ::|n3
= X::n3 : ×2 R̃(2)
n3 (117)
U(2)
:n2 :|n3
= X::n3 : ×1 R̃(1)
n3 (118)
U(3)n3 = X::n3 : ×1 R̃(1)
n3 ×2 R̃(2)
n3 (119)
Y=U (3)
×3 R̃(3)
(120)
Thus, two types of E and M steps are combined with respect to the data structure. The
algorithms for the other cases can be easily obtained in a similar way.
In some cases, there are two or more observation spaces. Some examples are pre-
sented in Fig. 15. In this figure, the data structure (a) is the case in which two different
surveys are made for the same group of respondents. The data structure of the Movie-
Lens with side information is represented by (c). When the number of modes increases,
the number of possible variations increases exponentially. Nevertheless, the TSOM
(and the TGTM) family can adapt to all data structures. Therefore, these algorithms
can be unified using a comprehensive theoretical system of topographic mapping fam-
ilies for tensorial data.
36
more than others. For example, if a teacher tries to analyze students’ abilities using data
consisting of many physical indices and few academic indices, then academic ability is
almost ignored as noise. This metric problem is generally unavoidable in unsupervised
learning methods. Even though TSOM is also affected by this problem, TSOM can
reduce these undesired effects. By regarding the dataset as a 2-mode relational dataset,
TSOM generates two maps, namely a map of target objects and a map of data compo-
nents. If two components are identical, then those components are located in the same
position in the component map. Consequently, their influence in determining winners
is weakened, and thus TSOM generates a more moderate result. Therefore, TSOM is
recommended for use in conventional tasks where the original SOM has been applied.
The second perspective is relevant to the origins of the SOM. The SOM was origi-
nally a neural network model of a brain map of a visual field, and turned out to be useful
as a dimension reduction tool for practical applications [2]. Early studies investigated
both the self-organization of topology preserving neural projections in visual fields
(e.g., [50]), and the self-organization of visual features (e.g., [51]). Though those stud-
ies appear contiguous, there is a gap between them. The former is the self-organization
of the order of visual signal lines, whereas the latter orders the features of visual sig-
nals. Kohonen referred to these as type 1 and type 2 self-organization respectively, and
he stated that his SOM is categorized to type 2 [52]. From the viewpoint of the TSOM,
type 1 corresponds to a map of components, while type 2 is a map of objects. In this
sense, TSOM unifies both types of self-organization.
The third perspective is relevant to the axes of the topographic map. Usually the
meaning of the axes is of little concern. However, by regarding the conventional 2-
dimensional square map as a product space of two 1-dimensional latent spaces, the
SOM becomes a TSOM for coupled data, that is, TSOM–C2 . As has already been
discussed above, this is related to nonlinear ICA. This is an important issue for TSOMs
that should be investigated further.
8 Conclusions
In this paper, we presented two algorithms for topographic mappings of tensorial data,
TSOM and TGTM. TGTM provides the theoretical background of the algorithms, and
TSOM is useful for practical applications. Among the variations, the TSOM with
orthonormal bases was computationally fast and had a high resolution. Therefore, it is
the best choice for large-scale tasks. We also presented various TSOM variations and
visualization methods, which further increase the applicability of TSOM to real data
analysis.
Theoretically, TSOM and TGTM can be derived from the generative model by
applying the EM algorithm. SOM2 (SOM of SOMs) is a sibling of TSOM that is
adapted to a hierarchical data structure. We have shown that SOM2 can be unified
to the TSOM family. Therefore, this work presented some nonlinear tensor analysis
methods, and attempted to establish a comprehensive theoretical system for the family
of algorithms.
37
Acknowledgement
This work was supported by JSPS KAKENHI Grant Numbers 23500280, 22120510.
References
[1] T. Kohonen, Self-Organizing Maps, Springer-Verlag, Berlin Heidelberg, 2001.
38
[12] F. Miwakeichi, E. Martı́nez-Montes, P. A. Valdeś-Sosa, N. Nishiyama,
H. Mizuhara, Y. Yamaguchi, Decomposing EEG data into space-time-frequency
components using parallel factor analysis, NeuroImage 22 (3) (2004) 1035 –
1045.
[13] J. Li, L. Zhang, D. Tao, H. Sun, Q. Zhao, A prior neurophysiologic knowledge
free Tensor-Based scheme for single trial EEG classification, Neural Systems and
Rehabilitation Engineering, IEEE Transactions on 17 (2) (2009) 107–115.
[14] M. A. O. Vasilescu, D. Terzopoulos, Multilinear image analysis for facial recog-
nition, in: ICPR (2), 2002, pp. 511–514. doi:10.1109/ICPR.2002.1048350.
[15] J. Yang, D. Zhang, A. F. Frangi, J.-y. Yang, Two-dimensional PCA:
A new approach to appearance-based face representation and recogni-
tion, IEEE Trans. Pattern Anal. Mach. Intell. 26 (1) (2004) 131–137.
doi:10.1109/TPAMI.2004.1261097.
[16] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, H. Zhang, Multilinear discriminant
analysis for face recognition, IEEE Trans. on Image Processing 16 (1) (2007)
212–220.
[17] N. E. Helwig, S. Hong, J. D. Polk, Parallel factor analysis of gait waveform data:
A multimode extension of principal component analysis, Human Movement Sci-
ence 31 (3) (2012) 630 – 648.
[18] T. G. Kolda, B. W. Bader, Tensor decompositions and applications, SIAM RE-
VIEW 51 (3) (2009) 455–500.
[19] E. Acar, B. Yener, Unsupervised multiway data analysis: A literature survey,
IEEE Transactions on Knowledge and Data Engineering 21 (2008) 6–20.
[20] A. Cichocki, R. Zdunek, A.-H. Phan, S. Amari, Nonnegative Matrix and Tensor
Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind
Source Separation, John Wiley & Sons, Ltd, 2009.
[21] L. D. Lathauwer, B. D. Moor, J. Vandewalle, A multilinear singular value decom-
position, SIAM J. Matrix Anal. Appl 21 (2000) 1253–1278.
[22] H. Lu, K. N. Plataniotis, A. N. Venetsanopoulos, MPCA: Multilinear principal
component analysis of tensor objects., IEEE Transactions on Neural Networks
19 (1) (2008) 18–39.
[23] H. Lu, K. N. Plataniotis, A. N. Venetsanopoulos, A survey of multilinear subspace
learning for tensor data, Pattern Recognition 44 (7) (2011) 1540 – 1551.
[24] C. F. Beckmann, S. M. Smith, Tensorial extensions of independent component
analysis for multisubject FMRI analysis, Neuroimage 25 (1) (2005) 294–311.
[25] M. A. O. Vasilescu, D. Terzopoulos, Multilinear independent components analy-
sis, in: Proceedings of the 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, IEEE Computer Society, 2005, pp. 547–553.
39
[26] M. Welling, M. Weber, Positive tensor factorization., Pattern Recognition Letters
22 (12) (2001) 1255–1261.
[31] C.-S. Lee, A. Elgammal, Modeling view and posture manifolds for tracking, IEEE
International Conference on Computer Vision (2007) 1–8.
[32] X. Gao, C. Tian, Multi-view face recognition based on tensor subspace analysis
and view manifold modeling, Neurocomput. 72 (16-18) (2009) 3742–3750.
[33] T. Furukawa, SOM of SOMs: Self-organizing map which maps a group of self-
organizing maps, in: W. Duch, J. Kacprzyk, E. Oja, S. Zadrozny (Eds.), Artificial
Neural Networks: Biological Inspirations, Vol. 3696 of Lecture Notes in Com-
puter Science, Springer, 2005, pp. 391–396.
[38] Y. Cheng, Convergence and ordering of kohonen’s batch map, Neural Comput.
9 (8) (1997) 1667–1676.
40
[41] T. Kamishima, H. Kazawa, S. Akaho, A survey and empirical comparison of ob-
ject ranking methods, in: J. Fürnkranz, E. Hüllermeier (Eds.), Preference Learn-
ing, Springer, 2010, pp. 181–201.
[42] A. Ultsch, H. P. Siemon, Kohonen’s self organizing feature maps for exploratory
data analysis., in: Proc. INNC’90, Int. Neural Network Conf., 1990, pp. 305–308.
[43] P. Stefanovic, O. Kurasova, Visual analysis of self-organzing maps, Nonlinear
Analysis: Modelling and Control 16 (4) (2011) 488–504.
[44] B. Klimt, Y. Yang, The Enron corpus: A new dataset for email classification
research, in: Machine Learning: ECML 2004, 15th European Conference on
Machine Learning, Pisa, Italy, September 20-24, 2004, Proceedings, 2004, pp.
217–226.
[45] A. McCallum, X. Wang, A. Corrada-Emmanuel, Topic and role discovery in so-
cial networks with experiments on enron and academic email, J. Artif. Int. Res.
30 (1) (2007) 249–272.
[51] C. von der Malsburg, Self-organization of orientation sensitive cells in the striate
cortex, Kybernetik 14 (2) (1973) 85–100. doi:10.1007/BF00288907.
41