0% found this document useful (0 votes)
63 views

Tensorsom: March 2016

This document describes TensorSOM and TensorGTM, which are extensions of self-organizing maps (SOM) and generative topographic mapping (GTM) for analyzing tensor data. SOM and GTM are dimension reduction techniques that project high-dimensional data onto lower-dimensional spaces while preserving topological structure. TensorSOM (TSOM) and TensorGTM (TGTM) extend these methods to tensor data, allowing simultaneous organization of multiple topographic maps for different modes of tensor data. This allows intra-mode and inter-mode analysis of multimodal relational datasets. TSOM and TGTM are described as useful tools for visualizing and analyzing tensor data like user-item rating data,

Uploaded by

muhazizm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Tensorsom: March 2016

This document describes TensorSOM and TensorGTM, which are extensions of self-organizing maps (SOM) and generative topographic mapping (GTM) for analyzing tensor data. SOM and GTM are dimension reduction techniques that project high-dimensional data onto lower-dimensional spaces while preserving topological structure. TensorSOM (TSOM) and TensorGTM (TGTM) extend these methods to tensor data, allowing simultaneous organization of multiple topographic maps for different modes of tensor data. This allows intra-mode and inter-mode analysis of multimodal relational datasets. TSOM and TGTM are described as useful tools for visualizing and analyzing tensor data like user-item rating data,

Uploaded by

muhazizm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/298646719

TensorSOM

Data · March 2016

CITATIONS READS

0 65

2 authors, including:

Tetsuo Furukawa
Kyushu Institute of Technology
84 PUBLICATIONS   461 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Tetsuo Furukawa on 17 March 2016.

The user has requested enhancement of the downloaded file.


Tensor SOM and tensor GTM:
Nonlinear tensor analysis by topographic
mappings ∗
Tohru Iwasaki Tetsuo Furukawa†
Department of Human Intelligence Systems, Kyushu Institute of Technology, Japan

Abstract
In this paper, we propose nonlinear tensor analysis methods: the tensor self-
organizing map (TSOM) and the tensor generative topographic mapping (TGTM).
TSOM is a straightforward extension of the self-organizing map from high-dimensional
data to tensorial data, and TGTM is an extension of the generative topographic
map, which provides a theoretical background for TSOM using a probabilistic
generative model. These methods are useful tools for analyzing and visualizing
tensorial data, especially multimodal relational data. For given n-mode relational
data, TSOM and TGTM can simultaneously organize a set of n-topographic maps.
Furthermore, they can be used to explore the tensorial data space by interactively
visualizing the relationships between modes. We present the TSOM algorithm and
a theoretical description from the viewpoint of TGTM. Various TSOM variations
and visualization techniques are also described, along with some applications to
real relational datasets. Additionally, we attempt to build a comprehensive de-
scription of the TSOM family by adapting various data structures.
Keywords:
self-organizing map, generative topographic map, tensor decomposition, relational data

∗ This is the accepted manuscript version of the journal article in Neural Networks 77 (2016) 107–125,

which is made available for scholarly purposes only, in accordance with the journal’s author permissions.
The final publication is available at http://www.sciencedirect.com/science/article/pii/S0893608016000149.
DOI: http://dx.doi.org/10.1016/j.neunet.2016.01.013
† Corresponding author. E-mail: [email protected]

1
1 Introduction
Topographic mappings are a class of dimension reduction methods that project high-
dimensional data into a lower-dimensional space with preserving the topological struc-
ture. Representatives of this class include the self-organizing map (SOM) [1, 2] and
the generative topographic mapping (GTM) [3, 4]. The aim of this work is to extend
these methods for tensorial data, namely, to the tensor SOM (TSOM) and the tensor
GTM (TGTM), which can be used to visualize multimodal relational data.
A typical application of TSOM/TGTM is the interpretation of user–item rating
data, which contain product ratings for an online shop as evaluated by the users [5,
6, 7]. If the dataset consists of D-dimensional rating scores for J items evaluated by
I users, then the data can be represented as an I × J × D-dimensional tensor. In such
a case, we are interested in analyzing the customers, the items, and the relationships
between them. These kinds of tensorial datasets are common in many fields. An-
other typical example is SNS or e-mail message analysis, which can be modeled by
a tensor of (user)×(keyword)×(time) [8, 9, 10, 11]. When recording a multi-channel
electroencephalogram, the power of wavelet filters can be represented by a tensor of
(channel)×(frequency)×(time) [12, 13]. Another possible application is image recog-
nition for faces, postures, or gaits, because these images can be decomposed into
(person)×(posture)×(emotional expression) [14, 15, 16, 17].
Representative methods for tensorial data analysis constitute a group of algorithms
called tensor factorizations or tensor decompositions [18, 19, 20]. Tensor decompo-
sitions such as Tucker and PARAFAC involve generalizations of matrix decomposi-
tions, which decompose the given tensor into a product of lower-order matrices or ten-
sors. Singular value decomposition (SVD) [21], principal component analysis (PCA)
[22, 23], independent component analysis (ICA) [24, 25], and nonnegative matrix fac-
torization (NMF) [26, 27] have all been extended to cope with tensorial data. The aim
of this work is to add nonlinear tensor analysis tools to this group by extending the
SOM and the GTM.
To analyze a multimodal relational dataset, we need to make two different types
of analyses, that is, intra-mode and inter-mode analyses. For the intra-mode analysis,
a low-dimensional representation is required for every mode. In the case of user–
item ratings, we need two topographic maps, one for users and one for items. The
user map indicates how much each user has similar or different preferences to other
users, whereas the item map visualizes how much each item is preferred similarly
or differently by users compared with other items. To make the inter-mode analysis,
we need to visualize the relations between different modes. At times, for example,
we need to analyze user preferences by focusing on an item group of interest, and at
other times need to analyze preferred items by focusing on a user group of interest.
For these purposes, we assume that the inherent property of each user and each item
is individually represented by a low-dimensional latent variable, and that the rating
score is determined by a nonlinear function of the latent variables. This latent variable
model meets the framework of the GTM, a Bayesian extension of the SOM. This is the
motivation for developing tensorial extensions of topographic mappings.
Another motivation of this work is to develop a simple, fast and flexible analysis
tool for tensorial data. We intend for people to be able to use it easily in practical

2
tasks, and for this reason we chose the SOM. The SOM is a popular neural network
architecture with a simple algorithm. By a straightforward extension, the advantages
of SOMs are inherited by TSOMs. In fact, the TSOM algorithm can be programmed
with matrix multiplications only, and does not require either inverses or eigenvalues
to be found. This has the advantage not only of simplicity, but also of computational
cost, which is usually expensive in practical tensor data analysis. In addition, tensor
data sometimes has a complex structure, and our TSOM can easily be adapted to such
cases by systematically combining SOMs like building-blocks (Secs. 6 and 7). This is
another advantage of using the SOM.
The remainder of this paper is organized as follows. Sec. 2 presents the theoretical
background of this work and introduces some related research. Sec. 3 describes the ba-
sic TSOM and TGTM algorithms. Sec. 4 presents simulation results using artificial and
real datasets, and Sec. 5 introduces the visualization techniques. Sec. 6 describes some
extensions of the TSOM. In Sec. 7, we discuss the TSOM family and the relationship
between the original SOM and TSOM. Sec. 8 presents our conclusions.

2 Theoretical Preparations
2.1 Notation of vectors, matrices and tensors
An M-dimensional array of scalars is referred as a tensor of order M, each dimension
of which is referred as the mode [18]. In the case of user–item rating data, the 1st mode
corresponds to the user, and 2nd corresponds to the item.
Scalars are all in R and are indicated by italics, e.g., x, y. The exceptions are
d, i, j, k, l, m, n and their upper cases, which are all N. A lower case font is used for
indices of vector or tensor components, and capital letters are used to represent their
upper bound. Thus, i ∈ {1, . . . , I}. Vectors are denoted in bold, lower case font (e.g.,
x, y), and are column vectors unless otherwise specified. Matrices are denoted in bold,
upper case font (e.g., X, Y), and higher-order tensors are denoted by an underscore
(e.g., X, Y). The (i, j) component of a matrix A is denoted by Ai j . Similarly, Ai jk is
the (i, j, k) component of tensor A of order 3. For a tensor of order M, Ak1 k2 ...kM is also
denoted by Ak , where k = (k1 , . . . , k M ).
A subarray is denoted by a colon expression. xi: and x: j are the vectors consisting
of the ith row and jth column components of matrix X, respectively. Similarly, xi j: is a
vector cut out of tensor X along the 3rd mode. For a tensor of order M, subtensors of
order (M − 1) are called slices. For example, A::i:: is the ith slice of mode 3. To describe
general cases, a slice of A is also denoted by Ai(m) , which means the ith slice of mode
m. Thus, A::i:: ≡ Ai(3) .
The m-mode tensor–matrix product is denoted by ×m . For example, the product of
X ∈ RK1 ×···×KM and A ∈ R J×Km becomes Y = X ×m A, each component of which is given
by


Km
Yk1 ...km−1 j km+1 ...kM = Xk1 ...km ...kM A jkm . (1)
km =1

3
Note that Y is a tensor of order M, the size of which is K1 ×· · ·×Km−1 ×J×Km+1 ×. . . K M .
When M matrices {A} = {A(1) , . . . , A(M) } are given, the M-multiple product of X and
{A} is
Y = X ×1 A(1) ×2 . . . × M A(M) = X × {A}. (2)
X ×−m {A} denotes the (M − 1)-multiple product of X and {A(1) , . . . , A(M) } except A(m) .
Similarly, the multiple product of X and a subset of {A} is denoted by X ×M′ {A}, where
M′ is a set of the indices of the modes. For example, X ×2 A(2) ×4 A(4) ≡ X ×{2,4} {A}1 .
The outer product of X and Y is denoted by X ◦ Y. For X ∈ RK1 ×···×KM and Y ∈
R J1 ×···×JN , Z = X ◦ Y becomes Z ∈ RK1 ×···×KM ×J1 ×···×JN , where Zkj = Xk Yj . The inner
product is denoted by ⟨X, Y⟩, defined by the product sum of all components. √∑ Thus

⟨X, Y⟩ , k Xk Yk . X is the Euclidean norm of X, defined by X , k Xk =
2

⟨X, X⟩.

2.2 Tensor analysis methods


A representative tensor decomposition was proposed by Tucker [28]. It decomposes a
tensor into the products of a core tensor and a set of matrices [18, 19]. For a tensor of
order 3, the Tucker decomposition of X is
X ≃ A ×1 U(1) ×2 U(2) ×3 U(3) (3)
= A × {U} . (4)
Here, A is called the core tensor. This equation means that

J1 ∑
J2 ∑
J3
Xk1 k2 k3 ≃ A j1 j2 j3 Uk(1)
1 j1
Uk(2)
2 j2
Uk(3)
3 j3
. (5)
j1 =1 j2 =1 j3 =1

This type of tensor decomposition is also called Tucker3. If X is decomposed into a


core tensor and two matrices similar to X ≃ A ×1 U(1) ×2 U(2) , it is called Tucker2.
TSOM also provides a Tucker2-like decomposition. Another representative method is
PARAFAC. It decomposes a tensor into a sum of rank 1 tensors similar to Xk1 k2 k3 ≃
∑J (1) (2) (3)
j=1 λ j Rk1 j Rk2 j Rk3 j .
Orthodox approaches for nonlinear tensorial data use the kernel method [29, 30].
An alternative approach is to use manifold learning methods [31, 32]. The method
most related to our study is SOM2 , which is also referred as ‘SOM of SOMs’ [33, 34].
SOM2 visualizes the relationships between a set of datasets. To achieve this, SOM2
has a hierarchical architecture; the lower level generates a set of models for the given
datasets, and the higher level generates a map of lower models. Theoretically, SOM2
models a set of datasets using a geometrical structure called a fiber bundle, which is
also represented by a tensor. The essential difference between TSOM and SOM2 is in
their data structures. From our viewpoint, both algorithms can be unified into the same
family of algorithms, which can be adapted to various data structures. We discuss this
issue in Sec. 7.
1 In this paper, X and Y often have an extra (M + 1)-th mode that is not multiplied by a matrix, but these

notations are used in the same manner.

4
2.3 Self-organizing map: SOM
We first briefly summarize SOM as an introduction to TSOM. Suppose that Ω is the
universe of the objects of interest, and a data vector x(ω) ∈ X is observed from a
sample point ω ∈ Ω. Here, X is the observation space, which is assumed to be X = RD
in this paper. Let ΩS = {ω1 , . . . , ωN } be a sample set from the universe Ω, and xn:
be the data vector observed from ωn . Thus, X = (xn: ) becomes a N × D data matrix.
From the viewpoint of a generative model, xn: is assumed to be generated from the
latent variable zn: ∈ Z and a nonlinear smooth map f (z), so that xn: = f (zn: ) + ε.
Here, ε is isotropic Gaussian noise. The task of the SOM is to estimate the latent
variables (zn: ) and the nonlinear map f (z), so that they represent the observed data by
xn: ≃ f (zn: ). As the result, we obtain a topographic map of the objects in the latent
space Z. Usually Z is assumed to be a closed area in a low dimensional space RL ,
typically the 2-dimensional square space. In this paper, the prior of z is the uniform
distribution in Z = [−1, 1]L ⊂ RL (L = 1 or 2), but it is easily generalized to other
cases.
To represent the nonlinear map f (z), we discretize the latent space Z into K regular
nodes. Let ζ k: be the positional vector of the kth node in Z, and yk: , f (ζ k: ). Then
the matrix Y = (yk: ) represents the entire map. Because yk: indicates the corresponding
point in X for the kth node in Z, it is often referred to as the ‘reference vector’.
The SOM algorithm is an expectation-maximization (EM) algorithm in a broad
sense, in which we iterate over the expectation (E) and maximization (M) steps while
reducing the neighborhood area until the map converges [35, 36]. Considering the
extension to tensor cases, let us summarize the SOM algorithm using matrix notation.
In the E step, the maximum a posteriori node is the winning node for each data
vector. That is,
kn∗ = arg min ∥xn: − yk: ∥2 , (6)
k
Bkn = δ(k, kn∗ ), (7)
where B = (Bkn ) ∈ R is the winning matrix, and δ is Kronecker’s delta. Let
K×N

H ∈ RK×K be the neighborhood matrix given by


[ ]
1 2
Hkk′ = exp − 2 ζ k: − ζ k′ : . (8)
2σ (t)
σ(t) is the neighborhood size, which is gradually reduced with learning time t. This
neighborhood shrinkage is necessary for avoiding local minima, as in simulated an-
nealing. Using the winning matrix B and the neighborhood matrix H, the responsibility
matrix R ∈ RK×N is determined by
R = HB. (9)
The following two matrices are also calculated for the M step.
 N 
∑ 
G = diag  Rkn  (10)
n=1
R̃ = G−1 R, (11)

5
where G ∈ RK×K is the diagonal matrix, which represents the sum of the responsibility
of each node, and R̃ ∈ RK×N is the responsibility normalized by G.
In the M step, the map Y is updated so that the expectation of the square error is
minimized using

1 ∑
N
yk: = Rkn xn: . (12)
Gk n=1

In matrix notation, (12) is


Y = R̃X. (13)
Using the tensor–matrix product notation, (13) can be written as
Y = X ×1 R̃, (14)
where the data matrix X and the map matrix Y are regarded as tensors X and Y of order
2. We iterate over these steps, reducing the neighborhood size until the map converges.
Many previous studies have shown that the objective function of the SOM is

1 ∑∑
N K
F=− Hk kn∗ ∥xn: − yk: ∥2 (15)
2N n=1 k=1

[35, 36, 37, 38, 39]. We can easily derive the M step (13) using this objective function,
and slightly modify the E step (6) to

K
kn∗ = arg min Hk k′ ∥xn: − yk: ∥2 . (16)
k′ k=1

The original SOM algorithm (6) can be regarded as an approximation of (16), because
they are almost equal when the neighborhood is sufficiently small. Note that F is a
function of σ, which gradually changes throughout the learning process.

2.4 SOM with basis functions


It is also possible to represent the nonlinear map f (z) by a linear combination of basis
functions, instead of a discrete number of reference vectors. Basis functions have the
advantage of allowing us to deal with continuous, differentiable maps with arbitrary
resolutions. Now, suppose that {φ1 (z), . . . , φ J (z)} is an appropriate basis set defined on
Z, and the nonlinear map f (z) can be represented by

J
f (z) ≃ w j φ j (z) = WT φ(z) (17)
j=1

with sufficient accuracy. In this case, W is updated using


W = (ΦT GΦ)−1 ΦT RX
= ΦG# R̃X, (18)

6
( )
so that it minimizes the expectation error. Here, Φ , φ j (ζ k: ) ∈ RK×J and ΦG# is the
generalized inverse of Φ with the weight G. Thus, ΦG# , (ΦT GΦ)−1 ΦT G. Defining
the J × N matrix Q̃ by

Q̃ , ΦG# R̃ = (ΦT GΦ)−1 ΦT R, (19)

(18) becomes

W = Q̃X. (20)

The difference between (13) and (20) is that Y and R̃ are replaced by W and Q̃.
To update W using (18), ΦG# should be calculated at every iteration. Because Gk is
the sum of the responsibilities, Gk is more or less equal to NPk , where Pk is the prior
of the kth node. Therefore, ΦG# can be approximated by Φ#P , which we can calculate
in advance. In the case of the uniform prior, Φ#P becomes the ordinary Moore-Penrose
( )−1
inverse ΦT Φ ΦT .
Another advantage of using basis functions is that gradient methods can be used for
the E step. In the case of the gradient descent method, the latent variable zn: = (znl )T ∈
RL is updated by
⟨ ⟩
znl := znl − η WT φ(zn: ) − xn , WT φ′l (zn: ) , (21)

where φ′l (z) , ∂φ(z)/∂zl . In this case, R is calculated by


[ ]
1 2
Rkn = exp − 2 ζ k: − zn: . (22)
2σ (t)

Note that a single step of the gradient descent method is enough for every E step,
because the convergence speed is limited by the neighborhood size scheduling. Thus
there is no need to use other faster methods for this purpose.

2.5 Generative topographic mapping: GTM


The generative topographic mapping (GTM) is a SOM-like algorithm, which is derived
from a probabilistic generative model [3, 4, 40]. Thus GTM can be regarded as a
theoretical model for SOM. The GTM algorithm can be described by the EM algorithm
(or the variational Bayesian method). In the E step, the responsibility is calculated
using
[ ]
exp − β2 ∥xn: − yk: ∥2
Rkn = ∑ [ β ]. (23)
k′ =1 exp − 2 ∥xn: − yk′ : ∥
K 2

Unlike SOM, GTM does not have a neighborhood relationship in the E step. Instead,
GTM introduces other assumptions in the M step, to obtain a smooth continuous map.
Thus, the original GTM uses Gaussian radial basis functions [3], and another type of

7
GTM represents the map using a Gaussian stochastic process [4, 40]. In the former
case, the M step is
( )−1
λ
W = Φ GΦ + I
T
ΦT RX. (24)
β

Here, Φ is the matrix of the radial bases and the term λβ I represents the regularization.
( )−1
Letting Q̃ , ΦT GΦ + λβ I ΦT R, we can represent the M step by W = Q̃X, as in
(20). In the latter case, the M step is
( )−1
λ
Y = G + H−1 RX, (25)
β

where the neighborhood matrix H is used as the correlation matrix for the Gaussian
process. Note that both algorithms are equivalent if the radial bases satisfy ΦΦT = H.

3 Basic TSOM
3.1 Definition of the task
To clarify the task of the TSOM, let us revisit the example of the user–item rating. In
this case, the objects of interest are the set of users Ω(user) , and the set of items for
sale Ω(item) . Because each rating score x ∈ RD is determined by a pair of instances
(ω(user) , ω(item) ), we can call this type of data 2-mode relational data. The data tensor
X is of order 3, and its modes correspond to the users, items, and rating components.
Suppose that the last mode is not the target of the tensor analysis, similar to Tucker2.
(If necessary, the data can be regarded as 3-mode relational data, which can be also
dealt with by a TSOM.)
The aim of the TSOM is to visualize the relationships between objects within each
mode (i.e., between users and between items), and simultaneously visualize the rela-
tionships between two modes (i.e., between users and items). In this paper, the TSOM
for 2-mode relational data is abbreviated to TSOM–R2 .
This situation can be{ generalized as} follows. Suppose { (m) that the }set of object uni-
(m) (m)
verses to be analyzed is Ω , . . . , Ω
(1) (M)
, and ΩS = ω1 , . . . , ωNm is the sample set
of( Ω(m) , consisting ) of N m instances. We observe the D-dimensional data xn1 ...nM : =
(1) (M) ∏M
x ωn1 , . . . , ωnM for all members of the direct product set ΩS = m=1 ΩS(m) . Then, we
obtain M-mode relational data X, which is a tensor of order (M + 1).
Given such tensorial data, TSOM–R M must organize M topographic maps (each of
which visualizes the relationships of objects within each mode) and visualize the rela-
tionships between modes. TSOM–R M assumes that there are M latent spaces {Z(1) , . . . , Z(M) },
and that the observed data can be represented by a nonlinear map f of M latent vari-
ables. That is,
( )
xn1 ...nM : ≃ f z(1) (M)
n1 : , . . . , zn M : . (26)

8
Z(1) X

Z(2)

Grid nodes of Reference vectors


the latent spaces in the observa!on space
(a)

Z(1)
X

Z(2)

Winner nodes Slice manifolds and


in the latent Spaces an instance manifold
(b)

Figure 1: (a) The architecture of TSOM–R2 . The map from the latent spaces to the
observation space is represented by the set of reference vectors (yk1 k2 : ), each of which
corresponds to a pair of nodes (ζ (1) (2)
k1 , ζ k2 ) in the latent spaces. The reference vectors
represent a product manifold in the observation space. (Note that the product manifold
is 4-dimensional in this case, and is depicted as a 3-dimensional nonlinear cube in
this figure). (b) Estimates of the instance manifold {U(m) (m)
n } for each instance ωn . The
winner is determined according to the best matching slice manifold of the instance
manifold.

Thus, the actual TSOM task is to simultaneously estimate the nonlinear map and the
latent variables.
For ordinary SOM, the dimensions of the observation space (D) and of the latent
space (L) should satisfy D ≥ L, so that the topology is preserved. Note that this
restriction is not necessary for TSOM. In fact, we can even consider {a scalar
} case{ (i.e.,
}
(m)
D = 1). This is because we expect topological preservation between Xn(m) and zn: ,
and the dimension of Xn(m) is usually much greater than L.

3.2 TSOM–R2 algorithm


We first present the TSOM algorithm for 2-mode relational data, namely, TSOM–R2 .
In this case, the relational dataset is given as a tensor of order 3, X = (xn1 n2 d ). The aim

9
of TSOM–R2 is to model the 2-mode relational data using a nonlinear map. That is,
f : Z(1) × Z(2) −→ Y ⊂ X = RD
(z(1) , z(2) ) 7−→ y(z(1) , z(2) ). (27)

To represent f , TSOM–R2 has two latent spaces, Z(1) and Z(2) , each of which is dis-
cretized to K1 and K2 nodes, respectively. Letting ζ (m) km be the coordinate of the km th
node of the mth latent space, we assign a reference vector to every pair of nodes (k1 , k2 ),
i.e., yk1 k2 : = f (ζ (1) (2)
k1 , ζ k2 ). Consequently, the entire map is represented by the tensor
Y = (Yk1 k2 d ), as shown in Fig. 1 (a).
In the TSOM–R2 algorithm, slice manifolds and instance manifolds are essential
for determining the winning nodes. The slice manifolds are the submanifolds of Y,
which are represented by(the slices) of Y. For example, the k1 th slice manifold of mode
1 is defined by Y(1) k1 = f ζ k1 , Z
(1) (2)
, which is represented by Yk1 :: . The slice manifold
characterizes the data distribution for a specified latent variable. In contrast, the in-
stance manifold characterizes the data distribution when an instance is specified. Now,
suppose that an instance ω(1) (1) ) and the data distribution {xn1 1(1)
n1 (is specified, , . . . , xn1 N2 } is
approximated by xn1 n2 ≃ f (1) z(2) ω . Then, the instance manifold Un1 is given by
( (2) (1) ) n2 : n1
Un1 = f
(1) (1)
Z ωn1 . Suppose that this manifold is represented by a matrix U(1) n1 :: ,
( (2) (1) )
defined by u (1)
n1 k2 := f (1)
ζ ω . Thus, the instance manifold U is the n th slice
k2 n1
(1)
n1 :: 1
of U(1) . The instance manifolds of mode 2 are also defined in the same way. In the
TSOM algorithm, the instance manifolds are estimated for all instances, and they are
regarded as the feature vectors for determining the winners. This situation is shown in
Fig. 1 (b).
Like the ordinary SOM, the TSOM algorithm consists of an E step and an M step.
For unfamiliar readers, we denote the algorithm in two ways, with and without the
tensor convention.

E step
In the E step, the winner of each mode is determined by
2
kn∗(1)
1
= arg min Yk1 :: − U(1)
n1 :: (28)
k1

K2 ∑
D ( )2
= arg min Yk1 k2 d − Un(1)
1 k2 d
(29)
k1 k2 =1 d=1
2
kn∗(2)
2
= arg min Y:k2 : − U(2)
:n2 : (30)
k2

K1 ∑
D ( )2
= arg min Yk1 k2 d − Uk(2)
1 n2 d
(31)
k2 k1 =1 d=1

Thus, the winner of ω(1)n1 is determined as the best matching slice manifold for the
instance manifold Un1 :: (Fig. 1 (b)). (28) also means that U(1)
n1 :: and Yk1 :: behave as if
they are the data and the reference vectors with respect to mode 1.

10
After determining all the winners, the winning matrices B(1) and B(2) , and the re-
sponsibility matrices R(1) and R(2) are obtained using
∗(1)
B(1)
k1 n1 = δ(k1 , kn1 ) (32)
B(2)
k2 n2 = δ(k2 , kn∗(2)
2
) (33)
(1)
R =H B (1) (1)
(34)
R(2)
=H B .(2) (2)
(35)

Here, H(m) is the neighborhood matrix of mode m. Elementwise, these equations can
be represented by
[ ]
(1) 1 2 ∗(1)
Rk1 n1 = exp − d (k1 , kn1 ) (36)
σ(1) 2
[ ]
1 2 ∗(2)
R(2)
k2 n2 = exp − (2) 2 d (k2 , kn2 ) , (37)
σ
where d(k, k′ ) is the distance between two nodes in the latent space. Similar to (10)
and (11), the sum of the responsibility G(m) and normalized responsibility R̃(m) are also
calculated for each mode using
N 
 ∑ 1 
G = diag 
(1)
Rk1 n1 
(1)
(38)
n1 =1
N 
 ∑ 2 
G = diag 
(2)
Rk2 n2 
(2)
(39)
n2 =1
(1) −1
R̃ (1)
=G R(1) (40)
(2) −1
R̃(2) = G R(2) . (41)

M step

In the M step, tensors Y, U(1) , and U(2) are updated using

U(1) = X ×2 R̃(2) (42)


U (2)
= X ×1 R̃ (1)
(43)
Y = X ×1 R̃(1) ×2 R̃(2) . (44)

11
If the components of the tensors are denoted explicitly, the above equations can be
rewritten as

1 ∑
N2
Un(1)
1 k2 d
= R(2)
k2 n2 Xn1 n2 d (45)
G(2)
k2 n2 =1

1 ∑
N1
Uk(2)
1 n2 d
= R(1)
k1 n1 Xn1 n2 d (46)
G(1)
k1 n1 =1

1 ∑
N1 ∑
N2
Yk 1 k 2 d = R(1) (2)
k1 n1 Rk2 n2 Xn1 n2 d . (47)
G(1) (2)
k1 G k2 n1 =1 n2 =1

Note that (44) is also denoted by

Y = U(1) ×1 R̃(1) (48)


Y=U (2)
×2 R̃ . (2)
(49)

Computationally, it is quicker to evaluate Y by (48) or (49) than (44).

3.3 TSOM–R M algorithm


We now present the generalized TSOM algorithm for the arbitrary M-mode relational
data, namely, TSOM–R M . In this case, the data tensor X and the map tensor Y become
tensors of order (M + 1), which are (N1 × · · · × N M × D) and (K1 × · · · × K M × D) in
size, respectively.

E step

As in TSOM–R2 , the winners are determined by the distance between the instance
manifolds U(m)
n(m) and the slice manifolds Yk(m) . That is,

2
(m)
kn∗(m) = arg min Yk(m) − Un(m) (50)
k
( )
∗(m)
B(m)
kn = δ k, kn . (51)

Then, the matrices R(m) , G(m) , and R̃(m) are calculated for each mode using

R(m) = H(m) B(m) (52)


N 
∑m 
G = diag  Rkn 
(m) (m)
(53)
n=1
(m) −1 (m)
R̃(m) = G R . (54)

12
M step

In the M step, Y and {U(m) } are updated according to


{ }
Y = X × R̃ (55)
{ }
U(m) = X ×−m R̃ . (56)
We iterate over these two steps, reducing the neighborhood size until the map con-
verges.
Consider the algorithm for calculating (55) and (56). If these tensors are calculated
independently, we require M 2 tensor–matrix multiplications. For example, if M = 4,
(55) and (56) become
Y = X ×1 R̃(1) ×2 R̃(2) ×3 R̃(3) ×4 R̃(4) (57)
U (1)
= X ×2 R̃ (2)
×3 R̃ (3)
×4 R̃ (4)
(58)
U (2)
= X ×1 R̃ (1)
×3 R̃ (3)
×4 R̃ (4)
(59)
U (3)
= X ×1 R̃ (1)
×2 R̃ (2)
×4 R̃ (4)
(60)
U (4)
= X ×1 R̃ (1)
×2 R̃ (2)
×3 R̃ , (3)
(61)
in which there are 16 tensor–matrix multiplications. However, we can reduce this to
⌈M log2 M + 1⌉ multiplications. Suppose that M = {1, . . . , M} is a set of modes, and
M′ is a subset of M. Suppose further that the intermediate product tensor P(M′ ) is
defined by
{ }
P(M′ ) , X ×M′ R̃ . (62)

Note that P(M′ + {m}) = P(M′ ) ×m R̃(m) , if m < M′ . Considering Y = P(M) and
U(m) = P(M \ {m}), (57)–(61) can be calculated using
P({1, 2}) = X ×1 R̃(1) ×2 R̃(2) (63)
P({3, 4}) = X ×3 R̃ (3)
×4 R̃ (4)
(64)
U (1)
= P({3, 4}) ×2 R̃ (2)
(65)
U (2)
= P({3, 4}) ×1 R̃ (1)
(66)
U (3)
= P({1, 2}) ×4 R̃ (4)
(67)
U (4)
= P({1, 2}) ×3 R̃ (3)
(68)
Y=U (1)
×1 R̃ . (1)
(69)
Thus, these tensors can be calculated in parallel like a binary-tree algorithm. In this
case, the number of tensor–matrix multiplications is reduced from 16 to 9 (= 4 log2 4 +
1).

3.4 Derivation of the TSOM algorithm


In this subsection, we present a theoretical derivation of the TSOM algorithm. To begin
with, let us start from the notation of the weighted square error between two tensors.

13
Suppose that X and Y are tensors of order (M + 1), and {R} = {R(1) , . . . , R(M) } is a set
of responsibility matrices that give the weights. The weighted error between X and Y
is defined as
M 
( ) ∑ ∑ ∏  2
E X, Y; {R} , 
 Rkm nm  xn1 ...nM : − yk1 ...kM : .
(m)
(70)
n1 ,...,n M k1 ,...,k M m=1

If each R(m) is normalized so that k R(m)
kn = 1, (70) gives the expectation error. Let the
square errors of two tensors except mode m be defined as
 
 M 
∑ ∑  ∏  2
(m′ ) 
(−m)
Ekm nm (X, Y; {R}) , 

 Rkm′ nm′  xn1 ...nM : − yk1 ...kM : . (71)
n ,...,n
1
 ′
M k1 ,...,k M m =1

except nm except km except m

Note that E(−m) (X, Y; {R}) becomes a Km × Nm matrix.


Using this weighted square error, the objective function of TSOM can be defined
as
1 ( )
F,− E X, Y; {R} . (72)
2N

Here, N = m Nm and R(m) (m)
kn = Hk kn∗(m) . This is a straightforward extension of (15).
Note that (72) can be decomposed with respect to mode m, by applying Pythagoras’
theorem. That is,
{ )}
1 ∑ 1 2
Nm ∑
1 (−m) (
Km
(m)
F≃− R(m) Y − U n(m) + E X, U(m)
; {R} , (73)
2Nm n=1 k=1 kn K−m k(m) N−m kn
∏ ∏
where N−m = m′ ,m Nm′ , K−m = m′ ,m Km′ , and U(m) is the instance manifold given
by (56).
In the E step, (72) is maximized with respect to {kn∗(m) } according to

kn∗(m) = arg min (m) (−m)
Hkk ′ E k′ n (X, Y; {R}).
m
(74)
k k′

Applying (73), (74) becomes


∑ 2
(m) (m)
kn∗(m) = arg min Hkk ′ Yk′ (m) − Un(m) , (75)
k k′

because the second term of (73) is independent of kn∗(m) . Similar to the approximations
from (16) to (6), (74) and (75) become
kn∗(m) = arg min Ekn
(−m)
(X, Y; {R}) (76)
k
2
= arg min Yk(m) − U(m)
n(m) . (77)
k

Thus, we obtain (50). Both (76) and (77) yield the same result, but (77) can be calcu-
lated quicker than (76).
In the M step, (72) is maximized with respect to Y. It is easy to show that the
solution is given by (55).

14
3.5 TSOM with basis functions
If the order of the tensor increases, the computational costs drastically increase. The
size of the task is roughly proportional to K M . To reduce the costs, we need to reduce
the task size as well as algorithmic improvement. The easiest way is to reduce K, i.e.,
the node number. However, this approach sacrifices the spatial resolution.
Another approach is to use a set of basis functions. Because we assume that the
map is smooth and continuous, we should be able to sufficiently represent it using a
relatively small number of bases. If J basis functions for each mode are enough to
represent the map, then the map can be represented by the tensor W ∈ R J ×D , in-
M

stead of Y ∈ RK ×D . Thus, Y is compressed to W by (J/K) M . This approach also


M

provides further advantages by using an orthonormal basis set. Because distances are
preserved by the orthonormal transformation, the distance between two maps is equal
to the distance between two corresponding coefficient vectors. Applying this to (50),
the distance between K M−1 × D dimensional vectors can be evaluated by the distance
between J M−1 × D dimensional coefficient vectors. Accordingly, this reduces the calcu-
lation times in the competitive learning process. In this work, we used the normalized
Legendre polynomials as a typical orthonormal system in the square latent space.
Another advantage of using a continuous basis set is that it provides the derivative
of the objective function with respect to the latent variable z. Thus we can find the
best matching point using more efficient algorithms such as the gradient method. This
means that we can specify the winners more precisely while using less computational
time than in the ordinary all-play-all competition. { }
Suppose that the orthonormal system for mode m is φ1(m) (z), . . . , φ(m) (z) , and let
( ) J
the matrix representation be Φ(m) = φ j (ζ k: ) ∈ RK×J . Then Y is obtained by decom-
pressing W by

Y = W ×1 Φ(1) ×2 · · · × M Φ(M) = W × {Φ} . (78)

In this paper, W is called the core tensor of Y. To determine the winners, we also need
the core tensors of the instance manifolds. Letting V(m) be the core tensor of U(m) , the
instance manifolds are obtained by

U(m) = V(m) ×−m {Φ} . (79)

The core tensors of the slice manifolds T(m) are easily calculated by only decompress-
ing the W with respect to mode m. Thus,

T(m) , W ×m Φ(m) . (80)

Because T(m) and V(m) are compressed representations for all modes except m, they are
(J/K) M−1 times smaller than the original Yk(m) and U(m) . Nevertheless, the distance
between the two maps is preserved by the orthonormal system. Thus,
2 2
Y (m) = T(m) − V(m) .
k(m) − Un(m) k(m) n(m) (81)

Applying (81) to (50), we can determine the winners without decompressing W or V.

15
To obtain W and V, we need to calculate the generalized inverse ΦG# for every
iteration and mode. This computational cost could outweigh the benefits of using the
basis functions. However, as described in Sec. 2, the inverse can be approximated by
Φ#P , which is just the transpose ΦT in an orthonormal system. Thus, we do not need to
calculate the inverse matrix at all.
Based on the above discussions, the TSOM algorithm with orthonormal bases is
described below.

E step
Applying (81), the winner is determined without decompressing the core tensors. That
is,
2
(m)
kn∗(m) = arg min T(m)
k(m) − V
n(m) . (82)
k

We can replace the above all-play-all competition with a gradient method if preferred.
In the case of the gradient descent method, it becomes
⟨ ⟩
′ (m)
z(m) (m) (m)
nl := znl − η W ×m φ(zn ) − Vn(m) , W ×m φl (zn ) .
(m)
(83)

After determining the winners, we calculate R(m) , G(m) , and R̃(m) using (52)–(54).
Then, Q̃(m) (the normalized responsibility of the bases) is calculated using
T
Q̃(m) = Φ(m) R̃(m) . (84)
T
Here Φ(m) is used as the generalized inverse.

M step

In the M step, the core tensors W and {V(m) } are updated according to
{ }
W = X × Q̃ (85)
{ }
V(m) = X ×−m Q̃ . (86)

The only difference between this and the discrete version in (55) (56) is that R̃(m) is
replaced by Q̃(m) . The binary tree calculation is also effective. The simulation results
(Table 1, 2) show that the basis function is effective in reducing the calculation time.

3.6 Initialization and scheduling


TSOM is initialized in a similar way as conventional batch SOM. There are two random
initialization methods: randomly initialize the reference vectors, or randomly initialize
the winner. In the first case, the loop starts in the E step, whereas it starts in the M step
in the second case. Although there are no significant differences, we recommend using
the second method for TSOM, because it is not affected by biases in the given dataset.

16
Z(1) X

x=y+ε
Z(2)

Latent Spaces Observaon Space

(a)

W L

(b)

Figure 2: (a) The generative model of the relational data dealt with by TGTM–R2 . Note
that the two 2-dimensional latent spaces are mapped onto the 4-dimensional manifold
in the observation space, which is depicted as a 3-dimensional nonlinear cube. (b) The
graphical model of the generative model.

In real applications, initialization using PCA is also recommended. For relational data,
PCA should be used for each mode, by vectorizing ( the
) slices of the data tensor. For
example, a PCA for mode m regards x̂(m) n = vec Xn(m) as the nth data vector of mode
m. When the PCA finishes, the initial winning nodes are determined using a principal
axes transformation.
The time scheduling of the neighborhood size is implemented in a similar way. For
example, σ(t) = max [σ0 (1 − t/τ), σ∞ ] or σ(t) = (σ0 − σ∞ ) e−t/τ + σ∞ , where t is the
calculation time, σ0 and σ∞ are the initial and last neighborhood sizes, and τ is the
time constant.

17
3.7 TGTM
By applying the probabilistic generative model, it is possible to derive the algorithm
using a fully Bayesian approach, namely, TGTM. In this paper, we have abbreviated
TGTM for M-mode relational data to TGTM–R M .
We first define the generative model for the relational data dealt with by TGTM–
R M . Suppose that we have M latent spaces {Z(m) } (m = 1, . . . , M), each of which is dis-
cretized to Km nodes with the positional vectors {ζ k(m) }. The priors of the latent variables
( )
are given by Prob z(m) = ζ (m)
k = p(m) (m)
k , which we have assumed to be pk = 1/Km . To
obtain the training dataset, the latent variables are generated Nm times, independently
for each mode. The observations are made for all combinations of the latent variables
using
( )
xn1 ...nM : = y z(1) (M)
n1 : , . . . , zn M : + εn1 ...n M : , (87)

where y is the nonlinear map from the direct product space Z(1) × · · · × Z(M) to the
) X = R∏, and ε ∈ R is the observation noise generated by p(ε) =
D D
observation
( space
−1
N ε 0, β ID . Thus, m Nm data are observed, and are represented by a tensor
( )
X = Xn1 ...nM d .
We assume that the map is represented by a set of radial basis functions, such that
Y = W × {Φ} , (88)

where Φ(m) ∈ RKm ×Jm denotes the radial bases of mode m, and W is the core tensor.
The prior of W is
( ) ( )
p W = N W O,−1 L (89)
L = (λ1 IK1 ) ◦ · · · ◦ (λ M IKM ). (90)
Thus,
( each component ) of W obeys the independent Gaussian prior p(W j1 ... jM d ) =
N W j1 ... jM d 0, λ−1 . The generative model for M = 2 is shown in Fig. 2 (a) and
(b).
The task of TGTM is to estimate the core tensor W, the latent variables {z(m) nm : }, and
the parameter β from the observed relational data X. The TGTM algorithm can be de-
rived by applying the standard EM algorithm. Here, Y and β are estimated as the MAP
solutions, and the latent variables {z(m)
nm : } are estimated as their posteriors. (Note that we
can also easily estimate the posteriors of Y and β by applying the variational Bayesian
method [40].) Applying the variational approximation to this generative model, the
objective function is
β ( ) ∑∑ ( )
F(R, W, β) = − E X, Y(W); {R} − Rkn ln Rkn + ln P W , (91)
2 k n

where E(X, Y; R) is the expectation error defined by (70), and R(m) is the responsibility
matrix of {z(m)
nm : }. In the E step, F is maximized with respect to R
(m)
using
β (−m) ( )
ln R(m)
kn = − 2 E kn X, Y(W); {R} + const. (92)

18
If M = 3, (92) becomes
β ∑ ∑ (2) (3) 2
ln R(1)
k1 n1 = − Rk2 n2 RK3 n3 yk1 k2 k3 : − xn1 n2 n3 : + const. (93)
2 k ,k n ,n
2 3 2 3

In the M step, F is maximized with respect to W and then β using


{ }
W = X × Q̃ (94)
1 ( )
β−1 = ∏ E X, Y(W); {R} . (95)
m Nm

Here, Q̃(m) is given by


 
∑ 
(m) 
G (m)
= diag  Rkm nm 
 (96)
nm
( )−1
T λm T
Q̃(m) = Φ(m) G(m) Φ(m) + √ I KM Φ(m) R(m) . (97)
M
β

This is the TGTM–R M algorithm. It is also possible to use the Gaussian process instead
of the RBF bases. If the RBF bases are replaced by Nadaraya-Watson smoother, and the
latent variables are estimated by the maximum likelihood, then this algorithm becomes
the TSOM.
The TGTM algorithm is a Bayesian algorithm, which is its advantage compared
with TSOM. However, the computational cost of TGTM is much greater than TSOM
(Table 1, 2). Thus, it is hard to apply TGTM to practical tasks without the development
of a more efficient algorithm. In this paper, the significance of TGTM is to bridge the
gap between neural network-based algorithms and Bayesian algorithms, rather than
to provide a practically useful algorithm. With such a solid theoretical background,
we consider TSOM to be a widely applicable method that can be used with ease and
confidence.

4 Simulation Results
4.1 Artificial datasets
We used some artificial datasets to examine the performances of the proposed algo-
rithms. For the sake of visualization, we synthesized 2-mode relational datasets with 1
dimensional latent spaces using the generative model. The protocol for generating data
points is as follows. First, we generated Nm latent variables {z(m) (m)
1 , . . . , zN1 } using the
uniform distribution in [−1, +1]L for m = {1, 2}. Then, the observation data xn1 n2 were
generated by
( )
xn1 n2 = f z(1)
n1 , zn2 + ε,
(2)
(98)

19
Table 1: Parameters used in the simulation of the artificial dataset, and the calculation
results. 2 2 2 2
Basic TSOM–R Legendre TSOM–R TGTM–R with TGTM–R with
(20 nodes/mode) with 4 bases/mode 20 bases/mode 4 bases/mode
Number of nodes (K1 , K2 ) 20 × 20 — 20 × 20 20 × 20
Number of bases (J1 , J2 ) (20 × 20) 4×4 20 × 20 4×4
Neighborhood radius (σ0 ) 2.0 2.0 2.0 2.0
(σ∞ ) 0.1 0.1 0.8 0.8
Neighborhood time constant (τ) 50
Number of instances (N1 , N2 ) 100 × 100
Noise amplitude (σnoise ) 0.1
RMSE (Average of 20 trials) 0.0775±0.0037 0.0867±0.0046 0.0693±0.0027 0.0777±0.0019
Calculation time for 100 loops (sec)* 0.45593±0.00083 0.09796±0.00027 15.3294±0.0057 10.9452±0.0038
* Intel Core i7-2600K (3.40GHz), Visual C++ (single thread)

1 1
0.5 0.5
0 y3 0 x3
-0.5 -0.5
-1 -1
1 1
y1 y2 0.5 x2 0.5
-1 0 x1 0
-0.5 -0.5 -1 -0.5 -0.5
0 0.5 -1 0
1 0.5 -1
1

(a) (b) (c)

1 1
0.5 0.5
0 y3 0 y3
-0.5 -0.5
-1 -1

1 1
y1 y2 0.5 y1 y2 0.5
-1 0 -1 0
-0.5 -0.5 -0.5 0 -0.5
0 0.5 -1 0.5 -1
1 1

(d) (e) (f)


Figure 3: Results using the artificial dataset: (a) original map (desired result); (b) data
points with Gaussian noise; (c) basic TSOM–R2 ; (d) Legendre TSOM–R2 ; (e) TGTM–
R2 with 20 RBFs; and (f) TGTM–R2 with 4 RBFs.

Table 2: Calculation time comparison for a large dataset.


Basic TSOM–R2 Legendre TSOM–R2 TGTM–R2
Number of nodes (K1 , K2 ) 400 × 400 — 400 × 400
Number of bases (J1 , J2 ) (400 × 400) 16 × 16 16 × 16
Number of instances (N1 , N2 ) 1, 000 × 1, 000
Calculation time for 100 loops (sec)* 2.65 × 103 3.38 × 101 3.91 × 105
* Intel Core i7-2600K (3.40GHz), Visual C++ (single thread)

20
1 1
0.5 0.5
0 y3 0 y
-0.5 -0.5
-1 -1

1 1
y1 y2 0.5 z1 z2 0.5
-1 0 -1 0
-0.5 0 -0.5 -0.5 0 -0.5
0.5 -1 0.5 -1
1 1

(a) (b)

Figure 4: Results for the other artificial dataset organized by basic TSOM–R2 . Data
points are indicated by dots. (a) 3-dimensional dataset and the organized map using
TSOM–R2 . The axes represent the three components of the observation space. (b)
1-dimensional dataset and the organized map. Unlike other figures, the two horizontal
axes represent the latent variables, and the observation space is only represented by the
vertical axis.

( )
for N1 ×N2 combinations of z(1) n1 and zn2 . Here, ε is Gaussian noise p(ε) = N ε 0, σnoise I .
(2) 2

In the simulations, we used nonlinear maps ranging in [−1, +1], and set the noise am-
plitude to σnoise = 0.1.
We examined four variations of TSOM and TGTM: (1) basic TSOM–R2 with 20
discrete nodes per mode, (2) TSOM–R2 with 4 Legendre bases per mode (Legendre
TSOM–R2 ), (3) TGTM–R2 with 20 radial basis function (RBF) bases per mode, and
(4) TGTM–R2 with 4 RBF bases per mode. The parameters used in the simulation are
shown in Table 1. TSOMs were initialized by randomly assigning the winners, and
TGTM was initialized by randomly constructing the map. Empirically, GTM is more
likely to fall into local minima than SOM, especially when it is initialized randomly. In
this work, the radius of the RBF was gradually reduced to avoid local minima, in the
same manner as the neighborhood size of the SOMs. With this modification, TGTM
did not fall into local minima in any of the simulations. In the basic TSOM–R2 , the
latent variables are estimated by the all-play-all competition, whereas they are updated
by the gradient descent method in the Legendre TSOM–R2 .
We iterated over the E and M steps, and reduced the neighborhood size exponen-
tially with the time constant τ = 50. In this experiment, the calculations were moni-
tored until t = 600, but the map converged after a smaller number of iterations. Typi-
cally, at most one hundred iterations are enough for random initialization. The iteration
cycles can be further reduced by using PCA initialization.
In the first artificial dataset, the data points were generated using
 
z1 cos π/4 − z2 sin π/4
 
f (z1 , z2 ) = z1 sin π/4 + z2 cos π/4 . (99)
 
z21 − z22

The results are shown in Fig. 3 and Table 1. All the algorithms succeeded in capturing
the data distribution, as expected. We repeated the simulations 20 times for each algo-
rithm, randomly changing the initial state and the noise. All the results were consistent
and stable. For this dataset, the RMSEs were more or less equal for all algorithms.
Among the examined algorithms, the fastest was Legendre TSOM–R2 . It was ap-

21
Table 3: Summary of the sushi and beverage datasets.
Sushi dataset Beverage dataset
Number of modes 2 (user, item) 3 (user, item, context)
Number of instances 5, 000 × 10 604 × 14 × 11
Data type scalar (integer) scalar (integer)
Data scale ranking from 1 to 10 grades from 1 to 5

proximately 5 times faster than basic TSOM–R2 . In contrast, TGTM–R2 was more
than 100 times slower than the Legendre TSOM–R2 . The speed differences increased
when the tasks became larger (Table 2). Therefore, Legendre TSOM–R2 is the best
solution for a large-scale relational data analysis.
Fig. 4 (a) shows the results using another artificial dataset, generated by
 
cos(0.5πz1 + πz2 )
 
f (z1 , z2 ) =  sin(0.5πz2 + πz2 )  . (100)
 
z2

Ordinary SOM cannot generally learn this ‘Swiss-roll’ type of dataset. Nevertheless,
TSOM succeeded. This is because the relational data contain more information about
the latent variables than the ordinary dataset. This issue is discussed in Sec. 7.
TSOM can estimate the map even if the observed data are scalar. Fig. 4 (b) is an
example of a scalar dataset. Here, the scalar data were generated by
( )

f (z1 , z2 ) = sin (z1 + z2 ) . (101)
4

In this case, the two latent spaces degenerate into the one dimensional observation
space. Thus, different latent variable pairs can often generate the same output. Never-
theless, TSOM estimated both the nonlinear map and the latent variables, as shown in
Fig. 4 (b).

4.2 User–item rating datasets


As an example of a real application, we applied TSOM to two sets of user–item rating
data: a sushi dataset with 2-modes, and a beverage dataset with 3-modes. The sushi
dataset2 consists of the ranking scores of 10 sushi toppings evaluated by 5,000 respon-
dents [41]. Thus, 10 toppings are ranked from 1 to 10 by each respondent. In this work,
they are regarded as scalar scores. The beverage dataset3 consists of scores evaluated
by 604 respondents, who were asked to evaluate their preferences regarding 14 bever-
ages (e.g., orange juice or coke) in 11 different contexts (e.g., at lunch or during sports).
Thus, the beverage dataset is a 3-mode relational data of (user)×(item)×(context). A
summary of these datasets is shown in Table 3.
The results for the sushi dataset are shown in Fig. 5. Using TSOM–R2 , we ob-
tained the topographic maps of users (respondents) and items (sushi toppings). These

2 http://www.kamishima.net/sushi/
3 http://www.brain.kyutech.ac.jp/ furukawa/beverage-e/

22
B C
A D

(a) (b) (c)

Figure 5: Maps for the sushi dataset: (a) the map of users (respondents) and (b) the
map of items (sushi toppings), colored by the marginal U-matrix, where red regions
represent the cluster borders and A–E are the conditioning points for Fig. 6; (c) the map
of items (sushi toppings) colored by the marginal component plane. The red/blue colors
indicate the higher/lower average scores. Fatty tuna is the most preferred topping, and
cucumber roll is the least favorite.

A B C D E
Figure 6: The user maps for the sushi dataset, colored by the conditional component
plane. The conditioning points are labeled A–E in Fig. 5 (b). The red/blue regions in A
correspond to the respondents who like/dislike cucumber rolls, while those in E show
users who prefer the shrimp topping. When the conditioning point continuously moves
from A to E in the sushi map, the user map changes color gradually from A to E.

topographic maps can be translated as in conventional SOM. Therefore, two users or


two items located closer to each other in the maps are more similar than ones that are
further away. For example, the shrimp sushi and tuna roll are closer in the map space.
This implies that the people who prefer shrimp are also likely to prefer tuna rolls, and
vice versa.
Fig. 7 presents the results for the beverage dataset obtained by TSOM–R3 . That is,
the maps of users (respondents), items (beverages), and contexts (situations). Similar
types of beverages are closer in the item map. For example, carbonated drinks are
located in the same map region.
One may think that the difference between a TSOM and a conventional SOM is
only due to the number of organized maps. However, it is worth noting that TSOM
organizes a single model that preserves all the information, including the relationships
between different modes. TSOM provides several visualization methods for the inter-
mode relationships, as introduced in the next section.

23
(a) (b) (c)

Figure 7: Maps for the beverage dataset, colored by the marginal U-matrix: (a) users
(respondents); (b) items (beverages); and (c) contexts (situations).

(a) (b)

Figure 8: Maps of the beverage dataset colored by the conditional component plane.
Red/blue regions represent the higher/lower scores. The markup balloons are the corre-
spondence overlay. Note that the scores are marginalized with respect to the user mode.
(a) The conditional component map of beverages colored for ‘exercise time’ (‘+’ in the
context map). The beverages in the red region (mineral water and isotonic drink) are
preferred during exercise. The overlaid label ‘exercise time’ is also located between
these beverages. (b) The conditional component map of contexts colored for ‘isotonic
drink’ (‘+ in the beverage map). The map shows that isotonic drink is preferred in the
situations indicated by red (exercise time, outdoor work time, and outdoor playtime).
The overlaid label ‘isotonic drink’ is located near these situations.

24
5 Visualization and Exploration in Map Spaces
In this section, we introduce some visualization techniques for TSOM. We only de-
scribe the methods for TSOM, but they are also all available for TGTM.

5.1 The joint map


In conventional SOM, the organized nonlinear map is represented by a square topo-
graphic map. It is usually colored according to either the component plane or the U-
matrix [42, 43]. The component plane visualizes the nonlinear map by focusing onto
a component of the data vectors. Thus, nodes that have higher values with respect to
the focused component are highlighted in the topographic map. Alternatively, when
using the U-matrix, each node is colored depending on the distance in the data space
from the node to its neighbors. As a result, the cluster boarders are highlighted in the
topographic map. These two methods are also available for a TSOM.
For the TSOM, the purpose of visualizations is to outline the nonlinear map y =
f (z(1) , . . . , z(M) ). The best way to capture the whole image is to visualize it for all
combinations of the latent variables. Thus, we visualize f using a single topographic
map, the dimension of which is ML. We refer to this as the joint map. Fig. 3 and Fig. 4
are examples of joint maps, which can also be represented by the component plane or
the U-matrix. Unfortunately, limitations of the visualization mean that the joint map is
only available when M = 2 and L = 1. Thus, we cannot visualize the whole image in
most cases. Instead, we can color each topographic map to show the intra- and inter-
mode relationships by focusing on an item of interest, as described in the following
methods.

5.2 The marginal map


If our purpose is to analyze the relationships between instances within a mode, then the
marginal map is appropriate. Using this method, we obtain M topographic maps, each
of which is colored like a conventional SOM, taking the average over the unconcerned
modes.
In the case of the marginal component plane, the kth node of the mth map is colored
using a grayscale or heat map, which is dependent on the averaged value defined as
(m) 1 ∑
Y kd = Ykd , (102)
K−m k
km =k

where d is the component of interest. Fig. 5 (c) shows the marginal component plane
of the sushi map, which visualizes how much each topping is preferred by the respon-
dents.
Similarly, for the marginal U-matrix, each node is colored using the average dis-
tance between two neighboring units, defined as
2
Yk(m) − Yk′ (m) ,
2 1
Dm (k, k′ ) = (103)
K−m

25
where k and k′ are the indices of two neighboring nodes. Fig. 5 (a) (b) and Fig. 7 are
examples using the marginal U-matrix.

5.3 The conditional map


The conditional map is useful if we want to analyze the relationships between two
(or more) modes. Here, the term ‘conditional’ means that the topographic map of the
focused mode is colored under the condition specified by the other modes.
When the nonlinear map y = f (z(1) , z(2) ) is visualized by the conditional map with
respect to the 1st mode, f is regarded as a function of z(1) under the conditions given by
z(2) . Then, the topographic map of the 1st mode is colored by the component plane or
U-matrix with respect to z(1) , while z(2) is fixed to a specified position. Fig. 6 contains
examples using the conditional component plane.
An interactive user interface on a PC can be used to take full advantage of the con-
ditional map. We have developed prototype software, in which the user can specify the
condition in one of the maps using a pointing device4 . Then, the other maps are colored
as conditional maps. When the user moves the mouse cursor in the conditioning map,
then the other map changes accordingly, as if the user moves around in the space of the
relational data (Fig. 6).
If M > 2, we can combine the marginal and conditional visualizations. Thus, we
can color a focused map under the condition of some other modes, and then marginalize
the remaining unconcerned modes. Fig. 8 (a) (b) contains some examples in which the
component (rating score) is marginalized with respect to the user mode, while the other
mode is used for conditioning.

5.4 Correspondence overlay


The interactive visualization allows for an intuitive analysis, but it is difficult to print.
The correspondence overlay produces a printable static visualization, which summa-
rizes the conditional component plane under different conditions. We can also use
this technique to see the relationships between the instances belonging to two different
modes.
Fig. 8 (a) shows an example of the correspondence overlay, in which the context
labels (markup balloons) are overlaid on the beverage map. These labels indicate the
maximum point of the conditional component plane. For example, the label ‘Exercise
time’ signifies the beverage with the highest score in this context. Thus, this method
is similar to the correspondence analysis. Note that it is also possible to label the
minimum points (i.e., the beverages with the worst scores in this context).
Conversely, Fig. 8 (b) is the correspondence overlay of beverage labels on the con-
text map. In this case, the overlaid beverage labels mean the highest score point for the
beverages in the context map. Simply speaking, Fig. 8 (a) shows the most preferred
beverages for the labeled situations, whereas Fig. 8 (b) represents the most preferred
situations for the labeled beverages. Thus, these two maps have different meanings.

4 http://www.brain.kyutech.ac.jp/ furukawa/tsom-e/

26
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1

1 1
0.5 0.5
-1 0 -1 0
-0.5 -0.5 -0.5 0 -0.5
0 0.5 -1 0.5 -1
1 1
60% (100 x 100) 90% (1000 x 1000)
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1

1 1
0.5 0.5
-1 0 -1 0
-0.5 0 -0.5 -0.5 -0.5
0.5 -1 0 0.5 -1
1 1
80% (100 x 100) 98% (1000 x 1000)
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1

1 1
0.5 0.5
-1 0 -1 0
-0.5 0 -0.5 -0.5 -0.5
0.5 -1 0 0.5 -1
1 1
90% (100 x 100) 99% (1000 x 1000)

(a)
0.11
100 x 100
1000 x 1000
90%
0.1
RMSE

0.09 99%
80%

98%
60%
96%
0.08
92%

20% 0% 0%
90%
80% 60% 20%
0.07
1,000 10,000 100,000 1e+006
Given data number
(b)

Figure 9: The results of the missing data estimation: (a) estimated manifolds under
different missing ratios; and (b) RMSE of the missing data estimates. The RMSEs
were evaluated for 2 different original data sizes, 100×100 and 1, 000×1, 000. Missing
ratios are also indicated.

27
Pulp Fiction
The Terminator
Aliens
Fargo

The Godfather
Independence Day

Star Trek
Jurassic Park
Men In Black

The Lion King Toy Story


Aladdin Pinocchio

(a) (b)

Figure 10: The results for the MovieLens dataset: (a) user map (dominant classes of
generations and genders are indicated by the correspondence overlay, 20s M and 20s F
respectively represent men and women in their 20’s); and (b) movie map (movie genres
are indicated by the correspondence overlay, as well as some popular movie titles).

6 TSOM Variations
In this section, we introduce several variations of the TSOM, and example applications.
We do not present the TGTM variations, but they are also possible.

6.1 Missing data estimation


As previously discussed, user–item rating data is a typical application of TSOM. In
real applications, customers usually purchase a small subset of the items, so most of
the score table is missing. Missing value estimates are necessary when analyzing real
data, and are also important when developing recommendation systems [5, 6].
Assume that there are two modes (M = 2), to simplify the explanation. When the
given dataset contains missing values, the M step (47) becomes

1 ∑
N1 ∑
N2
Yk1 k2 d = R(1) (2)
k1 n1 Rk2 n2 Γn1 n2 d Xn1 n2 d . (104)
G k1 k2 d n1 =1 n2 =1
∑ ∑
Here, Gk1 k2 d = nN11=1 nN22=1 R(1) (2)
k1 n1 Rk2 n2 Γn1 n2 d , and Γn1 n2 d indicates whether Xn1 n2 d is miss-
ing. Thus, Γn1 n2 d = 0 when Xn1 n2 d is missing, otherwise Γn1 n2 d = 1. The instance
manifolds are also calculated in a similar way.
In the E step, we can determine winners in two ways. One way is to use (77), as in
the ordinary case. This method results in a quick winning decision. When the missing
ratio increases, there may only be a few data points for each instance. In such cases,
it is better to directly evaluate (76), because the distance between the instance and the
slice manifolds may be affected by missing data. After the training process, the missing
values are estimated using X̃n1 n2 d = Ykn∗(1) kn∗(2) d .
1 2
We investigated this algorithm using the same artificial dataset as in Fig. 3. We
randomly removed part of the dataset, and then passed it to the TSOM. (At least 2

28
Sending to Sending to

Receiving from Receiving from

Sending to Sending to

Receiving from Receiving from

CEO President Trader

Figure 11: The results for the Enron e-mail dataset. The center map is colored by the
marginal U-matrix, where the cluster borders are indicated by red regions. The num-
bers represent user IDs. The surrounding small maps are the conditional component
planes, representing the sending/receiving e-mail traffic from/by the conditioned user.

data points were left for each instance.) In this simulation, we estimated the winning
positions using (76). We repeated this protocol, changing the various missing ratios.
The results are shown in Fig. 9. The TSOM captured the manifold shape, even when
the missing ratio was higher than 90% (Fig. 9 (a)). The RMSE result also shows that
TSOM was robust to missing data (Fig. 9 (b)).
We also applied this algorithm to the MovieLens dataset5 , which contains user–item
rating data for movie titles. The dataset used here is a subset called MovieLens 100k,
which contains 100,000 rating scores on a 5-point scale. The scores were evaluated
by 943 people for 1,682 movie titles, and approximately 94% of the data are missing.
The dataset also contains the information for the respondent attributes and the movie
title genres, but they were not used in this experiment. To evaluate the missing data
estimation, we used the training and test dataset provided by MLcomp6 . The root
mean square error (RMSE) was 0.9533 ± 0.0008. This result is more or less equal to
other methods within top 10 in MLcomp.
Here, we introduced a simple algorithm for estimating missing data. By applying
the EM algorithm, this algorithm may be integrated into the TGTM, so that the missing
values can be estimated at the same time as the nonlinear map. Such a probabilistic
approach is an important issue for further investigation.

5 http://grouplens.org/datasets/movielens/
6 http://www.mlcomp.org/

29
6.2 Using side information
In some situations, some additional data called side information are also provided. For
example, the MovieLens dataset contains the side information on the age, gender, and
occupation as user attributes. In such cases, it is possible to consider side information
with the relational data.
Suppose that xs.i.(m)
n is the side information of mode m. Then, the winner of mode
m is determined using both the instance manifold and the side information. Let ys.i.(m)
k
be the kth reference vector of the side information of mode m, and let Ds.i.(m) be the
dimension of xs.i.(m)
n . Then, the winner is determined by
{ 2 }
1 2
(m)
α(m) s.i.(m) s.i.(m)
∗(m)
kn = arg min Y − Un(m) + s.i.(m) yk − xn . (105)
k K−m D k(m) D

Here, α(m) is a weight parameter that determines how much the side information affects
the winning decision. A typical value for α(m) is 1, so the relational data and the side
information equally affect the winning decision. In the M step, ys.i.(m)
k is also updated
according to

Ys.i.(m) = R̃(m) Xs.i.(m) . (106)

Thus, the TSOM works as an ordinary SOM with respect to the side information.
We applied this algorithm to the MovieLens dataset. The side information for the
respondents (age, gender, and occupation) and the movie titles (genres such as drama,
comedy, SF, etc.) were considered in the winning decision for α(m) = 1. The obtained
maps are shown in Fig. 10. The user map showed distinct cluster borders, which coin-
cided with users’ generations and genders. Some clusters of genres were also present
in the movie map.

6.3 TSOM for square relational data


For M-mode relational data, each mode corresponds to an object set, which is the item
of interest for our analysis. Thus, there are typically M object sets to be analyzed.
However, some modes may share the same object set. For example, there is only one
object set to be analyzed, and each observed data point is determined by a combination
of two members of the set. In this case, the observed data are modeled by a nonlinear
function, xn1 n2 : ≃ f (zn1 , zn2 ), where both zn1 , zn2 belong to the same latent space, Z. We
refer to this type of relational data as square, because the data tensor becomes a square
with respect to the two modes. This situation can be generalized to cases in which three
or more modes share an identical object set.
A representative example application is message traffic data between members of
the same community (e.g., e-mail traffic within a company). In this case, the sender
set is identical to the receiver set. When applied to this problem, basic TSOM–R2
organizes two different maps (i.e., the sender and receiver maps), which contain differ-
ent arrangements of the community members. However, if we wish to analyze mutual
interactions between members, we need a single topographic map representing the
member relationships within the community. To obtain such a unified map, the winner

30
should be determined with respect to both modes. Thus, the winner of each member is
defined by
( 2 )
kn∗ = arg min Yk:: − U(1) (2) 2
n:: + Y:k: − U:n: , (107)
k

instead of (28) – (30).


We applied this algorithm to the e-mail data of the Enron dataset7 [44, 45]. This
dataset consists of e-mails from 148 Enron employees. The procedure for the experi-
ment is summarized as follows. The relational data used here is the number of e-mails
between members, determined using the ‘To:’ field. Because the distribution has a
long tail, we transformed the traffic number using the exponential function so that the
data were approximately uniformly distributed in [0, 1]. e-mail traffic to oneself was
ignored, and treated as a missing value.
The results are shown in Fig. 11. We can recognize some clusters using the marginal
U-matrix (center map). The most distinct cluster represents the executives, whose e-
mail patterns are quite different from the other members. The traffic between two nodes
of the map can be visualized using the conditional component planes (small maps).
These results show that e-mails were mainly exchanged within each cluster, suggesting
that the clusters correspond to company departments.
In this section, we presented two practical applications to demonstrate the potential
of the TSOM. Needless to say, further investigations are necessary to examine the
validity of the obtained maps. What we wish to emphasize here is that TSOM can
be extensively applied to many fields with some modifications, and can extract rich
information by combining visualization techniques.

7 Discussion: The Tensor SOM Family


In this section, we compare TSOM with another SOM variation that is relevant to a
tensorial representation. This discussion is also an attempt at constructing a theoret-
ical background for the TSOM family. Furthermore, the significance of TSOM as a
generalization of the conventional SOM is also discussed.

7.1 Comparison with SOM of SOMs (SOM2 )


Among SOM variations, SOM2 (or ‘SOM of SOMs’) is the most related to our method
[33, 34]. TSOM and SOM2 are very alike, and there are many similarities in their
tasks and algorithms. First, the common aim of both SOMs is to estimate the nonlin-
ear map from two or more latent spaces to the observation space, which is represented
by a tensor. Second, the winner for each instance is determined by the distance from
the instance manifold to the slice manifolds. (In the original paper, these are called
‘sections’, and the other orthogonal manifolds are ‘fibers’ because of the fiber bundle
term [34]). However, the most essential difference relates to their hierarchical struc-
tures. There is a distinct hierarchy between modes which are referred to as ‘parent’

7 https://www.cs.cmu.edu/˜./enron/

31
β

W L

Figure 12: Graphical representation of the generative model for SOM2 (TSOM–H2 ).

and ‘children’ in SOM2 , whereas there is no such hierarchy in TSOM. The difference
originates in the type of data structures used by these algorithms. Thus, these two al-
gorithms share the same goal (estimating the nonlinear map from the multiple latent
spaces to the observation space), but use different start points (the data structure of the
given dataset). In this sense, it is better to call SOM2 a ‘tensor SOM for hierarchical
data’ (TSOM-H), and call the proposed algorithm a ‘tensor SOM for relational data’
(TSOM-R). Clarifying the similarities and differences between the two algorithms is
important when comprehensively constructing the family of TSOMs. Hereafter, SOM2
is referred as TSOM-H and is regarded as a member of the TSOM family.

7.2 TSOM–H2 (SOM2 ) algorithm


The essential difference resides in the data structure, so let us start from the generative
model of SOM2 , i.e., TSOM–H2 . Suppose that we have two object sets, Ω(1) and Ω(2) ,
and suppose that Ω(2) is the parent of Ω(1) in the data observation. Here ‘parent’ means
that it behaves like a parameter rather than a variable, and a series of observations
is made for various child objects under the same parent object. For example, if we
independently conduct consumer surveys for every new product, then the product set
becomes the parent of the respondent set. Note that the obtained dataset is no longer
relational, because different respondents are sampled for every survey. Therefore, we
do not know how the same respondent evaluates other products.
Under this situation, the observed dataset is obtained in the following way. First,
we sample a set of instances of the parent mode, Ω(2) (2)
S = {ωn2 }, and then independently
(2)
sample the instances of the child mode } each ωn2 . Thus,
{ (1) for we have N2 sample sets
with respect to mode 1, Ω (2) = ωn1 |n2 . Note that ωn1 |n2 , ω(2)
(1) (1)
n1 |n′2
if n2 , n′2 . As
S |ωn2
( (1) )
a result, we obtain N1 × N2 data {xn1 |n2 } from the combination of ωn1 |n2 , ω(2) n2 . This
generative model is shown in Fig. 12. (Strictly speaking, the sample number N1 is also
determined independently for each ω(2) n2 , but we assume that they are equal for the ease
of explanation.)

32
2
Data structure 1 2 1 2
diagram 1

1 1 1
0.5 0.5 0.5
0 0 0
Dataset -0.5
-1
-0.5
-1
-0.5
-1
1 1 1
0.5 0.5 0.5
-1 0 -1 0 -1 0
-0.5 -0.5 -0.5 -0.5 -0.5 -0.5
0 -1 0 -1 0 -1
0.5 0.5 0.5
1 1 1

1 1
0.5 0.5
Instance 0 0
-0.5 -0.5
-1 -1
manifolds 1 1
0.5 0.5
-1 0 0
-0.5 -1
-0.5 -0.5 -0.5
0 -1 0 -1
0.5 0.5
1 1

1 1 1
0.5 0.5 0.5
0 0 0
Organized map -0.5 -0.5 -0.5
-1 -1 -1
1 1 1
0.5 0.5 0.5
-1 0 0 0
-0.5 -1 -1
-0.5 -0.5 -0.5 -0.5 -0.5
0 -1 0 -1 0
0.5 0.5 0.5 -1
1 1 1

2 2
(a) TSOM-R (b) TSOM-H (c) TSOM-C 2 (SOM)

Figure 13: Comparison of three possible data structures and their corresponding algo-
rithms: (a) relational data with TSOM–R2 ; (b) hierarchical data with TSOM–H2 ; and
(c) coupled data with TSOM–C2 (which becomes ordinary SOM). The 1st row contains
the data structure diagrams, and the 2nd shows examples of the sampled dataset. The
relational data in (a) are aligned in two directions, while the hierarchical data in (b) are
aligned in one direction. No data alignment was applied to the coupled data case in (c).
The 3rd row contains the instance manifolds calculated by TSOM–R2 and TSOM–H2 .
No instance manifold is estimated in TSOM–C2 . The 4th row contains the estimated
nonlinear maps.

Considering the above discussion, the SOM2 algorithm can be rewritten as TSOM–
2
H , a variation of TSOM. Although the algorithm described below looks quite different
from the original, they are equivalent. In the E step, the winner is determined according
to
2
kn∗(1) = arg min y ∗(1) − xn |n (108)
1 |n2 k1 kn : 1 2
2
k1
2
kn∗(2)
2
= arg min Y:k2 : − U(2)
:n2 : . (109)
k2

(108) and (109) mean that each winner of the parent mode is determined by the dis-
tance between U(1) (2)
:n2 : (the instance manifold) and Y:k2 : (the slice manifolds), whereas
the winner of the child mode is determined by the best matching unit within the best
matching slice manifold of its parent.

33
3
3 2 3 2 2
3
1 2 3 1 2 1 1 1

(a) TSOM-R3 (b) TSOM-HR2 (c) TSOM-R2H (d) TSOM-H2R (e) TSOM-H3

Figure 14: Diagrams of the possible 3rd order data structures.

3
1 2 3 1 2 1 2

(a) (b) (c)

Figure 15: Three examples of data structures with two or more observation spaces.
Note that there are more possible data structures. The data structure in (c) corresponds
to relational data with side information.

After determining the winners, the responsibility matrices R(1)


n2 and R
(2)
are calcu-
lated using the neighborhood matrices H and H , and the winning matrices B(1)
(1) (2)
n2 and
B(2) . That is,

n2 = H Bn2
R(1) (1) (1)
(110)
R (2)
=H B .
(2) (2)
(111)

Note that the responsibility of the child mode R(1) (2)


n2 depends on its parent ωn2 , whereas
the responsibility of the parent R(2) does not depend on its child mode.
In the M step, the map is updated according to

U(2)
:n2 : = X:n2 : ×1 R̃n2
(1)
(112)
Y=U (2)
×2 R̃ . (2)
(113)

We iterate over these two steps until the map converges.


Fig. 13 shows how TSOM–R2 and TSOM–H2 estimate the nonlinear map from the
given datasets. In TSOM–R2 , the instance manifolds are estimated for every mode,
whereas in TSOM–H2 , they are only estimated for the parent mode.

34
7.3 Data structure diagram
To illustrate the data structures, let us introduce the diagram in Fig. 13 (top row). In this
diagram, each sample set is represented by a circle node and its mode number. When
two or more nodes are integrated into a set of sample combinations, they are packaged
to an oval node. The nodes with no links mean that the samples of the nodes were
selected independently to other nodes. The symbol ‘×’ represents the direct product
of independent nodes, which produces all possible combinations of two sample sets.
The black thin arrow denotes the hierarchical sampling of the connected nodes, and
the nodes connected by double lines are coupled in the data sampling. Thus, a pair
of instances is always sampled together. The thick white arrow represents the data
observation, which is made for every member of the root node.
The top row of Fig. 13 (a) contains a diagram of a 2-mode relational dataset, dealt
with by TSOM–R2 . In this )} data vectors {xn1 n2 } are observed for all members
{( case,(2)the
of the sample set ΩS = ω(1) n1 , ω n2 .
Fig. 13 (b) depicts the 2-mode hierarchical dataset case, dealt with by TSOM–
H2 . In this)}case, the data vectors {xn1 |n2 } are observed for all members of ΩS =
{(
ω(1)
n1 |n2
, ω(2)
n2 . This situation appears in transfer or multisystem learning tasks, where
the parent mode corresponds to the system parameter, and the child mode represents
the state variable under the given parameter [46, 47].
In the last situation depicted in Fig. 13 (c), two nodes are
( (1) ) tightly coupled in the data
(2)
observation. In this case, a pair of instances ωn , ωn is sampled simultaneously,
and then a data vector xn is observed from the instance pair. Here, we refer to the
TSOM for this type dataset as TSOM–C2 (tensor SOM for coupled data). Interestingly,
the TSOM–C2 algorithm is almost the same as the conventional SOM, because we
do not choose the winning position but take the ordinary best matching point in the
entire nonlinear map. This situation corresponds to nonlinear ICA, and several previous
studies have shown that SOM can be applied to this task [48, 49]. It would be worth
investigating TSOM–C from the viewpoint of nonlinear ICA. This is an important issue
that we will consider in the future.
The difficulties of these tasks are also different for the three cases. In case (a),
TSOM–R2 must estimate (N1 + N2 )L latent variables from N1 N2 D observed data,
whereas TSOM–H2 in case (b) must estimate (N1 N2 + N2 )L unknowns from N1 N2 D
known values. Thus, TSOM–H2 task is more difficult than TSOM–R2 . The most dif-
ficult is case (c), in which TSOM–C2 must estimate 2NL latent variables from ND
observed data. This is why TSOM–R2 captured the data distribution better than ordi-
nary SOM (i.e., TSOM–C2 ).

7.4 TSOM family of order 3


If there are 3 or more modes, new situations arise in which the relational and the hier-
archical data structures are mixed. For example, we first obtain the sample set of mode
1, and then observe the relational data of modes 2 and 3 for each sample of mode 1. In
this case, the data vectors are denoted by x(n2 n3 )|n1 . For 3-mode data, there are 5 pos-
sible data structures, as shown in Fig. 14. Here, we have excluded the data structures
with coupled modes. The corresponding TSOMs are (a) TSOM–R3 , (b) TSOM–HR2 ,

35
(c) TSOM–R2 H, (d) TSOM–H2 R, and (e) TSOM–H3 (SOM3 ). All these algorithms
can be derived by combining TSOM–R and TSOM–H. For example, the TSOM–HR2
algorithm (Fig. 14 (b)) can be described as follows.

E step

2
kn∗(1)
1 |n3
= arg min Yk1 :kn∗(3) : − U(1)
n1 ::|n3
(114)
k1 3

2
kn∗(2) = arg min Y ∗(3) − U(2) (115)
2 |n3 :k2 kn : :n2 :|n3
3
k2
2
kn∗(3) = arg min Y − U(3) (116)
3 ::k3 : ::n3 :
k3

M step

U(1)
n1 ::|n3
= X::n3 : ×2 R̃(2)
n3 (117)
U(2)
:n2 :|n3
= X::n3 : ×1 R̃(1)
n3 (118)
U(3)n3 = X::n3 : ×1 R̃(1)
n3 ×2 R̃(2)
n3 (119)
Y=U (3)
×3 R̃(3)
(120)

Thus, two types of E and M steps are combined with respect to the data structure. The
algorithms for the other cases can be easily obtained in a similar way.
In some cases, there are two or more observation spaces. Some examples are pre-
sented in Fig. 15. In this figure, the data structure (a) is the case in which two different
surveys are made for the same group of respondents. The data structure of the Movie-
Lens with side information is represented by (c). When the number of modes increases,
the number of possible variations increases exponentially. Nevertheless, the TSOM
(and the TGTM) family can adapt to all data structures. Therefore, these algorithms
can be unified using a comprehensive theoretical system of topographic mapping fam-
ilies for tensorial data.

7.5 The relationship between the original SOM and TSOM


Finally, let us discuss the original SOM from the viewpoint of TSOM. Because the
original method can be regarded as a special case of TSOM, it is expected that TSOM
provides several new perspectives on the original SOM.
In the original SOM, the winning node is determined to be the node in the data
space nearest to a given data vector. Therefore the learning result depends on the met-
ric that determines distances in the data space. It is usual for every component of
the data vectors to be normalized so that their mean and variance are zero and one
respectively, and the Euclidean distance is applied. This protocol seems to treat all
components equally, but the metric problem still remains. If some components are al-
most identical to each other, then those duplicated components may affect the metric

36
more than others. For example, if a teacher tries to analyze students’ abilities using data
consisting of many physical indices and few academic indices, then academic ability is
almost ignored as noise. This metric problem is generally unavoidable in unsupervised
learning methods. Even though TSOM is also affected by this problem, TSOM can
reduce these undesired effects. By regarding the dataset as a 2-mode relational dataset,
TSOM generates two maps, namely a map of target objects and a map of data compo-
nents. If two components are identical, then those components are located in the same
position in the component map. Consequently, their influence in determining winners
is weakened, and thus TSOM generates a more moderate result. Therefore, TSOM is
recommended for use in conventional tasks where the original SOM has been applied.
The second perspective is relevant to the origins of the SOM. The SOM was origi-
nally a neural network model of a brain map of a visual field, and turned out to be useful
as a dimension reduction tool for practical applications [2]. Early studies investigated
both the self-organization of topology preserving neural projections in visual fields
(e.g., [50]), and the self-organization of visual features (e.g., [51]). Though those stud-
ies appear contiguous, there is a gap between them. The former is the self-organization
of the order of visual signal lines, whereas the latter orders the features of visual sig-
nals. Kohonen referred to these as type 1 and type 2 self-organization respectively, and
he stated that his SOM is categorized to type 2 [52]. From the viewpoint of the TSOM,
type 1 corresponds to a map of components, while type 2 is a map of objects. In this
sense, TSOM unifies both types of self-organization.
The third perspective is relevant to the axes of the topographic map. Usually the
meaning of the axes is of little concern. However, by regarding the conventional 2-
dimensional square map as a product space of two 1-dimensional latent spaces, the
SOM becomes a TSOM for coupled data, that is, TSOM–C2 . As has already been
discussed above, this is related to nonlinear ICA. This is an important issue for TSOMs
that should be investigated further.

8 Conclusions
In this paper, we presented two algorithms for topographic mappings of tensorial data,
TSOM and TGTM. TGTM provides the theoretical background of the algorithms, and
TSOM is useful for practical applications. Among the variations, the TSOM with
orthonormal bases was computationally fast and had a high resolution. Therefore, it is
the best choice for large-scale tasks. We also presented various TSOM variations and
visualization methods, which further increase the applicability of TSOM to real data
analysis.
Theoretically, TSOM and TGTM can be derived from the generative model by
applying the EM algorithm. SOM2 (SOM of SOMs) is a sibling of TSOM that is
adapted to a hierarchical data structure. We have shown that SOM2 can be unified
to the TSOM family. Therefore, this work presented some nonlinear tensor analysis
methods, and attempted to establish a comprehensive theoretical system for the family
of algorithms.

37
Acknowledgement
This work was supported by JSPS KAKENHI Grant Numbers 23500280, 22120510.

References
[1] T. Kohonen, Self-Organizing Maps, Springer-Verlag, Berlin Heidelberg, 2001.

[2] T. Kohonen, Self-organized formation of topologically correct feature maps, Bi-


ological Cybernetics 43 (1) (1982) 59–69.

[3] C. M. Bishop, M. Svensen, C. K. I. Williams, GTM: The generative topographic


mapping, Neural Computation 10 (1998) 215–234.

[4] C. M. Bishop, M. Svensen, C. K. I. Williams, Developments of the generative


topographic mapping., Neurocomputing 21 (1-3) (1998) 203–224.

[5] G. Ricci, M. de Gemmis, G. Semeraro, Matrix and tensor factorization techniques


applied to recommender systems: a survey, International Journal of Computer and
Information Technology 1 (2012) 94–98.
[6] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender
systems, Computer 42 (8) (2009) 30–37. doi:10.1109/MC.2009.263.
[7] A. Karatzoglou, X. Amatriain, L. Baltrunas, N. Oliver, Multiverse recom-
mendation: N-dimensional tensor factorization for context-aware collabora-
tive filtering, in: Proceedings of the Fourth ACM Conference on Recom-
mender Systems, RecSys ’10, ACM, New York, NY, USA, 2010, pp. 79–86.
doi:10.1145/1864708.1864727.

[8] E. Acar, S. A. Çamtepe, M. S. Krishnamoorthy, B. Yener, Modeling and multi-


way analysis of chatroom tensors, in: Proceedings of the 2005 IEEE International
Conference on Intelligence and Security Informatics, ISI’05, Springer-Verlag,
Berlin, Heidelberg, 2005, pp. 256–268.
[9] E. Acar, S. A. Camtepe, B. Yener, Collective sampling and analysis of high order
tensors for chatroom communications, in: in ISI 2006: IEEE International Con-
ference on Intelligence and Security Informatics, Springer, 2006, pp. 213–224.

[10] M. W. Berry, M. Browne, Email surveillance using non-negative matrix factor-


ization., Computational & Mathematical Organization Theory 11 (3) (2005) 249–
264.
[11] B. W. Bader, M. W. Berry, M. Browne, Discussion tracking in Enron email using
PARAFAC, in: M. W. Berry, M. Castellanos (Eds.), Survey of Text Mining II:
Clustering, Classification, and Retrieval, Springer, 2008, Ch. 8, pp. 147–163.

38
[12] F. Miwakeichi, E. Martı́nez-Montes, P. A. Valdeś-Sosa, N. Nishiyama,
H. Mizuhara, Y. Yamaguchi, Decomposing EEG data into space-time-frequency
components using parallel factor analysis, NeuroImage 22 (3) (2004) 1035 –
1045.
[13] J. Li, L. Zhang, D. Tao, H. Sun, Q. Zhao, A prior neurophysiologic knowledge
free Tensor-Based scheme for single trial EEG classification, Neural Systems and
Rehabilitation Engineering, IEEE Transactions on 17 (2) (2009) 107–115.
[14] M. A. O. Vasilescu, D. Terzopoulos, Multilinear image analysis for facial recog-
nition, in: ICPR (2), 2002, pp. 511–514. doi:10.1109/ICPR.2002.1048350.
[15] J. Yang, D. Zhang, A. F. Frangi, J.-y. Yang, Two-dimensional PCA:
A new approach to appearance-based face representation and recogni-
tion, IEEE Trans. Pattern Anal. Mach. Intell. 26 (1) (2004) 131–137.
doi:10.1109/TPAMI.2004.1261097.
[16] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, H. Zhang, Multilinear discriminant
analysis for face recognition, IEEE Trans. on Image Processing 16 (1) (2007)
212–220.
[17] N. E. Helwig, S. Hong, J. D. Polk, Parallel factor analysis of gait waveform data:
A multimode extension of principal component analysis, Human Movement Sci-
ence 31 (3) (2012) 630 – 648.
[18] T. G. Kolda, B. W. Bader, Tensor decompositions and applications, SIAM RE-
VIEW 51 (3) (2009) 455–500.
[19] E. Acar, B. Yener, Unsupervised multiway data analysis: A literature survey,
IEEE Transactions on Knowledge and Data Engineering 21 (2008) 6–20.
[20] A. Cichocki, R. Zdunek, A.-H. Phan, S. Amari, Nonnegative Matrix and Tensor
Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind
Source Separation, John Wiley & Sons, Ltd, 2009.
[21] L. D. Lathauwer, B. D. Moor, J. Vandewalle, A multilinear singular value decom-
position, SIAM J. Matrix Anal. Appl 21 (2000) 1253–1278.
[22] H. Lu, K. N. Plataniotis, A. N. Venetsanopoulos, MPCA: Multilinear principal
component analysis of tensor objects., IEEE Transactions on Neural Networks
19 (1) (2008) 18–39.
[23] H. Lu, K. N. Plataniotis, A. N. Venetsanopoulos, A survey of multilinear subspace
learning for tensor data, Pattern Recognition 44 (7) (2011) 1540 – 1551.
[24] C. F. Beckmann, S. M. Smith, Tensorial extensions of independent component
analysis for multisubject FMRI analysis, Neuroimage 25 (1) (2005) 294–311.
[25] M. A. O. Vasilescu, D. Terzopoulos, Multilinear independent components analy-
sis, in: Proceedings of the 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, IEEE Computer Society, 2005, pp. 547–553.

39
[26] M. Welling, M. Weber, Positive tensor factorization., Pattern Recognition Letters
22 (12) (2001) 1255–1261.

[27] A. Shashua, T. Hazan, Non-negative tensor factorization with applications to


statistics and computer vision, in: Proceedings of the International Conference
on Machine Learning, 2005, pp. 792–799.
[28] L. R. Tucker, Implications of factor analysis of three-way matrices for measure-
ment of change, in: C. W. Harris (Ed.), Problems in Measuring Change, Univer-
sity of Wisconsin Press, 1963, pp. 122–137.

[29] M. Signoretto, L. D. Lathauwer, J. A. Suykens, A kernel-based framework to


tensorial data analysis, Neural Networks 24 (8) (2011) 861 – 874, artificial Neural
Networks: Selected Papers from ICANN 2010.
[30] Q. Zhao, G. Zhou, T. Adali, L. Zhang, A. Cichocki, Kernel-based tensor partial
least squares for reconstruction of limb movements, in: ICASSP’13, 2013, pp.
3577–3581.

[31] C.-S. Lee, A. Elgammal, Modeling view and posture manifolds for tracking, IEEE
International Conference on Computer Vision (2007) 1–8.

[32] X. Gao, C. Tian, Multi-view face recognition based on tensor subspace analysis
and view manifold modeling, Neurocomput. 72 (16-18) (2009) 3742–3750.

[33] T. Furukawa, SOM of SOMs: Self-organizing map which maps a group of self-
organizing maps, in: W. Duch, J. Kacprzyk, E. Oja, S. Zadrozny (Eds.), Artificial
Neural Networks: Biological Inspirations, Vol. 3696 of Lecture Notes in Com-
puter Science, Springer, 2005, pp. 391–396.

[34] T. Furukawa, SOM of SOMs, Neural Networks 22 (4) (2009) 463–478.


[35] T. Heskes, J.-J. Spanjers, W. Wiegerinck, EM algorithms for self-organizing
maps., in: IJCNN (6), 2000, pp. 9–14.
[36] J. Verbeek, N. Vlassis, B. Krose, Self-organizing mixture models, Neurocomput-
ing 63 (2005) 99–123. doi:10.1016/j.neucom.2004.04.008.

[37] S. P. Luttrell, Derivation of a class of training algorithms, IEEE Transactions on


Neural Networks 1 (2) (1990) 229–232. doi:10.1109/72.80234.

[38] Y. Cheng, Convergence and ordering of kohonen’s batch map, Neural Comput.
9 (8) (1997) 1667–1676.

[39] T. Graepel, M. Burger, K. Obermayer, Self-organizing maps: Generalizations and


new optimization techniques, Neurocomputing 21 (1998) 173–190.

[40] I. Olier, A. Vellido, Variational bayesian generative topographic mapping., J.


Math. Model. Algorithms 7 (4) (2008) 371–387.

40
[41] T. Kamishima, H. Kazawa, S. Akaho, A survey and empirical comparison of ob-
ject ranking methods, in: J. Fürnkranz, E. Hüllermeier (Eds.), Preference Learn-
ing, Springer, 2010, pp. 181–201.

[42] A. Ultsch, H. P. Siemon, Kohonen’s self organizing feature maps for exploratory
data analysis., in: Proc. INNC’90, Int. Neural Network Conf., 1990, pp. 305–308.
[43] P. Stefanovic, O. Kurasova, Visual analysis of self-organzing maps, Nonlinear
Analysis: Modelling and Control 16 (4) (2011) 488–504.
[44] B. Klimt, Y. Yang, The Enron corpus: A new dataset for email classification
research, in: Machine Learning: ECML 2004, 15th European Conference on
Machine Learning, Pisa, Italy, September 20-24, 2004, Proceedings, 2004, pp.
217–226.
[45] A. McCallum, X. Wang, A. Corrada-Emmanuel, Topic and role discovery in so-
cial networks with experiments on enron and academic email, J. Artif. Int. Res.
30 (1) (2007) 249–272.

[46] T. Ohkubo, T. Furukawa, K. Tokunaga, Requirements for the learning of multi-


ple dynamics, in: J. Laaksonen, T. Honkela (Eds.), Advances in Self-Organizing
Maps, Vol. 6731 of Lecture Notes in Computer Science, Springer, 2011, pp. 101–
110.

[47] T. Furukawa, K. Natsume, T. Ohkubo, Research on multi-system learning theory:


A case study of brain-inspired system research, in: Proc. of SCIS-ISIS 2012,
2012, pp. 311–314.
[48] P. Pajunen, A. Hyvarinen, J. Karhunen, Nonlinear blind source separation by self-
organizing maps, in: In Proc. Int. Conf. on Neural Information Processing, 1996,
pp. 1207–1210.

[49] M. Haritopoulos, H. Yin, N. M. Allinson, Image denoising using self-organizing


map-based nonlinear independent component analysis., Neural Networks 15 (8-9)
(2002) 1085–1098.

[50] S.-I. Amari, Topographic organization of nerve fields, Bulletin of Mathematical


Biology 42 (3) (1980) 339–364. doi:10.1007/BF02460791.

[51] C. von der Malsburg, Self-organization of orientation sensitive cells in the striate
cortex, Kybernetik 14 (2) (1973) 85–100. doi:10.1007/BF00288907.

[52] T. Kohonen, Self-organizing neural projections, Neural Networks 19 (6–7) (2006)


723 – 733.

41

View publication stats

You might also like