0% found this document useful (0 votes)
4 views

Constant-Time_Algorithms_for_Continuous_Optimizati

This chapter discusses constant-time algorithms for continuous optimization problems, focusing on quadratic function minimization and tensor decomposition, which are significant in machine learning and data mining. The analysis leverages graph limit theory, introducing key concepts and algorithms that allow for efficient computation in scenarios where traditional methods are infeasible due to scalability issues. The chapter presents a specific algorithm for quadratic function minimization that achieves approximation guarantees while operating in constant time, highlighting its practical applications and theoretical foundations.

Uploaded by

chronicalsarath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Constant-Time_Algorithms_for_Continuous_Optimizati

This chapter discusses constant-time algorithms for continuous optimization problems, focusing on quadratic function minimization and tensor decomposition, which are significant in machine learning and data mining. The analysis leverages graph limit theory, introducing key concepts and algorithms that allow for efficient computation in scenarios where traditional methods are infeasible due to scalability issues. The chapter presents a specific algorithm for quadratic function minimization that achieves approximation guarantees while operating in constant time, highlighting its practical applications and theoretical foundations.

Uploaded by

chronicalsarath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Chapter 3

Constant-Time Algorithms for


Continuous Optimization Problems

Yuichi Yoshida

Abstract In this chapter, we consider constant-time algorithms for continuous opti-


mization problems. Specifically, we consider quadratic function minimization and
tensor decomposition, both of which have numerous applications in machine learn-
ing and data mining. The key component in our analysis is graph limit theory, which
was originally developed to study graphs analytically.

3.1 Introduction

In this chapter, we turn our attention to constant-time algorithms for continuous


optimization problems. Specifically, we consider quadratic function minimization
and tensor decomposition, both of which have numerous applications in machine
learning and data mining. The key component in our analysis is graph limit theory,
which was originally developed to study graphs analytically.
We introduce graph limit theory in Sect. 3.2, and then discuss quadratic function
minimization and tensor decomposition in Sects. 3.3 and 3.4, respectively. Through-
out this chapter, we assume the real RAM model, in which we can perform basic
algebraic operations on real numbers in one step. For a positive integer n, let [n]
denote the set {1, 2, . . . , n}. For real values a, b, c ∈ R, a = b ± c is used as short-
hand for b − c ≤ a ≤ b + c. The algorithms and analysis presented in this chapter
are based on [5, 6].

3.2 Graph Limit Theory

This section reviews the basic concepts of graph limit theory. For further details,
refer to the book by Lovász [7].

Y. Yoshida (B)
National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan
e-mail: [email protected]

© The Author(s) 2022 31


N. Katoh et al. (eds.), Sublinear Computation Paradigm,
https://doi.org/10.1007/978-981-16-4095-7_3
32 Y. Yoshida

We call a (measurable) function W : [0, 1] K → R a dikernel of order K . We


define

|W| F = W(x)2 dx, (Frobenius norm)
[0,1] K

|W|max = max |W(x)|, (Max norm)


x∈[0,1] K
 
 
|W| = sup  W(x)dx. (Cut norm)
S1 ,...,S K ⊆[0,1] S1 ×···×S K

We note that these norms satisfy the triangle inequality.  For two dikernels W and
W , we define their inner product as W, W  = [0,1] K W(x)W (x)dx. For a
dikernel W : [0, 1]2 → R and a function f : [0, 1] → R, we define a function W f :
[0, 1] → R as (W f )(x) = W(x, ·), f .
Let λ be a Lebesgue measure. A map π : [0, 1] → [0, 1] is said to be measure-
preserving if the pre-image π −1 (X ) is measurable for every measurable set X ,
and λ(π −1 (X )) = λ(X ). A measure-preserving bijection is a measure-preserving
map whose inverse map exists and is also measurable (and, in turn, also measure-
preserving). For a measure-preserving bijection π : [0, 1] → [0, 1] and a dikernel
W : [0, 1] K →R, we define a dikernel π(W) : [0, 1] K → R as π(W)(x1 , . . . , x K ) =
W(π(x1 ), . . . , π(x K )).
A partition P = (V1 , . . . , V p ) of the interval [0, 1] is called an equipartition if
λ(Vi ) = 1/ p for every i ∈ [ p]. For a dikernel W : [0, 1] K → R and an equipartition
P = (V1 , . . . , V p ) of [0, 1], we define WP : [0, 1] K → R as the dikernel obtained
by averaging each Vi1 × · · · × Vi K for i 1 , . . . , i K ∈ [ p]. More formally, we define
 
1
WP (x) =  W(x  )dx  = p K W(x  )dx  ,
k∈[K ] λ(Vi k ) Vi1 ×···×Vi K Vi1 ×···×Vi K

where i k is the unique index such that xk ∈ Vik for each k ∈ [K ]. The following
lemma states that any dikernel W : [0, 1] K → R can be well approximated by WP
for some equipartition P into a small number of parts.
Lemma 3.1 (Weak regularity lemma for dikernels [4]) Let W1 , . . . , W T : [0, 1] K →
R be dikernels. Then, for any  > 0, there exists an equipartition P into |P| ≤
2 O(T / ) parts, such that for every t ∈ [T ],
2K

|Wt − WPt | ≤ |Wt | F .

We can construct the dikernel X : [0, 1] K → R from a tensor X ∈ R N1 ×···×N K as


follows. For an integer n ∈ N, let I1n = [0, n1 ], I2n = ( n1 , n2 ], . . . , Inn = ( n−1 n
, . . . , 1].
For x ∈ [0, 1], we define i n (x) ∈ [n] as the unique integer such that x ∈ Iin . We then
define X(x1 , . . . , x K ) = X i N1 (x1 )···i N K (x K ) . The main motivation of creating a dikernel
from a tensor is that, in doing so, we can define the distance between two tensors
3 Constant-Time Algorithms for Continuous Optimization Problems 33

X and Y of different sizes via the cut norm—that is, |X − Y| , where X and Y are
dikernels corresponding to X and Y , respectively.
Let W : [0, 1] K → R be a dikernel and Sk = (x1k , . . . , xsk ) for k ∈ [K ] be
sequences of elements in [0, 1]. Then, we define a dikernel W| S1 ,...,SK : [0, 1] K → R
as follows: We first extract a tensor W ∈ Rs×···×s by setting Wi1 ···i K =W(xi11 , . . . , xiKK ).
Next, we define W| S1 ,...,SK as the dikernel corresponding to W | S1 ,...,SK . The following
is the key technical lemma in the analysis of the algorithms given in the subsequent
sections.
Lemma 3.2 Let W1 , . . . , W T : [0, 1] K → [−L , L] be dikernels. Let S1 , . . . , SK
be sequences of s elements uniformly and independently sampled from [0, 1]. Then,
with probability at least 1 − exp(− K (s 2 (T / log s)1/K )), there exists a measure-
preserving bijection π : [0, 1] → [0, 1] such that, for every t ∈ [T ], we have
 1/2K
T
|W − π(W | S1 ,...,SK )| = L · O K
t t
,
log s

where O K (·) and  K (·) hide factors depending on K .

3.3 Quadratic Function Minimization

Background
Quadratic functions are one of the most important function classes in machine learn-
ing, statistics, and data mining. Many fundamental problems such as linear regression,
k-means clustering, principal component analysis, support vector machines, and ker-
nel methods can be formulated as a minimization problem of a quadratic function.
See, e.g., [8] for more details.
In some applications, it is sufficient to compute the minimum value of a quadratic
function rather than its solution. For example, Yamada et al. [13] proposed an efficient
method for estimating the Pearson divergence, which provides useful information
about data, such as the density ratio [10]. They formulated the estimation problem
as the minimization of a squared loss and showed that the Pearson divergence can be
estimated from the minimum value. Least-squares mutual information [9] is another
example that can be computed in a similar manner.
Despite its importance, minimization of quadratic functions suffers from the issue
of scalability. Let n ∈ N be the number of variables. In general, this kind of min-
imization problem can be solved by quadratic programming (QP), which requires
poly(n) time. If the problem is convex and there are no constraints, then the prob-
lem is reduced to solving a system of linear equations, which requires O(n 3 ) time.
Both methods easily become infeasible, even for medium-scale problems of, say,
n > 10000.
Although several techniques have been proposed to accelerate quadratic function
minimization, they require at least linear time in n. This is problematic when handling
34 Y. Yoshida

Algorithm 1
Input: n ∈ N, query access to a matrix A ∈ Rn×n and to vectors d, b ∈ Rn , and , δ ∈ (0, 1).
1: S ← a sequence of s = s(, δ) indices independently and uniformly sampled from [n].
2
2: return ns 2 minv∈Rn ps,A| S ,d| S ,b| S (v).

large-scale problems, where even linear time is slow or prohibitive. For example,
stochastic gradient descent (SGD) is an optimization method that is widely used
for large-scale problems. A nice property of this method is that, if the objective
function is strongly convex, it outputs a point that is sufficiently close to an optimal
solution after a constant number of iterations [1]. Nevertheless, each iteration needs
at least (n) time to access the variables. Another popular technique is low-rank
approximation such as Nyström’s method [12]. The underlying idea is to approximate
the input matrix by a low-rank matrix, which drastically reduces the time complexity.
However, we still need to compute the matrix vector product of size n, which requires
(n) time. Clarkson et al. [2] proposed sublinear-time algorithms for special cases of
quadratic function minimization. However, these are “sublinear” with respect to the
number of pairwise interactions of the variables, which is (n 2 ), and the algorithms
require O(n logc n) time for some c ≥ 1.
Constant-time algorithm for quadratic function minimization
Let A ∈ Rn×n be a matrix and d, b ∈ Rn be vectors. Then, we consider the following
quadratic problem:

minimize
n
pn,A,d,b (v), where pn,A,d,b (v) = v, Av + nv, diag(d)v + nb, v,
v∈R
(3.1)

where ·, · denotes the inner product and diag(d) denotes a diagonal matrix in which
the diagonal entries are specified by d. Note that although a constant term can be
included in (3.1), it is omitted here because it is irrelevant when optimizing (3.1),
and hence we omit it.
Let z ∗ ∈ R be the optimal value of (3.1) and let , δ ∈ (0, 1) be parameters. Then,
our goal is then to compute z with |z − z ∗ | = O(n 2 ) with probability at least 1 − δ
in constant time. We further assume that we have query access to A, b, and d, with
which we can obtain their entry by specifying an index. We note that z ∗ is typically
(n 2 ) because v, Av consists of (n 2 ) terms, and v, diag(d)v and b, v consist
of (n) terms. Hence, we can regard the error of (n 2 ) as an error of () for
each term, which is reasonably small in typical situations.
Let ·| S be an operator that extracts a submatrix (or subvector) specified by an
index set S ⊂ N. Our algorithm is then given by Algorithm 1, where the parameter
s := s(, δ) is determined later. In other words, we sample a constant number of
indices from the set [n], and then solve the problem (3.1) restricted to these indices.
3 Constant-Time Algorithms for Continuous Optimization Problems 35

Note that the number of queries and the time complexity are O(s 2 ) and poly(s),
respectively.
The goal of the rest of this section is to show the following approximation guar-
antee of Algorithm 1.
Theorem 3.1 Let v∗ and z ∗ be an optimal solution and the optimal value, respec-
tively, of problem (3.1). By choosing s(, δ) = 2(1/ ) + (log 1δ log log 1δ ), with
2

probability at least 1 − δ, a sequence S of s indices independently and uniformly


sampled from [n] satisfies the following: Let ṽ∗ and z̃ ∗ be an optimal solution and the
optimal value, respectively, of the problem minv∈Rs ps,A|S ,d|S ,b|S (v). Then, we have
 n2 
 
 2 z̃ ∗ − z ∗  ≤  L M 2 n 2 ,
s
where

L = max max |Ai j |, max |di |, max |bi | and M = max max |vi∗ |, max |ṽi∗ | .
i, j i i i∈[n] i∈[n]

We can show that M is bounded when A is symmetric and full rank. To see this, we
first note that we can assume A + ndiag(d) is positive-definite, as otherwise pn,A,d,b
is not bounded and the problem is uninteresting. Then, for any set S ⊆ [n] of s indices,
(A + ndiag(d))| S is again positive-definite because it is a principal submatrix. Hence,
we have v∗ = (A + ndiag(d))−1 nb/2 and ṽ∗ = (A| S + ndiag(d| S ))−1 nb| S /2, which
means that M is bounded.

3.3.1 Proof of Theorem 3.1

To use dikernels in our analysis, we first introduce a continuous version of pn,A,d,b .


The real-valued function Pn,A,d,b on the functions f : [0, 1] → R is defined as

Pn,A,d,b ( f ) =  f, A f  +  f 2 , D1 +  f, B1,

where D and B are the dikernels corresponding to d1 and b1 , respectively,


f 2 : [0, 1] → R is a function such that f 2 (x) = f (x)2 for every x ∈ [0, 1] and
1 : [0, 1] → R is a constant function that has a value of 1 everywhere. The fol-
lowing lemma states that the minimizations of pn,A,d,b and Pn,A,d,b are equivalent:

Lemma 3.3 Let A ∈ Rn×n be a matrix and d, b ∈ Rn×n be vectors. Then, we have

min pn,A,d,b (v) = n 2 · inf Pn,A,d,b ( f ).


v∈[−M,M]n f :[0,1]→[−M,M]

for any M > 0.


36 Y. Yoshida

Proof First, we show that n 2 · inf f :[0,1]→[−M,M] Pn,A,d,b ( f ) ≤ minv∈[−M,M]n


pn,A,d,b (v). Given a vector v ∈ [−M, M]n , we define f : [0, 1] → [−M, M] as
f (x) = vin (x) . Then,
 
1 1
 f, A f  = Ai j f (x) f (y)dxdy = Ai j vi v j = v, Av,
Iin I nj n2 n2
i, j∈[n] i, j∈[n]
  
 f , D1 =
2
di f (x) dxdy =
2
di f (x)2 dx
n I nj n
i, j∈[n] Ii i∈[n] Ii
1 1
= di vi2 = v, diag(d)v,
n n
i∈[n]
  
1 1
 f, B1 = bi f (x)dxdy = bi f (x)dx = bi vi = v, b.
n I nj n n n
i, j∈[n] Ii i∈[n] Ii i∈[n]

Hence, we have n 2 Pn,A,d,b ( f ) ≤ pn,A,d,b (v).


Next, we show that minv∈[−M,M]n pn,A,d,b (v) ≤ n 2 · inf f :[0,1]→[−M,M] Pn,A,d,b ( f ).
Let f : [0, 1] → [−M, M] be a measurable function. For x ∈ [0, 1], we then have

∂ Pn,A,d,b ( f (x))
∂ f (x)
 
= Aiin (x) f (y)dy + Ain (x) j f (y)dy + 2din (x) f (x) + bin (x) .
i∈[n] Iin j∈[n] I jn

Note that the form of this partial derivative depends on only i n (x). Hence, in the
optimal solution f ∗ : [0, 1] → [−M, M], we can assume f ∗ (x) = f ∗ (y) if i n (x) =
i n (y). In other words, f ∗ is constant on each of the intervals I1n , . . . , Inn . For such
f ∗ , we define the vector v ∈ Rn as vi = f ∗ (x), where x ∈ [0, 1] is any element in
Iin . Then, we have
 
v, Av = Ai j vi v j = n 2 Ai j f ∗ (x) f ∗ (y)dxdy = n 2  f ∗ , A f ∗ ,
n I nj
i, j∈[n] i, j∈[n] Ii

v, diag(d)v = di vi2 = n di f ∗ (x)2 dx = n( f ∗ )2 , D1,
n
i∈[n] i∈[n] Ii

v, b = bi vi = n bi f ∗ (x)dx = n f ∗ , B1.
n
i∈[n] i∈[n] Ii

Hence, we have pn,A,d,b (v) ≤ n 2 Pn,A,d,b ( f ∗ ).

Proof (of Theorem 3.1) We instantiate Lemma 3.2 with s = 2(1/ ) + (log 1δ log log 1δ )
2

and the dikernels A, D, and B. Then, with probability at least 1 − δ, there exists a
measure-preserving bijection π : [0, 1] → [0, 1] such that
L M2
max | f, (A − π(A| S )) f |, | f 2 , (D − π(D| S ))1|, | f, (B − π(B| S ))1| ≤
3
3 Constant-Time Algorithms for Continuous Optimization Problems 37

for any function f : [0, 1] → [−M, M]. Conditioned on this event, we have

z̃ ∗ = mins ps,A|S ,d|S ,b|S (v) = min ps,A|S ,d|S ,b|S (v)
v∈R v∈[−M,M]s

= s2 · inf Ps,A|S ,d|S ,b|S ( f ) (By Lemma 3)


f :[0,1]→[−M,M]

= s2 · inf  f, (π(A| S ) − A) f  +  f, A f  +  f 2 , (π(D| S ) − D)1+
f :[0,1]→[−M,M]

 f 2 , D1 +  f, (π(B| S ) − B)1 +  f, B1
 
≤ s2 · inf  f, A f  +  f 2 , D1 +  f, B1 ±  L M 2
f :[0,1]→[−M,M]
2
s
= · min pn,A,d,b (v) ±  L M 2 s 2 . (By Lemma 3)
n 2 v∈[−M,M]n
s2 s2
= 2 · minn pn,A,d,b (v) ±  L M 2 s 2 = 2 z ∗ ±  L M 2 s 2 .
n v∈R n

Rearranging the inequality, we obtain the desired result.

3.4 Tensor Decomposition

Background
We say that a tensor (or a multidimensional array) is of order K if it is a K -
dimensional array. Each dimension is called a mode in tensor terminology. Tensor
decomposition, which approximates the input tensor by a number of smaller tensors,
is a fundamental tool for dealing with large tensors because it drastically reduces
memory usage.
Among the many existing tensor decomposition methods, Tucker decomposi-
tion [11] is a popular choice. To some extent, Tucker decomposition is analogous to
singular-value decomposition (SVD). Whereas SVD decomposes a matrix into left
and right singular vectors that interact via singular values, Tucker decomposition of
an order-K tensor consists of K factor matrices that interact via the so-called core
tensor. The key difference between SVD and Tucker decomposition is that, in the
latter, the core tensor does not need to be diagonal and its “rank” can differ for each
mode. We refer to the size of the core tensor, which is a K -tuple, as the Tucker rank
of a Tucker decomposition.
We are usually interested in obtaining factor matrices and a core tensor to minimize
the residual error—the error between the input and low-rank approximated tensors.
Sometimes, however, knowing the residual error itself is a task of interest. The
residual error tells us how suitable a low-rank approximation is to approximate the
input tensor in the first place, and is also useful to predetermine the Tucker rank.
In real applications, Tucker ranks are not explicitly given, and we must select them
by considering the tradeoff between space usage and approximation accuracy. For
38 Y. Yoshida

example, if the selected Tucker rank is too small, we risk losing essential information
in the input tensor, whereas if the selected Tucker rank is too large, the computational
cost of computing the Tucker decomposition (even if we allow for approximation
methods) increases considerably along with space usage. As with the case of the
matrix rank, one might think that a reasonably good Tucker rank can be found using
a grid search. Unfortunately, grid search for an appropriate Tucker rank is challenging
because, for an order-K tensor, the Tucker rank consists of K free parameters and
the search space grows exponentially in K . Hence, we want to evaluate each grid
point as quickly as possible.
Although several practical algorithms have been proposed, such as the higher order
orthogonal iteration (HOOI) [3], they are not sufficiently scalable. For each mode,
HOOI iteratively applies SVD to an unfolded tensor—a matrix that is reshaped from
the input tensor.Given an N1 × · · · × N K tensor, the computational cost is hence
O(K maxk Nk · k Nk ), which crucially depends on the input size N1 , . . . , N K .
Although there are several approximation algorithms, their computational costs are
still intensive.
Constant-time algorithm for the Tucker fitting problem
The problem of computing the residual error is formalized as the following Tucker
fitting problem: Given an order-K tensor X ∈ R N1 ×···×N K and integers Rk ≤ Nk (k =
1, . . . , K ), we want to compute the following normalized residual error:
 2
 
 X − [[G; U (1) , . . . , U (K ) ]]
R1 ,...,R K (X ) := min  F
, (3.2)
G∈R R1 ×···×R K ,{U (k) ∈R Nk ×Rk }k∈[K ] k∈[K ] N k

where [[G; U (1) , . . . , U (K ) ]] ∈ R N1 ×···×N K is an order-K tensor, defined as



[[G; U (1) , . . . , U (K ) ]]i1 ···i K = G r1 ···r K Ui(k)
k rk
r1 ∈[R1 ],...,r K ∈[R K ] k∈[K ]

for every i 1 ∈ [N1 ], . . . , i K ∈ [N K ]. Here, G is the core tensor, and U (1) , . . . , U (K )


are the factor matrices. Note that we are not concerned with computing the minimizer,
but only want to compute the minimum value. In addition, we do not need the
exact minimum. Indeed, a rough estimate still helps to narrow down promising rank
candidates. The question here is how quickly we can compute the normalized residual
error R1 ,...,R K (X ) with moderate accuracy.
In this section, we consider the following simple sampling algorithm, and
show that it can be used to approximately solve the Tucker fitting problem. First,
given an order-K tensor X ∈ R N1 ×···×N K , Tucker rank (R1 , . . . , R K ), and sam-
ple size s ∈ N, we sample a sequence of indices Sk = (x1k , . . . , xsk ) uniformly
and independently from [Nk ] for each mode k ∈ [K ]. We then construct a mini-
tensor X | S1 ,...,SK ∈ Rs×···×s , where (X | S1 ,...,SK )i1 ,...,i K = X xi1 ...,xiK . Finally, we com-
1 K
pute R1 ,...,R K (X | S1 ,...,SK ) using an arbitrary solver, such as HOOI, and output the
obtained value. The details are provided in Algorithm 2. Note that the time complex-
3 Constant-Time Algorithms for Continuous Optimization Problems 39

Algorithm 2 Sampling algorithm for the Tucker fitting problem


Input: N1 , . . . , N K ∈ N, query access to a tensor X ∈ R N1 ×···×N K , Tucker rank (R1 , . . . , Rk ), and
, δ ∈ (0, 1).
1: for k = 1 to K do
2: Sk ← a sequence of s = s(, δ) indices uniformly and independently sampled from [Nk ].
3: Construct a mini-tensor X | S1 ,...,SK .
4: return R1 ,...,R K (X | S1 ,...,SK ).

ity for computing R1 ,...,R K (X | S1 ,...,SK ) does not depend on the input size N1 , . . . , N K
but rather on the sample size s, meaning that the algorithm runs in constant time,
regardless of the input size.
The goal of the rest of this section is to show the following approximation guar-
antee of Algorithm 2.
Theorem 3.2 Let X ∈ R N1 ×···×N K be a tensor, R1 , . . . , R K be integers, and , δ ∈
2K −2
(0, 1). For s(, δ) = 2(1/ ) + (log 1δ log log 1δ ), we have the following. Let
S1 , . . . , SK be sequences of indices as defined in Algorithm 2. Let (G ∗ , U1∗ , . . . , U K∗ )
and (G̃ ∗ , Ũ1∗ , . . . , Ũ K∗ ) be minimizers of problem (3.2) on X and X | S1 ,...,SK for which
the factor matrices are orthonormal, respectively. Then we have

R1 ,...,R K (X | S1 ,...,SK ) = R1 ,...,R K (X ) ± O( L 2 (1 + 2M R)),

with probability
 at least 1 − δ, where L = |X |max , M = max{|G ∗ |max , |G̃ ∗ |max }, and
R = k∈[K ] Rk .
We remark that, for the matrix case (i.e., K = 2), |G ∗ |max and |G̃ ∗ |max are equal to
the maximum singular values of the original and sampled matrices, respectively.

3.4.1 Preliminaries

Let X ∈ R N1 ×···N K be a tensor. We define



|X | F = X i21 ···i K , (Frobenius norm)
i 1 ,...,i K

|X |max = max |X i1 ···i K |, (Max norm)


i 1 ∈[N1 ],...,i K ∈[N K ]
 
 
 
|X | = max  X i1 ···i K  . (Cut norm)
S1 ⊆[N1 ],...,SK ⊆[N K ]  
i 1 ∈S1 ,...,i K ∈SK

We note that these norms satisfy the triangle inequality.


For a vector v ∈ Rn and a sequence S = (x1 , . . . , xs ) of indices in [n], we define
the restriction v| S ∈ Rs of v as (v| S )i = vxi for i ∈ [s]. Let X ∈ R N1 ×···×N K be a
40 Y. Yoshida

tensor, and Sk = (x1k , . . . , xsk ) be a sequence of indices in [Nk ] for each mode k ∈
[K ]. Then, we define the restriction X | S1 ,...,SK ∈ Rs×···×s of X to S1 × · · · × SK as
(X | S1 ,...,SK )i1 ···i K = X xi1 ,...,xiK for each i 1 ∈ [N1 ], . . . , i K ∈ [Nk ].
1 K
For a tensor G∈R R1 ×···×R K and vector-valued functions {F (k) : [0, 1]→R Rk }k∈[K ] ,
we define an order-K dikernel [[G; F (1) , . . . , F (K ) ]] : [0, 1] K → R as

[[G; F (1) , . . . , F (K ) ]](x1 , . . . , x K ) = G r1 ,...,r K F (k) (xk )rk
r1 ∈[R1 ],...,r K ∈[R K ] k∈[K ]

We note that [[G; F (1) , . . . , F (K ) ]] is a continuous analogue of Tucker decomposition.

3.4.2 Proof of Theorem 3.2

To prove Theorem 3.2, we first consider the dikernel counterpart to the Tucker fitting
problem, in which we want to minimize the following:
 2
 
R1 ,...,R K (X) := inf X − [[G; f (1) , . . . , f (K ) ]] , (3.3)
G∈R R1 ×···×R K ,{ f (k) :[0,1]→R Rk }k∈[K ] F

The following lemma, which is proved in Sect. 3.4.3, states that the Tucker fitting
problem and its dikernel counterpart have the same optimum values.
Lemma 3.4 Let X ∈ R N1 ×···×N K be a tensor, and let R1 , . . . , R K ∈ N be integers.
Then, we have

R1 ,...,R K (X ) = R1 ,...,R K (X).

For a set of vector-valued functions F = { f (k) : [0, 1] → R Rk }k∈[K ] , we define


|F|max = maxk∈[K ],r ∈[Rk ],x∈[0,1] fr(k) (x). For a dikernel X : [0, 1] K → R, we define a
dikernel X2 : [0, 1] K → R as X2 (x) = X(x)2 for every x ∈ [0, 1] K . The following
lemma, which is proved in Sect. 3.4.4, states that if X and Y are close in the cut
norm, then the optimum values when the Tucker fitting problem is applied to them
are also close.
Lemma 3.5 Let X, Y : [0, 1] K → R be dikernels with |X − Y| ≤  and |X2 −
Y2 | ≤ . For R1 , . . . , R K ∈ N, we have
  
R1 ,...,R K (X) = R1 ,...,R K (Y) ± 2 1 + R |G X |max |FX |max + |G Y |max |FY |max ,
K K

where (G X , FX = { f X(k) }k∈[K ] ) and (G Y, FY = { f Y(k) }k∈[K ] ) are solutions to prob-


lem (3.3) on X and Y, respectively,
 which have objective values exceeding the infima
by at most , and R = k∈[K ] Rk .
3 Constant-Time Algorithms for Continuous Optimization Problems 41

Proof (of Theorem 3.2) We apply Lemma 3.2 to X and X2 . Thus, with probability at
least 1 − δ, there exists a measure-preserving bijection π : [0, 1] → [0, 1] such that

|X − π(X| S1 ,...,SK )| ≤  L and |X2 − π(X2 | S1 ,...,SK )| ≤  L 2 .

In the following, we assume that this has happened. By Lemma 3.5 and the fact that
R1 ,...,R K (X| S1 ,...,SK ) = R1 ,...,R K (π(X| S1 ,...,SK )), we have
 
R1 ,...,R K (X| S1 ,...,S K ) = R1 ,...,R K (X) ±  L 1 + 2R(|G|max |F|max + |G̃|max | F̃|max ) ,
2 K K

where (G, F = { f (k) }k∈[K ] ) and (G̃, F̃ = { f˜(k) }k∈[K ] ) are as in the statement of
Lemma 3.5. From the proof of Lemma 3.4, we can assume that |G|max = |G ∗ |max ,
|G̃|max = |G̃ ∗ |max , |F|max ≤ 1, and | F̃|max ≤ 1 (owing to the orthonormality of
U1∗ , . . . , U K∗ and Ũ1∗ , . . . , Ũ K∗ ). It follows that
 
R1 ,...,R K (X| S1 ,...,SK ) = R1 ,...,R K (X) ±  L 2 1 + 2R(|G ∗ |max + |G̃ ∗ |max ) . (3.4)

Then, we have

R1 ,...,R K (X | S1 ,...,SK ) = R1 ,...,R K (X| S1 ,...,SK ) (By Lemma 4)


 
∗ ∗
= R1 ,...,R K (X) 1 + 2R(|G |max + |G̃ |max )
± L 2
(By 4)
 
= R1 ,...,R K (X ) ±  L 2 1 + 2R(|G ∗ |max + |G̃ ∗ |max ) . (By Lemma 4)

Hence, the proof is complete.

3.4.3 Proof of Lemma 3.4

We say that a vector-valued function f : [0, 1] → R R is orthonormal if  fr , fr  = 1


for every r ∈ [R] and  fr , fr   = 0 if r = r  . First, we calculate the partial derivatives
of the objective function. We omit the proof because it is a straightforward (but
tedious) calculation.
Lemma 3.6 Let X ∈ [0, 1] K → R be a dikernel, G ∈ R R1 ×···R K be a tensor, and
{ f (k) : [0, 1] → R Rk }k∈[K ] be a set of orthonormal vector-valued functions. Then,
we have
42 Y. Yoshida

∂  2
 (1) (K ) 
X − [[G; f , . . . , f ]] 
∂ fr(k0 0 ) (x0 ) F
 
=2 G r1 ···r K X(x) fr(k)
k
(xk )dx
r1 ,...,r K :rk0 =r0 [0,1] :xk0 =x0
K
k∈[K ]\{k0 }

−2 G r1 ···r K G r1 ···rk0 −1 r0 rk0 +1 ···r K fr(kk 0 ) (x0 ).


0
r1 ,...,r K

Proof (of Lemma 3.4) First, we show that (LHS) ≤ (RHS). Consider a sequence of
solutions for the continuous problem (3.3) for which the objective values attain the
infimum. For Tucker decompositions, it is well known that there exists a minimizer
for which the factor matrices U (1) , . . . , U (K ) are orthonormal. By similar reasoning,
we can show that the vector-valued functions f (1) , . . . , f (K ) in each solution of the
sequence are orthonormal. As the objective function is coercive with respect to tensor
G, we can take a subsequence for which G converges. Let G ∗ be the limit. Now,
for any δ > 0, we can create a matrix G̃ by perturbing G ∗ so that (i) by fixing G
to G̃ in the continuous problem, the infimum increases only by δ, and (ii) a matrix
constructed from G̃ is invertible and has a condition number at least δ  = δ  (δ) > 0.
Now, consider a sequence of solutions for the continuous problem (3.3) with G
fixed to G̃ for which the objective values attain the infimum. We can show that the
partial derivatives converge to zero almost everywhere. For any  > 0, there then
exists a solution (G̃, f (1) , . . . , f (K ) ) in the sequence such that the partial derivatives
are at most  almost everywhere.
Then by Lemma 3.6, for any k0 ∈ [K ], r0 ∈ [Rk ], and almost all x ∈ [0, 1], we
have

G̃ r1 ···r K G̃ r1 ···rk0 −1 r0 rk0 +1 ···r K fr(kk 0 ) (x0 )


0
r1 ,...,r K
 
= G̃ r1 ···r K X(x) fr(k)
k
(xk )dx ± (k0 , r0 , x), (3.5)
r1 ,...,r K :rk0 =r0 [0,1] K :xk0 =x0 k∈[K ]\{k0 }

where (k0 , r0 , x) = O(). Now, we consider a system of linear equations consisting


of (3.5) for r0 = 1, . . . , Rk0 , where the variables are f 1k0 (x0 ), . . . , f Rkk0 (x0 ). We can
0
assume that the matrix involved in this system is invertible and has a positive condition

number. For any k ∈ [K ], r ∈ [Rk ] and almost every pair x, x ∈ [0, 1] with i Nk (x) =
i Nk (x  ), we then have fr(k0 0 ) (x) = fr(k0 0 ) (x  ) ± O(/δ  ). For each k ∈ [K ], we can
define a matrix U (k) ∈ R Nk ×Rk as Uir(k) = fr(k) (x), where x ∈ [0, 1] is an arbitrary
value with i Nk (x) = i. Then, we have
3 Constant-Time Algorithms for Continuous Optimization Problems 43

1  2
 1  2
X − [[G̃; U (1) , . . . , U (K ) ]] = X i1 ···i K − [[G̃; U (1) , . . . , U (K ) ]]i1 ···i K
N F N i ,...,i

1 K
 2
(1) (K ) 
= N N
X(x) − [[ G̃; f , . . . , f ]](x) ± O(/δ ) dx
i 1 ,...,i K Ii 1 ×···×Ii K
1 K
 2
 
= X − [[G̃; f (1) , . . . , f (K ) ]] ± O( 2 N /(δ  )2 )
F

for N = k∈[K ] Nk . As the choice of  and δ are arbitrary, we obtain (LHS) ≤ (RHS).
Second, we show that (RHS) ≤ (LHS). Let U (k) ∈ R Nk ×Rk (k ∈ [K ]) be matrices.
We define a vector-valued function f (k) : [0, 1] → R Rk as fr(k) (x) = Ui(k) Nk (x)r
for each
k ∈ [K ] and r ∈ [Rk ]. Then, we have
 2   2
 
X − [[G; f (1) , . . . , f (K ) ]] = X(x) − [[G; f (1) , . . . , , f (K ) ]](x) dx
F [0,1] K
  2
(1) (K )
=  N
X(x) − [[G; f , . . . , f ]](x) dx
k
i 1 ,...,i K k∈[K ] Ii k

1  2
= X i1 ···i K − [[G; U (1) , . . . , U (K ) ]]i1 ···i K
N i 1 ,...,i K
1  2

=  X − [G; U (1) , . . . , U (K ) ] ,
N F

from which the claim follows.

3.4.4 Proof of Lemma 3.5



For a sequence of functions f (1) , . . . , f (K ) , we define their tensor product k∈[K ]
 
f (k) ∈ [0, 1] K → R as k∈[K ] f (k) (x1 , . . . , x K ) = k∈[K ] f (k) (xk ), which is a dik-
ernel of order-K .
The cut norm is useful for bounding the absolute value of the inner product
between a tensor and a tensor product:
Lemma 3.7 Let  ≥ 0 and W : [0, 1] K → R be a dikernel with |W|   ≤ . Then,
for any functions f (1) , . . . , f (K ) : [0, 1] → [−L , L], we have |W, k∈[K ] f (k) | ≤
LK .

Proof For τ ∈ R and the function h : [0, 1] → R, let L τ (h) := {x ∈ [0, 1] | h(x) =
τ } be the level set of h at τ . For f (i) = f (i) /L, we have
44 Y. Yoshida
   
 
     
 W, f (k)  = L K  W, f (k) 

 k∈[K ]   k∈[K ] 
 
 
   
K 
=L  τk  W(x)dxdτ 
 [−1,1] k∈[K ]
K
k∈[K ] L τk ( f
(k) ) 
  
  
 
≤L K
|τk |   W(x)dxdτ 
[−1,1] K k∈[K ]  k∈[K ] L τ ( f (k) ) 
k
 
≤ LK |τk |dτ =  L K .
[−1,1] K k∈[K ]

Thus, we have the following:


Lemma 3.8 Let X, Y : [0, 1] K → R be dikernels with |X − Y| ≤  and |X2 −
Y2 | ≤ , where X2 (x) = X(x)2 and Y2 (x) = Y(x)2 for every x ∈ [0, 1] K . Then,
for any tensor G ∈ R R1 ×···×R K and a set of vector-valued functions F = { f (k) :
[0, 1] → R Rk }k∈[K ] , we have
 2  2  
   
X − [[G; f (1) , . . . , f (K ) ]] = Y − [[G; f (1) , . . . , f (K ) ]] ±  1 + 2R|G|max |F|max
K
,
F F


where R = k∈[K ] RK .
Proof We have
 2  2 

 X − [[G; f (1) , . . . , f (K ) ]] F − Y − [[G; f (1) , . . . , f (K ) ]] F 
  2

=  X(x) − [[G; f (1) , . . . , f (K ) ]](x) dx
[0,1] K
  2 
− Y(x) − [[G; f (1) , . . . , f (K ) ]](x) dx 
[0,1] K
    
 
=  X(x)2 − Y(x)2 dx − 2 (X(x) − Y(x))[[G; f (1) , . . . , f (K ) ]](x)dx 
[0,1] K [0,1] K
 

  
≤ |X − Y | + 2
2 2
|G r1 ···r K | ·  X − Y, frk 
(k)

r1 ∈[R1 ],...,rk ∈[Rk ]  k∈[K ] 


≤  + 2 R|G|max |F|max
K

by Lemma 3.7.
Proof (of Lemma 3.5) By Lemma 3.8, we have
 2  2
 (1) (K )   (1) (K ) 
Y − [[G Y ; f Y , . . . , f Y ]] ≤ Y − [[G X ; f X , . . . , f X ]] + 
F F
 2  
 (1) (K ) 
≤ X − [[G X ; f X , . . . , f X ]] + 2 + 2 R|G X max FX max . K
F
3 Constant-Time Algorithms for Continuous Optimization Problems 45

Similarly, we have
 2  2
 (1) (K )   (1) (K ) 
X − [[G X ; f X , . . . , f X ]] ≤ X − [[G Y ; f Y , . . . , f Y ]] + 
F F
 2  
 (1) (K ) 
≤ Y − [[G Y ; f Y , . . . , f Y ]] + 2 + 2 RG Y max FY max . K
F

Hence, the claim follows.

References

1. L. Bottou, Stochastic learning, in Advanced Lectures on Machine Learning (2004), pp. 146–168
2. K.L. Clarkson, E. Hazan, D.P. Woodruff, Sublinear optimization for machine learning. J. ACM
59(5), 23:1–23:49 (2012)
3. Lieven De Lathauwer, Bart De Moor, Joos Vandewalle, On the best rank-1 and rank-
(r1 , r2 , . . . , rn ) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl. 21(4),
1324–1342 (2000)
4. A. Frieze, R. Kannan, The regularity lemma and approximation schemes for dense problems,
in FOCS (1996), pp. 12–20
5. K. Hayashi, Y. Yoshida, Minimizing quadratic functions in constant time, in NIPS (2016), pp.
2217–2225
6. K. Hayashi, Y. Yoshida, Fitting low-rank tensors in constant time, in NIPS (2017), pp. 2473–
2481
7. L. Lovász, Large Networks and Graph Limits (American Mathematical Society, 2012)
8. K.P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012)
9. Taiji Suzuki, Masashi Sugiyama, Least-squares independent component analysis. Neural Com-
put. 23(1), 284–301 (2011)
10. M. Sugiyama, T. Suzuki, T. Kanamori, Density Ratio Estimation in Machine Learning (Cam-
bridge University Press, 2012)
11. Ledyard R. Tucker, Some mathematical notes on three-mode factor analysis. Psychometrika
31(3), 279–311 (1966)
12. K.I. Christopher, in Using the Nyström Method to Speed up Kernel Machines, NIPS eds. by C.
Williams, M. Seeger (2001)
13. M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, M. Sugiyama, Relative density-ratio estima-
tion for robust distribution comparison, in NIPS (2011)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.

You might also like