0% found this document useful (0 votes)

4 views

Constant-Time_Algorithms_for_Continuous_Optimizati

This chapter discusses constant-time algorithms for continuous optimization problems, focusing on quadratic function minimization and tensor decomposition, which are significant in machine learning and data mining. The analysis leverages graph limit theory, introducing key concepts and algorithms that allow for efficient computation in scenarios where traditional methods are infeasible due to scalability issues. The chapter presents a specific algorithm for quadratic function minimization that achieves approximation guarantees while operating in constant time, highlighting its practical applications and theoretical foundations.

Uploaded by

chronicalsarath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Constant-Time_Algorithms_for_Continuous_Optimizati

Uploaded by

chronicalsarath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Chapter 3

Constant-Time Algorithms for

Continuous Optimization Problems

Yuichi Yoshida

Abstract In this chapter, we consider constant-time algorithms for continuous opti-

mization problems. Specifically, we consider quadratic function minimization and
tensor decomposition, both of which have numerous applications in machine learn-
ing and data mining. The key component in our analysis is graph limit theory, which
was originally developed to study graphs analytically.

3.1 Introduction

In this chapter, we turn our attention to constant-time algorithms for continuous

optimization problems. Specifically, we consider quadratic function minimization
and tensor decomposition, both of which have numerous applications in machine
learning and data mining. The key component in our analysis is graph limit theory,
which was originally developed to study graphs analytically.
We introduce graph limit theory in Sect. 3.2, and then discuss quadratic function
minimization and tensor decomposition in Sects. 3.3 and 3.4, respectively. Through-
out this chapter, we assume the real RAM model, in which we can perform basic
algebraic operations on real numbers in one step. For a positive integer n, let [n]
denote the set {1, 2, . . . , n}. For real values a, b, c ∈ R, a = b ± c is used as short-
hand for b − c ≤ a ≤ b + c. The algorithms and analysis presented in this chapter
are based on [5, 6].

3.2 Graph Limit Theory

This section reviews the basic concepts of graph limit theory. For further details,
refer to the book by Lovász [7].

Y. Yoshida (B)
National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan
e-mail: [email protected]

© The Author(s) 2022 31

N. Katoh et al. (eds.), Sublinear Computation Paradigm,
https://doi.org/10.1007/978-981-16-4095-7_3
32 Y. Yoshida

We call a (measurable) function W : [0, 1] K → R a dikernel of order K . We

define

|W| F = W(x)2 dx, (Frobenius norm)
[0,1] K

|W|max = max |W(x)|, (Max norm)

x∈[0,1] K

|W| = sup W(x)dx. (Cut norm)
S1 ,...,S K ⊆[0,1] S1 ×···×S K

We note that these norms satisfy the triangle inequality. For two dikernels W and
W , we define their inner product as W, W = [0,1] K W(x)W (x)dx. For a
dikernel W : [0, 1]2 → R and a function f : [0, 1] → R, we define a function W f :
[0, 1] → R as (W f )(x) = W(x, ·), f .
Let λ be a Lebesgue measure. A map π : [0, 1] → [0, 1] is said to be measure-
preserving if the pre-image π −1 (X ) is measurable for every measurable set X ,
and λ(π −1 (X )) = λ(X ). A measure-preserving bijection is a measure-preserving
map whose inverse map exists and is also measurable (and, in turn, also measure-
preserving). For a measure-preserving bijection π : [0, 1] → [0, 1] and a dikernel
W : [0, 1] K →R, we define a dikernel π(W) : [0, 1] K → R as π(W)(x1 , . . . , x K ) =
W(π(x1 ), . . . , π(x K )).
A partition P = (V1 , . . . , V p ) of the interval [0, 1] is called an equipartition if
λ(Vi ) = 1/ p for every i ∈ [ p]. For a dikernel W : [0, 1] K → R and an equipartition
P = (V1 , . . . , V p ) of [0, 1], we define WP : [0, 1] K → R as the dikernel obtained
by averaging each Vi1 × · · · × Vi K for i 1 , . . . , i K ∈ [ p]. More formally, we define

1
WP (x) = W(x )dx = p K W(x )dx ,
k∈[K ] λ(Vi k ) Vi1 ×···×Vi K Vi1 ×···×Vi K

where i k is the unique index such that xk ∈ Vik for each k ∈ [K ]. The following
lemma states that any dikernel W : [0, 1] K → R can be well approximated by WP
for some equipartition P into a small number of parts.
Lemma 3.1 (Weak regularity lemma for dikernels [4]) Let W1 , . . . , W T : [0, 1] K →
R be dikernels. Then, for any > 0, there exists an equipartition P into |P| ≤
2 O(T / ) parts, such that for every t ∈ [T ],
2K

|Wt − WPt | ≤ |Wt | F .

We can construct the dikernel X : [0, 1] K → R from a tensor X ∈ R N1 ×···×N K as

follows. For an integer n ∈ N, let I1n = [0, n1 ], I2n = ( n1 , n2 ], . . . , Inn = ( n−1 n
, . . . , 1].
For x ∈ [0, 1], we define i n (x) ∈ [n] as the unique integer such that x ∈ Iin . We then
define X(x1 , . . . , x K ) = X i N1 (x1 )···i N K (x K ) . The main motivation of creating a dikernel
from a tensor is that, in doing so, we can define the distance between two tensors
3 Constant-Time Algorithms for Continuous Optimization Problems 33

X and Y of different sizes via the cut norm—that is, |X − Y| , where X and Y are
dikernels corresponding to X and Y , respectively.
Let W : [0, 1] K → R be a dikernel and Sk = (x1k , . . . , xsk ) for k ∈ [K ] be
sequences of elements in [0, 1]. Then, we define a dikernel W| S1 ,...,SK : [0, 1] K → R
as follows: We first extract a tensor W ∈ Rs×···×s by setting Wi1 ···i K =W(xi11 , . . . , xiKK ).
Next, we define W| S1 ,...,SK as the dikernel corresponding to W | S1 ,...,SK . The following
is the key technical lemma in the analysis of the algorithms given in the subsequent
sections.
Lemma 3.2 Let W1 , . . . , W T : [0, 1] K → [−L , L] be dikernels. Let S1 , . . . , SK
be sequences of s elements uniformly and independently sampled from [0, 1]. Then,
with probability at least 1 − exp(− K (s 2 (T / log s)1/K )), there exists a measure-
preserving bijection π : [0, 1] → [0, 1] such that, for every t ∈ [T ], we have
1/2K
T
|W − π(W | S1 ,...,SK )| = L · O K
t t
,
log s

where O K (·) and K (·) hide factors depending on K .

3.3 Quadratic Function Minimization

Background
Quadratic functions are one of the most important function classes in machine learn-
ing, statistics, and data mining. Many fundamental problems such as linear regression,
k-means clustering, principal component analysis, support vector machines, and ker-
nel methods can be formulated as a minimization problem of a quadratic function.
See, e.g., [8] for more details.
In some applications, it is sufficient to compute the minimum value of a quadratic
function rather than its solution. For example, Yamada et al. [13] proposed an efficient
method for estimating the Pearson divergence, which provides useful information
about data, such as the density ratio [10]. They formulated the estimation problem
as the minimization of a squared loss and showed that the Pearson divergence can be
estimated from the minimum value. Least-squares mutual information [9] is another
example that can be computed in a similar manner.
Despite its importance, minimization of quadratic functions suffers from the issue
of scalability. Let n ∈ N be the number of variables. In general, this kind of min-
imization problem can be solved by quadratic programming (QP), which requires
poly(n) time. If the problem is convex and there are no constraints, then the prob-
lem is reduced to solving a system of linear equations, which requires O(n 3 ) time.
Both methods easily become infeasible, even for medium-scale problems of, say,
n > 10000.
Although several techniques have been proposed to accelerate quadratic function
minimization, they require at least linear time in n. This is problematic when handling
34 Y. Yoshida

Algorithm 1
Input: n ∈ N, query access to a matrix A ∈ Rn×n and to vectors d, b ∈ Rn , and , δ ∈ (0, 1).
1: S ← a sequence of s = s(, δ) indices independently and uniformly sampled from [n].
2
2: return ns 2 minv∈Rn ps,A| S ,d| S ,b| S (v).

large-scale problems, where even linear time is slow or prohibitive. For example,
stochastic gradient descent (SGD) is an optimization method that is widely used
for large-scale problems. A nice property of this method is that, if the objective
function is strongly convex, it outputs a point that is sufficiently close to an optimal
solution after a constant number of iterations [1]. Nevertheless, each iteration needs
at least (n) time to access the variables. Another popular technique is low-rank
approximation such as Nyström’s method [12]. The underlying idea is to approximate
the input matrix by a low-rank matrix, which drastically reduces the time complexity.
However, we still need to compute the matrix vector product of size n, which requires
(n) time. Clarkson et al. [2] proposed sublinear-time algorithms for special cases of
quadratic function minimization. However, these are “sublinear” with respect to the
number of pairwise interactions of the variables, which is (n 2 ), and the algorithms
require O(n logc n) time for some c ≥ 1.
Constant-time algorithm for quadratic function minimization
Let A ∈ Rn×n be a matrix and d, b ∈ Rn be vectors. Then, we consider the following
quadratic problem:

minimize
n
pn,A,d,b (v), where pn,A,d,b (v) = v, Av + nv, diag(d)v + nb, v,
v∈R
(3.1)

where ·, · denotes the inner product and diag(d) denotes a diagonal matrix in which
the diagonal entries are specified by d. Note that although a constant term can be
included in (3.1), it is omitted here because it is irrelevant when optimizing (3.1),
and hence we omit it.
Let z ∗ ∈ R be the optimal value of (3.1) and let , δ ∈ (0, 1) be parameters. Then,
our goal is then to compute z with |z − z ∗ | = O(n 2 ) with probability at least 1 − δ
in constant time. We further assume that we have query access to A, b, and d, with
which we can obtain their entry by specifying an index. We note that z ∗ is typically
(n 2 ) because v, Av consists of (n 2 ) terms, and v, diag(d)v and b, v consist
of (n) terms. Hence, we can regard the error of (n 2 ) as an error of () for
each term, which is reasonably small in typical situations.
Let ·| S be an operator that extracts a submatrix (or subvector) specified by an
index set S ⊂ N. Our algorithm is then given by Algorithm 1, where the parameter
s := s(, δ) is determined later. In other words, we sample a constant number of
indices from the set [n], and then solve the problem (3.1) restricted to these indices.
3 Constant-Time Algorithms for Continuous Optimization Problems 35

Note that the number of queries and the time complexity are O(s 2 ) and poly(s),
respectively.
The goal of the rest of this section is to show the following approximation guar-
antee of Algorithm 1.
Theorem 3.1 Let v∗ and z ∗ be an optimal solution and the optimal value, respec-
tively, of problem (3.1). By choosing s(, δ) = 2(1/ ) + (log 1δ log log 1δ ), with
2

probability at least 1 − δ, a sequence S of s indices independently and uniformly

sampled from [n] satisfies the following: Let ṽ∗ and z̃ ∗ be an optimal solution and the
optimal value, respectively, of the problem minv∈Rs ps,A|S ,d|S ,b|S (v). Then, we have
n2

2 z̃ ∗ − z ∗ ≤ L M 2 n 2 ,
s
where

L = max max |Ai j |, max |di |, max |bi | and M = max max |vi∗ |, max |ṽi∗ | .
i, j i i i∈[n] i∈[n]

We can show that M is bounded when A is symmetric and full rank. To see this, we
first note that we can assume A + ndiag(d) is positive-definite, as otherwise pn,A,d,b
is not bounded and the problem is uninteresting. Then, for any set S ⊆ [n] of s indices,
(A + ndiag(d))| S is again positive-definite because it is a principal submatrix. Hence,
we have v∗ = (A + ndiag(d))−1 nb/2 and ṽ∗ = (A| S + ndiag(d| S ))−1 nb| S /2, which
means that M is bounded.

3.3.1 Proof of Theorem 3.1

To use dikernels in our analysis, we first introduce a continuous version of pn,A,d,b .

The real-valued function Pn,A,d,b on the functions f : [0, 1] → R is defined as

Pn,A,d,b ( f ) = f, A f + f 2 , D1 + f, B1,

where D and B are the dikernels corresponding to d1 and b1 , respectively,

f 2 : [0, 1] → R is a function such that f 2 (x) = f (x)2 for every x ∈ [0, 1] and
1 : [0, 1] → R is a constant function that has a value of 1 everywhere. The fol-
lowing lemma states that the minimizations of pn,A,d,b and Pn,A,d,b are equivalent:

Lemma 3.3 Let A ∈ Rn×n be a matrix and d, b ∈ Rn×n be vectors. Then, we have

min pn,A,d,b (v) = n 2 · inf Pn,A,d,b ( f ).

v∈[−M,M]n f :[0,1]→[−M,M]

for any M > 0.

36 Y. Yoshida

Proof First, we show that n 2 · inf f :[0,1]→[−M,M] Pn,A,d,b ( f ) ≤ minv∈[−M,M]n

pn,A,d,b (v). Given a vector v ∈ [−M, M]n , we define f : [0, 1] → [−M, M] as
f (x) = vin (x) . Then,

1 1
f, A f = Ai j f (x) f (y)dxdy = Ai j vi v j = v, Av,
Iin I nj n2 n2
i, j∈[n] i, j∈[n]

f , D1 =
2
di f (x) dxdy =
2
di f (x)2 dx
n I nj n
i, j∈[n] Ii i∈[n] Ii
1 1
= di vi2 = v, diag(d)v,
n n
i∈[n]

1 1
f, B1 = bi f (x)dxdy = bi f (x)dx = bi vi = v, b.
n I nj n n n
i, j∈[n] Ii i∈[n] Ii i∈[n]

Hence, we have n 2 Pn,A,d,b ( f ) ≤ pn,A,d,b (v).

Next, we show that minv∈[−M,M]n pn,A,d,b (v) ≤ n 2 · inf f :[0,1]→[−M,M] Pn,A,d,b ( f ).
Let f : [0, 1] → [−M, M] be a measurable function. For x ∈ [0, 1], we then have

∂ Pn,A,d,b ( f (x))
∂ f (x)

= Aiin (x) f (y)dy + Ain (x) j f (y)dy + 2din (x) f (x) + bin (x) .
i∈[n] Iin j∈[n] I jn

Note that the form of this partial derivative depends on only i n (x). Hence, in the
optimal solution f ∗ : [0, 1] → [−M, M], we can assume f ∗ (x) = f ∗ (y) if i n (x) =
i n (y). In other words, f ∗ is constant on each of the intervals I1n , . . . , Inn . For such
f ∗ , we define the vector v ∈ Rn as vi = f ∗ (x), where x ∈ [0, 1] is any element in
Iin . Then, we have

v, Av = Ai j vi v j = n 2 Ai j f ∗ (x) f ∗ (y)dxdy = n 2 f ∗ , A f ∗ ,
n I nj
i, j∈[n] i, j∈[n] Ii

v, diag(d)v = di vi2 = n di f ∗ (x)2 dx = n( f ∗ )2 , D1,
n
i∈[n] i∈[n] Ii

v, b = bi vi = n bi f ∗ (x)dx = n f ∗ , B1.
n
i∈[n] i∈[n] Ii

Hence, we have pn,A,d,b (v) ≤ n 2 Pn,A,d,b ( f ∗ ).

Proof (of Theorem 3.1) We instantiate Lemma 3.2 with s = 2(1/ ) + (log 1δ log log 1δ )
2

and the dikernels A, D, and B. Then, with probability at least 1 − δ, there exists a
measure-preserving bijection π : [0, 1] → [0, 1] such that
L M2
max | f, (A − π(A| S )) f |, | f 2 , (D − π(D| S ))1|, | f, (B − π(B| S ))1| ≤
3
3 Constant-Time Algorithms for Continuous Optimization Problems 37

for any function f : [0, 1] → [−M, M]. Conditioned on this event, we have

z̃ ∗ = mins ps,A|S ,d|S ,b|S (v) = min ps,A|S ,d|S ,b|S (v)
v∈R v∈[−M,M]s

= s2 · inf Ps,A|S ,d|S ,b|S ( f ) (By Lemma 3)

f :[0,1]→[−M,M]

= s2 · inf f, (π(A| S ) − A) f + f, A f + f 2 , (π(D| S ) − D)1+
f :[0,1]→[−M,M]

f 2 , D1 + f, (π(B| S ) − B)1 + f, B1

≤ s2 · inf f, A f + f 2 , D1 + f, B1 ± L M 2
f :[0,1]→[−M,M]
2
s
= · min pn,A,d,b (v) ± L M 2 s 2 . (By Lemma 3)
n 2 v∈[−M,M]n
s2 s2
= 2 · minn pn,A,d,b (v) ± L M 2 s 2 = 2 z ∗ ± L M 2 s 2 .
n v∈R n

Rearranging the inequality, we obtain the desired result.

3.4 Tensor Decomposition

Background
We say that a tensor (or a multidimensional array) is of order K if it is a K -
dimensional array. Each dimension is called a mode in tensor terminology. Tensor
decomposition, which approximates the input tensor by a number of smaller tensors,
is a fundamental tool for dealing with large tensors because it drastically reduces
memory usage.
Among the many existing tensor decomposition methods, Tucker decomposi-
tion [11] is a popular choice. To some extent, Tucker decomposition is analogous to
singular-value decomposition (SVD). Whereas SVD decomposes a matrix into left
and right singular vectors that interact via singular values, Tucker decomposition of
an order-K tensor consists of K factor matrices that interact via the so-called core
tensor. The key difference between SVD and Tucker decomposition is that, in the
latter, the core tensor does not need to be diagonal and its “rank” can differ for each
mode. We refer to the size of the core tensor, which is a K -tuple, as the Tucker rank
of a Tucker decomposition.
We are usually interested in obtaining factor matrices and a core tensor to minimize
the residual error—the error between the input and low-rank approximated tensors.
Sometimes, however, knowing the residual error itself is a task of interest. The
residual error tells us how suitable a low-rank approximation is to approximate the
input tensor in the first place, and is also useful to predetermine the Tucker rank.
In real applications, Tucker ranks are not explicitly given, and we must select them
by considering the tradeoff between space usage and approximation accuracy. For
38 Y. Yoshida

example, if the selected Tucker rank is too small, we risk losing essential information
in the input tensor, whereas if the selected Tucker rank is too large, the computational
cost of computing the Tucker decomposition (even if we allow for approximation
methods) increases considerably along with space usage. As with the case of the
matrix rank, one might think that a reasonably good Tucker rank can be found using
a grid search. Unfortunately, grid search for an appropriate Tucker rank is challenging
because, for an order-K tensor, the Tucker rank consists of K free parameters and
the search space grows exponentially in K . Hence, we want to evaluate each grid
point as quickly as possible.
Although several practical algorithms have been proposed, such as the higher order
orthogonal iteration (HOOI) [3], they are not sufficiently scalable. For each mode,
HOOI iteratively applies SVD to an unfolded tensor—a matrix that is reshaped from
the input tensor.Given an N1 × · · · × N K tensor, the computational cost is hence
O(K maxk Nk · k Nk ), which crucially depends on the input size N1 , . . . , N K .
Although there are several approximation algorithms, their computational costs are
still intensive.
Constant-time algorithm for the Tucker fitting problem
The problem of computing the residual error is formalized as the following Tucker
fitting problem: Given an order-K tensor X ∈ R N1 ×···×N K and integers Rk ≤ Nk (k =
1, . . . , K ), we want to compute the following normalized residual error:
2

X − [[G; U (1) , . . . , U (K ) ]]
R1 ,...,R K (X ) := min F
, (3.2)
G∈R R1 ×···×R K ,{U (k) ∈R Nk ×Rk }k∈[K ] k∈[K ] N k

where [[G; U (1) , . . . , U (K ) ]] ∈ R N1 ×···×N K is an order-K tensor, defined as

[[G; U (1) , . . . , U (K ) ]]i1 ···i K = G r1 ···r K Ui(k)
k rk
r1 ∈[R1 ],...,r K ∈[R K ] k∈[K ]

for every i 1 ∈ [N1 ], . . . , i K ∈ [N K ]. Here, G is the core tensor, and U (1) , . . . , U (K )

are the factor matrices. Note that we are not concerned with computing the minimizer,
but only want to compute the minimum value. In addition, we do not need the
exact minimum. Indeed, a rough estimate still helps to narrow down promising rank
candidates. The question here is how quickly we can compute the normalized residual
error R1 ,...,R K (X ) with moderate accuracy.
In this section, we consider the following simple sampling algorithm, and
show that it can be used to approximately solve the Tucker fitting problem. First,
given an order-K tensor X ∈ R N1 ×···×N K , Tucker rank (R1 , . . . , R K ), and sam-
ple size s ∈ N, we sample a sequence of indices Sk = (x1k , . . . , xsk ) uniformly
and independently from [Nk ] for each mode k ∈ [K ]. We then construct a mini-
tensor X | S1 ,...,SK ∈ Rs×···×s , where (X | S1 ,...,SK )i1 ,...,i K = X xi1 ...,xiK . Finally, we com-
1 K
pute R1 ,...,R K (X | S1 ,...,SK ) using an arbitrary solver, such as HOOI, and output the
obtained value. The details are provided in Algorithm 2. Note that the time complex-
3 Constant-Time Algorithms for Continuous Optimization Problems 39

Algorithm 2 Sampling algorithm for the Tucker fitting problem

Input: N1 , . . . , N K ∈ N, query access to a tensor X ∈ R N1 ×···×N K , Tucker rank (R1 , . . . , Rk ), and
, δ ∈ (0, 1).
1: for k = 1 to K do
2: Sk ← a sequence of s = s(, δ) indices uniformly and independently sampled from [Nk ].
3: Construct a mini-tensor X | S1 ,...,SK .
4: return R1 ,...,R K (X | S1 ,...,SK ).

ity for computing R1 ,...,R K (X | S1 ,...,SK ) does not depend on the input size N1 , . . . , N K
but rather on the sample size s, meaning that the algorithm runs in constant time,
regardless of the input size.
The goal of the rest of this section is to show the following approximation guar-
antee of Algorithm 2.
Theorem 3.2 Let X ∈ R N1 ×···×N K be a tensor, R1 , . . . , R K be integers, and , δ ∈
2K −2
(0, 1). For s(, δ) = 2(1/ ) + (log 1δ log log 1δ ), we have the following. Let
S1 , . . . , SK be sequences of indices as defined in Algorithm 2. Let (G ∗ , U1∗ , . . . , U K∗ )
and (G̃ ∗ , Ũ1∗ , . . . , Ũ K∗ ) be minimizers of problem (3.2) on X and X | S1 ,...,SK for which
the factor matrices are orthonormal, respectively. Then we have

R1 ,...,R K (X | S1 ,...,SK ) = R1 ,...,R K (X ) ± O( L 2 (1 + 2M R)),

with probability
at least 1 − δ, where L = |X |max , M = max{|G ∗ |max , |G̃ ∗ |max }, and
R = k∈[K ] Rk .
We remark that, for the matrix case (i.e., K = 2), |G ∗ |max and |G̃ ∗ |max are equal to
the maximum singular values of the original and sampled matrices, respectively.

3.4.1 Preliminaries

Let X ∈ R N1 ×···N K be a tensor. We define

|X | F = X i21 ···i K , (Frobenius norm)
i 1 ,...,i K

|X |max = max |X i1 ···i K |, (Max norm)

i 1 ∈[N1 ],...,i K ∈[N K ]

|X | = max X i1 ···i K . (Cut norm)
S1 ⊆[N1 ],...,SK ⊆[N K ]
i 1 ∈S1 ,...,i K ∈SK

We note that these norms satisfy the triangle inequality.

For a vector v ∈ Rn and a sequence S = (x1 , . . . , xs ) of indices in [n], we define
the restriction v| S ∈ Rs of v as (v| S )i = vxi for i ∈ [s]. Let X ∈ R N1 ×···×N K be a
40 Y. Yoshida

tensor, and Sk = (x1k , . . . , xsk ) be a sequence of indices in [Nk ] for each mode k ∈
[K ]. Then, we define the restriction X | S1 ,...,SK ∈ Rs×···×s of X to S1 × · · · × SK as
(X | S1 ,...,SK )i1 ···i K = X xi1 ,...,xiK for each i 1 ∈ [N1 ], . . . , i K ∈ [Nk ].
1 K
For a tensor G∈R R1 ×···×R K and vector-valued functions {F (k) : [0, 1]→R Rk }k∈[K ] ,
we define an order-K dikernel [[G; F (1) , . . . , F (K ) ]] : [0, 1] K → R as

[[G; F (1) , . . . , F (K ) ]](x1 , . . . , x K ) = G r1 ,...,r K F (k) (xk )rk
r1 ∈[R1 ],...,r K ∈[R K ] k∈[K ]

We note that [[G; F (1) , . . . , F (K ) ]] is a continuous analogue of Tucker decomposition.

3.4.2 Proof of Theorem 3.2

To prove Theorem 3.2, we first consider the dikernel counterpart to the Tucker fitting
problem, in which we want to minimize the following:
2

R1 ,...,R K (X) := inf X − [[G; f (1) , . . . , f (K ) ]] , (3.3)
G∈R R1 ×···×R K ,{ f (k) :[0,1]→R Rk }k∈[K ] F

The following lemma, which is proved in Sect. 3.4.3, states that the Tucker fitting
problem and its dikernel counterpart have the same optimum values.
Lemma 3.4 Let X ∈ R N1 ×···×N K be a tensor, and let R1 , . . . , R K ∈ N be integers.
Then, we have

R1 ,...,R K (X ) = R1 ,...,R K (X).

For a set of vector-valued functions F = { f (k) : [0, 1] → R Rk }k∈[K ] , we define

|F|max = maxk∈[K ],r ∈[Rk ],x∈[0,1] fr(k) (x). For a dikernel X : [0, 1] K → R, we define a
dikernel X2 : [0, 1] K → R as X2 (x) = X(x)2 for every x ∈ [0, 1] K . The following
lemma, which is proved in Sect. 3.4.4, states that if X and Y are close in the cut
norm, then the optimum values when the Tucker fitting problem is applied to them
are also close.
Lemma 3.5 Let X, Y : [0, 1] K → R be dikernels with |X − Y| ≤ and |X2 −
Y2 | ≤ . For R1 , . . . , R K ∈ N, we have

R1 ,...,R K (X) = R1 ,...,R K (Y) ± 2 1 + R |G X |max |FX |max + |G Y |max |FY |max ,
K K

where (G X , FX = { f X(k) }k∈[K ] ) and (G Y, FY = { f Y(k) }k∈[K ] ) are solutions to prob-

lem (3.3) on X and Y, respectively,
which have objective values exceeding the infima
by at most , and R = k∈[K ] Rk .
3 Constant-Time Algorithms for Continuous Optimization Problems 41

Proof (of Theorem 3.2) We apply Lemma 3.2 to X and X2 . Thus, with probability at
least 1 − δ, there exists a measure-preserving bijection π : [0, 1] → [0, 1] such that

|X − π(X| S1 ,...,SK )| ≤ L and |X2 − π(X2 | S1 ,...,SK )| ≤ L 2 .

In the following, we assume that this has happened. By Lemma 3.5 and the fact that
R1 ,...,R K (X| S1 ,...,SK ) = R1 ,...,R K (π(X| S1 ,...,SK )), we have

R1 ,...,R K (X| S1 ,...,S K ) = R1 ,...,R K (X) ± L 1 + 2R(|G|max |F|max + |G̃|max | F̃|max ) ,
2 K K

where (G, F = { f (k) }k∈[K ] ) and (G̃, F̃ = { f˜(k) }k∈[K ] ) are as in the statement of
Lemma 3.5. From the proof of Lemma 3.4, we can assume that |G|max = |G ∗ |max ,
|G̃|max = |G̃ ∗ |max , |F|max ≤ 1, and | F̃|max ≤ 1 (owing to the orthonormality of
U1∗ , . . . , U K∗ and Ũ1∗ , . . . , Ũ K∗ ). It follows that

R1 ,...,R K (X| S1 ,...,SK ) = R1 ,...,R K (X) ± L 2 1 + 2R(|G ∗ |max + |G̃ ∗ |max ) . (3.4)

Then, we have

R1 ,...,R K (X | S1 ,...,SK ) = R1 ,...,R K (X| S1 ,...,SK ) (By Lemma 4)

∗ ∗
= R1 ,...,R K (X) 1 + 2R(|G |max + |G̃ |max )
± L 2
(By 4)

= R1 ,...,R K (X ) ± L 2 1 + 2R(|G ∗ |max + |G̃ ∗ |max ) . (By Lemma 4)

Hence, the proof is complete.

3.4.3 Proof of Lemma 3.4

We say that a vector-valued function f : [0, 1] → R R is orthonormal if fr , fr = 1

for every r ∈ [R] and fr , fr = 0 if r = r . First, we calculate the partial derivatives
of the objective function. We omit the proof because it is a straightforward (but
tedious) calculation.
Lemma 3.6 Let X ∈ [0, 1] K → R be a dikernel, G ∈ R R1 ×···R K be a tensor, and
{ f (k) : [0, 1] → R Rk }k∈[K ] be a set of orthonormal vector-valued functions. Then,
we have
42 Y. Yoshida

∂ 2
(1) (K )
X − [[G; f , . . . , f ]]
∂ fr(k0 0 ) (x0 ) F

=2 G r1 ···r K X(x) fr(k)
k
(xk )dx
r1 ,...,r K :rk0 =r0 [0,1] :xk0 =x0
K
k∈[K ]\{k0 }

−2 G r1 ···r K G r1 ···rk0 −1 r0 rk0 +1 ···r K fr(kk 0 ) (x0 ).

0
r1 ,...,r K

Proof (of Lemma 3.4) First, we show that (LHS) ≤ (RHS). Consider a sequence of
solutions for the continuous problem (3.3) for which the objective values attain the
infimum. For Tucker decompositions, it is well known that there exists a minimizer
for which the factor matrices U (1) , . . . , U (K ) are orthonormal. By similar reasoning,
we can show that the vector-valued functions f (1) , . . . , f (K ) in each solution of the
sequence are orthonormal. As the objective function is coercive with respect to tensor
G, we can take a subsequence for which G converges. Let G ∗ be the limit. Now,
for any δ > 0, we can create a matrix G̃ by perturbing G ∗ so that (i) by fixing G
to G̃ in the continuous problem, the infimum increases only by δ, and (ii) a matrix
constructed from G̃ is invertible and has a condition number at least δ = δ (δ) > 0.
Now, consider a sequence of solutions for the continuous problem (3.3) with G
fixed to G̃ for which the objective values attain the infimum. We can show that the
partial derivatives converge to zero almost everywhere. For any > 0, there then
exists a solution (G̃, f (1) , . . . , f (K ) ) in the sequence such that the partial derivatives
are at most almost everywhere.
Then by Lemma 3.6, for any k0 ∈ [K ], r0 ∈ [Rk ], and almost all x ∈ [0, 1], we
have

G̃ r1 ···r K G̃ r1 ···rk0 −1 r0 rk0 +1 ···r K fr(kk 0 ) (x0 )

0
r1 ,...,r K

= G̃ r1 ···r K X(x) fr(k)
k
(xk )dx ± (k0 , r0 , x), (3.5)
r1 ,...,r K :rk0 =r0 [0,1] K :xk0 =x0 k∈[K ]\{k0 }

where (k0 , r0 , x) = O(). Now, we consider a system of linear equations consisting

of (3.5) for r0 = 1, . . . , Rk0 , where the variables are f 1k0 (x0 ), . . . , f Rkk0 (x0 ). We can
0
assume that the matrix involved in this system is invertible and has a positive condition

number. For any k ∈ [K ], r ∈ [Rk ] and almost every pair x, x ∈ [0, 1] with i Nk (x) =
i Nk (x ), we then have fr(k0 0 ) (x) = fr(k0 0 ) (x ) ± O(/δ ). For each k ∈ [K ], we can
define a matrix U (k) ∈ R Nk ×Rk as Uir(k) = fr(k) (x), where x ∈ [0, 1] is an arbitrary
value with i Nk (x) = i. Then, we have
3 Constant-Time Algorithms for Continuous Optimization Problems 43

1 2
1 2
X − [[G̃; U (1) , . . . , U (K ) ]] = X i1 ···i K − [[G̃; U (1) , . . . , U (K ) ]]i1 ···i K
N F N i ,...,i

1 K
2
(1) (K )
= N N
X(x) − [[ G̃; f , . . . , f ]](x) ± O(/δ ) dx
i 1 ,...,i K Ii 1 ×···×Ii K
1 K
2

= X − [[G̃; f (1) , . . . , f (K ) ]] ± O( 2 N /(δ )2 )
F

for N = k∈[K ] Nk . As the choice of and δ are arbitrary, we obtain (LHS) ≤ (RHS).
Second, we show that (RHS) ≤ (LHS). Let U (k) ∈ R Nk ×Rk (k ∈ [K ]) be matrices.
We define a vector-valued function f (k) : [0, 1] → R Rk as fr(k) (x) = Ui(k) Nk (x)r
for each
k ∈ [K ] and r ∈ [Rk ]. Then, we have
2 2

X − [[G; f (1) , . . . , f (K ) ]] = X(x) − [[G; f (1) , . . . , , f (K ) ]](x) dx
F [0,1] K
2
(1) (K )
= N
X(x) − [[G; f , . . . , f ]](x) dx
k
i 1 ,...,i K k∈[K ] Ii k

1 2
= X i1 ···i K − [[G; U (1) , . . . , U (K ) ]]i1 ···i K
N i 1 ,...,i K
1 2

= X − [G; U (1) , . . . , U (K ) ] ,
N F

from which the claim follows.

3.4.4 Proof of Lemma 3.5

For a sequence of functions f (1) , . . . , f (K ) , we define their tensor product k∈[K ]

f (k) ∈ [0, 1] K → R as k∈[K ] f (k) (x1 , . . . , x K ) = k∈[K ] f (k) (xk ), which is a dik-
ernel of order-K .
The cut norm is useful for bounding the absolute value of the inner product
between a tensor and a tensor product:
Lemma 3.7 Let ≥ 0 and W : [0, 1] K → R be a dikernel with |W| ≤ . Then,
for any functions f (1) , . . . , f (K ) : [0, 1] → [−L , L], we have |W, k∈[K ] f (k) | ≤
LK .

Proof For τ ∈ R and the function h : [0, 1] → R, let L τ (h) := {x ∈ [0, 1] | h(x) =
τ } be the level set of h at τ . For f (i) = f (i) /L, we have
44 Y. Yoshida

W, f (k) = L K W, f (k)

k∈[K ] k∈[K ]

K
=L τk W(x)dxdτ
[−1,1] k∈[K ]
K
k∈[K ] L τk ( f
(k) )

≤L K
|τk | W(x)dxdτ
[−1,1] K k∈[K ] k∈[K ] L τ ( f (k) )
k

≤ LK |τk |dτ = L K .
[−1,1] K k∈[K ]

Thus, we have the following:

Lemma 3.8 Let X, Y : [0, 1] K → R be dikernels with |X − Y| ≤ and |X2 −
Y2 | ≤ , where X2 (x) = X(x)2 and Y2 (x) = Y(x)2 for every x ∈ [0, 1] K . Then,
for any tensor G ∈ R R1 ×···×R K and a set of vector-valued functions F = { f (k) :
[0, 1] → R Rk }k∈[K ] , we have
2 2

X − [[G; f (1) , . . . , f (K ) ]] = Y − [[G; f (1) , . . . , f (K ) ]] ± 1 + 2R|G|max |F|max
K
,
F F

where R = k∈[K ] RK .
Proof We have
2 2

X − [[G; f (1) , . . . , f (K ) ]] F − Y − [[G; f (1) , . . . , f (K ) ]] F
2

= X(x) − [[G; f (1) , . . . , f (K ) ]](x) dx
[0,1] K
2
− Y(x) − [[G; f (1) , . . . , f (K ) ]](x) dx
[0,1] K

= X(x)2 − Y(x)2 dx − 2 (X(x) − Y(x))[[G; f (1) , . . . , f (K ) ]](x)dx
[0,1] K [0,1] K

≤ |X − Y | + 2
2 2
|G r1 ···r K | · X − Y, frk
(k)

r1 ∈[R1 ],...,rk ∈[Rk ] k∈[K ]

≤ + 2 R|G|max |F|max
K

by Lemma 3.7.
Proof (of Lemma 3.5) By Lemma 3.8, we have
2 2
(1) (K ) (1) (K )
Y − [[G Y ; f Y , . . . , f Y ]] ≤ Y − [[G X ; f X , . . . , f X ]] +
F F
2
(1) (K )
≤ X − [[G X ; f X , . . . , f X ]] + 2 + 2 R|G X max FX max . K
F
3 Constant-Time Algorithms for Continuous Optimization Problems 45

Similarly, we have
2 2
(1) (K ) (1) (K )
X − [[G X ; f X , . . . , f X ]] ≤ X − [[G Y ; f Y , . . . , f Y ]] +
F F
2
(1) (K )
≤ Y − [[G Y ; f Y , . . . , f Y ]] + 2 + 2 RG Y max FY max . K
F

Hence, the claim follows.

References

1. L. Bottou, Stochastic learning, in Advanced Lectures on Machine Learning (2004), pp. 146–168
2. K.L. Clarkson, E. Hazan, D.P. Woodruff, Sublinear optimization for machine learning. J. ACM
59(5), 23:1–23:49 (2012)
3. Lieven De Lathauwer, Bart De Moor, Joos Vandewalle, On the best rank-1 and rank-
(r1 , r2 , . . . , rn ) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl. 21(4),
1324–1342 (2000)
4. A. Frieze, R. Kannan, The regularity lemma and approximation schemes for dense problems,
in FOCS (1996), pp. 12–20
5. K. Hayashi, Y. Yoshida, Minimizing quadratic functions in constant time, in NIPS (2016), pp.
2217–2225
6. K. Hayashi, Y. Yoshida, Fitting low-rank tensors in constant time, in NIPS (2017), pp. 2473–
2481
7. L. Lovász, Large Networks and Graph Limits (American Mathematical Society, 2012)
8. K.P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012)
9. Taiji Suzuki, Masashi Sugiyama, Least-squares independent component analysis. Neural Com-
put. 23(1), 284–301 (2011)
10. M. Sugiyama, T. Suzuki, T. Kanamori, Density Ratio Estimation in Machine Learning (Cam-
bridge University Press, 2012)
11. Ledyard R. Tucker, Some mathematical notes on three-mode factor analysis. Psychometrika
31(3), 279–311 (1966)
12. K.I. Christopher, in Using the Nyström Method to Speed up Kernel Machines, NIPS eds. by C.
Williams, M. Seeger (2001)
13. M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, M. Sugiyama, Relative density-ratio estima-
tion for robust distribution comparison, in NIPS (2011)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.

Assignment 2 Solutions
100% (1)
Assignment 2 Solutions
10 pages
Proximal Minimization With D-Functions: Gorithms
No ratings yet
Proximal Minimization With D-Functions: Gorithms
11 pages
Hw3sol PDF
No ratings yet
Hw3sol PDF
8 pages
An Algorithm For Minimax Solution of Overdetennined Systems of Non-Linear Equations
No ratings yet
An Algorithm For Minimax Solution of Overdetennined Systems of Non-Linear Equations
8 pages
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
No ratings yet
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
3 pages
1 Solving Systems of Linear Equations: Gaussian Elimination: Lecture 9: October 26, 2021
No ratings yet
1 Solving Systems of Linear Equations: Gaussian Elimination: Lecture 9: October 26, 2021
8 pages
Wavelets 3
No ratings yet
Wavelets 3
29 pages
Paper 23-A New Type Method For The Structured Variational Inequalities Problem
No ratings yet
Paper 23-A New Type Method For The Structured Variational Inequalities Problem
4 pages
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
No ratings yet
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
8 pages
(Ebook) Optimization Models (Instructor's Solution Manual) (Solutions) by Giuseppe C. Calafiore, Laurent El Ghaoui ISBN 9781107050877, 1107050871 pdf download
100% (1)
(Ebook) Optimization Models (Instructor's Solution Manual) (Solutions) by Giuseppe C. Calafiore, Laurent El Ghaoui ISBN 9781107050877, 1107050871 pdf download
47 pages
Sparse Regression and Dictionary Learning
No ratings yet
Sparse Regression and Dictionary Learning
14 pages
Cs421 Cheat Sheet
No ratings yet
Cs421 Cheat Sheet
2 pages
Dommel and Pichler - 2024 - On the Approximation of Kernel functions
No ratings yet
Dommel and Pichler - 2024 - On the Approximation of Kernel functions
27 pages
Introduction to optimization - Jean-François Aujol
No ratings yet
Introduction to optimization - Jean-François Aujol
51 pages
hw2 Sol
No ratings yet
hw2 Sol
7 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
2 pages
Solving Underdetermined Nonlinear Equations by Newton-Like Method
No ratings yet
Solving Underdetermined Nonlinear Equations by Newton-Like Method
22 pages
Several NP-Hard Problems Arising in Robust Stability Analysis
No ratings yet
Several NP-Hard Problems Arising in Robust Stability Analysis
7 pages
A Linear-Time Algorithm For Concave One-Dimensional Dynamic Programming
No ratings yet
A Linear-Time Algorithm For Concave One-Dimensional Dynamic Programming
5 pages
Partial Exam 23 Nov 2011
No ratings yet
Partial Exam 23 Nov 2011
7 pages
Conjugate Gradient Method
No ratings yet
Conjugate Gradient Method
50 pages
Lec Continuous Least Squares Annotated
No ratings yet
Lec Continuous Least Squares Annotated
31 pages
DR FX CXX DX Q XD DXR R
No ratings yet
DR FX CXX DX Q XD DXR R
11 pages
Bri11 Jensen Equivalent
No ratings yet
Bri11 Jensen Equivalent
6 pages
Basic Iterative Methods For Solving Linear Systems PDF
No ratings yet
Basic Iterative Methods For Solving Linear Systems PDF
33 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
A First Course in Optimization: Answers To Selected Exercises
No ratings yet
A First Course in Optimization: Answers To Selected Exercises
71 pages
A First Course in Optimization: Answers To Selected Exercises
No ratings yet
A First Course in Optimization: Answers To Selected Exercises
71 pages
Station Iter
No ratings yet
Station Iter
11 pages
Linear Algebra Review
No ratings yet
Linear Algebra Review
2 pages
Optimization Models Instructor s Solution Manual Solutions 1st Edition Giuseppe C. Calafiore - Download the ebook now to never miss important information
No ratings yet
Optimization Models Instructor s Solution Manual Solutions 1st Edition Giuseppe C. Calafiore - Download the ebook now to never miss important information
90 pages
Analysis and Optimization of An Algorithm For Discrete Tomography
No ratings yet
Analysis and Optimization of An Algorithm For Discrete Tomography
32 pages
Reproducing Kernel Hilbert Spaces For Penalized Regression: A Tutorial
No ratings yet
Reproducing Kernel Hilbert Spaces For Penalized Regression: A Tutorial
25 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Unconstrained Optimization: Prof. S.S. Jang Department of Chemical Engineering National Tsing-Hua Univeristy
No ratings yet
Unconstrained Optimization: Prof. S.S. Jang Department of Chemical Engineering National Tsing-Hua Univeristy
46 pages
Institute of Computer Science: Academy of Sciences of The Czech Republic
No ratings yet
Institute of Computer Science: Academy of Sciences of The Czech Republic
49 pages
Solving Structured Linear Systems With Large Displacement Rank
No ratings yet
Solving Structured Linear Systems With Large Displacement Rank
27 pages
Note Set 7 - Nonlinear Equations: 7.1 - Overview
No ratings yet
Note Set 7 - Nonlinear Equations: 7.1 - Overview
10 pages
Gcmma
No ratings yet
Gcmma
23 pages
Symmetric Matrix In: Manchester Ml3 Opl, Engeanc
No ratings yet
Symmetric Matrix In: Manchester Ml3 Opl, Engeanc
16 pages
Optimization Models Instructor s Solution Manual Solutions 1st Edition Giuseppe C. Calafiore - Download the ebook now and read anytime, anywhere
No ratings yet
Optimization Models Instructor s Solution Manual Solutions 1st Edition Giuseppe C. Calafiore - Download the ebook now and read anytime, anywhere
85 pages
MA398 Script
No ratings yet
MA398 Script
115 pages
Chapter 3 Supplementary: (Recall That N (N) )
No ratings yet
Chapter 3 Supplementary: (Recall That N (N) )
5 pages
A Limited-memory Algorithm For
No ratings yet
A Limited-memory Algorithm For
22 pages
Homework 2 MATH2050
No ratings yet
Homework 2 MATH2050
10 pages
IRWA-Rewighted Burke 2015
No ratings yet
IRWA-Rewighted Burke 2015
34 pages
Least Square by Nicholson-linear algebra-2018
No ratings yet
Least Square by Nicholson-linear algebra-2018
12 pages
IBM Thomas J. Watson Research Center, New York, U.S.A.: K L K L
No ratings yet
IBM Thomas J. Watson Research Center, New York, U.S.A.: K L K L
22 pages
Remezr2 Arxiv
No ratings yet
Remezr2 Arxiv
30 pages
GP of Tokyo Editorial
No ratings yet
GP of Tokyo Editorial
13 pages
Solvingsingular Linear Equation
No ratings yet
Solvingsingular Linear Equation
49 pages
Quadratic Vector Equations: (X) Is Operator Concave Instead of Convex
No ratings yet
Quadratic Vector Equations: (X) Is Operator Concave Instead of Convex
15 pages
Instant download Optimization Models Instructor s Solution Manual Solutions 1st Edition Giuseppe C. Calafiore pdf all chapter
100% (1)
Instant download Optimization Models Instructor s Solution Manual Solutions 1st Edition Giuseppe C. Calafiore pdf all chapter
81 pages
MAT637 Practice Examination
No ratings yet
MAT637 Practice Examination
9 pages
Optimization by UC Berkley
No ratings yet
Optimization by UC Berkley
77 pages
10.1007@s00211 005 0618 1
No ratings yet
10.1007@s00211 005 0618 1
29 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
0.1 Worst and Best Case Analysis
No ratings yet
0.1 Worst and Best Case Analysis
6 pages
Data Structure Lecture 1
No ratings yet
Data Structure Lecture 1
36 pages
Sample Q & A
No ratings yet
Sample Q & A
11 pages
Question Bank (ADA)
No ratings yet
Question Bank (ADA)
10 pages
DAA PUT Model Ppr 2024-25
No ratings yet
DAA PUT Model Ppr 2024-25
3 pages
ADA Viva Questions Unit 1 and 2
No ratings yet
ADA Viva Questions Unit 1 and 2
5 pages
A Variant of Bucket Sort: Shell Sort Vs Insertion Sort
No ratings yet
A Variant of Bucket Sort: Shell Sort Vs Insertion Sort
5 pages
Quantum Computing Lecture 10
No ratings yet
Quantum Computing Lecture 10
18 pages
11 - Big O and Recursion
No ratings yet
11 - Big O and Recursion
21 pages
Predictive Analysis Overview 2013
100% (1)
Predictive Analysis Overview 2013
180 pages
Analysis of Algorithms: CS 302 - Data Structures Section 2.6
No ratings yet
Analysis of Algorithms: CS 302 - Data Structures Section 2.6
48 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
61 pages
Proofread General Mathematics
No ratings yet
Proofread General Mathematics
154 pages
Fycs Sem II Design&Analysisofalgorithm Notes.docx (1)
No ratings yet
Fycs Sem II Design&Analysisofalgorithm Notes.docx (1)
85 pages
Binary Division Attack For Elliptic Curve Discrete
No ratings yet
Binary Division Attack For Elliptic Curve Discrete
16 pages
Short Cycles Make W-Hard Problems Hard: FPT Algorithms For W-Hard Problems in Graphs With No Short Cycles
No ratings yet
Short Cycles Make W-Hard Problems Hard: FPT Algorithms For W-Hard Problems in Graphs With No Short Cycles
28 pages
Q. No. 1-5 Carry One Mark Each: Ax Ax
No ratings yet
Q. No. 1-5 Carry One Mark Each: Ax Ax
32 pages
MTE Solution Sketch and Grading Policy
No ratings yet
MTE Solution Sketch and Grading Policy
5 pages
What Is A Metaheuristic
No ratings yet
What Is A Metaheuristic
17 pages
Data Structure: Chapter 1 - Basic Concepts
No ratings yet
Data Structure: Chapter 1 - Basic Concepts
32 pages
Mid Semester: Examination
No ratings yet
Mid Semester: Examination
5 pages
(Ebook) A Common-Sense Guide to Data Structures and Algorithms: Level Up Your Core Programming Skills by Jay Wengrow ISBN 9781680502442, 1680502441 instant download
100% (1)
(Ebook) A Common-Sense Guide to Data Structures and Algorithms: Level Up Your Core Programming Skills by Jay Wengrow ISBN 9781680502442, 1680502441 instant download
61 pages
Document
No ratings yet
Document
19 pages
Clrs
No ratings yet
Clrs
25 pages
Convex Hull Notes
No ratings yet
Convex Hull Notes
15 pages
Correction of Worksheet 1
No ratings yet
Correction of Worksheet 1
6 pages
Rebalancing BST
No ratings yet
Rebalancing BST
7 pages
Data Structure and Algorithm Reviewer
No ratings yet
Data Structure and Algorithm Reviewer
3 pages
Get Discovering Computer Science Interdisciplinary Problems Principles and Python Programming 1st Edition Jessen Havill free all chapters
100% (1)
Get Discovering Computer Science Interdisciplinary Problems Principles and Python Programming 1st Edition Jessen Havill free all chapters
55 pages
Selection
No ratings yet
Selection
20 pages

Constant-Time_Algorithms_for_Continuous_Optimizati

Uploaded by

Constant-Time_Algorithms_for_Continuous_Optimizati

Uploaded by

Chapter 3

Constant-Time Algorithms for

Abstract In this chapter, we consider constant-time algorithms for continuous opti-

In this chapter, we turn our attention to constant-time algorithms for continuous

3.2 Graph Limit Theory

© The Author(s) 2022 31

We call a (measurable) function W : [0, 1] K → R a dikernel of order K . We

|W|max = max |W(x)|, (Max norm)

|Wt − WPt | ≤ |Wt | F .

We can construct the dikernel X : [0, 1] K → R from a tensor X ∈ R N1 ×···×N K as

where O K (·) and  K (·) hide factors depending on K .

3.3 Quadratic Function Minimization

probability at least 1 − δ, a sequence S of s indices independently and uniformly

3.3.1 Proof of Theorem 3.1

To use dikernels in our analysis, we first introduce a continuous version of pn,A,d,b .

Pn,A,d,b ( f ) =  f, A f  +  f 2 , D1 +  f, B1,

where D and B are the dikernels corresponding to d1 and b1 , respectively,

min pn,A,d,b (v) = n 2 · inf Pn,A,d,b ( f ).

for any M > 0.

Proof First, we show that n 2 · inf f :[0,1]→[−M,M] Pn,A,d,b ( f ) ≤ minv∈[−M,M]n

Hence, we have n 2 Pn,A,d,b ( f ) ≤ pn,A,d,b (v).

Hence, we have pn,A,d,b (v) ≤ n 2 Pn,A,d,b ( f ∗ ).

= s2 · inf Ps,A|S ,d|S ,b|S ( f ) (By Lemma 3)

Rearranging the inequality, we obtain the desired result.

3.4 Tensor Decomposition

where [[G; U (1) , . . . , U (K ) ]] ∈ R N1 ×···×N K is an order-K tensor, defined as

for every i 1 ∈ [N1 ], . . . , i K ∈ [N K ]. Here, G is the core tensor, and U (1) , . . . , U (K )

Algorithm 2 Sampling algorithm for the Tucker fitting problem

R1 ,...,R K (X | S1 ,...,SK ) = R1 ,...,R K (X ) ± O( L 2 (1 + 2M R)),

Let X ∈ R N1 ×···N K be a tensor. We define

|X |max = max |X i1 ···i K |, (Max norm)

We note that these norms satisfy the triangle inequality.

We note that [[G; F (1) , . . . , F (K ) ]] is a continuous analogue of Tucker decomposition.

3.4.2 Proof of Theorem 3.2

R1 ,...,R K (X ) = R1 ,...,R K (X).

For a set of vector-valued functions F = { f (k) : [0, 1] → R Rk }k∈[K ] , we define

where (G X , FX = { f X(k) }k∈[K ] ) and (G Y, FY = { f Y(k) }k∈[K ] ) are solutions to prob-

|X − π(X| S1 ,...,SK )| ≤ L and |X2 − π(X2 | S1 ,...,SK )| ≤ L 2 .

R1 ,...,R K (X | S1 ,...,SK ) = R1 ,...,R K (X| S1 ,...,SK ) (By Lemma 4)

Hence, the proof is complete.

3.4.3 Proof of Lemma 3.4

We say that a vector-valued function f : [0, 1] → R R is orthonormal if  fr , fr  = 1

−2 G r1 ···r K G r1 ···rk0 −1 r0 rk0 +1 ···r K fr(kk 0 ) (x0 ).

G̃ r1 ···r K G̃ r1 ···rk0 −1 r0 rk0 +1 ···r K fr(kk 0 ) (x0 )

where (k0 , r0 , x) = O(). Now, we consider a system of linear equations consisting

from which the claim follows.

3.4.4 Proof of Lemma 3.5

Thus, we have the following:

r1 ∈[R1 ],...,rk ∈[Rk ] k∈[K ]

Hence, the claim follows.

You might also like

where O K (·) and K (·) hide factors depending on K .

Pn,A,d,b ( f ) = f, A f + f 2 , D1 + f, B1,

We say that a vector-valued function f : [0, 1] → R R is orthonormal if fr , fr = 1