Constant-Time_Algorithms_for_Continuous_Optimizati
Constant-Time_Algorithms_for_Continuous_Optimizati
Yuichi Yoshida
3.1 Introduction
This section reviews the basic concepts of graph limit theory. For further details,
refer to the book by Lovász [7].
Y. Yoshida (B)
National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan
e-mail: [email protected]
We note that these norms satisfy the triangle inequality. For two dikernels W and
W , we define their inner product as W, W = [0,1] K W(x)W (x)dx. For a
dikernel W : [0, 1]2 → R and a function f : [0, 1] → R, we define a function W f :
[0, 1] → R as (W f )(x) = W(x, ·), f .
Let λ be a Lebesgue measure. A map π : [0, 1] → [0, 1] is said to be measure-
preserving if the pre-image π −1 (X ) is measurable for every measurable set X ,
and λ(π −1 (X )) = λ(X ). A measure-preserving bijection is a measure-preserving
map whose inverse map exists and is also measurable (and, in turn, also measure-
preserving). For a measure-preserving bijection π : [0, 1] → [0, 1] and a dikernel
W : [0, 1] K →R, we define a dikernel π(W) : [0, 1] K → R as π(W)(x1 , . . . , x K ) =
W(π(x1 ), . . . , π(x K )).
A partition P = (V1 , . . . , V p ) of the interval [0, 1] is called an equipartition if
λ(Vi ) = 1/ p for every i ∈ [ p]. For a dikernel W : [0, 1] K → R and an equipartition
P = (V1 , . . . , V p ) of [0, 1], we define WP : [0, 1] K → R as the dikernel obtained
by averaging each Vi1 × · · · × Vi K for i 1 , . . . , i K ∈ [ p]. More formally, we define
1
WP (x) = W(x )dx = p K W(x )dx ,
k∈[K ] λ(Vi k ) Vi1 ×···×Vi K Vi1 ×···×Vi K
where i k is the unique index such that xk ∈ Vik for each k ∈ [K ]. The following
lemma states that any dikernel W : [0, 1] K → R can be well approximated by WP
for some equipartition P into a small number of parts.
Lemma 3.1 (Weak regularity lemma for dikernels [4]) Let W1 , . . . , W T : [0, 1] K →
R be dikernels. Then, for any > 0, there exists an equipartition P into |P| ≤
2 O(T / ) parts, such that for every t ∈ [T ],
2K
X and Y of different sizes via the cut norm—that is, |X − Y| , where X and Y are
dikernels corresponding to X and Y , respectively.
Let W : [0, 1] K → R be a dikernel and Sk = (x1k , . . . , xsk ) for k ∈ [K ] be
sequences of elements in [0, 1]. Then, we define a dikernel W| S1 ,...,SK : [0, 1] K → R
as follows: We first extract a tensor W ∈ Rs×···×s by setting Wi1 ···i K =W(xi11 , . . . , xiKK ).
Next, we define W| S1 ,...,SK as the dikernel corresponding to W | S1 ,...,SK . The following
is the key technical lemma in the analysis of the algorithms given in the subsequent
sections.
Lemma 3.2 Let W1 , . . . , W T : [0, 1] K → [−L , L] be dikernels. Let S1 , . . . , SK
be sequences of s elements uniformly and independently sampled from [0, 1]. Then,
with probability at least 1 − exp(− K (s 2 (T / log s)1/K )), there exists a measure-
preserving bijection π : [0, 1] → [0, 1] such that, for every t ∈ [T ], we have
1/2K
T
|W − π(W | S1 ,...,SK )| = L · O K
t t
,
log s
Background
Quadratic functions are one of the most important function classes in machine learn-
ing, statistics, and data mining. Many fundamental problems such as linear regression,
k-means clustering, principal component analysis, support vector machines, and ker-
nel methods can be formulated as a minimization problem of a quadratic function.
See, e.g., [8] for more details.
In some applications, it is sufficient to compute the minimum value of a quadratic
function rather than its solution. For example, Yamada et al. [13] proposed an efficient
method for estimating the Pearson divergence, which provides useful information
about data, such as the density ratio [10]. They formulated the estimation problem
as the minimization of a squared loss and showed that the Pearson divergence can be
estimated from the minimum value. Least-squares mutual information [9] is another
example that can be computed in a similar manner.
Despite its importance, minimization of quadratic functions suffers from the issue
of scalability. Let n ∈ N be the number of variables. In general, this kind of min-
imization problem can be solved by quadratic programming (QP), which requires
poly(n) time. If the problem is convex and there are no constraints, then the prob-
lem is reduced to solving a system of linear equations, which requires O(n 3 ) time.
Both methods easily become infeasible, even for medium-scale problems of, say,
n > 10000.
Although several techniques have been proposed to accelerate quadratic function
minimization, they require at least linear time in n. This is problematic when handling
34 Y. Yoshida
Algorithm 1
Input: n ∈ N, query access to a matrix A ∈ Rn×n and to vectors d, b ∈ Rn , and , δ ∈ (0, 1).
1: S ← a sequence of s = s(, δ) indices independently and uniformly sampled from [n].
2
2: return ns 2 minv∈Rn ps,A| S ,d| S ,b| S (v).
large-scale problems, where even linear time is slow or prohibitive. For example,
stochastic gradient descent (SGD) is an optimization method that is widely used
for large-scale problems. A nice property of this method is that, if the objective
function is strongly convex, it outputs a point that is sufficiently close to an optimal
solution after a constant number of iterations [1]. Nevertheless, each iteration needs
at least (n) time to access the variables. Another popular technique is low-rank
approximation such as Nyström’s method [12]. The underlying idea is to approximate
the input matrix by a low-rank matrix, which drastically reduces the time complexity.
However, we still need to compute the matrix vector product of size n, which requires
(n) time. Clarkson et al. [2] proposed sublinear-time algorithms for special cases of
quadratic function minimization. However, these are “sublinear” with respect to the
number of pairwise interactions of the variables, which is (n 2 ), and the algorithms
require O(n logc n) time for some c ≥ 1.
Constant-time algorithm for quadratic function minimization
Let A ∈ Rn×n be a matrix and d, b ∈ Rn be vectors. Then, we consider the following
quadratic problem:
minimize
n
pn,A,d,b (v), where pn,A,d,b (v) = v, Av + nv, diag(d)v + nb, v,
v∈R
(3.1)
where ·, · denotes the inner product and diag(d) denotes a diagonal matrix in which
the diagonal entries are specified by d. Note that although a constant term can be
included in (3.1), it is omitted here because it is irrelevant when optimizing (3.1),
and hence we omit it.
Let z ∗ ∈ R be the optimal value of (3.1) and let , δ ∈ (0, 1) be parameters. Then,
our goal is then to compute z with |z − z ∗ | = O(n 2 ) with probability at least 1 − δ
in constant time. We further assume that we have query access to A, b, and d, with
which we can obtain their entry by specifying an index. We note that z ∗ is typically
(n 2 ) because v, Av consists of (n 2 ) terms, and v, diag(d)v and b, v consist
of (n) terms. Hence, we can regard the error of (n 2 ) as an error of () for
each term, which is reasonably small in typical situations.
Let ·| S be an operator that extracts a submatrix (or subvector) specified by an
index set S ⊂ N. Our algorithm is then given by Algorithm 1, where the parameter
s := s(, δ) is determined later. In other words, we sample a constant number of
indices from the set [n], and then solve the problem (3.1) restricted to these indices.
3 Constant-Time Algorithms for Continuous Optimization Problems 35
Note that the number of queries and the time complexity are O(s 2 ) and poly(s),
respectively.
The goal of the rest of this section is to show the following approximation guar-
antee of Algorithm 1.
Theorem 3.1 Let v∗ and z ∗ be an optimal solution and the optimal value, respec-
tively, of problem (3.1). By choosing s(, δ) = 2(1/ ) + (log 1δ log log 1δ ), with
2
L = max max |Ai j |, max |di |, max |bi | and M = max max |vi∗ |, max |ṽi∗ | .
i, j i i i∈[n] i∈[n]
We can show that M is bounded when A is symmetric and full rank. To see this, we
first note that we can assume A + ndiag(d) is positive-definite, as otherwise pn,A,d,b
is not bounded and the problem is uninteresting. Then, for any set S ⊆ [n] of s indices,
(A + ndiag(d))| S is again positive-definite because it is a principal submatrix. Hence,
we have v∗ = (A + ndiag(d))−1 nb/2 and ṽ∗ = (A| S + ndiag(d| S ))−1 nb| S /2, which
means that M is bounded.
Lemma 3.3 Let A ∈ Rn×n be a matrix and d, b ∈ Rn×n be vectors. Then, we have
∂ Pn,A,d,b ( f (x))
∂ f (x)
= Aiin (x) f (y)dy + Ain (x) j f (y)dy + 2din (x) f (x) + bin (x) .
i∈[n] Iin j∈[n] I jn
Note that the form of this partial derivative depends on only i n (x). Hence, in the
optimal solution f ∗ : [0, 1] → [−M, M], we can assume f ∗ (x) = f ∗ (y) if i n (x) =
i n (y). In other words, f ∗ is constant on each of the intervals I1n , . . . , Inn . For such
f ∗ , we define the vector v ∈ Rn as vi = f ∗ (x), where x ∈ [0, 1] is any element in
Iin . Then, we have
v, Av = Ai j vi v j = n 2 Ai j f ∗ (x) f ∗ (y)dxdy = n 2 f ∗ , A f ∗ ,
n I nj
i, j∈[n] i, j∈[n] Ii
v, diag(d)v = di vi2 = n di f ∗ (x)2 dx = n( f ∗ )2 , D1,
n
i∈[n] i∈[n] Ii
v, b = bi vi = n bi f ∗ (x)dx = n f ∗ , B1.
n
i∈[n] i∈[n] Ii
Proof (of Theorem 3.1) We instantiate Lemma 3.2 with s = 2(1/ ) + (log 1δ log log 1δ )
2
and the dikernels A, D, and B. Then, with probability at least 1 − δ, there exists a
measure-preserving bijection π : [0, 1] → [0, 1] such that
L M2
max | f, (A − π(A| S )) f |, | f 2 , (D − π(D| S ))1|, | f, (B − π(B| S ))1| ≤
3
3 Constant-Time Algorithms for Continuous Optimization Problems 37
for any function f : [0, 1] → [−M, M]. Conditioned on this event, we have
z̃ ∗ = mins ps,A|S ,d|S ,b|S (v) = min ps,A|S ,d|S ,b|S (v)
v∈R v∈[−M,M]s
Background
We say that a tensor (or a multidimensional array) is of order K if it is a K -
dimensional array. Each dimension is called a mode in tensor terminology. Tensor
decomposition, which approximates the input tensor by a number of smaller tensors,
is a fundamental tool for dealing with large tensors because it drastically reduces
memory usage.
Among the many existing tensor decomposition methods, Tucker decomposi-
tion [11] is a popular choice. To some extent, Tucker decomposition is analogous to
singular-value decomposition (SVD). Whereas SVD decomposes a matrix into left
and right singular vectors that interact via singular values, Tucker decomposition of
an order-K tensor consists of K factor matrices that interact via the so-called core
tensor. The key difference between SVD and Tucker decomposition is that, in the
latter, the core tensor does not need to be diagonal and its “rank” can differ for each
mode. We refer to the size of the core tensor, which is a K -tuple, as the Tucker rank
of a Tucker decomposition.
We are usually interested in obtaining factor matrices and a core tensor to minimize
the residual error—the error between the input and low-rank approximated tensors.
Sometimes, however, knowing the residual error itself is a task of interest. The
residual error tells us how suitable a low-rank approximation is to approximate the
input tensor in the first place, and is also useful to predetermine the Tucker rank.
In real applications, Tucker ranks are not explicitly given, and we must select them
by considering the tradeoff between space usage and approximation accuracy. For
38 Y. Yoshida
example, if the selected Tucker rank is too small, we risk losing essential information
in the input tensor, whereas if the selected Tucker rank is too large, the computational
cost of computing the Tucker decomposition (even if we allow for approximation
methods) increases considerably along with space usage. As with the case of the
matrix rank, one might think that a reasonably good Tucker rank can be found using
a grid search. Unfortunately, grid search for an appropriate Tucker rank is challenging
because, for an order-K tensor, the Tucker rank consists of K free parameters and
the search space grows exponentially in K . Hence, we want to evaluate each grid
point as quickly as possible.
Although several practical algorithms have been proposed, such as the higher order
orthogonal iteration (HOOI) [3], they are not sufficiently scalable. For each mode,
HOOI iteratively applies SVD to an unfolded tensor—a matrix that is reshaped from
the input tensor.Given an N1 × · · · × N K tensor, the computational cost is hence
O(K maxk Nk · k Nk ), which crucially depends on the input size N1 , . . . , N K .
Although there are several approximation algorithms, their computational costs are
still intensive.
Constant-time algorithm for the Tucker fitting problem
The problem of computing the residual error is formalized as the following Tucker
fitting problem: Given an order-K tensor X ∈ R N1 ×···×N K and integers Rk ≤ Nk (k =
1, . . . , K ), we want to compute the following normalized residual error:
2
X − [[G; U (1) , . . . , U (K ) ]]
R1 ,...,R K (X ) := min F
, (3.2)
G∈R R1 ×···×R K ,{U (k) ∈R Nk ×Rk }k∈[K ] k∈[K ] N k
ity for computing R1 ,...,R K (X | S1 ,...,SK ) does not depend on the input size N1 , . . . , N K
but rather on the sample size s, meaning that the algorithm runs in constant time,
regardless of the input size.
The goal of the rest of this section is to show the following approximation guar-
antee of Algorithm 2.
Theorem 3.2 Let X ∈ R N1 ×···×N K be a tensor, R1 , . . . , R K be integers, and , δ ∈
2K −2
(0, 1). For s(, δ) = 2(1/ ) + (log 1δ log log 1δ ), we have the following. Let
S1 , . . . , SK be sequences of indices as defined in Algorithm 2. Let (G ∗ , U1∗ , . . . , U K∗ )
and (G̃ ∗ , Ũ1∗ , . . . , Ũ K∗ ) be minimizers of problem (3.2) on X and X | S1 ,...,SK for which
the factor matrices are orthonormal, respectively. Then we have
with probability
at least 1 − δ, where L = |X |max , M = max{|G ∗ |max , |G̃ ∗ |max }, and
R = k∈[K ] Rk .
We remark that, for the matrix case (i.e., K = 2), |G ∗ |max and |G̃ ∗ |max are equal to
the maximum singular values of the original and sampled matrices, respectively.
3.4.1 Preliminaries
tensor, and Sk = (x1k , . . . , xsk ) be a sequence of indices in [Nk ] for each mode k ∈
[K ]. Then, we define the restriction X | S1 ,...,SK ∈ Rs×···×s of X to S1 × · · · × SK as
(X | S1 ,...,SK )i1 ···i K = X xi1 ,...,xiK for each i 1 ∈ [N1 ], . . . , i K ∈ [Nk ].
1 K
For a tensor G∈R R1 ×···×R K and vector-valued functions {F (k) : [0, 1]→R Rk }k∈[K ] ,
we define an order-K dikernel [[G; F (1) , . . . , F (K ) ]] : [0, 1] K → R as
[[G; F (1) , . . . , F (K ) ]](x1 , . . . , x K ) = G r1 ,...,r K F (k) (xk )rk
r1 ∈[R1 ],...,r K ∈[R K ] k∈[K ]
To prove Theorem 3.2, we first consider the dikernel counterpart to the Tucker fitting
problem, in which we want to minimize the following:
2
R1 ,...,R K (X) := inf X − [[G; f (1) , . . . , f (K ) ]] , (3.3)
G∈R R1 ×···×R K ,{ f (k) :[0,1]→R Rk }k∈[K ] F
The following lemma, which is proved in Sect. 3.4.3, states that the Tucker fitting
problem and its dikernel counterpart have the same optimum values.
Lemma 3.4 Let X ∈ R N1 ×···×N K be a tensor, and let R1 , . . . , R K ∈ N be integers.
Then, we have
Proof (of Theorem 3.2) We apply Lemma 3.2 to X and X2 . Thus, with probability at
least 1 − δ, there exists a measure-preserving bijection π : [0, 1] → [0, 1] such that
In the following, we assume that this has happened. By Lemma 3.5 and the fact that
R1 ,...,R K (X| S1 ,...,SK ) = R1 ,...,R K (π(X| S1 ,...,SK )), we have
R1 ,...,R K (X| S1 ,...,S K ) = R1 ,...,R K (X) ± L 1 + 2R(|G|max |F|max + |G̃|max | F̃|max ) ,
2 K K
where (G, F = { f (k) }k∈[K ] ) and (G̃, F̃ = { f˜(k) }k∈[K ] ) are as in the statement of
Lemma 3.5. From the proof of Lemma 3.4, we can assume that |G|max = |G ∗ |max ,
|G̃|max = |G̃ ∗ |max , |F|max ≤ 1, and | F̃|max ≤ 1 (owing to the orthonormality of
U1∗ , . . . , U K∗ and Ũ1∗ , . . . , Ũ K∗ ). It follows that
R1 ,...,R K (X| S1 ,...,SK ) = R1 ,...,R K (X) ± L 2 1 + 2R(|G ∗ |max + |G̃ ∗ |max ) . (3.4)
Then, we have
∂ 2
(1) (K )
X − [[G; f , . . . , f ]]
∂ fr(k0 0 ) (x0 ) F
=2 G r1 ···r K X(x) fr(k)
k
(xk )dx
r1 ,...,r K :rk0 =r0 [0,1] :xk0 =x0
K
k∈[K ]\{k0 }
Proof (of Lemma 3.4) First, we show that (LHS) ≤ (RHS). Consider a sequence of
solutions for the continuous problem (3.3) for which the objective values attain the
infimum. For Tucker decompositions, it is well known that there exists a minimizer
for which the factor matrices U (1) , . . . , U (K ) are orthonormal. By similar reasoning,
we can show that the vector-valued functions f (1) , . . . , f (K ) in each solution of the
sequence are orthonormal. As the objective function is coercive with respect to tensor
G, we can take a subsequence for which G converges. Let G ∗ be the limit. Now,
for any δ > 0, we can create a matrix G̃ by perturbing G ∗ so that (i) by fixing G
to G̃ in the continuous problem, the infimum increases only by δ, and (ii) a matrix
constructed from G̃ is invertible and has a condition number at least δ = δ (δ) > 0.
Now, consider a sequence of solutions for the continuous problem (3.3) with G
fixed to G̃ for which the objective values attain the infimum. We can show that the
partial derivatives converge to zero almost everywhere. For any > 0, there then
exists a solution (G̃, f (1) , . . . , f (K ) ) in the sequence such that the partial derivatives
are at most almost everywhere.
Then by Lemma 3.6, for any k0 ∈ [K ], r0 ∈ [Rk ], and almost all x ∈ [0, 1], we
have
1 2
1 2
X − [[G̃; U (1) , . . . , U (K ) ]] = X i1 ···i K − [[G̃; U (1) , . . . , U (K ) ]]i1 ···i K
N F N i ,...,i
1 K
2
(1) (K )
= N N
X(x) − [[ G̃; f , . . . , f ]](x) ± O(/δ ) dx
i 1 ,...,i K Ii 1 ×···×Ii K
1 K
2
= X − [[G̃; f (1) , . . . , f (K ) ]] ± O( 2 N /(δ )2 )
F
for N = k∈[K ] Nk . As the choice of and δ are arbitrary, we obtain (LHS) ≤ (RHS).
Second, we show that (RHS) ≤ (LHS). Let U (k) ∈ R Nk ×Rk (k ∈ [K ]) be matrices.
We define a vector-valued function f (k) : [0, 1] → R Rk as fr(k) (x) = Ui(k) Nk (x)r
for each
k ∈ [K ] and r ∈ [Rk ]. Then, we have
2 2
X − [[G; f (1) , . . . , f (K ) ]] = X(x) − [[G; f (1) , . . . , , f (K ) ]](x) dx
F [0,1] K
2
(1) (K )
= N
X(x) − [[G; f , . . . , f ]](x) dx
k
i 1 ,...,i K k∈[K ] Ii k
1 2
= X i1 ···i K − [[G; U (1) , . . . , U (K ) ]]i1 ···i K
N i 1 ,...,i K
1 2
= X − [G; U (1) , . . . , U (K ) ] ,
N F
Proof For τ ∈ R and the function h : [0, 1] → R, let L τ (h) := {x ∈ [0, 1] | h(x) =
τ } be the level set of h at τ . For f (i) = f (i) /L, we have
44 Y. Yoshida
W, f (k) = L K W, f (k)
k∈[K ] k∈[K ]
K
=L τk W(x)dxdτ
[−1,1] k∈[K ]
K
k∈[K ] L τk ( f
(k) )
≤L K
|τk | W(x)dxdτ
[−1,1] K k∈[K ] k∈[K ] L τ ( f (k) )
k
≤ LK |τk |dτ = L K .
[−1,1] K k∈[K ]
where R = k∈[K ] RK .
Proof We have
2 2
X − [[G; f (1) , . . . , f (K ) ]] F − Y − [[G; f (1) , . . . , f (K ) ]] F
2
= X(x) − [[G; f (1) , . . . , f (K ) ]](x) dx
[0,1] K
2
− Y(x) − [[G; f (1) , . . . , f (K ) ]](x) dx
[0,1] K
= X(x)2 − Y(x)2 dx − 2 (X(x) − Y(x))[[G; f (1) , . . . , f (K ) ]](x)dx
[0,1] K [0,1] K
≤ |X − Y | + 2
2 2
|G r1 ···r K | · X − Y, frk
(k)
by Lemma 3.7.
Proof (of Lemma 3.5) By Lemma 3.8, we have
2 2
(1) (K ) (1) (K )
Y − [[G Y ; f Y , . . . , f Y ]] ≤ Y − [[G X ; f X , . . . , f X ]] +
F F
2
(1) (K )
≤ X − [[G X ; f X , . . . , f X ]] + 2 + 2 R|G X max FX max . K
F
3 Constant-Time Algorithms for Continuous Optimization Problems 45
Similarly, we have
2 2
(1) (K ) (1) (K )
X − [[G X ; f X , . . . , f X ]] ≤ X − [[G Y ; f Y , . . . , f Y ]] +
F F
2
(1) (K )
≤ Y − [[G Y ; f Y , . . . , f Y ]] + 2 + 2 RG Y max FY max . K
F
References
1. L. Bottou, Stochastic learning, in Advanced Lectures on Machine Learning (2004), pp. 146–168
2. K.L. Clarkson, E. Hazan, D.P. Woodruff, Sublinear optimization for machine learning. J. ACM
59(5), 23:1–23:49 (2012)
3. Lieven De Lathauwer, Bart De Moor, Joos Vandewalle, On the best rank-1 and rank-
(r1 , r2 , . . . , rn ) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl. 21(4),
1324–1342 (2000)
4. A. Frieze, R. Kannan, The regularity lemma and approximation schemes for dense problems,
in FOCS (1996), pp. 12–20
5. K. Hayashi, Y. Yoshida, Minimizing quadratic functions in constant time, in NIPS (2016), pp.
2217–2225
6. K. Hayashi, Y. Yoshida, Fitting low-rank tensors in constant time, in NIPS (2017), pp. 2473–
2481
7. L. Lovász, Large Networks and Graph Limits (American Mathematical Society, 2012)
8. K.P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012)
9. Taiji Suzuki, Masashi Sugiyama, Least-squares independent component analysis. Neural Com-
put. 23(1), 284–301 (2011)
10. M. Sugiyama, T. Suzuki, T. Kanamori, Density Ratio Estimation in Machine Learning (Cam-
bridge University Press, 2012)
11. Ledyard R. Tucker, Some mathematical notes on three-mode factor analysis. Psychometrika
31(3), 279–311 (1966)
12. K.I. Christopher, in Using the Nyström Method to Speed up Kernel Machines, NIPS eds. by C.
Williams, M. Seeger (2001)
13. M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, M. Sugiyama, Relative density-ratio estima-
tion for robust distribution comparison, in NIPS (2011)
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.