A Linear-Complexity Tensor Butterfly Algorithm For Compressing High-Dimensional Oscillatory Integral Operators

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

A LINEAR-COMPLEXITY TENSOR BUTTERFLY ALGORITHM

FOR COMPRESSING HIGH-DIMENSIONAL OSCILLATORY


INTEGRAL OPERATORS
P. MICHAEL KIELSTRA∗ , TIANYI SHI† , HENGRUI LUO ‡, JIANLIANG QIAN § , AND

YANG LIU†

Abstract. This paper presents a multilevel tensor compression algorithm called tensor butterfly
algorithm for efficiently representing large-scale and high-dimensional oscillatory integral operators,
arXiv:2411.03029v1 [math.NA] 5 Nov 2024

including Green’s functions for wave equations and integral transforms such as Radon transforms
and Fourier transforms. The proposed algorithm leverages a tensor extension of the so-called com-
plementary low-rank property of existing matrix butterfly algorithms. The algorithm partitions the
discretized integral operator tensor into subtensors of multiple levels, and factorizes each subtensor
at the middle level as a Tucker-like interpolative decomposition, whose factor matrices are formed
in a multilevel fashion. For a d-dimensional integral operator discretized into a 2d-mode tensor with
n2d entries, the overall CPU time and memory requirement scale as O(nd ), in stark contrast to the
O(nd log n) requirement of existing matrix algorithms such as matrix butterfly algorithm and fast
Fourier transforms (FFT), where n is the number of points per direction. When comparing with
other tensor algorithms such as quantized tensor train (QTT), the proposed algorithm also shows
superior CPU and memory performance for tensor contraction. Remarkably, the tensor butterfly
algorithm can efficiently model high-frequency Green’s function interactions between two unit cubes,
each spanning 512 wavelengths per direction, which represents over 512× larger problem sizes than
existing algorithms. On the other hand, for a problem representing 64 wavelengths per direction,
which is the largest size existing algorithms can handle, our tensor butterfly algorithm exhibits 200x
speedups and 30× memory reduction comparing with existing ones. Moreover, the tensor butterfly
algorithm also permits O(nd )-complexity FFTs and Radon transforms up to d = 6 dimensions.

Key word. butterfly algorithm, tensor algorithm, Tucker decomposition, interpolative decom-
position, quantized tensor train (QTT), fast Fourier transforms (FFT), fast algorithm, high-frequency
wave equations, integral transforms, Radon transform, low-rank compression, Fourier integral oper-
ator, non-uniform FFT (NUFFT)

AMS subject classifications. 15A23, 65F50, 65R10, 65R20

1. Introduction. Oscillatory integral operators (OIOs), such as Fourier trans-


forms and Fourier integral operators [32, 7], are critical computational and theoretical
tools for many scientific and engineering applications, such as signal and image pro-
cessing, inverse problems and imaging, computer vision, quantum mechanics, and
analyzing and solving partial differential equations (PDEs). The development of ac-
curate and efficient algorithms for computing OIOs has profound impacts on the
evolution of the pertinent research areas including, perhaps mostly remarkably, the
invention of the fast Fourier transform (FFT) by Cooley and Tukey in 1965 and the
invention of the fast multipole method (FMM) by Greengard and Rokhlin in 1987,
both of which were listed among the ten most significant algorithms discovered in the
20th century. Among existing analytical and algebraic methods for OIOs, butterfly
algorithms [52, 46, 37, 36, 56] represent an emerging class of multilevel matrix de-
composition algorithms that have been proposed for Fourier transforms and Fourier

∗ Department of Mathematics, University of California, Berkeley, CA, USA.

Email: [email protected]
† Applied Mathematics and Computational Research Division, Lawrence Berkeley National Lab-

oratory, Berkeley, CA, USA.


Email: {tianyishi,liuyangzhuan}@lbl.gov
‡ Department of Statistics, Rice University, Houston, TX, USA. Email: [email protected]
§ Department of Mathematics and Department of CMSE, Michigan State University, East Lansing,

MI, USA. Email: [email protected]


1
integral operators [8, 68, 67], special function transforms [63, 4, 54], fast iterative
[53, 52, 47] and direct [24, 43, 25, 26, 59, 44, 60] solution of surface and volume in-
tegral equations for wave equations, high-frequency Green’s function ansatz for inho-
mogeneous wave equations [45, 41, 48], direct solution of PDE-induced sparse systems
[42, 13], and machine learning for inverse problems [33, 35]. The (matrix) butterfly al-
gorithms leverage the so-called complementary low-rank (CLR) property of the matrix
representation of OIOs after proper row/column permutation. The CLR states that
judiciously selected submatrices exhibit numerical low ranks, known as the butterfly
ranks, which stay constant irrespective of the matrix sizes. This permits a multilevel
sparse matrix decomposition requiring O(n log n) factorization time, application time,
and storage units with n being the matrix size.
Despite their low asymptotic complexity, the matrix butterfly algorithms often-
times exhibit relatively large prefactors, i.e., constant but high butterfly ranks, par-
ticularly for higher-dimensional OIOs. Examples include Green’s functions for 3D
high-frequency wave equations [59, 45], 3D Radon transforms for linear inverse prob-
lems [17], 6D Fourier–Bros–Iagolnitzer transforms for Wigner equations [15, 66], 6D
Fourier transforms in diffusion magnetic resonance imaging [11] and plasma physics
[18], 4D space-time transforms in quantum field theories [57, 49], and multi-particle
Green’s functions in quantum chemistry [21]. For these high-dimensional OIOs, the
computational advantage of the matrix butterfly algorithms over other existing algo-
rithms becomes significant only for very large matrices.
More broadly speaking, for large-scale multi-dimensional scientific data and op-
erators, tensor algorithms are typically more efficient than matrix algorithms. Popu-
lar low-rank tensor compression algorithms include CANDECOMP/PARAFAC [30],
Tucker [16], hierarchical Tucker [28], tensor train (TT) [55], and tensor network [12]
decomposition algorithms. See references [34, 23] for a more complete review of
available tensor formats and their applications. When applied to the representa-
tion of high-dimensional integral operators, tensor algorithms often leverage addi-
tional translational- or scaling-invariance property to achieve superior compression
performance, including solution of quasi-static wave equations [65, 64, 22, 14], elliptic
PDEs [3, 27], many-body Schrödinger equations [31], and quantum Fourier trans-
forms (QFTs) [9]. That being said, most existing tensor decomposition algorithms
will break down for OIOs due to their incapability to exploit the oscillatory structure
of these operators; therefore, new tensor algorithms are called for.
In this paper, we propose a linear-complexity, low-prefactor tensor decomposi-
tion algorithm for large-scale and high-dimensional OIOs. This new tensor algorithm,
henceforth dubbed the tensor butterfly algorithm, leverages the intrinsic CLR prop-
erty of high-dimensional OIOs more effectively than the matrix butterfly algorithm,
which is enabled by additional tensor properties such as translational invariance of
free-space Green’s functions and dimensional separability of Fourier transforms. The
algorithm partitions the OIO tensor into subtensors of multiple levels, and factorizes
each subtensor at the middle level as a Tucker-like interpolative decomposition, whose
factor matrices are further constructed in a nested fashion. For a d-dimensional OIO
(assuming d constant) discretized as a 2d-mode tensor with n being the size per mode,
the factorization time, application time, and storage cost scale as O(nd ), and the re-
sulting tensor factors have small multi-linear ranks. This is in stark contrast both to
the O(nd log n) scaling of existing matrix algorithms such as matrix butterfly algo-
rithms and FFTs, and to the super-linear scaling of existing tensor algorithms. We
mention that the linear complexity of the factorization time in our proposed algorithm
is achieved via a simple random entry evaluation scheme, assuming that any arbitrary
2
entry can be computed in O(1) time. We remark that, for 3D high-frequency wave
equations, the proposed tensor butterfly algorithm can handle discretized Green’s
function tensors 512× larger than existing algorithms; on the other hand, for the
largest sized tensor that can be handled by existing algorithms, our tensor butter-
fly algorithm is 200× faster than existing ones. Moreover, we claim that the tensor
butterfly algorithm instantiates the first linear-complexity implementation of high-
dimensional FFTs for arbitrary input data.
1.1. Related Work. Multi-dimensional butterfly algorithms represent a version
of matrix butterfly algorithms designed for high-dimensional OIOs [38, 10]. Instead
of the traditional binary tree partitioning of the matrix rows/columns [52], these
algorithms can be viewed as a modern version of [53] that permits quadtree and octree
partitioning of the matrix rows/columns, which have been demonstrated on 2D and
3D OIOs. For a general d-dimensional OIO, the d-dimensional tree partitioning leads
to a butterfly factorization with a d-fold reduction in the number of levels compared
to the binary tree partitioning. However, we note that both the multi-dimensional
and binary tree-based butterfly algorithms are still matrix-based algorithms that scale
as O(nd log n), as opposed to the proposed tensor algorithm that scales as O(nd ).
Quantized tensor train (QTT) algorithms, or simply TT algorithms, are tensor
algorithms well-suited for very high-dimensional integral operators. They have been
proposed to compress volume integral operators [14] arising from quasi-static wave
equations and static PDEs with O(log n) memory and CPU complexities. However,
for high-frequency wave equations, the QTT rank scales proportionally to the wave
number [14] leading to deteriorated CPU and memory complexities (see our numerical
results in section 4). Moreover, QTT has been proposed for computing FFT and
QFT with O(log n) memory and CPU complexities [9]. However, after obtaining
the QTT-compressed formats of both the volume-integral operator and the Fourier
transform, the CPU complexity for contracting such a QTT compressed operator with
arbitrary (i.e., non QTT-compressed) input data scales super-linearly. In contrast, our
algorithm yields a linear CPU and memory complexity for the contraction operation.
1.2. Contents. In what follows, we first review the matrix low-rank decompo-
sition and butterfly decomposition algorithms in section 2. In subsection 3.1, we
introduce the Tucker-like interpolative decomposition algorithm as the building block
for the proposed tensor butterfly algorithm detailed in subsection 3.2. The multi-
linear butterfly ranks for a few special cases are analyzed in subsection 3.2.1 and the
complete complexity analysis is given in subsection 3.2.2. Section 4 shows a variety
of numerical examples, including Green’s functions for wave equations, Radon trans-
forms, and uniform and non-uniform discrete Fourier transforms, to demonstrate the
performance of matrix butterfly, tensor butterfly, Tucker and QTT algorithms.
1.3. Notations. Given a scalar-valued function f (x), its integral transform is
defined as
Z
(1.1) g(x) = K(x, y)f (y)dy
y

with an integral kernel K(x, y). The indexing of a matrix K is denoted by K(i, j)
or K(t, s), where i, j are indices and t, s are index sets. We use KT to denote the
transpose of matrix K. For a sequence of matrices K1 , . . . , Kn , the matrix product is
n
Y
(1.2) Ki = K1 K2 . . . Kn ,
i=1
3
the vertical stacking (assuming the same column dimension) is

(1.3) [Ki ]i = [K1 ; K2 ; . . . ; Kn ],

and

(1.4) diagi (Ki ) = diag(K1 , K2 , . . . , Kn )

is a block diagonal matrix with Ki being the diagonal blocks. Given an L-level binary-
tree partitioning Tt of an index set t = {1, 2, ···, n}, any node τ at each level is a subset
of t. The parent and children of τ are denoted by pτ and τ c (c = 1, 2), respectively,
and τ = τ 1 ∪ τ 2 .
A multi-index i = (i1 , · · ·, id ) is a tuple of indices, and similarly a multi-set
τ = (τ1 , τ2 , · · ·, τd ) is a tuple of index sets. We define

(1.5) τk←t = (τ1 , τ2 , · · ·, τk−1 , t, τk+1 , τk+2 , · · ·, τd ).

Given a tuple of nodes (i.e. a multi-set) τ = (τ1 , τ2 , · · ·, τd ) and a multi-index c =


(c1 , c2 , · · ·, cd ) with ci ∈ {1, 2}, the children of τ are denoted τ c = (τ1 c1 , τ2 c2 , · · ·, τd cd )
and the parents of τi , i = 1, 2, · · ·, d can be simply written as pτ = (pτ1 , pτ2 , · · ·, pτd ).
Similar to the above-described notations, we can replace the index i in [Ki ]i and
diagi (Ki ) with an index set τ , a multi-index c, or a multi-set τ assuming certain
predefined index ordering.
Given complex-valued (or real-valued) functions f (x) of d variables and inte-
gral operators K(x, y), the tensor representations of their discretizations are respec-
tively denoted by F ∈ Cn1 ×n2 ×···×nd and K ∈ Cm1 ×m2 ×···×md ×n1 ×n2 ×···×nd , where
n1 , · · · , nd and m1 , · · · , md are sizes of discretizations for the corresponding vari-
ables. In this paper, we use matricization to denote the reshaping of K into a
(Πk mk )×(Πk nk ) matrix, and the reshaping of F into a (Πk nk )×1 matrix. The entries
of F and K are denoted by F (i) (or equivalently F (i1 , i2 , · · ·, id )) and K(i, j), respec-
tively. Similarly the subtensors are denoted by F (τ ) (or equivalently F (τ1 , τ2 , ···, τd ))
and K(τ , ν).
Given a d-mode tensor F ∈ Cn1 ×n2 ×···×nd , the mode-j unfolding is denoted by
(j)
F ∈ C(Πk̸=j nk )×nj , the mode-j tensor-matrix product of F with a matrix X ∈
m×nj
C is denoted by Y = F ×j X, or equivalently Y(j) = F(j) XT .
2. Review of Matrix Algorithms. We consider a d-dimensional OIO kernel
K(x, y) with x, y ∈ Rd discretized on point pairs xi and y j , i = 1, 2, ..., (m1 m2 ·
· · md ), j = 1, 2, ..., (n1 n2 · · · nd ), where i (and similarly j) is the flattening of the
corresponding multi-index i. Such a discretization can be represented as a matrix
K ∈ C(m1 m2 ···md )×(n1 n2 ···nd ) . When it is clear in the context, we assume that mk =
nk = n for k = 1, . . . , d. Throughout this paper, we assume that K (and its tensor
representation) is never fully formed, but instead a function is provided to evaluate any
matrix (or tensor) entry in O(1) time. Next we review matrix compression algorithms
for K including low-rank and butterfly algorithms.
2.1. Interpolative Decomposition. The interpolative decomposition (ID) al-
gorithm [29] is a matrix compression technique that constructs a low-rank decompo-
sition whose factors contain original entries of the matrix. More specifically, consider
the matrix K(τ, ν) ∈ Cm×n with m ≈ n, τ = {1, 2, . . . , m}, ν = {1, 2, . . . , n}, the
column ID of K (the index sets τ and ν are omitted for clarity in context) is

(2.1) K ≈ K(:, ν)V,


4
where the skeleton matrix K(:, ν) contains r skeleton columns indexed by ν ⊆ ν and
the interpolation matrix V has bounded entries. Here the numerical rank r is chosen
such that
(2.2) ∥K − K(:, ν)V∥2F ⩽ O(ϵ2 )∥K∥2F
for a prescribed relative tolerance ϵ. In practice, the column ID can be computed via
rank-revealing QR decomposition with a relative tolerance ϵ [39]. Similarly, the row
ID of the matrix K reads
(2.3) K ≈ UK(τ , :),
where the skeleton matrix K(τ , :) contains r skeleton rows indexed by τ ⊆ τ and the
interpolation matrix U has bounded entries. The row ID can be simply computed by
the column ID of KT . Combining the column and row ID in (2.1) and (2.3) gives
(2.4) K ≈ UK(τ , ν)V.
It is straightforward to note that the memory and CPU complexities of ID scale as
O(nr) and O(n2 r), respectively. The CPU complexity can be reduced to O(nr2 )
when properly selected proxy rows in (2.1) and columns in (2.3) are used in the rank-
revealing QR. Common strategies of choosing proxy rows/columns (henceforth called
proxy index strategies) for integral operators include evenly spaced or uniform random
samples, and more generally the use of Chebyshev nodes and proxy surfaces (where
new rows K(x, y j ) other than original rows of K are used with x denoting the proxies).
However, for large OIOs, the rank r depends on the size n of the matrix; consequently,
ID is not an efficient compression algorithm. Next, we review the matrix butterfly
algorithm capable of achieving quasi-linear memory and CPU complexities for OIOs.
2.2. Matrix Butterfly Algorithm. Letting t0 = {1, 2, · · ·, m}, s0 = {1, 2, · ·
·, n}, and m = n, the L-level butterfly representation of the discretized OIO K(t0 , s0 )
is based on two binary trees, Tt0 and Ts0 , and the CLR property of the OIO takes the
following form: at any level 0 ≤ l ≤ L, for any node τ at level l of Tt0 and any node
ν at level L − l of Ts0 , the subblock K(τ, ν) is numerically low-rank with rank rτ,ν
bounded by a small number r called the butterfly rank [46, 36, 37, 56].
For any subblock K(τ, ν), CLR permits a low-rank representation using ID in
(2.4) as
(2.5) K(τ, ν) ≈ Uτ,ν K(τ , ν)Vτ,ν ,
where the skeleton rows and columns are indexed by τ and ν, respectively. It is worth
noting that given a node ν, the selection of skeleton columns ν depends on the node
τ . However, the notation ¯· does not reflect the dependency when it is clear in the
context.
Without loss of generality, we assume that L is an even number so that Lc = L/2
denotes the middle level. At levels l = 0, . . . , Lc , the interpolation matrices Vτ,ν are
computed as follows:
At level l = 0, Vτ,ν are explicitly formed. While at level 0 < l ≤ Lc , they are
represented in a nested fashion. To see this, consider a node pair (τ, ν) at level l > 0
and let ν 1 , ν 2 and pτ be the children and parent of ν and τ , respectively. Let s be the
ancestor of ν at level Lc of Ts0 and let Ts denote the subtree rooted at s.
By CLR, we have
K(τ, ν) = K(τ, ν 1 ) K(τ, ν 2 )
 

5
 Vps ,ν 1
 

(2.6) ≈ K(τ, ν 1 ) K(τ, ν 2 ) τ
Vpsτ ,ν 2
 s 
s Vpτ ,ν 1
(2.7) ≈ K(τ, ν)Wτ,ν .
Vpsτ ,ν 2

s
Here Wτ,ν and ν are the interpolation matrix and skeleton columns from the ID of
1 2
K(τ, ν ∪ ν ), respectively. Wτ,ν is henceforth referred to as the transfer matrix for
ν in the rest of this paper. Note that we have added an additional superscript s to
Vpτ ,ν c and Wτ,ν , for notation convenience in the later context. From (2.6), it is clear
s
that the interpolation matrix Vτ,ν can be expressed in terms of its parent pτ ’s and
1 2
children ν , ν ’s interpolation matrices as
 s 
s s Vpτ ,ν 1
(2.8) Vτ,ν = Wτ,ν .
Vpsτ ,ν 2

s s
Note that the interpolation matrices Vτ,ν at level l = 0 and transfer matrices Wτ,ν
c
at level 0 < l ≤ L do not require the column ID on the full subblocks K(τ, ν) and
K(τ, ν 1 ∪ ν 2 ), which would lead to at least an O(mn) computational complexity.
s s
In practice, one can select O(rτ,ν ) proxy rows τ̂ ⊂ τ to compute Vτ,ν and Wτ,ν
via ID as:
s
(2.9) K(τ̂ , ν) ≈ K(τ̂ , ν)Vτ,ν , l = 0,
s
(2.10) K(τ̂ , ν 1 ∪ ν 2 ) ≈ K(τ̂ , ν)Wτ,ν , 0 < l ≤ Lc .

The viable choices for proxy rows have been discussed in several existing papers [45,
56, 59, 8].
At levels l = Lc , . . . , L, the interpolation matrices Uτ,ν are computed by perform-
ing similar operations on KT . We only provide their expressions here and omit the
redundant explanation. Let t be the ancestor of ν at level Lc of Tt0 and let Tt be the
subtree rooted at t. At level l = L, Utτ,ν are explicitly formed. At level Lc ≤ l < L,
only the transfer matrices Ptτ,ν are computed from the column ID of KT (ν, τ 1 ∪ τ 2 )
satisfying
 t 
Uτ 1 ,pν
(2.11) Utτ,ν = Ptτ,ν .
Utτ 2 ,pν

Combining (2.5), (2.8) and (2.11), the matrix butterfly decomposition can be
expressed for each node pair (t, s) at level Lc of Tt0 and Ts0 as

Lc
Y  1
 Y 
t t,s t,s s
(2.12) K(t, s) ≈ U Pl K(t, s) Wl V .
l=1 l=Lc

Here, t and s represent the skeleton rows and columns of the ID of K(t, s). The
t s
interpolation factors U and V in (2.12) are
t
(2.13) U = diagτ (Utτ,s0 ), τ at level Lc of Tt ,
s
(2.14) V = diagν (Vts0 ,ν ), ν at level Lc of Ts ,
6
t,s t,s
and the transfer factors Pl and Wl for l = 1, . . ., Lc consist of transfer matrices
s
Wτ,ν and Psτ,ν :

t,s
   τ at level l − 1 of Tt0 , and t ⊆ τ,
(2.15) Wl = diagτ diagν (Wτs c ,ν ) c ,
ν at level Lc − l of Ts ;
t,s
   τ at level Lc − l of Tt ,
(2.16) (Pl )T = diagν diagτ (Ptτ,ν c )T c ,
ν at level l − 1 of Ts0 , and s ⊆ ν.
Here τ c and ν c with c = 1, 2 are children of τ and ν, respectively.
The CPU and memory requirement for computing the matrix butterfly decom-
position can be briefly analyzed as follows. Note that we only need to analyze the
s s
costs for Vτ,ν , Wτ,ν and K(t, s) as those for Utτ,ν and Ptτ,ν are similar. By CLR
assumption, we assume that rτ,ν ≤ r, ∀τ, ν for some constant r. Thanks to the use
s s
of the proxy rows and columns, the computation of one individual Vτ,ν and Wτ,ν by
ID only operates on O(r) × O(r) matrices, hence its memory and CPU requirements
c
are O(r2 ) and O(r3 ), respectively. In total, there are O(2L ) middle-level nodes s
c c
each having O(2L ) numbers of Vτ,ν s
and O(Lc 2L ) numbers of Wτ,ν s
. Similarly, each
K(t, s) requires O(r ) CPU and memory costs, and there are in total O(2L ) middle-
2

level node pairs (t, s). These numbers sum up to the overall O(nr2 log n) memory and
O(nr3 log n) CPU complexities for matrix butterfly algorithms.
For d-dimensional discretized OIOs K ∈ C(m1 m2 ···md )×(n1 n2 ···nd ) with mk = nk =
n, we can assume that n = Cb 2L with some constant Cb . For the above-described
binary-tree-based butterfly algorithm, the leaf nodes of the trees are of size Cbd and
this leads to a dL-level butterfly factorization. The memory and CPU complexities for
this algorithm become O(dnd r2 log n) and O(dnd r3 log n), respectively. On the other
hand, the multi-dimensional tree-based butterfly algorithm [38, 10] leads to a L-level
factorization with O(2d nd r2 log n) memory and O(2d nd r3 log n) CPU complexities.
Despite their quasi-linear complexity for high-dimensional OIOs, the butterfly rank
r is constant but high, leading to very large prefactors of these binary and multi-
dimensional tree-based algorithms. In the following, we turn to tensor decomposition
algorithms to reduce both the prefactor and asymptotic scaling of matrix butterfly
algorithms.
3. Proposed Tensor Algorithms. In this section, we assume that the d-
dimensional discretized OIO in section 2 is directly represented as a 2d-mode tensor
K ∈ Cm1 ×m2 ×···×md ×n1 ×n2 ×···×nd . We first extend the matrix ID algorithm in sub-
section 2.1 to its tensor variant, which serves as the building block for the proposed
tensor butterfly algorithm.
3.1. Tucker-like Interpolative Decomposition. Given the 2d-mode tensor
K(τ , ν) with τk = {1, 2, . . . , mk } and νk = {1, 2, . . . , nk } for k = 1, . . . , d, the pro-
posed tensor ID decomposition compresses each dimension independently via the col-
umn ID of the unfolding of K along the k-th dimension,
(3.1) K(k) = K(k) (:, τk )Uk , K(d+k) = K(d+k) (:, νk )Vk , k = 1, . . . , d,
Q
where K(k) ∈ C( j̸=k nj )×nk
is the mode-k unfolding, or equivalently
(3.2) K = K(τk←τk , ν) ×k Uk , K = K(τ , νk←νk ) ×d+k Vk , k = 1, . . . , d.
Here, τk and νk denote the skeleton indices along modes k and d+k of K, respectively,
while τk←τk and νk←νk denote multi-sets that replace τk and νk , respectively, with
7
(a) (b) (c)

Fig. 3.1: Tensor diagrams for (a) the Tucker-ID decomposition of a 4-mode tensor,
and (b) the matrix partitioner corresponding to a 2d × 2 partitioning
h with id = 2 used
in the tensor butterfly decomposition of a 2d-mode tensor, such as Wτs,k c ,ν in (3.12)
h i c
for fixed s, τ , k and ν, or Pt,k
τ,ν c in (3.11) for fixed t, ν, k and τ . (c) The tensor
c h i
diagram involving blocks Vts,k0 ,ν (in green) and blocks Wτs,k
c ,ν (in blue) for fixed s
c
and k for the tensor butterfly decomposition of a 2d-mode tensor.

τk and νk . Combining (3.2) for all dimensions yields the following proposed tensor
interpolation decomposition,
d
Y d
 Y 
(3.3) K = K(τ , ν) ×k Uk ×d+k Vk ,
k=1 k=1

where, τ = (τ1 , τ2 , . . . , τd ), ν = (ν1 , ν2 , . . . , νd ), the core tensor K(τ , ν) is a subtensor


of K, and Uk and Vk are the factor matrices for modes k and d + k, respectively.
Note that the tensor diagram of (3.3) is exactly the same as Tucker decomposi-
tions or high-order singular value decompositions (HOSVD) [16]. Both decomposi-
tions provide a canonical form of “core and factor product” tensor approximation. See
Figure 3.1(a) for the tensor diagram of (3.3) for a 4-mode tensor. Unlike the Tucker
decomposition that leads to orthonormal factor matrices, the proposed decomposition
leads to factor matrices with bounded entries and the core tensor with the original ten-
sor entries. Therefore, the proposed decomposition is named Tucker-like interpolative
decomposition (Tucker ID). It is worth noting that there exist several interpolative
tensor decomposition algorithms [6, 50, 51, 58]. However they either use original
tensor entries in the factor matrices (instead of the core tensor) [50, 58, 6] or rely
on a different tensor diagram [51]. As will be seen in subsection 3.2, the Tucker ID
algorithm is a unique and essential building block of the tensor butterfly algorithm.
The memory and CPU complexities of Tucker-ID can be briefly analyzed as fol-
lows. Assuming that mk = nk = n and maxk |τ k |=maxk |ν k | = r is a constant (we
will discuss the case of non-constant r in subsection 3.2.3), the memory requirement
is simply O(drn + r2d ), where the first and second term represents the storage units
for the factor matrices and the core tensor, respectively. The CPU cost for naive
computation of Tucker-ID is O(drn2d + r2d ), where the first term represents the cost
of rank-revealing QR of the unfolding matrices in (3.1), and the second term repre-
sents the cost forming the core tensor K(τ , ν). In practice, however, the unfolding
matrices do not need to be fully formed and one can leverage the idea of proxy rows
in subsection 2.2 to reduce the cost for computing the factor matrices to O(dnr2d ).
We will explain this in more detail in the context of the proposed tensor butterfly
decomposition algorithm.
8
Just like the matrix ID algorithm, Tucker-ID is also not suitable for representing
large-sized OIOs as the rank r depends on the size n. That said, the Tucker-ID rank
is typically significantly smaller than the matrix ID rank, as it exploits more com-
pressibility properties across dimensions by leveraging e.g. translational-invariance
or dimensional-separability properties of OIOs; see subsection 3.2.1 for a few of such
examples. In what follows, we use Tucker-ID as the building block for constructing a
linear-complexity tensor butterfly decomposition algorithm for large-sized OIOs.
3.2. Tensor Butterfly Algorithm. Consider a 2d-mode OIO tensor K(t0 , s0 )
with t0 = (t01 , t01 , . . . , t0d ), s0 = (s01 , s01 , . . . , s0d ), t0k = {1, 2, . . . , mk }, s0k = {1, 2, . . . , nk },
k = 1, 2, . . . , d. Without loss of generality, we assume that mk = nk = n. We further
assume that each t0k (and s0k ) is binary partitioned with a tree Tt0k (and Ts0k ) of L levels
for k = 1, 2, . . . , d.
To start with, we first define the tensor CLR property as follows:
• For any level 0 ≤ l ≤ Lc , any multi-set τ = (τ1 , τ2 , . . . , τd ) with τi , i ≤ d at level
l of Tt0i , any multi-set s = (s1 , s2 , . . . , sd ) with si , i ≤ d at level Lc of Ts0i , any
mode 1 ≤ k ≤ d, and any node ν at level Lc − l of Tsk , the mode-(d + k) unfolding
of the subtensor K(τ , sk←ν ) is numerically low-rank (with rank bounded by r),
permitting an ID via (3.2):

(3.4) K(τ , sk←ν ) ≈ K(τ , sk←ν ) ×d+k Vτs,k


,ν .

• For any level 0 ≤ l ≤ Lc , any multi-set ν = (ν1 , ν2 , . . . , νd ) with νi , i ≤ d at level


l of Ts0i , any multi-set t = (t1 , t2 , . . . , td ) with ti , i ≤ d at level Lc of Tt0i , any
mode 1 ≤ k ≤ d, and any node τ at level Lc − l of Ttk , the mode-k unfolding of the
subtensor K(tk←τ , ν) is numerically low-rank (with rank bounded by r), permitting
an ID via (3.2):

(3.5) K(tk←τ , ν) ≈ K(tk←τ , ν) ×k Ut,k


τ,ν .

In essence, the tensor CLR in (3.4) and (3.5) investigates the unfolding of judiciously
selected subtensors rather than the matricization used in the matrix CLR. Moreover,
the√tensor CLR requires fixing d − 1 modes of the 2d-mode subtensors to be of size
O( n) while changing the remaining d + 1 modes with respect to l. Therefore each
ID computation can operate on larger subtensors compared to the matrix CLR. In
subsection 3.2.1 we provide two examples, namely a free-space Green’s function tensor
and a high-dimensional Fourier transform, to explain why the tensor CLR is valid, and
in subsection 3.2.2 we will see that the tensor CLR essentially reduces the quasilinear
complexity of the matrix butterfly algorithm to linear complexity. Here, assuming
that the tensor CLR holds true, we describe the tensor butterfly algorithm. We note
that there may be alternative ways to define the tensor CLR different from (3.4) and
(3.5), and we leave that as a future work.
In what follows, we focus on the computation of Vτs,k ,ν (corresponding to the
mid-level multi-set s), as Ut,kτ,ν (corresponding to the mid-level multi-set t) can be
computed in a similar fashion. At level l = 0, Vτs,k ,ν are explicitly formed. At level
0 < l ≤ Lc , they are represented in a nested fashion. Let pτ = (pτ1 , pτ2 , . . . , pτd )
consist of parents of τ = (τ1 , τ2 , . . . , τd ) in (3.4).
By the tensor CLR property, we have
" #
Vps,k
τ ,ν
1
K(τ , sk←ν ) ≈ K(τ , sk←ν 1 ∪ν 2 ) ×d+k
Vps,k
τ ,ν
2

9
mode 3
mode 4

mode 1

mode 2

(a) (b)

Fig. 3.2: (a) Tensor diagram for the tensor butterfly decomposition of L = 2 levels
of a 4-mode OIO tensor representing (b) high-frequency Green’s function interactions
between parallel facing 2D unit squares. Only the full connectivity regarding three
middle-level node pairs is shown (the two green circles and one orange circle in (a)).
The orange circle in (a) represents the core tensor K(t, s) for a mid-level pair (t, s)
with t = (t1 , t2 ), s = (s1 , s2 ) highlighted in orange in (b).

" #!
Vps,k 1
(3.6) ≈ K(τ , sk←ν ) ×d+k Wτs,k τ ,ν
.

Vps,k
τ ,ν
2

Comparing (3.6) and (3.4), one realizes that the interpolation matrix Vτs,k ,ν is
s,k s,k
represented as the product of the transfer matrix Wτ ,ν and diagc (Vpτ ,ν c ). Here, the
transfer matrix Wτs,k ,ν is computed as the interpolation matrix of the column ID of
the mode-(d + k) unfolding of K(τ , sk←ν 1 ∪ν 2 ). As mentioned in section 3, in practice
one never forms the unfolding matrix in full, but instead considers the unfolding of
K(τ̂ , ŝk←ν 1 ∪ν 2 ), where τ̂ = (τ̂1 , τ̂2 , . . . , τ̂d ) and ŝ = (ŝ1 , ŝ2 , . . . , ŝd ); here τ̂i and ŝi
consist of O(r) judiciously selected indices along modes i and d + i, respectively. Note
that ŝk is never used as it is replaced by ν 1 ∪ ν 2 in (3.6). The same proxy index
strategy can be used to obtain Vτs,k ,ν at the level l = 0. For each Wτ ,ν or Vτ ,ν , its
s,k s,k
2d+1
computation requires O(r ) CPU time.
Similarly in (3.5), Ut,k τ,ν is explicitly formed at l = 0 and constructed via the
transfer matrix Pτ,ν at level 0 < l ≤ Lc :
t,k

" #
Ut,k
τ 1 ,pν
K(tk←τ , ν) ≈ K(tk←τ 1 ∪τ 2 , ν) ×k
Ut,k
τ 2 ,pν
10
" #!
Ut,k
τ 1 ,pν
(3.7) ≈ K(tk←τ , ν) ×k Pt,k .
τ,ν
Ut,k
τ 2 ,pν

Putting together (3.4), (3.5), (3.6) and (3.7), the proposed tensor butterfly de-
composition can be expressed, for any multi-set t = (t1 , t2 , . . . , td ) with ti at level Lc
of Tt0i and any multi-set s = (s1 , s2 , . . . , sd ) with si at level Lc of Ts0i , by forming a
Tucker-ID for the (t, s) pair:

d
Y 1
 Y d
 Y 1
 Y 
t,s,k t,k t,s,k s,k
(3.8) K(t, s) ≈ K(t, s) ×k Pl U ×d+k Wl V .
k=1 l=Lc k=1 l=Lc

Here, t and s represent the skeleton indices of the Tucker-ID of K(t, s). The
t,k s,k
interpolation factors U and V in (3.8) are:
t,k
(3.9) U = diagτ (Ut,k c
τ,s0 ), τ at level L of Ttk ,
s,k
(3.10) V = diagν (Vts,k
0 ,ν ), ν at level Lc of Tsk ,

t,s,k t,s,k
and the transfer factors Pl and Wl for l = 1, . . ., Lc are:

t,s,k
h i  τ at level Lc − l of Tt ,
diagτ (Pt,k
k
(3.11) Pl = diagν τ,ν c ) ,
c νi at level l − 1 of Ts0i , si ⊆ νi , i ≤ d;
t,s,k
h i  τi at level l − 1 of Tt0 , ti ⊆ τi , i ≤ d,
(3.12) Wl = diagτ diagν (Wτs,kc ,ν ) , i

c ν at level Lc − l of Tsk .

One can verify that when d = 1, the tensor butterfly algorithm (3.8) reduces
to the matrix butterfly algorithm (2.12). But when d > 1, the tensor butterfly
algorithm has a distinct algorithmic structure and the computational complexity can
be significantly reduced compared with the matrix butterfly algorithm. Detailed
computational complexity analysis is provided in subsection 3.2.2.
To better understand the structure of the tensor butterfly in (3.8), (3.9), (3.10),
(3.11), and (3.12), we describe its tensor diagram here. We first create the tensor
diagram for a matrix partitioner as shown
h in Figure
i 3.1(b), which represents a 2d × 2
block partitioning of a matrix such as Wτs,k c ,ν in (3.12) for fixed s, τ , k and ν, or
h i c
t,k
Pτ,ν c in (3.11) for fixed t, ν, k and τ . In other words, there are 2 legs on the
c
column dimension and 2d legs on the row dimension. The h diagram
i in Figure 3.1(c)
s,k s,k
shows the connectivity for all Vt0 ,ν (the green circles) and Wτ c ,ν (the blue circles)
c
for fixed s and k. The multiplication or contraction of all matrices in Figure 3.1(c)
s,k
results in Vt,s k
for all mid-level multi-sets t, which are of course not explicitly formed.
As an example, consider an OIO representing the free-space Green’s function in-
teraction between two parallel facing unit square plates in Figure 3.2. The tensor is
K(i, j) = K(xi , y j ) = exp(−iωρ)
ρ where xi = ( in1 , in2 , 0), y j = ( jn1 , jn2 , 1), ρ = |xi − y j |
and ω is the wavenumber. Here 1 represents the distance between the two plates.
Consider an L=2-level tensor butterfly decomposition, with a total of 16 middle-level
multi-set pairs. Let (t, s) denote one middle-level multi-set pair with t = (t1 , t2 ) and
s = (s1 , s2 ) as highlighted in orange in Figure 3.2(b). Their children are t11 , t12 , t21 , t22
11
and s11 , s12 , s21 , s22 . Leveraging the representations in Figure 3.1(b)-(c), the full di-
agram for K(t, s) consists of one 4-mode tensor K(t, s) (highlighted in orange in
Figure 3.2(a)), one transfer matrix per mode, and two factor matrices per mode. In
addition, we plot the full connectivity for two other multi-set pairs (highlighted in
Green in Figure 3.2(a)). It is important to note that the factor matrices and transfer
matrices are shared among the multi-set pairs.
The proposed tensor butterfly algorithm is fully described in Algorithm 3.1 for a
2d-mode tensor K ∈ Cm1 ×m2 ×···×md ×n1 ×n2 ×···×nd , which consists of three steps: (1)
computation of Vτs,k s,k t,k
,ν and Wτ ,ν starting at Line 1, (2) computation of Uτ,ν and Pτ,ν
t,k

starting at Line 17, and (3) computation of K(t, s) starting at Line 33. We note that,
after each K(t, s) is formed, we leverage floating-point compression tools such as the
ZFP software [40] to further compress it.
Once K is compressed, any input tensor F ∈ Cn1 ×n2 ×···×nd ×nv can contract with
it to compute G = K ×d+1,d+2,...,2d F . It is clear to see that theQcontraction is
k mk ×nv , K ∈
equivalent
Q Q to matrix-matrix Q multiplication G = KF, where G ∈ C
m × n n ×n
C k k k k , and F ∈ C k k v are matricizations of G, K and F , respectively,
and nv is the number of columns of F. The contraction algorithm is described in
Algorithm 3.2 which consists of three steps:
(1) Contraction with Vτs,k s,k c
,ν and Wτ ,ν . For each level l = 0, 1, . . . , L , one notices that,
since the contraction operation for each multi-set τ with τi at level l of Tt0i and
the middle-level multi-set s is independent of each other, one needs a separate
tensor F τ ,s to store the contraction result for each multi-set pair (τ , s). F τ ,s
s,k
can be computed by mode-by-mode contraction with the factor matrices V for
l = 0 (Line 6) and the transfer matrices diagν (Wτs,k ,ν ) for l > 0 (Line 8).
(2) Contraction with K(t, s) at the middle level. Tensors at the middle level F t,s
are contracted with each subtensor K(t, s) separately, resulting in tensors G t,s =
K(t, s) ×d+1,d+2,...,2d F t,s .
(3) Contraction with Ut,k t,k c
τ,ν and Pτ,ν . As Step (1), for each level l = L , L
c−1
, . . . , 0,
the contraction operation for each multi-set ν with νi at level l of Ts0i and middle-
level multi-set t is independent. At level l > 0, the contribution of tensors G t,ν is
accumulated into G t,pν (Line 26); at level l = 0, the contraction results are stored
in the final output tensor G(t, 1 : nv ) (Line 24).

3.2.1. Rank Estimate. In this subsection, we use two specific high-dimensional


examples, namely high-frequency free-space Green’s functions for wave equations and
uniform discrete Fourier transforms (DFTs) to investigate the matrix and tensor CLR
properties, and compare the matrix and tensor butterfly ranks rm and rt , respectively.
For the Green’s function example, the tensor CLR property is a result of matrix CLR
and translational invariance, and rt is much smaller than rm ; for the DFT example, the
tensor CLR property is a result of matrix CLR and dimensionality separability, and
rt is exactly the same as rm of 1D DFTs. For more-general OIOs, such as analytical
and numerical Green’s functions for inhomogeneous media, Radon transforms, non-
uniform DFTs, and general Fourier integral operators, rigorous rank analysis is non-
trivial and we rely on numerical experiments in section 4 to demonstrate the efficacy
of the tensor butterfly algorithm.
High-frequency Green’s functions. We use an example similar to the one used
in subsection 3.2. Consider an OIO representing the free-space Green’s function in-
teraction between two parallel-facing unit-square plates. The n × n × n × n tensor
12
Algorithm 3.1 Construction algorithm for the tensor butterfly decomposition of a
2d-mode tensor K ∈ Cm1 ×m2 ×···×md ×n1 ×n2 ×···×nd
Input: A function to evaluate a 2d-mode tensor K(i, j) for arbitrary multi-indices
(i, j), binary partitioning trees of L levels Tt0k and Ts0k with roots t0k = {1, 2, . . . , mk }
and s0k = {1, 2, . . . , nk }, a relative compression tolerance ϵ.
Output: Tensor butterfly decomposition of K: (1) Vτs,k s,k
,ν at l = 0 and Wτ ,ν at
c
1 ≤ l ≤ L of k ≤ d for multi-set τ with node τi at level l of Tt0i , multi-set s with
node si at level Lc of Ts0i , and node ν at level Lc − l of subtree Tsk , (2) Ut,k
τ,ν at l = 0
t,k c
and Pτ,ν at 1 ≤ l ≤ L of k ≤ d for multi-set ν with node νi at level l of Ts0i ,
multi-set t with node ti at level Lc of Tt0i , and node τ at level Lc − l of subtree Ttk ,
and (3) subtensors K(t, s) at l = Lc .
1: (1) Compute Vτs,k s,k
,ν and Wτ ,ν :
c
2: for level l = 0, . . . , L do
3: for multi-set s = (s1 , . . . , sd ) with si at level Lc of Ts0 do
i
4: for multi-set τ = (τ1 , τ2 , . . . , τd ) with τi at level l of Tt0 do
i
5: for mode index k = 1, . . . , d do
6: for node ν at level Lc − l of Tsk do
7: if l = 0 then ▷ Use (3.4) with proxies τ̂ , ŝ and tolerance ϵ
8: Compute Vτs,k ,ν and ν via mode-(d + k) unfolding of K(τ̂ , ŝk←ν )
9: else ▷ Use (3.6) with proxies τ̂ , ŝ and tolerance ϵ
10: Compute Wτs,k ,ν and ν via mode-(d + k) unfolding of
K(τ̂ , ŝk←ν 1 ∪ν 2 )
11: end if
12: end for
13: end for
14: end for
15: end for
16: end for
17: (2) Compute Ut,k t,k
τ,ν and Pτ,ν :
c
18: for level l = 0, . . . , L do
19: for multi-set t = (t1 , . . . , td ) with ti at level Lc of Tt0 do
i
20: for multi-set ν = (ν1 , ν2 , . . . , νd ) with νi at level l of Ts0 do
i
21: for mode index k = 1, . . . , d do
c
22: for node τ at level L − l of Ttk do
23: if l = 0 then ▷ Use (3.5) with proxies t̂, ν̂ and tolerance ϵ
24: Compute Ut,k τ,ν and τ via mode-k unfolding of K(t̂k←τ , ν̂)
25: else ▷ Use (3.7) with proxies t̂, ν̂ and tolerance ϵ
26: Compute Pt,k τ,ν and τ via mode-k unfolding of K(t̂k←τ 1 ∪τ 2 , ν̂)
27: end if
28: end for
29: end for
30: end for
31: end for
32: end for
33: (3) Compute K(t, s):
34: for multi-set s = (s1 , . . . , sd ) with si at level Lc of Ts0 do
i
35: for multi-set t = (t1 , . . . , td ) with ti at level Lc of Tt0 do
i
36: Compute K(t, s) and ZFP compress it
37: end for
38: end for

13
Algorithm 3.2 Contraction algorithm for a tensor butterfly decomposition with an
input tensor
Input: The tensor butterfly decomposition of a 2d-mode tensor
K ∈ Cm1 ×m2 ×···×md ×n1 ×n2 ×···×nd , and a (full) d + 1-mode input tensor
F ∈ Cn1 ×n2 ×···×nd ×nv where nv denotes the number of columns of F(d+1) .
Output: The d + 1-mode output tensor G = K ×d+1,d+2,...,2d F where
G ∈ Cm1 ×m2 ×···×md ×nv .
1: (1) Multiply with Vτs,k s,k
,ν and Wτ ,ν :
2: for level l = 0, . . . , Lc do
3: for multi-set s = (s1 , s2 . . . , sd ) with si at level Lc of Ts0 do
i
4: for multi-set τ = (τ1 , τ2 , . . . , τd ) with τi at level l of Tt0 do
i
5: if l = 0 then Qd s,k
6: F τ ,s = F (s, 1 : nv ) k=1 ×k V
7: else
F τ ,s = F pτ ,s dk=1 ×k diagν (Wτs,k ▷ ν at level Lc − l of Tsk
Q
8: ,ν )
9: end if
10: end for
11: end for
12: end for
13: (2) Contract with K(t, s):
14: for multi-set t = (t1 , t2 . . . , td ) with ti at level Lc of Tt0 do
i
15: for multi-set s = (s1 , s2 . . . , sd ) with si at level Lc of Ts0 do
i
16: ZFP decompress K(t, s) and compute G t,s = K(t, s) ×d+1,d+2,...,2d F t,s
17: end for
18: end for
19: (3) Multiply with Ut,k t,k
τ,ν and Pτ,ν :
c
20: for level l = L , . . . , 0 do
21: for multi-set t = (t1 , t2 , . . . , td ) with ti at level Lc of Tt0 do
i
22: for multi-set ν = (ν1 , ν2 , . . . , νd ) with νi at level l of Ts0 do
i
23: if l = 0 then ▷ Compute and return G
Qd t,k
24: G(t, 1 : nv ) = G t,ν k=1 ×k U
25: else
G t,pν += G t,ν dk=1 ×k diagτ (Pt,k ▷ τ at level Lc − l of Ttk
Q
26: τ,ν )
27: end if
28: end for
29: end for
30: end for

is

exp(−iωρ)
(3.13) K(i, j) = K(xi , y j ) = ,
ρ

where xi = ( in1 , in2 , 0), y j = ( jn1 , jn2 , ρmin ), ω is the wavenumber, and ρ = |xi − y j |.
Here ρmin represents the distance between the two plates assumed to be sufficiently
large. In the high-frequency setting, n = Cp ω with a constant Cp independent of n
and ω, and the grid size is δx = δy = n1 per dimension. It has been well studied
[52, 53, 20, 5] that for any multi-set pair (τ , ν) leading to a subtensor K(τ , ν) of
sizes m1 × m2 × n1 × n2 with mi , ni ≤ n, the numerical rank of its matricization
14
K ∈ Cm1 m2 ×n1 n2 can be estimated as
ω 2 a 2 n1 n2
(3.14) rm ≈ ω 2 a2 θϕ ≈ .
n2 ρ2min
Here a is the radius of the sphere enclosing the target domain of physical sizes m1 δx ×
m2 δy . θ ≈ nρnmin
1
, ϕ ≈ nρnmin
2
, and the product θϕ represents the solid angle covered
by the source domain as seen from the center of the target domain. Note that ρωa min
approximately represents the Nyquist sampling rate per direction needed in the source
domain. The matrix and tensor butterfly ranks can be estimated as follows:
• Matrix butterfly rank: Consider a matrix butterfly factorization of K. By design,
for any node pair at each level, m1 n1 = m2 n2 = Cb n, where Cb2 represents the size
of the leaf nodes. Therefore, the matrix butterfly rank can be estimated from (3.14)
as
Cb2
(3.15) rm ≈
2Cp2 ρ2min
Here we have assumed a = √m2n 1
. Note that rm is a constant independent of n, and
therefore the matrix CLR property holds true.
• Tensor butterfly rank: Consider an L-level tensor butterfly factorization of K. We
just need to check the tensor rank, e.g., the rank of the mode-4 unfolding of the
corresponding subtensors at Step (1) of Algorithm 3.1, as the unfolding for the
other modes can be investigated in a similar fashion. Figure 3.3(a) shows an exam-
ple of L = 2, where the target and source domains are partitioned at l = 0 (top)
and l = Lc = 1 (bottom) at Step (1) of Algorithm 3.1. Consider a multi-set pair
(τ , sk←ν ) with k = 4 required by the tensor CLR property in (3.4). Figure 3.3(a)
highlights in orange one multi-set pair at l = 0 (top) and one multi-set pair at
l = Lc (bottom). Mode 4 is highlighted in red, which needs to be skeletonized by
ID. By (3.14), the rank of the matricization of K(τ , sk←ν ) is no longer a constant as
c
the tensor butterfly algorithm needs to keep n1 = |s1 | = n/2L (see Figure 3.3(b)).
However, due to translational invariance of the free-space Green’s function, i.e.,
K(xi , y j ) = K(x̃, ỹ), where x̃ = (0, in2 , 0), ỹ = ( j1 −i1 j2
n , n , ρmin ), the mode-4 un-
folding of K(τ , sk←ν ) is the matrix representing the Green’s function interaction
between an enlarged target domain of sizes (m1 + n1 )δx × m2 δy and a source line
segment of length n2 δy . Therefore its rank (hence the tensor rank) can be estimated
as

ωa′ n2 2Cb
(3.16) rt ≈ ωa′ ϕ ≈ ≤ ,
nρmin Cp ρmin

where a′ is the radius of the sphere enclosing the enlarged target domain and ρωa min
approximately represents the Nyquist sampling√rate on the source line segment.
The last inequality is a result of a′ ≈ m√1 +n
2n
1
≤ 2m n
1
and m2 n2 = Cb n. Here, the
critical condition n1 ≤ m1 is a direct result of the setup of the tensor CLR in (3.4):
c
l ≤ Lc and n1 = |s1 | = n/2L (i.e., s1 is fixed as the center level set as l changes).
One can clearly see from (3.16) that rt is independent of n, and thus the tensor
CLR property holds true.
We remark that the tensor butterfly rank rt in (3.16) is significantly smaller

than the matrix butterfly rank rm in (3.15) with rt ≈ 2 rm . One can perform similar
analysis of rm and rt for different geometrical settings, such as a pair of well-separated
3D unit cubes, or a pair of co-planar 2D unit-square plates. We leave these exercises
to the readers.
15
Discrete Fourier Transform. Our second example is the high-dimensional discrete
Fourier transform (DFT) defined by

(3.17) K(i, j) = exp(2πixi · y j )

with xi = (i1 − 1, i2 − 1, . . . , id − 1) and y j = ( j1n−1 , j2n−1 , . . . , jdn−1 ). We first notice


that, since
d  
Y 2πi(ik − 1)(jk − 1)
(3.18) exp(2πixi · y j ) = exp ,
n
k=1

to carry out arbitrary high-dimensional DFTs one can simply perform 1D DFTs one
dimension at a time (while fixing the indices of the other dimensions) by either 1D
FFT or 1D matrix butterfly algorithms. We choose the 1D butterfly approach as our
reference algorithm. For each node pair at dimension k discretized into a mk × nk
matrix, we assume that mk nk = Cb n. It has been proved in [8, 68] that this leads to
the matrix CLR property and each 1D DFT (fixing indices in other dimensions) can
be computed by the matrix butterfly algorithm in O(n log n) time with a constant
butterfly rank rm . Overall this approach requires O(dnd log n) operations.
In contrast, the tensor butterfly algorithm relies on direct compression of e.g.,
mode-k unfolding of subtensors K(τ , sk←ν ). Consider any submatrix Ksub ∈ Cmk ×nk
of this unfolding matrix K(k) ; by fixing ip and jp with p ̸= k, its entry is simply
 
2πi(ik − 1)(jk − 1)
exp
n
scaled by a constant factor
 
Y 2πi(ip − 1)(jp − 1)
(3.19) exp
n
p̸=k

of modulus 1. Therefore the tensor butterfly rank is

rt = rank(Ksub ) = rank(K(k) ) = rm .

The tensor CLR property thus holds true, and the tensor rank is exactly the same as
the 1D butterfly algorithm per dimension. However, as we will see subsection 3.2.2,
our tensor butterfly algorithm yields a linear instead of quasi-linear CPU complexity
for high-dimensional DFTs.
3.2.2. Complexity Analysis. Here we provide an analysis of computational
complexity and memory requirement of the proposed construction algorithm (Al-
gorithm 3.1) and contraction algorithm (Algorithm 3.2), assuming that the tensor
butterfly rank rt is a small constant and d > 1.
c
√ d
At Step√ (1) lof Algorithm 3.1, each level 1 ≤ l ≤ L has #s = O( n ), #τ = 2dl ,
#ν = O( n/2 ) for each mode k ≤ d. Each Wτs,k 2
,ν requires O(rt ) storage, and
2d+1
O(rt ) computational time when proxy indices τ̂ , ŝ are being used. The storage
requirement and computational cost for Wτs,k ,ν are:

c
L
X √ d √
(3.20) memW = dO( n )2dl O( n/2l )O(rt2 ) = O(dnd rt2 ),
l=1
16
mode 2 mode 4 m2
m2
n2
mode 1 n2
mode 3 n1 m1+n1
m1 a
a

m2
m2 n2 n2
m1+n1 a
m1 a n1
(a) (b) (c)
Fig. 3.3: Illustration of the tensor CLR property with L = 2 for a 4-mode ten-
sor representing free-space Green’s function interactions between parallel facing unit
square plates. (a) The target and source domains are partitioned at l = 0 (top)
and l = Lc = 1 (bottom) with a multi-set pair (τ , sk←ν ) highlighted in orange for
the skeletonization along mode 4. The sizes of the nodes are |τ1 | = m1 ,|τ2 | = m1 ,
|s1 | = n1 and |ν| = n2 . (b) Illustration of the rank of the matricization of K(τ , sk←ν ).
(c) Illustration of the rank of the mode-4 unfolding of K(τ , sk←ν ).

c
L
X √ d √
(3.21) timeW = dO( n )2dl O( n/2l )O(rt2d+1 ) = O(dnd rt2d+1 ).
l=1

One can easily verify that the computation and storage of Vτs,k ,ν at l = 0 is less
s,k
dominant than Wτ ,ν at l > 0 and we skip its analysis.
√ d √ d
At Step (2) of Algorithm 3.1, we have #s = O( n ) and #t = O( n ), and
each K(t, s) requires O(rt2d ) computation time and storage units (even if it is further
ZFP compressed to reduce storage requirement), which adds up to
√ d √ d
(3.22) memK = O( n )O( n )O(rt2d ) = O(nd rt2d ),
√ d √ d
(3.23) timeK = O( n )O( n )O(rt2d ) = O(nd rt2d ).

Step (3) of Algorithm 3.1 has similar computational cost and memory requirement
to Step (1) when contracting with the intermediate matrices Pt,k τ,ν , with memP ∼
memW and timeP ∼ timeW .
Overall, Algorithm 3.1 requires

(3.24) mem = memW + memK + memP = O(nd rt2d ),


(3.25) time = timeW + timeK + timeP = O(dnd rt2d+1 ).

Following a similar analysis, one can estimate the computational cost of Algo-
rithm 3.2 as O(nd rt2d nv ), which is essentially of the similar order as mem of Algo-
rithm 3.1, except an extra factor nv representing the size of the last dimension of the
input tensor.
One critical observation is that the time and storage complexity of the tensor
butterfly algorithm is linear in nd with smaller ranks rt , while that of the matrix
butterfly algorithm is quasi-linear in nd with much larger ranks rm . This leads to a
17
Factor time Apply time r
Algorithm d=2 d=3 d=2 d=3 d=2 d=3
Tensor butterfly n2 n3 n2 n3 1 1
2 3 2 3
Matrix butterfly n log n n log n n log n n log n 1 1
Tucker ID n4 n4 − n6∗ n4 n4 − n6∗ n n
QTT (Green&Radon) n log n n3 log n
3
n log n n5 log n
4
n n
QTT (DFT) log n log n n2 log n n3 log n 1 1

Table 3.1: CPU complexity of the tensor butterfly algorithm, matrix butterfly algo-
rithm, Tucker ID and QTT when applied to high-frequency Green’s functions (d = 2
represents two parallel facing unit square plates and d = 3 represents two separated
unit cubes), DFT and Radon transforms. Here we assume that tensor butterfly, ma-
trix butterfly and Tucker ID algorithms use proxy indices, and the QTT algorithm
uses TT-cross. The big O notation is assumed. *: for d = 3, the complexity of Tucker
ID is n6 for Radon transform and DFT, and n4 for Green’s function.

significantly superior algorithm, as will be demonstrated with the numerical results


in section 4. That being said, one can verify that there is no difference between the
two algorithms when d = 1.
3.2.3. Comparison with Tucker-ID and QTT. Here we make a comparison
of the computational complexities of the matrix butterfly algorithm, tensor butterfly
algorithm, Tucker-ID and QTT for several frequently encountered OIOs with d = 2, 3,
namely Green’s functions for high-frequency wave equations (where d = 2 represents
two parallel facing unit square plates and d = 3 represents two separated unit cubes),
Radon transforms (a type of Fourier integral operators), and DFT. We first summarize
the computational complexities of the factorization and application of matrix and
tensor butterfly algorithms in Table 3.1. Here we use r to denote the maximum rank
of the submatrices or (unfolding and matricization of) subtensors associated with each
algorithm. In other words, we drop the subscript of rm and rt in this subsection. We
note that r = O(1) for butterfly algorithms, and the computational complexity for
matrix and tensor butterfly algorithms is, respectively, O(dnd log n) and O(dnd ), for
all OIOs considered here.
The Tucker ID algorithm in subsection 3.1 (even with the use of proxy indices
to accelerate the factorization), always leads to r = O(n) for OIOs and hence almost
always O(n2d ) factorization and application complexities (see Table 3.1). One excep-
tion is perhaps the Green’s function for d = 3, where one can easily show that 4 out of
the 6 unfolding matrices have a rank of O(n) and the remaining 2 have a rank of O(1),
leading to the O(n4 ) computational complexity. Overall, we remark that Tucker-like
decomposition algorithms are typically the least efficient tensor algorithms for OIOs.
The QTT algorithm, on the other hand, is a more subtle algorithm to compare
with. Assuming that the maximum rank among all steps in QTT is r, we first summa-
rize the computational complexities of the factorization and application of QTT. For
factorization, we only consider the TT-cross type of algorithms, which yields the best
known computational complexity among all TT-based algorithms. The computational
complexity of TT-cross is O(dr3 log n) [14, 55]. Once factorized, the application cost
of the QTT factorization with a full input tensor is O(dr2 nd log n) [14]. This com-
plexity can be reduced to O(dr2 ri2 log n) when the input tensor is also in the QTT
format with TT rank ri . However, an arbitrary input tensor can have a TT rank up
18
to ri = O(nd/2 ) (which leads to the same application cost as contraction with a full
input tensor). Therefore in our comparative study, we stick with the O(dr2 nd log n)
application complexity.
For high-frequency Green’s functions and general-form Fourier integral operators
(e.g. Radon transforms), the TT rank in general behaves as r = O(n) [14], leading
to a factorization cost of O(dn3 log n) and an application cost of O(dn2+d log n), as
detailed in Table 3.1. It is worth mentioning that, treating DFTs as a special type
of Fourier integral operators, QTT can achieve r = O(1) when a proper bit-reversal
ordering is used [9], leading to a factorization cost of O(d log n) and an application
cost of O(dnd log n), as shown in Table 3.1. In contrast, the proposed tensor butterfly
algorithm can always yield O(dnd ) factorization and O(nd ) application costs.
4. Numerical Results. This section provides several numerical examples to
demonstrate the accuracy and efficiency of the proposed tensor butterfly algorithm
when applied to large-scale and high-dimensional OIOs including Green’s function
tensors for high-frequency Helmholtz equations (subsection 4.1), Radon transform
tensors (subsection 4.2), and high-dimensional DFTs (subsection 4.3). We compare
our algorithm with a few existing matrix and tensor algorithms including the matrix
butterfly algorithm in subsection 2.2, the Tucker ID algorithm in subsection 3.1, the
QTT algorithm [55], the FFT algorithm implemented in the heFFTe package [1], and
the non-uniform FFT (NUFFT) algorithm implemented in the FINUFFT package [2].
All of these algorithms except for Tucker ID and FINUFFT are tested in distributed-
memory parallelism. It is worth noting that currently there is no single package
that can both compute and apply the QTT decomposition in distributed-memory
parallelism. In our tests, we perform the factorization using a distributed-memory TT
code [61] that parallelizes a cross interpolation algorithm [19], and then we implement
the distributed-memory QTT contraction via the CTF package [62]. All experiments
are performed using 4 CPU nodes of the Perlmutter machine at NERSC in Berkeley,
where each node has two 64-core AMD EPYC 7763 processors and 128GB of 2133MHz
DDR4 memory.
4.1. Green’s functions for high-frequency Helmholtz equations. In this
subsection, we consider the tensor discretized from 3D free-space Green’s functions
for high-frequency Helmholtz equations. Specifically, the tensor entry is

exp(−iωρ)
(4.1) K(i, j) = , ρ = |xi − y j |,
ρ

where ω represents the wave number. Two tests are performed: (1) A 4-way tensor
representing the Green’s function interaction between two parallel facing unit plates
with distance 1, i.e., xi = ( in1 , in2 , 0), y j = ( jn1 , jn2 , 1), and d = 2. (2) A 6-way tensor
representing the Green’s function interaction between two unit cubes with the distance
between their centers set to 2, i.e., xi = ( in1 , in2 , in3 ), y j = ( jn1 , jn2 , jn3 +2), and d = 3. For
both tests, the wave number is chosen such that the number of points per wave length
is 4, i.e., 2πn/ω = 4 or Cp = 2/π. We first perform compression using the tensor
butterfly, Tucker ID and QTT algorithms, and then perform application/contraction
using a random input tensor F . We also add results for the matrix butterfly algorithm
using the corresponding matricization of K and F .
Figure 4.1 (left) shows the factorization time, application time and memory usage
of each algorithm using a compression tolerance ϵ = 10−6 for the parallel plate case.
For QTT, we show the memory of the factorization (labeled as “QTT(Factor)”) and
19
application (labeled as “QTT(Apply)”) separately. Note that although QTT factor-
ization requires sub-linear memory usage, QTT contraction becomes super-linear due
to the full QTT rank of the input tensor. Overall, we achieve the expected com-
plexities listed in Table 3.1 for the butterfly and Tucker ID algorithms. For QTT,
however, instead of an O(n) rank scaling, we observe an O(n3/4 ) rank scaling, leading
to slightly better complexities compared with Table 3.1. We leave this as a future
investigation. That said, the tensor butterfly algorithm achieves the linear CPU
and memory complexities for both factorization and application with a much smaller
prefactor compared to all the other algorithms. Remarkably, the tensor butterfly al-
gorithm achieves a 30x memory reduction and 15x speedup, capable of handling 64x
larger-sized tensors compared with the matrix butterfly algorithm.
Figure 4.1 (right) shows the factorization time, application time and memory
usage of each algorithm using a compression tolerance ϵ = 10−2 for the cube case.
Overall, we achieve the expected complexities listed in Table 3.1 for all four algorithms.
The tensor butterfly algorithm achieves the linear CPU and memory complexities for
both factorization and application with a much smaller prefactor compared to all the
other algorithms. Remarkably, the tensor butterfly algorithm achieves a 30× memory
reduction and 200x speedup, capable of handling 512× larger-sized tensors compared
with the matrix butterfly algorithm. The largest data point n = 2048 corresponds
to 512 wavelengths per physical dimension. The results in Figure 4.1 suggest the
superiority of the tensor butterfly algorithm in solving high-frequency wave equations
in 3D volumes and on 3D surfaces.
Next, we demonstrate the effect of changing compression tolerance ϵ for both test
cases in Table 4.1. Here the error is measured by
||K ×d+1,d+2,...,2d F e − KBF ×d+1,d+2,...,2d F e ||F
(4.2) error =
||K ×d+1,d+2,...,2d F e ||F

where KBF is the tensor butterfly representation of K, F e (j) = 1 for a small set of
random entries j and 0 elsewhere. This way, K does not need to be fully formed to
compute the error. Table 4.1 shows the minimum rank (rmin ) and maximum rank
(r), error, factorization time, application time and memory usage of varying ϵ, for
n = 16384, d = 2 and n = 512, d = 3. Overall, the errors are close to the prescribed
tolerances and the costs increase for smaller ϵ, as expected. We also note that keeping
r as low as possible is critical in maintaining small prefactors of the tensor butterfly
algorithm, particularly for higher dimensions.
4.2. Radon transforms. In this subsection, we consider 2D and 3D discretized
Radon transforms similar to those presented in [8]. Specifically, the tensor entry is

(4.3) K(i, j) = exp(2πiϕ(xi , y j ))

with xi = ( in1 , in2 , . . . , ind ) and y j = (j1 − n2 , j2 − n2 , . . . , jd − n2 ). For d = 2, we consider


q
(4.4) ϕ(x, y) = x · y + c21 y12 + c22 y22 ,
c1 = (2 + sin(2πx1 ) sin(2πx2 ))/16,
c2 = (2 + cos(2πx1 ) cos(2πx2 ))/16.

For d = 3, we consider

(4.5) ϕ(x, y) = x · y + c|y|,


20
104 104
2 1
103 103

102 1 102

Time (Sec)
Time (Sec)

1
10
1 101

10 0 100
-1
10-1 10

10-2
104 105 106 107 108 109 1010 104 106 108 1010

10
3 103

10
2 102
1
10 101

Memory (GB)
Memory (GB)

0
10 10
0

10-1 -1
10
10-2
10-2
10-3
10-4 10-3

104 105 106 107 108 109 1010 104 106 108 1010

103
101
102
10 0 101
Time (Sec)

100
Time (Sec)

10-1
10-1
10-2 10-2
10-3
10-3
10-4
10-4 4
10 105 106 107 108 109 1010 104 106 108 1010

Fig. 4.1: Helmholtz equation: Computational complexity comparison among butterfly


matrix, butterfly tensor, Tucker ID and QTT for compressing (left) a 4-way Green’s
function tensor for interactions between two parallel 2D plates and (right) a 6-way
Green’s function tensor for interactions between two 3D cubes. The geometries are
discretized with 4 points per wavelength. (Top): Factor time. (Middle): Factor and
apply memory. (Bottom): Apply time. The largest data points correspond to 8192
wavelengths per direction for the 2D tests (left) and 512 wavelengths per direction for
the 3D tests (right).

c = (3 + sin(2πx1 ) sin(2πx2 ) sin(2πx3 ))/100.

We first perform compression using the matrix butterfly, tensor butterfly, and QTT
algorithms, and then perform application/contraction using a random input tensor
F.
Figure 4.2 shows the factorization time, application time and memory usage of
each algorithm using a compression tolerance ϵ = 10−3 for the 2D transform (left)
and 3D transform (right). Overall, we achieve the expected complexities listed in
Table 3.1 for all three algorithms. The QTT algorithm can only obtain the first 2 or
3 data points due to its high memory usage and large QTT ranks. In comparison,
the tensor butterfly algorithm achieves the linear CPU and memory complexities for
both factorization and application with a much smaller prefactor compared to all
the other algorithms. Note that the Radon transform kernels in (4.4) and (4.5) are
not translational invariant, but the tensor butterfly algorithm can still attain small
ranks. As a result, the tensor butterfly algorithm can handle 64x larger-sized Radon
21
nd ϵ rmin r error Tf (sec) Ta (sec) M em (MB)
163842 1E-02 5 8 1.49E-02 6.83E+01 1.16E+00 2.40E+04
163842 1E-03 6 10 2.19E-03 1.17E+02 1.89E+00 4.69E+04
163842 1E-04 7 11 1.84E-04 1.57E+02 2.80E+00 7.49E+04
163842 1E-05 8 12 3.46E-05 2.29E+02 4.03E+00 1.21E+05
163842 1E-06 9 13 9.26E-06 3.18E+02 5.92E+00 1.96E+05
5123 1E-02 2 5 2.01E-02 1.18E+02 1.42E+00 1.19E+04
5123 1E-03 2 6 1.18E-03 3.46E+02 4.08E+00 4.87E+04
5123 1E-04 2 7 8.39E-05 6.26E+02 9.85E+00 1.49E+05
5123 1E-05 3 8 9.21E-06 1.25E+03 2.40E+01 4.07E+05

Table 4.1: The technical data for a 4-way Green’s function tensor of n = 16384 and
a 6-way Green’s function tensor of n = 512 for the Helmholtz equation using the
proposed tensor butterfly algorithm of varying compression tolerance ϵ. The table
shows the maximum rank r and minimum rank rmin across all ID operations, relative
error in (4.2), factor time Tf , apply time Ta , and memory usage M em.

nd ϵ rmin r error Tf (sec) Ta (sec) M em (MB)


20482 1E-02 4 18 2.04E-02 9.32E+01 7.20E-01 1.25E+04
20482 1E-03 4 20 1.51E-03 1.61E+02 1.28E+00 2.40E+04
20482 1E-04 4 22 1.49E-04 2.55E+02 2.05E+00 4.26E+04
20482 1E-05 4 23 2.45E-05 3.73E+02 3.12E+00 6.95E+04
1283 1E-02 2 6 4.31E-02 3.89E+01 8.57E-01 1.59E+04
1283 1E-03 2 8 1.00E-02 1.31E+02 3.74E+00 9.44E+04
1283 1E-04 2 9 1.68E-03 2.42E+02 8.28E+00 2.38E+05
1283 1E-05 2 11 1.48E-04 4.30E+02 2.05E+01 6.06E+05

Table 4.2: The technical data for a 4-way Radon transform tensor of n = 2048 in
(4.4) and a 6-way Radon transform tensor of n = 128 in (4.5) using the proposed
tensor butterfly algorithm of varying compression tolerance ϵ. The table shows the
maximum rank r and minimum rank rmin across all ID operations, relative error in
(4.2), factor time Tf , apply time Ta , and memory usage M em..

transforms compared with the matrix butterfly algorithm, showing their superiority
for solving linear inverse problems in tomography and seismic imaging.
Next, we demonstrate the effect of changing compression tolerance ϵ for both test
cases in Table 4.2 with the error defined by (4.2). Table 4.2 shows the minimum
and maximum ranks, error, factorization time, application time and memory usage
of varying ϵ, for n = 2048 with d = 2 and n = 128 with d = 3, respectively. Overall,
the errors are close to the prescribed tolerances and the costs increase for smaller ϵ,
as expected. Just like the Green’s function example, it is critical to keep r a low
constant, particularly for higher dimensions.
4.3. High-dimensional discrete Fourier transform. Finally, we consider
high-dimensional DFTs defined as

(4.6) K(i, j) = exp(2πixi · y j ),


22
104
104
103
103
Time (Sec)

Time (Sec)
102
102
101
1
10
100

100 10-1 4
104 105 106 107 108 109 10 105 106 107 108 109

103 103

10 2 102
Memory (GB)

Memory (GB)
101
101
100
100 10-1

10
-1 10-2
10-3
10-2
104 105 106 107 108 109 104 105 106 107 108 109
2 2
10 10

1
10 101
Time (Sec)

Time (Sec)

100 100

10-1 10-1

10-2 10-2

10-3 10-3 4
104 105 106 107 108 109 10 105 106 107 108 109

Fig. 4.2: Radon transforms: Computational complexity comparison among butterfly


matrix, butterfly tensor and QTT for compressing (left) a 2D Radon transform tensor
and (right) a 3D Radon transform tensor. (Top): Factor time. (Middle): Factor and
apply memory. (Bottom): Apply time.

where we choose xi = (i1 − 1, i2 − 1, . . . , id − 1) and y j = ( j1n−1 , j2n−1 , . . . , jdn−1 ) for


uniform DFTs, and we choose xi to be random (in the sense that xik ∈ [0, n − 1] for
k ≤ d is a random number) and y j = ( j1n−1 , j2n−1 , . . . , jdn−1 ) for type-2 non-uniform
DFTs. For high-dimensional DFTs with d = 3, 4, 5, 6, we perform compression using
the tensor butterfly algorithms (with the bit-reversal ordering for each dimension),
and perform application/contraction using a random input tensor F . In comparison,
for d = 3 we perform FFT via the heFFTe package for the uniform DFT example and
NUFFT via the FINUFFT package for the type-2 non-uniform DFT example.
Figure 4.3 shows the factorization time for the butterfly algorithm (or equiva-
lently the plan creation time for heFFTe/FINUFFT), application time and memory
usage of each algorithm using a compression tolerance ϵ = 10−3 (for butterfly and
FINUFFT) for the uniform (left) and nonuniform (right) transforms. Overall, the ten-
sor butterfly algorithm can obtain O(nd ) CPU and memory complexities compared
with the O(nd log n) complexities of FFT and NUFFT. It is also worth mentioning
that QTT can attain logarithmic-complexity uniform DFTs [9] when the input tensor
F is also in the QTT form with low TT ranks. However, for a general input ten-
sor, the complexity of QTT falls back to O(nd log n). Although the proposed tensor
butterfly algorithm can obtain the best computational complexity among all existing
algorithms, we observe that for the d = 3 case, FFT or NUFFT shows a memory
23
usage similar to the tensor butterfly algorithm but much smaller prefactors for plan
creation and application time. That said, the tensor butterfly algorithm provides
a unique capability to perform higher dimensional DFTs (i.e., d ≥ 4) with optimal
asymptotic complexities.
104 10
5

3
10 104
2
10 103
Time (Sec)

Time (Sec)
1 2
10 10
100 101

10-1 100

10-2 10
-1

104 106 108 1010 10


4
10
5
10
6
10
7
10
8
10
9
10
10

104 10
3

103
102
102
Memory (GB)

Memory (GB)

101 101
0
10
100
10-1
10-1
10-2
-3
10 10-2 4
104 106 108 1010 10 105 106 107 108 109 1010
2
10
102
101
101
100
Time (Sec)

Time (Sec)

100
10-1
10-1
10-2

10-3 10-2

10-4 10-3 4
104 106 108 1010 10 105 106 107 108 109 1010

Fig. 4.3: Fourier transforms: Computational complexity of (left) butterfly tensor and
heFFTe for compressing the high-dimensional DFT tensor and (right) butterfly tensor
and FINUFFT for compressing the high-dimensional NUFFT tensor. (Top): Factor
time of butterfly tensor and plan creation time for heFFTe/FINUFFT. (Middle):
Factor memory. (Bottom): Apply time.

5. Conclusion. We present a new tensor butterfly algorithm efficiently com-


pressing and applying large-scale and high-dimensional OIOs, such as Green’s func-
tions for wave equations and integral transforms, including Radon transforms and
Fourier transforms. The tensor butterfly algorithm leverages an essential tensor CLR
property to achieve both improved asymptotic computational complexities and lower
leading constants. For the contraction of high-dimensional OIOs with arbitrary input
tensors, the tensor butterfly algorithm achieves the optimal linear CPU and memory
complexities; this is in huge contrast with both existing matrix algorithms and fast
transform algorithms. The former includes the matrix butterfly algorithm, and the
latter contains FFT, NUFFT, and other tensor algorithms such as Tucker-like de-
compositions and QTT. Nevertheless, all these algorithms exhibit higher asymptotic
complexities and larger leading constants. As a result, the tensor butterfly algorithm
can efficiently model high-frequency 3D Green’s function interactions with over 512×
24
larger problem sizes than existing algorithms; for the largest sized tensor that can
be handled by existing algorithms, the tensor butterfly algorithm requires 200× less
CPU time and 30× less memory than existing algorithms. Moreover, it can perform
linear-complexity Radon transforms and DFTs with up to d = 6 dimensions. These
OIOs are frequently encountered in the solution of high-frequency wave equations,
X-ray and MRI-based inverse problems, seismic imaging and signal processing; there-
fore, we expect the tensor butterfly algorithm developed here to be both theoretically
attractive and practically useful for many applications.
The limitation of the tensor butterfly algorithm is the requirement for a tensor
grid, and hence its extension for unstructured meshes will be a future work. Also, the
mid-level subtensors represent a memory bottleneck and need to be compressed with
more efficient algorithms.

Acknowledgements. This research has been supported by the U.S. Depart-


ment of Energy, Office of Science, Office of Advanced Scientific Computing Research,
Mathematical Multifaceted Integrated Capability Centers (MMICCs) program, under
Contract No. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory. This
work was also supported in part by the NSF Mathematical Sciences Graduate In-
ternship (NSF MSGI) program. This research used resources of the National Energy
Research Scientific Computing Center, a DOE Office of Science User Facility sup-
ported by the Office of Science of the U.S. Department of Energy under Contract No.
DE-AC02-05CH11231 using NERSC award ASCR-ERCAP0024170. Qian’s research
was partially supported by NSF grants 2152011 and 2309534 and an MSU SPG grant.

REFERENCES

[1] Alan Ayala, Stanimire Tomov, Piotr Luszczek, Sébastien Cayrols, Gerald Ragghianti, and
Jack Dongarra. Analysis of the communication and computation cost of fft libraries
towards exascale. Technical report, Technical Report ICL-UT-22-07. https://icl. utk.
edu/files/publications/2022 . . . , 2022.
[2] Alexander H Barnett, Jeremy Magland, and Ludvig af Klinteberg. A parallel nonuniform fast
fourier transform library based on an “exponential of semicircle” kernel. SIAM Journal on
Scientific Computing, 41(5):C479–C504, 2019.
[3] David Joseph Biagioni. Numerical construction of Green’s functions in high dimensional elliptic
problems with variable coefficients and analysis of renewable energy data via sparse and
separable approximations. PhD thesis, University of Colorado at Boulder, 2012.
[4] James Bremer, Ze Chen, and Haizhao Yang. Rapid Application of the Spherical Har-
monic Transform via Interpolative Decomposition Butterfly Factorization. arXiv preprint
arXiv:2004.11346, 2020.
[5] Ovidio M Bucci and Giorgio Franceschetti. On the degrees of freedom of scattered fields. IEEE
transactions on Antennas and Propagation, 37(7):918–926, 1989.
[6] HanQin Cai, Keaton Hamm, Longxiu Huang, and Deanna Needell. Mode-wise tensor decom-
positions: Multi-dimensional generalizations of cur decompositions. Journal of machine
learning research, 22(185):1–36, 2021.
[7] Emmanuel Candes, Laurent Demanet, and Lexing Ying. Fast computation of fourier integral
operators. SIAM Journal on Scientific Computing, 29(6):2464–2493, 2007.
[8] Emmanuel Candès, Laurent Demanet, and Lexing Ying. A fast butterfly algorithm for the
computation of Fourier integral operators. Multiscale Model. Sim., 7(4):1727–1750, 2009.
[9] Jielun Chen and Michael Lindsey. Direct interpolative construction of the discrete fourier
transform as a matrix product operator. arXiv preprint arXiv:2404.03182, 2024.
[10] Ze Chen, Juan Zhang, Kenneth L Ho, and Haizhao Yang. Multidimensional phase recovery
and interpolative decomposition butterfly factorization. Journal of Computational Physics,
412:109427, 2020.
[11] Jian Cheng, Dinggang Shen, Peter J Basser, and Pew-Thian Yap. Joint 6D kq space compressed
sensing for accelerated high angular resolution diffusion MRI. In Information Processing
in Medical Imaging: 24th International Conference, IPMI 2015, Sabhal Mor Ostaig, Isle
25
of Skye, UK, June 28-July 3, 2015, Proceedings, pages 782–793. Springer, 2015.
[12] Andrzej Cichocki, Namgil Lee, Ivan Oseledets, Anh-Huy Phan, Qibin Zhao, Danilo P Mandic,
et al. Tensor networks for dimensionality reduction and large-scale optimization: Part 1
low-rank tensor decompositions. Foundations and Trends® in Machine Learning, 9(4-
5):249–429, 2016.
[13] Lisa Claus, Pieter Ghysels, Yang Liu, Thái Anh Nhan, Ramakrishnan Thirumalaisamy, Amneet
Pal Singh Bhalla, and Sherry Li. Sparse approximate multifrontal factorization with com-
posite compression methods. ACM Transactions on Mathematical Software, 49(3):1–28,
2023.
[14] Eduardo Corona, Abtin Rahimian, and Denis Zorin. A tensor-train accelerated solver for
integral equations in complex geometries. Journal of Computational Physics, 334:145–169,
2017.
[15] Maurice A De Gosson. The Wigner Transform. World Scientific Publishing Company, 2017.
[16] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multilinear singular value
decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000.
[17] Stanley R Deans. The Radon transform and some of its applications. Courier Corporation,
2007.
[18] Gian Luca Delzanno. Multi-dimensional, fully-implicit, spectral method for the vlasov–maxwell
equations with exact conservation laws in discrete form. Journal of Computational Physics,
301:338–356, 2015.
[19] Sergey Dolgov and Dmitry Savostyanov. Parallel cross interpolation for high-precision cal-
culation of high-dimensional integrals. Computer Physics Communications, 246:106869,
2020.
[20] Björn Engquist and Lexing Ying. Fast directional multilevel algorithms for oscillatory kernels.
SIAM Journal on Scientific Computing, 29(4):1710–1737, 2007.
[21] Alexander L Fetter and John Dirk Walecka. Quantum theory of many-particle systems. Courier
Corporation, 2012.
[22] Ilias I Giannakopoulos, Mikhail S Litsarev, and Athanasios G Polimeridis. Memory footprint
reduction for the fft-based volume integral equation method via tensor decompositions.
IEEE Transactions on Antennas and Propagation, 67(12):7476–7486, 2019.
[23] Lars Grasedyck, Daniel Kressner, and Christine Tobler. A literature survey of low-rank tensor
approximation techniques. GAMM-Mitteilungen, 36(1):53–78, 2013.
[24] Han Guo, Jun Hu, and Eric Michielssen. On MLMDA/butterfly compressibility of inverse
integral operators. IEEE Antennas Wirel. Propag. Lett., 12:31–34, 2013.
[25] Han Guo, Yang Liu, Jun Hu, and Eric Michielssen. A butterfly-based direct integral-equation
solver using hierarchical LU factorization for analyzing scattering from electrically large
conducting objects. IEEE Trans. Antennas Propag., 65(9):4742–4750, 2017.
[26] Han Guo, Yang Liu, Jun Hu, and Eric Michielssen. A butterfly-based direct solver using hier-
archical LU factorization for Poggio-Miller-Chang-Harrington-Wu-Tsai equations. Microw
Opt Technol Lett., 60:1381–1387, 2018.
[27] Wolfgang Hackbusch and Boris N Khoromskij. Tensor-product approximation to multidimen-
sional integral operators and green’s functions. SIAM journal on matrix analysis and
applications, 30(3):1233–1253, 2008.
[28] Wolfgang Hackbusch and Stefan Kühn. A new scheme for the tensor representation. Journal
of Fourier analysis and applications, 15(5):706–722, 2009.
[29] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with random-
ness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM
Review, 53(2):217–288, January 2011.
[30] Richard A Harshman et al. Foundations of the parafac procedure: Models and conditions for an
“explanatory” multi-modal factor analysis. UCLA working papers in phonetics, 16(1):84,
1970.
[31] Rui Hong, Ya-Xuan Xiao, Jie Hu, An-Chun Ji, and Shi-Ju Ran. Functional tensor network
solving many-body Schrödinger equation. Phys. Rev. B, 105:165116, Apr 2022.
[32] L Hörmander. Fourier integral operators. i. In Mathematics Past and Present Fourier Integral
Operators, pages 23–127. Springer, 1994.
[33] Yuehaw Khoo and Lexing Ying. Switchnet: a neural network model for forward and inverse
scattering problems. SIAM Journal on Scientific Computing, 41(5):A3182–A3201, 2019.
[34] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review,
51(3):455–500, 2009.
[35] Matthew Li, Laurent Demanet, and Leonardo Zepeda-Núñez. Wide-band butterfly network:
stable and efficient inversion via multi-frequency neural networks. Multiscale Modeling &
Simulation, 20(4):1191–1227, 2022.

26
[36] Yingzhou Li and Haizhao Yang. Interpolative butterfly factorization. SIAM J. Sci. Comput.,
39(2):A503–A531, 2017.
[37] Yingzhou Li, Haizhao Yang, Eileen R Martin, Kenneth L Ho, and Lexing Ying. Butterfly
factorization. Multiscale Model. Sim., 13(2):714–732, 2015.
[38] Yingzhou Li, Haizhao Yang, and Lexing Ying. Multidimensional butterfly factorization. Applied
and Computational Harmonic Analysis, 44(3):737–758, 2018.
[39] E. Liberty, F. Woolfe, P.-G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms
for the low-rank approximation of matrices. Proc. Natl. Acad. Sci. USA, 104:20167–20172,
2007.
[40] Peter Lindstrom. Fixed-rate compressed floating-point arrays. IEEE Transactions on Visual-
ization and Computer Graphics, 20(12):2674–2683, 2014.
[41] Yang Liu. A comparative study of butterfly-enhanced direct integral and differential equation
solvers for high-frequency electromagnetic analysis involving inhomogeneous dielectrics. In
2022 3rd URSI Atlantic and Asia Pacific Radio Science Meeting (AT-AP-RASC), pages
1–4. IEEE, 2022.
[42] Yang Liu, Pieter Ghysels, Lisa Claus, and Xiaoye Sherry Li. Sparse approximate multifrontal
factorization with butterfly compression for high-frequency wave equations. SIAM Journal
on Scientific Computing, 0(0):S367–S391, 2021.
[43] Yang Liu, Han Guo, and Eric Michielssen. An HSS matrix-inspired butterfly-based direct solver
for analyzing scattering from two-dimensional objects. IEEE Antennas Wirel. Propag.
Lett., 16:1179–1183, 2017.
[44] Yang Liu, Tianhuan Luo, Aman Rani, Hengrui Luo, and Xiaoye Sherry Li. Detecting reso-
nance of radio-frequency cavities using fast direct integral equation solvers and augmented
bayesian optimization. IEEE Journal on Multiscale and Multiphysics Computational Tech-
niques, 2023.
[45] Yang Liu, Jian Song, Robert Burridge, and Jianliang Qian. A fast butterfly-compressed
Hadamard-Babich integrator for high-frequency Helmholtz equations in inhomogeneous
media with arbitrary sources. Multiscale Modeling & Simulation, 21(1):269–308, 2023.
[46] Yang Liu, Xin Xing, Han Guo, Eric Michielssen, Pieter Ghysels, and Xiaoye Sherry Li. Butterfly
factorization via randomized matrix-vector multiplications. SIAM Journal on Scientific
Computing, 43(2):A883–A907, 2021.
[47] Yang Liu and Haizhao Yang. A hierarchical butterfly LU preconditioner for two-dimensional
electromagnetic scattering problems involving open surfaces. J. Comput. Phys.,
401:109014, 2020.
[48] W. Lu, J. Qian, and R. Burridge. Babich’s expansion and the fast Huygens sweeping method
for the Helmholtz wave equation at high frequencies. J. Comput. Phys., 313:478–510, 2016.
[49] Axel Maas. Two and three-point Green’s functions in two-dimensional Landau-gauge Yang-
Mills theory. Phys. Rev. D, 75:116004, 2007.
[50] Michael W Mahoney, Mauro Maggioni, and Petros Drineas. Tensor-cur decompositions for
tensor-based data. In Proceedings of the 12th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 327–336, 2006.
[51] Osman Asif Malik and Stephen Becker. Fast randomized matrix and tensor interpolative de-
composition using countsketch. Advances in Computational Mathematics, 46(6):76, 2020.
[52] Eric Michielssen and Amir Boag. Multilevel evaluation of electromagnetic fields for the rapid
solution of scattering problems. Microw Opt Technol Lett., 7(17):790–795, 1994.
[53] Eric Michielssen and Amir Boag. A multilevel matrix decomposition algorithm for analyzing
scattering from large structures. IEEE Trans. Antennas Propag., 44(8):1086–1093, 1996.
[54] Michael O’Neil, Franco Woolfe, and Vladimir Rokhlin. An algorithm for the rapid evaluation
of special function transforms. Appl. Comput. Harmon. A., 28(2):203 – 226, 2010. Special
Issue on Continuous Wavelet Transform in Memory of Jean Morlet, Part I.
[55] Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing,
33(5):2295–2317, 2011.
[56] Qiyuan Pang, Kenneth L. Ho, and Haizhao Yang. Interpolative decomposition butterfly fac-
torization. SIAM J. Sci. Comput., 42(2):A1097–A1115, 2020.
[57] Michael E Peskin. An introduction to quantum field theory. CRC press, 2018.
[58] Arvind K Saibaba. Hoid: higher order interpolatory decomposition for tensors based on tucker
representation. SIAM Journal on Matrix Analysis and Applications, 37(3):1223–1249,
2016.
[59] Sadeed Bin Sayed, Yang Liu, Luis J. Gomez, and Abdulkadir C. Yucel. A butterfly-accelerated
volume integral equation solver for broad permittivity and large-scale electromagnetic
analysis. IEEE Transactions on Antennas and Propagation, 70(5):3549–3559, 2022.
[60] Weitian Sheng, Abdulkadir C Yucel, Yang Liu, Han Guo, and Eric Michielssen. A domain

27
decomposition based surface integral equation simulator for characterizing EM wave prop-
agation in mine environments. IEEE Transactions on Antennas and Propagation, 2023.
[61] Tianyi Shi, Daniel Hayes, and Jing-Mei Qiu. Distributed memory parallel adaptive tensor-train
cross approximation. arXiv preprint arXiv:2407.11290, 2024.
[62] Edgar Solomonik, Devin Matthews, Jeff Hammond, and James Demmel. Cyclops tensor frame-
work: Reducing communication and eliminating load imbalance in massively parallel con-
tractions. In 2013 IEEE 27th International Symposium on Parallel and Distributed Pro-
cessing, pages 813–824. IEEE, 2013.
[63] Mark Tygert. Fast algorithms for spherical harmonic expansions, III. J. Comput. Phys.,
229(18):6181 – 6192, 2010.
[64] Mingyu Wang, Cheng Qian, Enrico Di Lorenzo, Luis J Gomez, Vladimir Okhmatovski, and
Abdulkadir C Yucel. Supervoxhenry: Tucker-enhanced and fft-accelerated inductance ex-
traction for voxelized superconducting structures. IEEE Transactions on Applied Super-
conductivity, 31(7):1–11, 2021.
[65] Mingyu Wang, Cheng Qian, Jacob K White, and Abdulkadir C Yucel. Voxcap: Fft-accelerated
and tucker-enhanced capacitance extraction simulator for voxelized structures. IEEE
Transactions on Microwave Theory and Techniques, 68(12):5154–5168, 2020.
[66] E. Wigner. On the quantum correction for thermodynamic equilibrium. Phys. Rev., 40:749–759,
Jun 1932.
[67] Haizhao Yang. A unified framework for oscillatory integral transforms: When to use NUFFT
or butterfly factorization? J. Comput. Phys., 388:103 – 122, 2019.
[68] Lexing Ying. Sparse Fourier Transform via Butterfly Algorithm. SIAM J. Sci. Comput.,
31(3):1678–1694, 2009.

28

You might also like