A Linear-Complexity Tensor Butterfly Algorithm For Compressing High-Dimensional Oscillatory Integral Operators
A Linear-Complexity Tensor Butterfly Algorithm For Compressing High-Dimensional Oscillatory Integral Operators
A Linear-Complexity Tensor Butterfly Algorithm For Compressing High-Dimensional Oscillatory Integral Operators
YANG LIU†
Abstract. This paper presents a multilevel tensor compression algorithm called tensor butterfly
algorithm for efficiently representing large-scale and high-dimensional oscillatory integral operators,
arXiv:2411.03029v1 [math.NA] 5 Nov 2024
including Green’s functions for wave equations and integral transforms such as Radon transforms
and Fourier transforms. The proposed algorithm leverages a tensor extension of the so-called com-
plementary low-rank property of existing matrix butterfly algorithms. The algorithm partitions the
discretized integral operator tensor into subtensors of multiple levels, and factorizes each subtensor
at the middle level as a Tucker-like interpolative decomposition, whose factor matrices are formed
in a multilevel fashion. For a d-dimensional integral operator discretized into a 2d-mode tensor with
n2d entries, the overall CPU time and memory requirement scale as O(nd ), in stark contrast to the
O(nd log n) requirement of existing matrix algorithms such as matrix butterfly algorithm and fast
Fourier transforms (FFT), where n is the number of points per direction. When comparing with
other tensor algorithms such as quantized tensor train (QTT), the proposed algorithm also shows
superior CPU and memory performance for tensor contraction. Remarkably, the tensor butterfly
algorithm can efficiently model high-frequency Green’s function interactions between two unit cubes,
each spanning 512 wavelengths per direction, which represents over 512× larger problem sizes than
existing algorithms. On the other hand, for a problem representing 64 wavelengths per direction,
which is the largest size existing algorithms can handle, our tensor butterfly algorithm exhibits 200x
speedups and 30× memory reduction comparing with existing ones. Moreover, the tensor butterfly
algorithm also permits O(nd )-complexity FFTs and Radon transforms up to d = 6 dimensions.
Key word. butterfly algorithm, tensor algorithm, Tucker decomposition, interpolative decom-
position, quantized tensor train (QTT), fast Fourier transforms (FFT), fast algorithm, high-frequency
wave equations, integral transforms, Radon transform, low-rank compression, Fourier integral oper-
ator, non-uniform FFT (NUFFT)
Email: [email protected]
† Applied Mathematics and Computational Research Division, Lawrence Berkeley National Lab-
with an integral kernel K(x, y). The indexing of a matrix K is denoted by K(i, j)
or K(t, s), where i, j are indices and t, s are index sets. We use KT to denote the
transpose of matrix K. For a sequence of matrices K1 , . . . , Kn , the matrix product is
n
Y
(1.2) Ki = K1 K2 . . . Kn ,
i=1
3
the vertical stacking (assuming the same column dimension) is
and
is a block diagonal matrix with Ki being the diagonal blocks. Given an L-level binary-
tree partitioning Tt of an index set t = {1, 2, ···, n}, any node τ at each level is a subset
of t. The parent and children of τ are denoted by pτ and τ c (c = 1, 2), respectively,
and τ = τ 1 ∪ τ 2 .
A multi-index i = (i1 , · · ·, id ) is a tuple of indices, and similarly a multi-set
τ = (τ1 , τ2 , · · ·, τd ) is a tuple of index sets. We define
5
Vps ,ν 1
(2.6) ≈ K(τ, ν 1 ) K(τ, ν 2 ) τ
Vpsτ ,ν 2
s
s Vpτ ,ν 1
(2.7) ≈ K(τ, ν)Wτ,ν .
Vpsτ ,ν 2
s
Here Wτ,ν and ν are the interpolation matrix and skeleton columns from the ID of
1 2
K(τ, ν ∪ ν ), respectively. Wτ,ν is henceforth referred to as the transfer matrix for
ν in the rest of this paper. Note that we have added an additional superscript s to
Vpτ ,ν c and Wτ,ν , for notation convenience in the later context. From (2.6), it is clear
s
that the interpolation matrix Vτ,ν can be expressed in terms of its parent pτ ’s and
1 2
children ν , ν ’s interpolation matrices as
s
s s Vpτ ,ν 1
(2.8) Vτ,ν = Wτ,ν .
Vpsτ ,ν 2
s s
Note that the interpolation matrices Vτ,ν at level l = 0 and transfer matrices Wτ,ν
c
at level 0 < l ≤ L do not require the column ID on the full subblocks K(τ, ν) and
K(τ, ν 1 ∪ ν 2 ), which would lead to at least an O(mn) computational complexity.
s s
In practice, one can select O(rτ,ν ) proxy rows τ̂ ⊂ τ to compute Vτ,ν and Wτ,ν
via ID as:
s
(2.9) K(τ̂ , ν) ≈ K(τ̂ , ν)Vτ,ν , l = 0,
s
(2.10) K(τ̂ , ν 1 ∪ ν 2 ) ≈ K(τ̂ , ν)Wτ,ν , 0 < l ≤ Lc .
The viable choices for proxy rows have been discussed in several existing papers [45,
56, 59, 8].
At levels l = Lc , . . . , L, the interpolation matrices Uτ,ν are computed by perform-
ing similar operations on KT . We only provide their expressions here and omit the
redundant explanation. Let t be the ancestor of ν at level Lc of Tt0 and let Tt be the
subtree rooted at t. At level l = L, Utτ,ν are explicitly formed. At level Lc ≤ l < L,
only the transfer matrices Ptτ,ν are computed from the column ID of KT (ν, τ 1 ∪ τ 2 )
satisfying
t
Uτ 1 ,pν
(2.11) Utτ,ν = Ptτ,ν .
Utτ 2 ,pν
Combining (2.5), (2.8) and (2.11), the matrix butterfly decomposition can be
expressed for each node pair (t, s) at level Lc of Tt0 and Ts0 as
Lc
Y 1
Y
t t,s t,s s
(2.12) K(t, s) ≈ U Pl K(t, s) Wl V .
l=1 l=Lc
Here, t and s represent the skeleton rows and columns of the ID of K(t, s). The
t s
interpolation factors U and V in (2.12) are
t
(2.13) U = diagτ (Utτ,s0 ), τ at level Lc of Tt ,
s
(2.14) V = diagν (Vts0 ,ν ), ν at level Lc of Ts ,
6
t,s t,s
and the transfer factors Pl and Wl for l = 1, . . ., Lc consist of transfer matrices
s
Wτ,ν and Psτ,ν :
t,s
τ at level l − 1 of Tt0 , and t ⊆ τ,
(2.15) Wl = diagτ diagν (Wτs c ,ν ) c ,
ν at level Lc − l of Ts ;
t,s
τ at level Lc − l of Tt ,
(2.16) (Pl )T = diagν diagτ (Ptτ,ν c )T c ,
ν at level l − 1 of Ts0 , and s ⊆ ν.
Here τ c and ν c with c = 1, 2 are children of τ and ν, respectively.
The CPU and memory requirement for computing the matrix butterfly decom-
position can be briefly analyzed as follows. Note that we only need to analyze the
s s
costs for Vτ,ν , Wτ,ν and K(t, s) as those for Utτ,ν and Ptτ,ν are similar. By CLR
assumption, we assume that rτ,ν ≤ r, ∀τ, ν for some constant r. Thanks to the use
s s
of the proxy rows and columns, the computation of one individual Vτ,ν and Wτ,ν by
ID only operates on O(r) × O(r) matrices, hence its memory and CPU requirements
c
are O(r2 ) and O(r3 ), respectively. In total, there are O(2L ) middle-level nodes s
c c
each having O(2L ) numbers of Vτ,ν s
and O(Lc 2L ) numbers of Wτ,ν s
. Similarly, each
K(t, s) requires O(r ) CPU and memory costs, and there are in total O(2L ) middle-
2
level node pairs (t, s). These numbers sum up to the overall O(nr2 log n) memory and
O(nr3 log n) CPU complexities for matrix butterfly algorithms.
For d-dimensional discretized OIOs K ∈ C(m1 m2 ···md )×(n1 n2 ···nd ) with mk = nk =
n, we can assume that n = Cb 2L with some constant Cb . For the above-described
binary-tree-based butterfly algorithm, the leaf nodes of the trees are of size Cbd and
this leads to a dL-level butterfly factorization. The memory and CPU complexities for
this algorithm become O(dnd r2 log n) and O(dnd r3 log n), respectively. On the other
hand, the multi-dimensional tree-based butterfly algorithm [38, 10] leads to a L-level
factorization with O(2d nd r2 log n) memory and O(2d nd r3 log n) CPU complexities.
Despite their quasi-linear complexity for high-dimensional OIOs, the butterfly rank
r is constant but high, leading to very large prefactors of these binary and multi-
dimensional tree-based algorithms. In the following, we turn to tensor decomposition
algorithms to reduce both the prefactor and asymptotic scaling of matrix butterfly
algorithms.
3. Proposed Tensor Algorithms. In this section, we assume that the d-
dimensional discretized OIO in section 2 is directly represented as a 2d-mode tensor
K ∈ Cm1 ×m2 ×···×md ×n1 ×n2 ×···×nd . We first extend the matrix ID algorithm in sub-
section 2.1 to its tensor variant, which serves as the building block for the proposed
tensor butterfly algorithm.
3.1. Tucker-like Interpolative Decomposition. Given the 2d-mode tensor
K(τ , ν) with τk = {1, 2, . . . , mk } and νk = {1, 2, . . . , nk } for k = 1, . . . , d, the pro-
posed tensor ID decomposition compresses each dimension independently via the col-
umn ID of the unfolding of K along the k-th dimension,
(3.1) K(k) = K(k) (:, τk )Uk , K(d+k) = K(d+k) (:, νk )Vk , k = 1, . . . , d,
Q
where K(k) ∈ C( j̸=k nj )×nk
is the mode-k unfolding, or equivalently
(3.2) K = K(τk←τk , ν) ×k Uk , K = K(τ , νk←νk ) ×d+k Vk , k = 1, . . . , d.
Here, τk and νk denote the skeleton indices along modes k and d+k of K, respectively,
while τk←τk and νk←νk denote multi-sets that replace τk and νk , respectively, with
7
(a) (b) (c)
Fig. 3.1: Tensor diagrams for (a) the Tucker-ID decomposition of a 4-mode tensor,
and (b) the matrix partitioner corresponding to a 2d × 2 partitioning
h with id = 2 used
in the tensor butterfly decomposition of a 2d-mode tensor, such as Wτs,k c ,ν in (3.12)
h i c
for fixed s, τ , k and ν, or Pt,k
τ,ν c in (3.11) for fixed t, ν, k and τ . (c) The tensor
c h i
diagram involving blocks Vts,k0 ,ν (in green) and blocks Wτs,k
c ,ν (in blue) for fixed s
c
and k for the tensor butterfly decomposition of a 2d-mode tensor.
τk and νk . Combining (3.2) for all dimensions yields the following proposed tensor
interpolation decomposition,
d
Y d
Y
(3.3) K = K(τ , ν) ×k Uk ×d+k Vk ,
k=1 k=1
In essence, the tensor CLR in (3.4) and (3.5) investigates the unfolding of judiciously
selected subtensors rather than the matricization used in the matrix CLR. Moreover,
the√tensor CLR requires fixing d − 1 modes of the 2d-mode subtensors to be of size
O( n) while changing the remaining d + 1 modes with respect to l. Therefore each
ID computation can operate on larger subtensors compared to the matrix CLR. In
subsection 3.2.1 we provide two examples, namely a free-space Green’s function tensor
and a high-dimensional Fourier transform, to explain why the tensor CLR is valid, and
in subsection 3.2.2 we will see that the tensor CLR essentially reduces the quasilinear
complexity of the matrix butterfly algorithm to linear complexity. Here, assuming
that the tensor CLR holds true, we describe the tensor butterfly algorithm. We note
that there may be alternative ways to define the tensor CLR different from (3.4) and
(3.5), and we leave that as a future work.
In what follows, we focus on the computation of Vτs,k ,ν (corresponding to the
mid-level multi-set s), as Ut,kτ,ν (corresponding to the mid-level multi-set t) can be
computed in a similar fashion. At level l = 0, Vτs,k ,ν are explicitly formed. At level
0 < l ≤ Lc , they are represented in a nested fashion. Let pτ = (pτ1 , pτ2 , . . . , pτd )
consist of parents of τ = (τ1 , τ2 , . . . , τd ) in (3.4).
By the tensor CLR property, we have
" #
Vps,k
τ ,ν
1
K(τ , sk←ν ) ≈ K(τ , sk←ν 1 ∪ν 2 ) ×d+k
Vps,k
τ ,ν
2
9
mode 3
mode 4
mode 1
mode 2
(a) (b)
Fig. 3.2: (a) Tensor diagram for the tensor butterfly decomposition of L = 2 levels
of a 4-mode OIO tensor representing (b) high-frequency Green’s function interactions
between parallel facing 2D unit squares. Only the full connectivity regarding three
middle-level node pairs is shown (the two green circles and one orange circle in (a)).
The orange circle in (a) represents the core tensor K(t, s) for a mid-level pair (t, s)
with t = (t1 , t2 ), s = (s1 , s2 ) highlighted in orange in (b).
" #!
Vps,k 1
(3.6) ≈ K(τ , sk←ν ) ×d+k Wτs,k τ ,ν
.
,ν
Vps,k
τ ,ν
2
Comparing (3.6) and (3.4), one realizes that the interpolation matrix Vτs,k ,ν is
s,k s,k
represented as the product of the transfer matrix Wτ ,ν and diagc (Vpτ ,ν c ). Here, the
transfer matrix Wτs,k ,ν is computed as the interpolation matrix of the column ID of
the mode-(d + k) unfolding of K(τ , sk←ν 1 ∪ν 2 ). As mentioned in section 3, in practice
one never forms the unfolding matrix in full, but instead considers the unfolding of
K(τ̂ , ŝk←ν 1 ∪ν 2 ), where τ̂ = (τ̂1 , τ̂2 , . . . , τ̂d ) and ŝ = (ŝ1 , ŝ2 , . . . , ŝd ); here τ̂i and ŝi
consist of O(r) judiciously selected indices along modes i and d + i, respectively. Note
that ŝk is never used as it is replaced by ν 1 ∪ ν 2 in (3.6). The same proxy index
strategy can be used to obtain Vτs,k ,ν at the level l = 0. For each Wτ ,ν or Vτ ,ν , its
s,k s,k
2d+1
computation requires O(r ) CPU time.
Similarly in (3.5), Ut,k τ,ν is explicitly formed at l = 0 and constructed via the
transfer matrix Pτ,ν at level 0 < l ≤ Lc :
t,k
" #
Ut,k
τ 1 ,pν
K(tk←τ , ν) ≈ K(tk←τ 1 ∪τ 2 , ν) ×k
Ut,k
τ 2 ,pν
10
" #!
Ut,k
τ 1 ,pν
(3.7) ≈ K(tk←τ , ν) ×k Pt,k .
τ,ν
Ut,k
τ 2 ,pν
Putting together (3.4), (3.5), (3.6) and (3.7), the proposed tensor butterfly de-
composition can be expressed, for any multi-set t = (t1 , t2 , . . . , td ) with ti at level Lc
of Tt0i and any multi-set s = (s1 , s2 , . . . , sd ) with si at level Lc of Ts0i , by forming a
Tucker-ID for the (t, s) pair:
d
Y 1
Y d
Y 1
Y
t,s,k t,k t,s,k s,k
(3.8) K(t, s) ≈ K(t, s) ×k Pl U ×d+k Wl V .
k=1 l=Lc k=1 l=Lc
Here, t and s represent the skeleton indices of the Tucker-ID of K(t, s). The
t,k s,k
interpolation factors U and V in (3.8) are:
t,k
(3.9) U = diagτ (Ut,k c
τ,s0 ), τ at level L of Ttk ,
s,k
(3.10) V = diagν (Vts,k
0 ,ν ), ν at level Lc of Tsk ,
t,s,k t,s,k
and the transfer factors Pl and Wl for l = 1, . . ., Lc are:
t,s,k
h i τ at level Lc − l of Tt ,
diagτ (Pt,k
k
(3.11) Pl = diagν τ,ν c ) ,
c νi at level l − 1 of Ts0i , si ⊆ νi , i ≤ d;
t,s,k
h i τi at level l − 1 of Tt0 , ti ⊆ τi , i ≤ d,
(3.12) Wl = diagτ diagν (Wτs,kc ,ν ) , i
c ν at level Lc − l of Tsk .
One can verify that when d = 1, the tensor butterfly algorithm (3.8) reduces
to the matrix butterfly algorithm (2.12). But when d > 1, the tensor butterfly
algorithm has a distinct algorithmic structure and the computational complexity can
be significantly reduced compared with the matrix butterfly algorithm. Detailed
computational complexity analysis is provided in subsection 3.2.2.
To better understand the structure of the tensor butterfly in (3.8), (3.9), (3.10),
(3.11), and (3.12), we describe its tensor diagram here. We first create the tensor
diagram for a matrix partitioner as shown
h in Figure
i 3.1(b), which represents a 2d × 2
block partitioning of a matrix such as Wτs,k c ,ν in (3.12) for fixed s, τ , k and ν, or
h i c
t,k
Pτ,ν c in (3.11) for fixed t, ν, k and τ . In other words, there are 2 legs on the
c
column dimension and 2d legs on the row dimension. The h diagram
i in Figure 3.1(c)
s,k s,k
shows the connectivity for all Vt0 ,ν (the green circles) and Wτ c ,ν (the blue circles)
c
for fixed s and k. The multiplication or contraction of all matrices in Figure 3.1(c)
s,k
results in Vt,s k
for all mid-level multi-sets t, which are of course not explicitly formed.
As an example, consider an OIO representing the free-space Green’s function in-
teraction between two parallel facing unit square plates in Figure 3.2. The tensor is
K(i, j) = K(xi , y j ) = exp(−iωρ)
ρ where xi = ( in1 , in2 , 0), y j = ( jn1 , jn2 , 1), ρ = |xi − y j |
and ω is the wavenumber. Here 1 represents the distance between the two plates.
Consider an L=2-level tensor butterfly decomposition, with a total of 16 middle-level
multi-set pairs. Let (t, s) denote one middle-level multi-set pair with t = (t1 , t2 ) and
s = (s1 , s2 ) as highlighted in orange in Figure 3.2(b). Their children are t11 , t12 , t21 , t22
11
and s11 , s12 , s21 , s22 . Leveraging the representations in Figure 3.1(b)-(c), the full di-
agram for K(t, s) consists of one 4-mode tensor K(t, s) (highlighted in orange in
Figure 3.2(a)), one transfer matrix per mode, and two factor matrices per mode. In
addition, we plot the full connectivity for two other multi-set pairs (highlighted in
Green in Figure 3.2(a)). It is important to note that the factor matrices and transfer
matrices are shared among the multi-set pairs.
The proposed tensor butterfly algorithm is fully described in Algorithm 3.1 for a
2d-mode tensor K ∈ Cm1 ×m2 ×···×md ×n1 ×n2 ×···×nd , which consists of three steps: (1)
computation of Vτs,k s,k t,k
,ν and Wτ ,ν starting at Line 1, (2) computation of Uτ,ν and Pτ,ν
t,k
starting at Line 17, and (3) computation of K(t, s) starting at Line 33. We note that,
after each K(t, s) is formed, we leverage floating-point compression tools such as the
ZFP software [40] to further compress it.
Once K is compressed, any input tensor F ∈ Cn1 ×n2 ×···×nd ×nv can contract with
it to compute G = K ×d+1,d+2,...,2d F . It is clear to see that theQcontraction is
k mk ×nv , K ∈
equivalent
Q Q to matrix-matrix Q multiplication G = KF, where G ∈ C
m × n n ×n
C k k k k , and F ∈ C k k v are matricizations of G, K and F , respectively,
and nv is the number of columns of F. The contraction algorithm is described in
Algorithm 3.2 which consists of three steps:
(1) Contraction with Vτs,k s,k c
,ν and Wτ ,ν . For each level l = 0, 1, . . . , L , one notices that,
since the contraction operation for each multi-set τ with τi at level l of Tt0i and
the middle-level multi-set s is independent of each other, one needs a separate
tensor F τ ,s to store the contraction result for each multi-set pair (τ , s). F τ ,s
s,k
can be computed by mode-by-mode contraction with the factor matrices V for
l = 0 (Line 6) and the transfer matrices diagν (Wτs,k ,ν ) for l > 0 (Line 8).
(2) Contraction with K(t, s) at the middle level. Tensors at the middle level F t,s
are contracted with each subtensor K(t, s) separately, resulting in tensors G t,s =
K(t, s) ×d+1,d+2,...,2d F t,s .
(3) Contraction with Ut,k t,k c
τ,ν and Pτ,ν . As Step (1), for each level l = L , L
c−1
, . . . , 0,
the contraction operation for each multi-set ν with νi at level l of Ts0i and middle-
level multi-set t is independent. At level l > 0, the contribution of tensors G t,ν is
accumulated into G t,pν (Line 26); at level l = 0, the contraction results are stored
in the final output tensor G(t, 1 : nv ) (Line 24).
13
Algorithm 3.2 Contraction algorithm for a tensor butterfly decomposition with an
input tensor
Input: The tensor butterfly decomposition of a 2d-mode tensor
K ∈ Cm1 ×m2 ×···×md ×n1 ×n2 ×···×nd , and a (full) d + 1-mode input tensor
F ∈ Cn1 ×n2 ×···×nd ×nv where nv denotes the number of columns of F(d+1) .
Output: The d + 1-mode output tensor G = K ×d+1,d+2,...,2d F where
G ∈ Cm1 ×m2 ×···×md ×nv .
1: (1) Multiply with Vτs,k s,k
,ν and Wτ ,ν :
2: for level l = 0, . . . , Lc do
3: for multi-set s = (s1 , s2 . . . , sd ) with si at level Lc of Ts0 do
i
4: for multi-set τ = (τ1 , τ2 , . . . , τd ) with τi at level l of Tt0 do
i
5: if l = 0 then Qd s,k
6: F τ ,s = F (s, 1 : nv ) k=1 ×k V
7: else
F τ ,s = F pτ ,s dk=1 ×k diagν (Wτs,k ▷ ν at level Lc − l of Tsk
Q
8: ,ν )
9: end if
10: end for
11: end for
12: end for
13: (2) Contract with K(t, s):
14: for multi-set t = (t1 , t2 . . . , td ) with ti at level Lc of Tt0 do
i
15: for multi-set s = (s1 , s2 . . . , sd ) with si at level Lc of Ts0 do
i
16: ZFP decompress K(t, s) and compute G t,s = K(t, s) ×d+1,d+2,...,2d F t,s
17: end for
18: end for
19: (3) Multiply with Ut,k t,k
τ,ν and Pτ,ν :
c
20: for level l = L , . . . , 0 do
21: for multi-set t = (t1 , t2 , . . . , td ) with ti at level Lc of Tt0 do
i
22: for multi-set ν = (ν1 , ν2 , . . . , νd ) with νi at level l of Ts0 do
i
23: if l = 0 then ▷ Compute and return G
Qd t,k
24: G(t, 1 : nv ) = G t,ν k=1 ×k U
25: else
G t,pν += G t,ν dk=1 ×k diagτ (Pt,k ▷ τ at level Lc − l of Ttk
Q
26: τ,ν )
27: end if
28: end for
29: end for
30: end for
is
exp(−iωρ)
(3.13) K(i, j) = K(xi , y j ) = ,
ρ
where xi = ( in1 , in2 , 0), y j = ( jn1 , jn2 , ρmin ), ω is the wavenumber, and ρ = |xi − y j |.
Here ρmin represents the distance between the two plates assumed to be sufficiently
large. In the high-frequency setting, n = Cp ω with a constant Cp independent of n
and ω, and the grid size is δx = δy = n1 per dimension. It has been well studied
[52, 53, 20, 5] that for any multi-set pair (τ , ν) leading to a subtensor K(τ , ν) of
sizes m1 × m2 × n1 × n2 with mi , ni ≤ n, the numerical rank of its matricization
14
K ∈ Cm1 m2 ×n1 n2 can be estimated as
ω 2 a 2 n1 n2
(3.14) rm ≈ ω 2 a2 θϕ ≈ .
n2 ρ2min
Here a is the radius of the sphere enclosing the target domain of physical sizes m1 δx ×
m2 δy . θ ≈ nρnmin
1
, ϕ ≈ nρnmin
2
, and the product θϕ represents the solid angle covered
by the source domain as seen from the center of the target domain. Note that ρωa min
approximately represents the Nyquist sampling rate per direction needed in the source
domain. The matrix and tensor butterfly ranks can be estimated as follows:
• Matrix butterfly rank: Consider a matrix butterfly factorization of K. By design,
for any node pair at each level, m1 n1 = m2 n2 = Cb n, where Cb2 represents the size
of the leaf nodes. Therefore, the matrix butterfly rank can be estimated from (3.14)
as
Cb2
(3.15) rm ≈
2Cp2 ρ2min
Here we have assumed a = √m2n 1
. Note that rm is a constant independent of n, and
therefore the matrix CLR property holds true.
• Tensor butterfly rank: Consider an L-level tensor butterfly factorization of K. We
just need to check the tensor rank, e.g., the rank of the mode-4 unfolding of the
corresponding subtensors at Step (1) of Algorithm 3.1, as the unfolding for the
other modes can be investigated in a similar fashion. Figure 3.3(a) shows an exam-
ple of L = 2, where the target and source domains are partitioned at l = 0 (top)
and l = Lc = 1 (bottom) at Step (1) of Algorithm 3.1. Consider a multi-set pair
(τ , sk←ν ) with k = 4 required by the tensor CLR property in (3.4). Figure 3.3(a)
highlights in orange one multi-set pair at l = 0 (top) and one multi-set pair at
l = Lc (bottom). Mode 4 is highlighted in red, which needs to be skeletonized by
ID. By (3.14), the rank of the matricization of K(τ , sk←ν ) is no longer a constant as
c
the tensor butterfly algorithm needs to keep n1 = |s1 | = n/2L (see Figure 3.3(b)).
However, due to translational invariance of the free-space Green’s function, i.e.,
K(xi , y j ) = K(x̃, ỹ), where x̃ = (0, in2 , 0), ỹ = ( j1 −i1 j2
n , n , ρmin ), the mode-4 un-
folding of K(τ , sk←ν ) is the matrix representing the Green’s function interaction
between an enlarged target domain of sizes (m1 + n1 )δx × m2 δy and a source line
segment of length n2 δy . Therefore its rank (hence the tensor rank) can be estimated
as
√
ωa′ n2 2Cb
(3.16) rt ≈ ωa′ ϕ ≈ ≤ ,
nρmin Cp ρmin
′
where a′ is the radius of the sphere enclosing the enlarged target domain and ρωa min
approximately represents the Nyquist sampling√rate on the source line segment.
The last inequality is a result of a′ ≈ m√1 +n
2n
1
≤ 2m n
1
and m2 n2 = Cb n. Here, the
critical condition n1 ≤ m1 is a direct result of the setup of the tensor CLR in (3.4):
c
l ≤ Lc and n1 = |s1 | = n/2L (i.e., s1 is fixed as the center level set as l changes).
One can clearly see from (3.16) that rt is independent of n, and thus the tensor
CLR property holds true.
We remark that the tensor butterfly rank rt in (3.16) is significantly smaller
√
than the matrix butterfly rank rm in (3.15) with rt ≈ 2 rm . One can perform similar
analysis of rm and rt for different geometrical settings, such as a pair of well-separated
3D unit cubes, or a pair of co-planar 2D unit-square plates. We leave these exercises
to the readers.
15
Discrete Fourier Transform. Our second example is the high-dimensional discrete
Fourier transform (DFT) defined by
to carry out arbitrary high-dimensional DFTs one can simply perform 1D DFTs one
dimension at a time (while fixing the indices of the other dimensions) by either 1D
FFT or 1D matrix butterfly algorithms. We choose the 1D butterfly approach as our
reference algorithm. For each node pair at dimension k discretized into a mk × nk
matrix, we assume that mk nk = Cb n. It has been proved in [8, 68] that this leads to
the matrix CLR property and each 1D DFT (fixing indices in other dimensions) can
be computed by the matrix butterfly algorithm in O(n log n) time with a constant
butterfly rank rm . Overall this approach requires O(dnd log n) operations.
In contrast, the tensor butterfly algorithm relies on direct compression of e.g.,
mode-k unfolding of subtensors K(τ , sk←ν ). Consider any submatrix Ksub ∈ Cmk ×nk
of this unfolding matrix K(k) ; by fixing ip and jp with p ̸= k, its entry is simply
2πi(ik − 1)(jk − 1)
exp
n
scaled by a constant factor
Y 2πi(ip − 1)(jp − 1)
(3.19) exp
n
p̸=k
rt = rank(Ksub ) = rank(K(k) ) = rm .
The tensor CLR property thus holds true, and the tensor rank is exactly the same as
the 1D butterfly algorithm per dimension. However, as we will see subsection 3.2.2,
our tensor butterfly algorithm yields a linear instead of quasi-linear CPU complexity
for high-dimensional DFTs.
3.2.2. Complexity Analysis. Here we provide an analysis of computational
complexity and memory requirement of the proposed construction algorithm (Al-
gorithm 3.1) and contraction algorithm (Algorithm 3.2), assuming that the tensor
butterfly rank rt is a small constant and d > 1.
c
√ d
At Step√ (1) lof Algorithm 3.1, each level 1 ≤ l ≤ L has #s = O( n ), #τ = 2dl ,
#ν = O( n/2 ) for each mode k ≤ d. Each Wτs,k 2
,ν requires O(rt ) storage, and
2d+1
O(rt ) computational time when proxy indices τ̂ , ŝ are being used. The storage
requirement and computational cost for Wτs,k ,ν are:
c
L
X √ d √
(3.20) memW = dO( n )2dl O( n/2l )O(rt2 ) = O(dnd rt2 ),
l=1
16
mode 2 mode 4 m2
m2
n2
mode 1 n2
mode 3 n1 m1+n1
m1 a
a
m2
m2 n2 n2
m1+n1 a
m1 a n1
(a) (b) (c)
Fig. 3.3: Illustration of the tensor CLR property with L = 2 for a 4-mode ten-
sor representing free-space Green’s function interactions between parallel facing unit
square plates. (a) The target and source domains are partitioned at l = 0 (top)
and l = Lc = 1 (bottom) with a multi-set pair (τ , sk←ν ) highlighted in orange for
the skeletonization along mode 4. The sizes of the nodes are |τ1 | = m1 ,|τ2 | = m1 ,
|s1 | = n1 and |ν| = n2 . (b) Illustration of the rank of the matricization of K(τ , sk←ν ).
(c) Illustration of the rank of the mode-4 unfolding of K(τ , sk←ν ).
c
L
X √ d √
(3.21) timeW = dO( n )2dl O( n/2l )O(rt2d+1 ) = O(dnd rt2d+1 ).
l=1
One can easily verify that the computation and storage of Vτs,k ,ν at l = 0 is less
s,k
dominant than Wτ ,ν at l > 0 and we skip its analysis.
√ d √ d
At Step (2) of Algorithm 3.1, we have #s = O( n ) and #t = O( n ), and
each K(t, s) requires O(rt2d ) computation time and storage units (even if it is further
ZFP compressed to reduce storage requirement), which adds up to
√ d √ d
(3.22) memK = O( n )O( n )O(rt2d ) = O(nd rt2d ),
√ d √ d
(3.23) timeK = O( n )O( n )O(rt2d ) = O(nd rt2d ).
Step (3) of Algorithm 3.1 has similar computational cost and memory requirement
to Step (1) when contracting with the intermediate matrices Pt,k τ,ν , with memP ∼
memW and timeP ∼ timeW .
Overall, Algorithm 3.1 requires
Following a similar analysis, one can estimate the computational cost of Algo-
rithm 3.2 as O(nd rt2d nv ), which is essentially of the similar order as mem of Algo-
rithm 3.1, except an extra factor nv representing the size of the last dimension of the
input tensor.
One critical observation is that the time and storage complexity of the tensor
butterfly algorithm is linear in nd with smaller ranks rt , while that of the matrix
butterfly algorithm is quasi-linear in nd with much larger ranks rm . This leads to a
17
Factor time Apply time r
Algorithm d=2 d=3 d=2 d=3 d=2 d=3
Tensor butterfly n2 n3 n2 n3 1 1
2 3 2 3
Matrix butterfly n log n n log n n log n n log n 1 1
Tucker ID n4 n4 − n6∗ n4 n4 − n6∗ n n
QTT (Green&Radon) n log n n3 log n
3
n log n n5 log n
4
n n
QTT (DFT) log n log n n2 log n n3 log n 1 1
Table 3.1: CPU complexity of the tensor butterfly algorithm, matrix butterfly algo-
rithm, Tucker ID and QTT when applied to high-frequency Green’s functions (d = 2
represents two parallel facing unit square plates and d = 3 represents two separated
unit cubes), DFT and Radon transforms. Here we assume that tensor butterfly, ma-
trix butterfly and Tucker ID algorithms use proxy indices, and the QTT algorithm
uses TT-cross. The big O notation is assumed. *: for d = 3, the complexity of Tucker
ID is n6 for Radon transform and DFT, and n4 for Green’s function.
exp(−iωρ)
(4.1) K(i, j) = , ρ = |xi − y j |,
ρ
where ω represents the wave number. Two tests are performed: (1) A 4-way tensor
representing the Green’s function interaction between two parallel facing unit plates
with distance 1, i.e., xi = ( in1 , in2 , 0), y j = ( jn1 , jn2 , 1), and d = 2. (2) A 6-way tensor
representing the Green’s function interaction between two unit cubes with the distance
between their centers set to 2, i.e., xi = ( in1 , in2 , in3 ), y j = ( jn1 , jn2 , jn3 +2), and d = 3. For
both tests, the wave number is chosen such that the number of points per wave length
is 4, i.e., 2πn/ω = 4 or Cp = 2/π. We first perform compression using the tensor
butterfly, Tucker ID and QTT algorithms, and then perform application/contraction
using a random input tensor F . We also add results for the matrix butterfly algorithm
using the corresponding matricization of K and F .
Figure 4.1 (left) shows the factorization time, application time and memory usage
of each algorithm using a compression tolerance ϵ = 10−6 for the parallel plate case.
For QTT, we show the memory of the factorization (labeled as “QTT(Factor)”) and
19
application (labeled as “QTT(Apply)”) separately. Note that although QTT factor-
ization requires sub-linear memory usage, QTT contraction becomes super-linear due
to the full QTT rank of the input tensor. Overall, we achieve the expected com-
plexities listed in Table 3.1 for the butterfly and Tucker ID algorithms. For QTT,
however, instead of an O(n) rank scaling, we observe an O(n3/4 ) rank scaling, leading
to slightly better complexities compared with Table 3.1. We leave this as a future
investigation. That said, the tensor butterfly algorithm achieves the linear CPU
and memory complexities for both factorization and application with a much smaller
prefactor compared to all the other algorithms. Remarkably, the tensor butterfly al-
gorithm achieves a 30x memory reduction and 15x speedup, capable of handling 64x
larger-sized tensors compared with the matrix butterfly algorithm.
Figure 4.1 (right) shows the factorization time, application time and memory
usage of each algorithm using a compression tolerance ϵ = 10−2 for the cube case.
Overall, we achieve the expected complexities listed in Table 3.1 for all four algorithms.
The tensor butterfly algorithm achieves the linear CPU and memory complexities for
both factorization and application with a much smaller prefactor compared to all the
other algorithms. Remarkably, the tensor butterfly algorithm achieves a 30× memory
reduction and 200x speedup, capable of handling 512× larger-sized tensors compared
with the matrix butterfly algorithm. The largest data point n = 2048 corresponds
to 512 wavelengths per physical dimension. The results in Figure 4.1 suggest the
superiority of the tensor butterfly algorithm in solving high-frequency wave equations
in 3D volumes and on 3D surfaces.
Next, we demonstrate the effect of changing compression tolerance ϵ for both test
cases in Table 4.1. Here the error is measured by
||K ×d+1,d+2,...,2d F e − KBF ×d+1,d+2,...,2d F e ||F
(4.2) error =
||K ×d+1,d+2,...,2d F e ||F
where KBF is the tensor butterfly representation of K, F e (j) = 1 for a small set of
random entries j and 0 elsewhere. This way, K does not need to be fully formed to
compute the error. Table 4.1 shows the minimum rank (rmin ) and maximum rank
(r), error, factorization time, application time and memory usage of varying ϵ, for
n = 16384, d = 2 and n = 512, d = 3. Overall, the errors are close to the prescribed
tolerances and the costs increase for smaller ϵ, as expected. We also note that keeping
r as low as possible is critical in maintaining small prefactors of the tensor butterfly
algorithm, particularly for higher dimensions.
4.2. Radon transforms. In this subsection, we consider 2D and 3D discretized
Radon transforms similar to those presented in [8]. Specifically, the tensor entry is
For d = 3, we consider
102 1 102
Time (Sec)
Time (Sec)
1
10
1 101
10 0 100
-1
10-1 10
10-2
104 105 106 107 108 109 1010 104 106 108 1010
10
3 103
10
2 102
1
10 101
Memory (GB)
Memory (GB)
0
10 10
0
10-1 -1
10
10-2
10-2
10-3
10-4 10-3
104 105 106 107 108 109 1010 104 106 108 1010
103
101
102
10 0 101
Time (Sec)
100
Time (Sec)
10-1
10-1
10-2 10-2
10-3
10-3
10-4
10-4 4
10 105 106 107 108 109 1010 104 106 108 1010
We first perform compression using the matrix butterfly, tensor butterfly, and QTT
algorithms, and then perform application/contraction using a random input tensor
F.
Figure 4.2 shows the factorization time, application time and memory usage of
each algorithm using a compression tolerance ϵ = 10−3 for the 2D transform (left)
and 3D transform (right). Overall, we achieve the expected complexities listed in
Table 3.1 for all three algorithms. The QTT algorithm can only obtain the first 2 or
3 data points due to its high memory usage and large QTT ranks. In comparison,
the tensor butterfly algorithm achieves the linear CPU and memory complexities for
both factorization and application with a much smaller prefactor compared to all
the other algorithms. Note that the Radon transform kernels in (4.4) and (4.5) are
not translational invariant, but the tensor butterfly algorithm can still attain small
ranks. As a result, the tensor butterfly algorithm can handle 64x larger-sized Radon
21
nd ϵ rmin r error Tf (sec) Ta (sec) M em (MB)
163842 1E-02 5 8 1.49E-02 6.83E+01 1.16E+00 2.40E+04
163842 1E-03 6 10 2.19E-03 1.17E+02 1.89E+00 4.69E+04
163842 1E-04 7 11 1.84E-04 1.57E+02 2.80E+00 7.49E+04
163842 1E-05 8 12 3.46E-05 2.29E+02 4.03E+00 1.21E+05
163842 1E-06 9 13 9.26E-06 3.18E+02 5.92E+00 1.96E+05
5123 1E-02 2 5 2.01E-02 1.18E+02 1.42E+00 1.19E+04
5123 1E-03 2 6 1.18E-03 3.46E+02 4.08E+00 4.87E+04
5123 1E-04 2 7 8.39E-05 6.26E+02 9.85E+00 1.49E+05
5123 1E-05 3 8 9.21E-06 1.25E+03 2.40E+01 4.07E+05
Table 4.1: The technical data for a 4-way Green’s function tensor of n = 16384 and
a 6-way Green’s function tensor of n = 512 for the Helmholtz equation using the
proposed tensor butterfly algorithm of varying compression tolerance ϵ. The table
shows the maximum rank r and minimum rank rmin across all ID operations, relative
error in (4.2), factor time Tf , apply time Ta , and memory usage M em.
Table 4.2: The technical data for a 4-way Radon transform tensor of n = 2048 in
(4.4) and a 6-way Radon transform tensor of n = 128 in (4.5) using the proposed
tensor butterfly algorithm of varying compression tolerance ϵ. The table shows the
maximum rank r and minimum rank rmin across all ID operations, relative error in
(4.2), factor time Tf , apply time Ta , and memory usage M em..
transforms compared with the matrix butterfly algorithm, showing their superiority
for solving linear inverse problems in tomography and seismic imaging.
Next, we demonstrate the effect of changing compression tolerance ϵ for both test
cases in Table 4.2 with the error defined by (4.2). Table 4.2 shows the minimum
and maximum ranks, error, factorization time, application time and memory usage
of varying ϵ, for n = 2048 with d = 2 and n = 128 with d = 3, respectively. Overall,
the errors are close to the prescribed tolerances and the costs increase for smaller ϵ,
as expected. Just like the Green’s function example, it is critical to keep r a low
constant, particularly for higher dimensions.
4.3. High-dimensional discrete Fourier transform. Finally, we consider
high-dimensional DFTs defined as
Time (Sec)
102
102
101
1
10
100
100 10-1 4
104 105 106 107 108 109 10 105 106 107 108 109
103 103
10 2 102
Memory (GB)
Memory (GB)
101
101
100
100 10-1
10
-1 10-2
10-3
10-2
104 105 106 107 108 109 104 105 106 107 108 109
2 2
10 10
1
10 101
Time (Sec)
Time (Sec)
100 100
10-1 10-1
10-2 10-2
10-3 10-3 4
104 105 106 107 108 109 10 105 106 107 108 109
3
10 104
2
10 103
Time (Sec)
Time (Sec)
1 2
10 10
100 101
10-1 100
10-2 10
-1
104 10
3
103
102
102
Memory (GB)
Memory (GB)
101 101
0
10
100
10-1
10-1
10-2
-3
10 10-2 4
104 106 108 1010 10 105 106 107 108 109 1010
2
10
102
101
101
100
Time (Sec)
Time (Sec)
100
10-1
10-1
10-2
10-3 10-2
10-4 10-3 4
104 106 108 1010 10 105 106 107 108 109 1010
Fig. 4.3: Fourier transforms: Computational complexity of (left) butterfly tensor and
heFFTe for compressing the high-dimensional DFT tensor and (right) butterfly tensor
and FINUFFT for compressing the high-dimensional NUFFT tensor. (Top): Factor
time of butterfly tensor and plan creation time for heFFTe/FINUFFT. (Middle):
Factor memory. (Bottom): Apply time.
REFERENCES
[1] Alan Ayala, Stanimire Tomov, Piotr Luszczek, Sébastien Cayrols, Gerald Ragghianti, and
Jack Dongarra. Analysis of the communication and computation cost of fft libraries
towards exascale. Technical report, Technical Report ICL-UT-22-07. https://icl. utk.
edu/files/publications/2022 . . . , 2022.
[2] Alexander H Barnett, Jeremy Magland, and Ludvig af Klinteberg. A parallel nonuniform fast
fourier transform library based on an “exponential of semicircle” kernel. SIAM Journal on
Scientific Computing, 41(5):C479–C504, 2019.
[3] David Joseph Biagioni. Numerical construction of Green’s functions in high dimensional elliptic
problems with variable coefficients and analysis of renewable energy data via sparse and
separable approximations. PhD thesis, University of Colorado at Boulder, 2012.
[4] James Bremer, Ze Chen, and Haizhao Yang. Rapid Application of the Spherical Har-
monic Transform via Interpolative Decomposition Butterfly Factorization. arXiv preprint
arXiv:2004.11346, 2020.
[5] Ovidio M Bucci and Giorgio Franceschetti. On the degrees of freedom of scattered fields. IEEE
transactions on Antennas and Propagation, 37(7):918–926, 1989.
[6] HanQin Cai, Keaton Hamm, Longxiu Huang, and Deanna Needell. Mode-wise tensor decom-
positions: Multi-dimensional generalizations of cur decompositions. Journal of machine
learning research, 22(185):1–36, 2021.
[7] Emmanuel Candes, Laurent Demanet, and Lexing Ying. Fast computation of fourier integral
operators. SIAM Journal on Scientific Computing, 29(6):2464–2493, 2007.
[8] Emmanuel Candès, Laurent Demanet, and Lexing Ying. A fast butterfly algorithm for the
computation of Fourier integral operators. Multiscale Model. Sim., 7(4):1727–1750, 2009.
[9] Jielun Chen and Michael Lindsey. Direct interpolative construction of the discrete fourier
transform as a matrix product operator. arXiv preprint arXiv:2404.03182, 2024.
[10] Ze Chen, Juan Zhang, Kenneth L Ho, and Haizhao Yang. Multidimensional phase recovery
and interpolative decomposition butterfly factorization. Journal of Computational Physics,
412:109427, 2020.
[11] Jian Cheng, Dinggang Shen, Peter J Basser, and Pew-Thian Yap. Joint 6D kq space compressed
sensing for accelerated high angular resolution diffusion MRI. In Information Processing
in Medical Imaging: 24th International Conference, IPMI 2015, Sabhal Mor Ostaig, Isle
25
of Skye, UK, June 28-July 3, 2015, Proceedings, pages 782–793. Springer, 2015.
[12] Andrzej Cichocki, Namgil Lee, Ivan Oseledets, Anh-Huy Phan, Qibin Zhao, Danilo P Mandic,
et al. Tensor networks for dimensionality reduction and large-scale optimization: Part 1
low-rank tensor decompositions. Foundations and Trends® in Machine Learning, 9(4-
5):249–429, 2016.
[13] Lisa Claus, Pieter Ghysels, Yang Liu, Thái Anh Nhan, Ramakrishnan Thirumalaisamy, Amneet
Pal Singh Bhalla, and Sherry Li. Sparse approximate multifrontal factorization with com-
posite compression methods. ACM Transactions on Mathematical Software, 49(3):1–28,
2023.
[14] Eduardo Corona, Abtin Rahimian, and Denis Zorin. A tensor-train accelerated solver for
integral equations in complex geometries. Journal of Computational Physics, 334:145–169,
2017.
[15] Maurice A De Gosson. The Wigner Transform. World Scientific Publishing Company, 2017.
[16] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multilinear singular value
decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000.
[17] Stanley R Deans. The Radon transform and some of its applications. Courier Corporation,
2007.
[18] Gian Luca Delzanno. Multi-dimensional, fully-implicit, spectral method for the vlasov–maxwell
equations with exact conservation laws in discrete form. Journal of Computational Physics,
301:338–356, 2015.
[19] Sergey Dolgov and Dmitry Savostyanov. Parallel cross interpolation for high-precision cal-
culation of high-dimensional integrals. Computer Physics Communications, 246:106869,
2020.
[20] Björn Engquist and Lexing Ying. Fast directional multilevel algorithms for oscillatory kernels.
SIAM Journal on Scientific Computing, 29(4):1710–1737, 2007.
[21] Alexander L Fetter and John Dirk Walecka. Quantum theory of many-particle systems. Courier
Corporation, 2012.
[22] Ilias I Giannakopoulos, Mikhail S Litsarev, and Athanasios G Polimeridis. Memory footprint
reduction for the fft-based volume integral equation method via tensor decompositions.
IEEE Transactions on Antennas and Propagation, 67(12):7476–7486, 2019.
[23] Lars Grasedyck, Daniel Kressner, and Christine Tobler. A literature survey of low-rank tensor
approximation techniques. GAMM-Mitteilungen, 36(1):53–78, 2013.
[24] Han Guo, Jun Hu, and Eric Michielssen. On MLMDA/butterfly compressibility of inverse
integral operators. IEEE Antennas Wirel. Propag. Lett., 12:31–34, 2013.
[25] Han Guo, Yang Liu, Jun Hu, and Eric Michielssen. A butterfly-based direct integral-equation
solver using hierarchical LU factorization for analyzing scattering from electrically large
conducting objects. IEEE Trans. Antennas Propag., 65(9):4742–4750, 2017.
[26] Han Guo, Yang Liu, Jun Hu, and Eric Michielssen. A butterfly-based direct solver using hier-
archical LU factorization for Poggio-Miller-Chang-Harrington-Wu-Tsai equations. Microw
Opt Technol Lett., 60:1381–1387, 2018.
[27] Wolfgang Hackbusch and Boris N Khoromskij. Tensor-product approximation to multidimen-
sional integral operators and green’s functions. SIAM journal on matrix analysis and
applications, 30(3):1233–1253, 2008.
[28] Wolfgang Hackbusch and Stefan Kühn. A new scheme for the tensor representation. Journal
of Fourier analysis and applications, 15(5):706–722, 2009.
[29] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with random-
ness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM
Review, 53(2):217–288, January 2011.
[30] Richard A Harshman et al. Foundations of the parafac procedure: Models and conditions for an
“explanatory” multi-modal factor analysis. UCLA working papers in phonetics, 16(1):84,
1970.
[31] Rui Hong, Ya-Xuan Xiao, Jie Hu, An-Chun Ji, and Shi-Ju Ran. Functional tensor network
solving many-body Schrödinger equation. Phys. Rev. B, 105:165116, Apr 2022.
[32] L Hörmander. Fourier integral operators. i. In Mathematics Past and Present Fourier Integral
Operators, pages 23–127. Springer, 1994.
[33] Yuehaw Khoo and Lexing Ying. Switchnet: a neural network model for forward and inverse
scattering problems. SIAM Journal on Scientific Computing, 41(5):A3182–A3201, 2019.
[34] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review,
51(3):455–500, 2009.
[35] Matthew Li, Laurent Demanet, and Leonardo Zepeda-Núñez. Wide-band butterfly network:
stable and efficient inversion via multi-frequency neural networks. Multiscale Modeling &
Simulation, 20(4):1191–1227, 2022.
26
[36] Yingzhou Li and Haizhao Yang. Interpolative butterfly factorization. SIAM J. Sci. Comput.,
39(2):A503–A531, 2017.
[37] Yingzhou Li, Haizhao Yang, Eileen R Martin, Kenneth L Ho, and Lexing Ying. Butterfly
factorization. Multiscale Model. Sim., 13(2):714–732, 2015.
[38] Yingzhou Li, Haizhao Yang, and Lexing Ying. Multidimensional butterfly factorization. Applied
and Computational Harmonic Analysis, 44(3):737–758, 2018.
[39] E. Liberty, F. Woolfe, P.-G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms
for the low-rank approximation of matrices. Proc. Natl. Acad. Sci. USA, 104:20167–20172,
2007.
[40] Peter Lindstrom. Fixed-rate compressed floating-point arrays. IEEE Transactions on Visual-
ization and Computer Graphics, 20(12):2674–2683, 2014.
[41] Yang Liu. A comparative study of butterfly-enhanced direct integral and differential equation
solvers for high-frequency electromagnetic analysis involving inhomogeneous dielectrics. In
2022 3rd URSI Atlantic and Asia Pacific Radio Science Meeting (AT-AP-RASC), pages
1–4. IEEE, 2022.
[42] Yang Liu, Pieter Ghysels, Lisa Claus, and Xiaoye Sherry Li. Sparse approximate multifrontal
factorization with butterfly compression for high-frequency wave equations. SIAM Journal
on Scientific Computing, 0(0):S367–S391, 2021.
[43] Yang Liu, Han Guo, and Eric Michielssen. An HSS matrix-inspired butterfly-based direct solver
for analyzing scattering from two-dimensional objects. IEEE Antennas Wirel. Propag.
Lett., 16:1179–1183, 2017.
[44] Yang Liu, Tianhuan Luo, Aman Rani, Hengrui Luo, and Xiaoye Sherry Li. Detecting reso-
nance of radio-frequency cavities using fast direct integral equation solvers and augmented
bayesian optimization. IEEE Journal on Multiscale and Multiphysics Computational Tech-
niques, 2023.
[45] Yang Liu, Jian Song, Robert Burridge, and Jianliang Qian. A fast butterfly-compressed
Hadamard-Babich integrator for high-frequency Helmholtz equations in inhomogeneous
media with arbitrary sources. Multiscale Modeling & Simulation, 21(1):269–308, 2023.
[46] Yang Liu, Xin Xing, Han Guo, Eric Michielssen, Pieter Ghysels, and Xiaoye Sherry Li. Butterfly
factorization via randomized matrix-vector multiplications. SIAM Journal on Scientific
Computing, 43(2):A883–A907, 2021.
[47] Yang Liu and Haizhao Yang. A hierarchical butterfly LU preconditioner for two-dimensional
electromagnetic scattering problems involving open surfaces. J. Comput. Phys.,
401:109014, 2020.
[48] W. Lu, J. Qian, and R. Burridge. Babich’s expansion and the fast Huygens sweeping method
for the Helmholtz wave equation at high frequencies. J. Comput. Phys., 313:478–510, 2016.
[49] Axel Maas. Two and three-point Green’s functions in two-dimensional Landau-gauge Yang-
Mills theory. Phys. Rev. D, 75:116004, 2007.
[50] Michael W Mahoney, Mauro Maggioni, and Petros Drineas. Tensor-cur decompositions for
tensor-based data. In Proceedings of the 12th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 327–336, 2006.
[51] Osman Asif Malik and Stephen Becker. Fast randomized matrix and tensor interpolative de-
composition using countsketch. Advances in Computational Mathematics, 46(6):76, 2020.
[52] Eric Michielssen and Amir Boag. Multilevel evaluation of electromagnetic fields for the rapid
solution of scattering problems. Microw Opt Technol Lett., 7(17):790–795, 1994.
[53] Eric Michielssen and Amir Boag. A multilevel matrix decomposition algorithm for analyzing
scattering from large structures. IEEE Trans. Antennas Propag., 44(8):1086–1093, 1996.
[54] Michael O’Neil, Franco Woolfe, and Vladimir Rokhlin. An algorithm for the rapid evaluation
of special function transforms. Appl. Comput. Harmon. A., 28(2):203 – 226, 2010. Special
Issue on Continuous Wavelet Transform in Memory of Jean Morlet, Part I.
[55] Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing,
33(5):2295–2317, 2011.
[56] Qiyuan Pang, Kenneth L. Ho, and Haizhao Yang. Interpolative decomposition butterfly fac-
torization. SIAM J. Sci. Comput., 42(2):A1097–A1115, 2020.
[57] Michael E Peskin. An introduction to quantum field theory. CRC press, 2018.
[58] Arvind K Saibaba. Hoid: higher order interpolatory decomposition for tensors based on tucker
representation. SIAM Journal on Matrix Analysis and Applications, 37(3):1223–1249,
2016.
[59] Sadeed Bin Sayed, Yang Liu, Luis J. Gomez, and Abdulkadir C. Yucel. A butterfly-accelerated
volume integral equation solver for broad permittivity and large-scale electromagnetic
analysis. IEEE Transactions on Antennas and Propagation, 70(5):3549–3559, 2022.
[60] Weitian Sheng, Abdulkadir C Yucel, Yang Liu, Han Guo, and Eric Michielssen. A domain
27
decomposition based surface integral equation simulator for characterizing EM wave prop-
agation in mine environments. IEEE Transactions on Antennas and Propagation, 2023.
[61] Tianyi Shi, Daniel Hayes, and Jing-Mei Qiu. Distributed memory parallel adaptive tensor-train
cross approximation. arXiv preprint arXiv:2407.11290, 2024.
[62] Edgar Solomonik, Devin Matthews, Jeff Hammond, and James Demmel. Cyclops tensor frame-
work: Reducing communication and eliminating load imbalance in massively parallel con-
tractions. In 2013 IEEE 27th International Symposium on Parallel and Distributed Pro-
cessing, pages 813–824. IEEE, 2013.
[63] Mark Tygert. Fast algorithms for spherical harmonic expansions, III. J. Comput. Phys.,
229(18):6181 – 6192, 2010.
[64] Mingyu Wang, Cheng Qian, Enrico Di Lorenzo, Luis J Gomez, Vladimir Okhmatovski, and
Abdulkadir C Yucel. Supervoxhenry: Tucker-enhanced and fft-accelerated inductance ex-
traction for voxelized superconducting structures. IEEE Transactions on Applied Super-
conductivity, 31(7):1–11, 2021.
[65] Mingyu Wang, Cheng Qian, Jacob K White, and Abdulkadir C Yucel. Voxcap: Fft-accelerated
and tucker-enhanced capacitance extraction simulator for voxelized structures. IEEE
Transactions on Microwave Theory and Techniques, 68(12):5154–5168, 2020.
[66] E. Wigner. On the quantum correction for thermodynamic equilibrium. Phys. Rev., 40:749–759,
Jun 1932.
[67] Haizhao Yang. A unified framework for oscillatory integral transforms: When to use NUFFT
or butterfly factorization? J. Comput. Phys., 388:103 – 122, 2019.
[68] Lexing Ying. Sparse Fourier Transform via Butterfly Algorithm. SIAM J. Sci. Comput.,
31(3):1678–1694, 2009.
28