0% found this document useful (0 votes)
20 views32 pages

Arithmetic Circuits Structured Matrices and Not So Deep T5hdkini

Uploaded by

gaurishukla412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views32 pages

Arithmetic Circuits Structured Matrices and Not So Deep T5hdkini

Uploaded by

gaurishukla412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Arithmetic Circuits, Structured Matrices and (not so) Deep

Learning
ATRI RUDRA

Department of Computer Science and Engineering


University at Buffalo
arXiv:2206.12490v2 [cs.CC] 31 Oct 2022

[email protected]

Abstract

This survey presents a necessarily incomplete (and biased) overview of results at the intersec-
tion of arithmetic circuit complexity, structured matrices and deep learning. Recently there has been
some research activity in replacing unstructured weight matrices in neural networks by structured
ones (with the aim of reducing the size of the corresponding deep learning models). Most of this
work has been experimental and in this survey, we formalize the research question and show how
a recent work that combines arithmetic circuit complexity, structured matrices and deep learning
essentially answers this question.
This survey is targeted at complexity theorists who might enjoy reading about how tools developed
in arithmetic circuit complexity helped design (to the best of our knowledge) a new family of struc-
tured matrices, which in turn seem well-suited for applications in deep learning. However, we hope
that folks primarily interested in deep learning would also appreciate the connections to complexity
theory.

1
I N M EMORY OF A LAN S ELMAN

Alan Selman was my colleague at University at Buffalo (UB) from 2007 (when I joined UB)
until 2014 (when Alan retired from UB). I still remember being taken directly from the airport
to Alan’s favorite restaurant, Trattoria Aroma, during my interview at Buffalo. I was a bit
intimidated by Alan during the dinner but we bonded over the fact that we were both married
to epidemiologists. After I joined Buffalo, Alan’s sage advice helped me throughout my tenure
process. Alan was a giant in the department and having him in my corner did not hurt.
More germane to this survey, Alan always turned up for UB theory meetings and I greatly en-
joyed presenting stuff I was working on to Alan during some of these meetings. After Alan
retired in 2014, I started working on some problems at the intersection of arithmetic circuit
complexity, structured matrices and deep learning that I think Alan would be enjoyed hearing
about. Since Alan passed in early 2021, this survey is my way of presenting the material to Alan
in his memory. – ATRI RUDRA

1 Introduction
This survey shows how concepts in arithmetic circuit complexity and structured matrices can be used to
solve a (theoretical) problem motivated by practical applications in machine learning (especially deep
learning). Since each of the areas of arithmetic (circuit) complexity, structured matrices and deep learn-
ing have been explored in great depth and this survey clearly cannot do any justice to all the great work
in each of the these areas, we will spend most of the introduction clarifying what this survey is not about.

Algebraic circuit complexity or more generally algebraic complexity theory [11] studies the power of
algebraic algorithms (as opposed to the Turing machine/RAM model). The arithmetic circuit model
(or the straight-line programs) are one of the standard models of computation in algebraic complexity
theory [11, Chapter 4]. In this survey we will ignore pretty much everything in this literature except for
results on the arithmetic circuit complexity of the linear map i.e. functions of the form x 7→ Wx (where
x is a vector over some field F and W is a matrix over the same field) [11, Chapter 13]. We would like to
stress that this survey will only scratch the surface of the literature on the algebraic circuit complexity of
the linear map. Just to give a sense of the breadth of this seemingly ‘specialized’ topic, we remark that the
study of matrix rigidity [22], which has seen a lot of recent research activity [1, 2, 3, 19, 10], is a part of this
topic. We note that originally, the topic of matrix rigidity was proposed by Valiant [42] as a way to prove
super-linear lower bounds, by constructing matrices that are rigid. However, our goal in this survey is
to prove upper bounds– i.e. we are interested in matrices for which the arithmetic circuit complexity
is small. We note that some of the recent work, including the work of Alman and Williams [1], is along
similar lines of showing that explicit matrices are not rigid (which at a very hand-wavy level is showing
that for certain explicit linear maps there indeed exist ‘small’ arithmetic of a restricted kind to compute
the linear map1 – see Section 4.3 for more details).

Structured matrices are (family) of matrices W for which one can have a much smaller representation
than the generic n × n (assuming W is a square) matrix representation. Typically, these structured repre-
sentations also imply that one can compute Wx for any vector x in o(n 2 ) time (recall that matrix-vector
1 The notion of small here is to show circuits of size o(n 2 ) but the bounds are still Ω n 2−ε for any fixed ε > 0, while in our
¡ ¢

case we are more interested in linear maps that have a near-linear sized general arithmetic circuits.

2
multiplication in the worst-case takes O(n 2 ) time), and in many celebrated examples (e.g. FFT for the
Discrete Fourier matrix [14]), it takes near-linear time/operations over the underlying field. At the risk of
over-simplifying things, structured matrices crop up in applications in two flavors. In the first flavor, the
application essentially determines the (family) of structured matrices. In other words, we do not have
any say in the choice of the structured matrix and the goal is design efficient matrix vector multiplication
algorithm (or algorithm for some other problems) involving the matrix. We mention two examples. The
first example is the family of orthogonal polynomial transforms (which include among others the Dis-
crete Cosine Transform) [38] that appear in many signal processing applications as well as many basic
mathematical studies including approximation theory. The second example is the family of low displace-
ment rank matrices [25], which have applications in signal processing and numerical linear algebra [26].
We will not cover this flavor of structured matrices in the survey though low displacement rank matrices
will make an appearance in Section 4.5.
The second flavor of structured matrices (which we will focus on in this survey), is where there is
some matrix M in the ‘wild’ and we want to approximate M by a more structured matrix W to e.g. save
on storing the matrix (and/or have more efficient operations on the matrix: e.g. matrix vector multipli-
cation). Perhaps the example of this is the ubiquitous low rank approximation. Udell and Townsend give
a theoretical justification for why low rank approximation is so ubiquitous in machine learning appli-
cations [41]. Now we present a (very incomplete) sampler of other applications of structured matrices
in machine learning– convolutions for image, language, and speech modeling [23], and low-rank and
sparse matrices for efficient storage and inference on edge devices [43]. Forms of structure such as spar-
sity have been at the forefront of recent advances in machine learning [21], and are critical for on-device
and energy-efficient models, two application areas of tremendous recent interest [40, 36].
At a very high level, the main question we consider in this survey is if there is a similar family of
structured matrices that has all the nice properties of low rank approximation but are more expressive
than low rank matrices (e.g. many of the transforms including the Fourier transform are full rank).

Deep learning is ubiquitous in our daily lives [28] with far reaching consequences– both good and
bad2 . For this survey, we will focus on the mathematical aspects of deep learning since a treatment of
the societal implications of deep learning is out of the scope of this survey. Even a broad theoretical study
of neural networks (which form the basis of deep learning) is beyond the scope of this survey and there
is a lot of excellent literature on this topic [4] that we will side-step.
Instead, we will focus on the issue that deep learning models are getting to be too big (which in paral-
lel raises3 its own ethical issues [7]). This for example, can be an issue when trying to store these models
(and run inference) on mobile platforms like smartphones. In addition, the state-of-the-art language
models have so many parameters that creating such models is not possible outside of large technology
companies. While there are many reasons for this, one typical reason is that these neural networks tend
to learn unstructured matrices as part of the neural network model (see Section 2.3 for why matrices
make an appearance in neural network architectures). Apriori, the advantage of learning from the set of
all possible matrices is that it gives the training algorithm the ‘best’ chance to learn the most expressive
matrix. However, given that in many situations there is a budget on how many parameters we can use in
representing the matrices, the high level question we consider in this survey is:
2 This has led to deep intellectual research on societal implications of machine learning even in the theory community [5].
3 OK, we could not resist. This though is the last mention of societal issues in the survey.

3
Question 1.1. Given a budget on number of parameters that one can use to repre-
sent a matrix, what is the ‘most expressive’ family of matrices?

We remark that a lot of recent innovations in deep learning have come from designing new architectures
of neural networks, which needless to say, is out of scope for the survey (and the author!). In particular,
in this survey we will consider a toy version of a single layer neural network, which by definition is not so
deep.

Organization of the survey. We present some preliminaries and background before formalizing Ques-
tion 1.1 in Section 2. We also formalize the problem of training a neural network (the Baur-Strassen
theorem [6] plays a starring role) in Section 3. In Section 4, we analyze existing families of structured
matrices and show how they all fall short in answering Question 1.1 (or more precisely its formal ver-
sion Question 2.3). In Section 5, we survey results from Dao et al. [17] who present (to the best of our
knowledge) a new family of structured matrices that indeed answers Question 2.3 in the affirmative. We
conclude with a (biased) list of open questions in Section 6.

2 Preliminaries and Problem Definition


We begin by setting up notation in Section 2.1. We setup necessary background in Section 2.2 (matrix
vector multiplication), Section 2.3 (neural networks), Section 2.4 (structured matrices) and Section 2.5
(arithmetic circuits). Finally, we formalize Question 1.1 in Section 2.6.

2.1 Notation
We use F to denote a field4 . The set of all length n vectors and m × n matrices over F are denoted by Fn
and Fm×n respectively.
We will denote the entry in W ∈ Fm×n corresponding to the i th row and j th column as W[i , j ]. The
i th row of W will be denoted by W[i , :]. Similarly, the i th entry in the vector x will be denoted as x[i ]. We
will follow the convention that the indices i and j start at 0. The inner product of vectors x and y will
be denoted by x, y . For any x ∈ Fn , we will use diag(x) to denote the diagonal matrix with x being its
­ ®

diagonal.
We will be using asymptotic notation and use O e (·) to hide poly-log factors in the Big-Oh notation.

2.2 Matrix Vector Multiplication


We now define the matrix-vector multiplication problem that will be central to the survey:

• Input: An m × n matrix W ∈ Fm×n and a vector x ∈ Fn of length n

• Output: Their product, which is denoted by

y = W · x,

4 We will pretty much use F = R (real number) or F = C (complex numbers) in the survey. Even though most of the results in

the survey can be made to work for finite fields, we will ignore this aspect of the results.

4
where y ∈ Fm is a vector of length m and its i th entry for 0 ≤ i < m is defined as
follows:
n−1
X
y[i ] = W[i , j ] · x[ j ].
j =0

One can easily verify that the naive algorithm that basically operationalizes the above definition takes
O(mn) operations in the worst-case. Further, if the matrix W is arbitrary, one would need Ω(mn) time
(this follows from a simple adversarial argument). Assuming that each operation for an field F can be
done in O(1) time, this implies that the worst-case complexity of matrix-vector multiplication is Θ(mn).
If we just cared about worst-case complexity, we would be done. However, since there is a fair bit
of survey left after this spot it is safe to assume that this is not all we care about. It turns out that in a
large number of practical applications, the matrix W is fixed (or more appropriately has some structure).
Thus, when designing algorithms to compute W · x (for arbitrary x), we can exploit the structure of W to
obtain a complexity that is asymptotically better than O(mn).
Next, we take a brief detour into deep learning and motivate why one would need structured matrices
in that application.

2.3 Neural Networks and (not so) deep learning

WARNING: We do not claim to have any non-trivial knowledge (deep or otherwise) of


deep learning. Thus, we will only consider a very simplified model of neural networks
and our treatment of neural networks should in no way be interpreted as being repre-
sentative of the current state of deep learning.

We consider a toy version of neural networks in use today: we will consider the so called single layer
neural network:
Definition 2.1. We define a single layer neural network with input x ∈ Fn and output y ∈ Fm where the
output is related to input as follows:
y = g (W · x) ,
where W ∈ Fm×n and g : Fm → Fm is a non-linear function.
Some remarks are in order: (1) In practice, neural networks are defined for F = R or F = C; (2) One of
the common examples of non-linear function g : Rm → Rm is applying to so called ReLu function to each
entry.5 (3) The entries in the matrix W are typically called the weights in the layer.
Neural networks have two tasks associated with it: the first is the task of learning the network. For the
network in Definition 2.1, this implies learning the matrix W given a set of training data (x 0 , y 0 ), (x 1 , y 1 ), · · ·
where y i is supposed to be a noisy version of g (Wx)– we will come back to this in Section 3.1.
The second task is that once we have learned W, we use it to classify new data points x by computing
g (Wx). In practice, we would like the second step to be as efficient as possible.6 Ideally we should be
able to compute g (Wx) with O(m + n) operations. The computational bottleneck in computing g (Wx)
is computing W · x. Further, it turns out (as well will see later in Section 3) that the complexity of the
first step of learning the network is closely related to the complexity of the corresponding matrix-vector
multiplication problem.
5 More precisely, we have ReLu(x) = max(0, x) for any x ∈ R and for any z 3 Rm , g (z) = (ReLu(z[0]), · · · , ReLu(z[m − 1])).
6 Ideally, we would also like the first step to be efficient but typically the learning of the network can be done in an offline step

so it can be (relatively) more inefficient.

5
2.4 Structured Matrices
As mentioned above, in the deep learning setup, we would like to have weight matrices W such that the
matrix-vector multiplication Wx for an arbitrary x ∈ Fn can be done in near-linear time. However, if the
matrix W is represented in the usual m×n matrix format, then we end up with an Ω(mn) time just to read
the entries of W. Thus, to have any hope of near-linear matrix vector multiplication, we need to have a
smarter representation of the structured matrices. We first recall two examples of structured matrices
that have wide applicability in numerical linear algebra and machine learning.
We begin with the notion of low-rank matrices (which are ubiquitous in machine learning [41]):

Definition 2.2 (Low rank matrices). A matrix W ∈ Fm×n has rank r (for 0 ≤ r ≤ min(m, n)) if and only if
there exists matrices L ∈ Fm×r and R ∈ Fr ×n such that

W = L · R.

It is easy to see that rank r matrices can be represented in r (n +m) elements (by storing L and R) and
also has an O(r (n + m))-operations matrix vector multiplication by computing Wx as (L · (R · x)). Thus,
constant rank matrices indeed satisfy the linear-time matrix-vector multiplication desiderata.
Next, we consider sparse matrices:

Definition 2.3 (Sparse matrices). A matrix W ∈ Fm×n is s sparse (for 0 ≤ s ≤ mn) if at most s entries in W
are non-zero.

The defacto representation of sparse matrices is the listing representation, where one keeps a list of
the locations of the s non-zero values along with the actual non-zero value (every entry not in this list
has a value of 0). Assuming the listing representation, the obvious modification to the naive matrix-
vector multiplication (where we automatically ‘skip’ over entries (i , j ) such that W[i , j ] = 0) results in
an O(s) operations algorithm. Thus, O e (n)-sparse matrices indeed satisfy the linear-time matrix-vector
multiplication desiderata.
Next, we consider more algebraic families of matrices. Consider the discrete Fourier matrix:

Definition 2.4. The n × n discrete Fourier matrix Fn defined as follows (for 0 ≤ i , j < n):
ij
F n [i , j ] = ωn ,
p
where ωn = e −2πι/n is the n-th root of unity and ι = −1.

We note that even though the discrete Fourier matrix has rank n and sparsity n 2 , it has a very simple
representation: just the number n.
Let us unroll the following matrix-vector multiplication: x̂ = Fn x. In particular, for any 0 ≤ i < n:
n−1
x[ j ] · e 2πι j i /n .
X
x̂[i ] =
j =0

In other words, x̂ is the discrete Fourier transform of x. It turns out that the discrete Fourier transform is
incredibly useful in practice (and is used in applications such as image compression). One of the most
celebrated algorithmic results is that the Fourier transform can be computed with O(n log n) operations:

Theorem 2.5 (Fast Fourier Transform (FFT) [14]). For any x ∈ Cn , one can compute Fn · x in O(n log n)
operations.

6
Thus, the discrete Fourier transform satisfies the near linear-time matrix-vector multiplication desider-
ata.
Consider the following matrix (called a Vandermonde matrix):

Definition 2.6 (Vandermonde Matrix). For any n ≥ 1 and any field F with size at least m, m distinct
elements a 0 , . . . , a m−1 ∈ F, consider the matrix (where 0 ≤ i < m and 0 ≤ j < n)
j
V(a)
n [i , j ] = a i ,

where a = (a 0 , . . . , a m−1 ).

We now state some interesting facts about these matrices (which also show that Vandermonde ma-
trices satisfy the near linear-time matrix-vector multiplication desiderata):

1. One can represent a Vandermonde matrix by noting a 0 . . . , αm−1 (along with n of course).

2. The discrete Fourier matrix is a special case of a Vandermonde matrix.

3. The Vandermonde matrix has full rank and has sparsity n 2 .

4. It turns out that Vn · x for any x ∈ Fn can be computed with O(n log2 n) operations [11].

In all the four examples of structured matrices that we have seen in this section, their representation
pretty much follows from their definitions. However, in general, whenever we have a family of structured
matrices, we would like a generic way of referring to the representation. To abstract this we will assume
that

Assumption 2.1. Given a vector θ ∈ Fs for some s = s(m, n) such that the vector θ
completely specifies a matrix in our chosen family. We will use Wθ to denote the
class of matrix family parameterized by θ.

For example, if s = mn, then we get the set of all matrices in Fm×n . On the other hand, for say the Van-
dermonde matrix (recall Definition 2.6), we have s(m, n) = m and θ = (a 0 , . . . , a m−1 ) for distinct a i ’s.

2.5 Arithmetic Circuits


So far we have tip-toed around how to determine the ‘optimal’ matrix vector multiplication time for a
given W. Now, we pay closer attention to this problem:

Question 2.1. Given an m × n matrix W, what is the optimal complexity of com-


puting W · x (for arbitrary x)?

Note that to even begin to answer the question above, we need to fix our ‘machine model.’ One
natural model is the RAM model on which we analyze most of our beloved algorithms. However, we
do not understand the power of RAM model (in the sense that we do not have a good handle on what
problems can be solved by say linear-time or quadratic-time algorithms7 ) and answering Question 2.1
in the RAM model seems hopeless.
7 The reader might have noticed that we are ignoring the P vs. NP elephant in the room.

7
So we need to consider a more restrictive model of computation. Instead of going through a list of
possible models, we will just state the model of computation we will use: arithmetic circuit (also known
as the straight-line program). In the context of an arithmetic circuit that computes y = Wx, there are n
inputs gates (corresponding to x[0], . . . , x[n −1]) and m output gates (corresponding to y[0], . . . , y[m −1]).
All the internal gates correspond to the addition, multiplication, subtraction and division operators over
the underlying field F. The circuit is also allowed to use constants from F for ‘free.’ The complexity of the
circuit will be its size: i.e. the number of addition, multiplication, subtraction and division gates in the
circuit. We will also care about the depth of the circuit, which is the depth of the DAG representing the
circuit. Let us record this choice:
Definition 2.7. For any function f : Fn → Fm , its arithmetic circuit complexity is the minimum number
of addition, multiplication, subtraction and division operations over F needed to compute f (x) for any
x ∈ Fn .
Given the above, we have the following more specific version of Question 2.1:

Question 2.2. Given a matrix W ∈ Fm×n , what is the arithmetic circuit complexity
of computing W · x (for arbitrary x ∈ Fn )?

One drawback of arithmetic circuits (especially for infinite fields e.g. F = R, which is our preferred
choice for deep learning applications) is that they assume operations over F can be performed exactly.
In particular, it ignores precision issues involved with real arithmetic. Nonetheless, this model turns out
to be a very useful model in reasoning about the complexity of doing matrix-vector multiplication for
any family of matrices.
Perhaps the strongest argument in support of arithmetic circuits is that a large (if not an overwhelm-
ing) majority of matrix-vector multiplication algorithm in the RAM model also imply an arithmetic cir-
cuit of size comparable to the runtime of the algorithm (and the depth of the circuit roughly corresponds
to the time taken to compute it by a parallel algorithm). For example consider the obvious algorithm to
compute Wx (i.e. for each i ∈ [m], compute y[i ] as the sum n−1
P
i =0 W[i , j ]x[ j ]). It is easy to see that this
algorithm implies an arithmetic circuit of size O(nm) and depth O(log n).
One reason for the vast majority of existing efficient matrix vector algorithms leading to arithmetic
circuits is that they generally are divide and conquer algorithms that use polynomial operations such as
polynomial multiplication or evaluation (both of which themselves are divide and conquer algorithms
that use FFT (Theorem 2.5) as a blackbox) or polynomial addition. Each of these pieces are well known
to have small (depth and size) arithmetic circuits (since FFT has these properties). Finally, the divide and
conquer structure of the algorithms leads to the circuit being of low depth. See the book of Pan [33] for a
more elaborate description of this connection.

2.5.1 Linear circuit complexity

Next, instead of considering the general arithmetic circuit complexity of Wx, let us consider the linear
arithmetic circuit complexity. A linear arithmetic circuit only uses linear operations:
Definition 2.8. A linear arithmetic circuit (over F) only allows operations of the form αX +βY , where α, β ∈
F are constants while X and Y are the inputs to the operation. The linear arithmetic circuit complexity
of Wx is the size of the smallest linear arithmetic circuit that computes Wx (where x are the inputs and
the circuit depends on W). Sometimes we will overload terminology and call the (linear) arithmetic circuit
complexity of computing Wx as the (linear) arithmetic circuit complexity of (just) W.

8
We first remark that the linear arithmetic circuit complexity seems to be a very natural model to
consider the complexity of computing Wx (recall this defines a linear function over x). In fact one could
plausibly conjecture that going from general arithmetic circuit complexity to linear arithmetic circuit
complexity of computing Wx should be without loss of generality (the intuition being: "What else can
you do?").
It turns out that for infinite fields, the above intuition is correct:

Theorem 2.9 ([11]). Let F be an infinite field and W ∈ Fm×n . Let C (W) and C L (W) be the arithmetic circuit
complexity and linear arithmetic circuit complexity of computing Wx (for arbitrary x). Then C L (W) =
Θ (C (W)).

We first make some observations. First, it turns out that Theorem 2.9 can be proved for finite fields
that are exponentially large. Second, it is a natural question to try and prove a version of Theorem 2.9 for
small finite fields (say over F2 ). This question is very much open.

2.6 Problem Definition


Finally, we have all the pieces in place so that we formally define the problem we are interested in.
Mainly for notational simplicity, we make the following assumption for the rest of the survey:

Assumption 2.2. Unless stated otherwise, we will consider square matrices, i.e. m =
n.

As mentioned in Section 2.3, we would like to use a weight matrix W such that computing Wx is effi-
cient. In particular, using our choice of measuring algorithmic efficiency by the arithmetic complexity of
computing Wx, the design problem becomes the following– can we design neural networks with weight
matrices W that are guaranteed to have an arithmetic circuit of size s (for some s that is at most o(n 2 ))?
In the rest of the section, we will successively formalize (and specialize) the above intuitive problem
statement.
Recall from Section 2.3 that for neural networks, the main bottleneck is to be able to ‘learn’ these
weight matrices W from the training data (we will formally state the learning problem in Definition 3.2
but for now we’ll keep the definition of training a bit vague). But even before we talk about the efficiency8
of learning the matrix W, we note that it is important to be more precise of the representation that the
learning algorithm outputs. In particular, even if W has an arithmetic circuit of size s = o(n 2 ), if the
learning algorithm outputs the matrix W in the usual n × n matrix format, then we are still stuck with an
Ω(n 2 ) arithmetic circuit complexity for the learned matrix W.
Thus, we want the learning process to not only learn a matrix W with arithmetic circuit complex-
ity s but also to learn a representation from which one can easily create a matrix-vector multiplication
algorithm with complexity (roughly) s. This implies that we first need to identify a class of structured
matrices that can capture matrices with arithmetic circuit complexity of s. Allowing for the possibility
that we might need more than s parameters to index the class of matrices we are after, here is a more
formal version of the problem we had stated earlier:
0
• A parameter size s 0 ≥ s and a function f : Fs → Fn×n such that
8 Recall that the training problem happens ‘offline’ so we do not need the learning to be say O(n) time but we would like the

learning algorithm to be at the worst be polynomial time.

9
0
1. For every matrix W with arithmetic circuit complexity at most s, there exists a θ ∈ Fs such
that f (θ) = W.
2. Given θ one can efficiently compute f (θ) · x (here by efficiently we mean with roughly O
e s0
¡ ¢

arithmetic operations).
3. We can efficiently learn the parameter θ that defines W.

• The overall goal would be to make s 0 as close to s as possible– ideally we want s 0 = O


e (s).

There is an ‘obvious’ family that almost gets us what we want– just define the parameter θ to encode
the circuit computing Wx. The problem with this formulation (other than being not an ‘interesting’
definition) is that there is no known efficient way to learn the optimal arithmetic circuit for W (even if we
were given access to the n × n representation of W).
Another candidate for the class of circuits we are looking for will be the family of low rank matrices.
In particular, given the target s, we would like to figure out the value of rank r so that we can pick s 0 = r n
and we use the standard representation of rank r matrices. In this case, it is easy to verify that all the
three properties above are satisfied. The problem of learning the rank r decomposition of a given matrix
W e.g. can be computed by the Singular Value Decomposition (or SVD).9 Unfortunately, in general s 0 can
be much larger than s– consider e.g. the DFT (Definition 2.4), which has s = O(n log n), but since the
matrix is full rank, we need r = n and hence s 0 = n 2 , which is not that useful.
We will consider some other choices for families of structured matrices in Section 4 but before we
finalize the problem statement, we use the following observation from practice to make the problem a
bit more tractable– it turns out in practice that the weight matrix W (or its representation θ) is learned
via gradient descent (see Algorithm 1). So we make the following assumption:

Assumption 2.3. We will assume that we can only use gradient descent to learn the
representation θ for our target matrix W.

0
What the above means is that it is sufficient to be able to compute the gradient of f at any point in Fs
(see Section 3 for details on why this is the case). Under this assumption, we can modify our earlier goal
into our final problem statement:

Question 2.3. Does there exist a family of n ×n matrices such that for every param-
0
eter n ≤ s ≤ n 2 , there exists a parameter s 0 and a map f : Fs → Fn×n such that for
0
every matrix W with arithmetic circuit complexity of at most s, there exists θ ∈ Fs
such that f (θ) = W. Furthermore, we want

• ( E XPRESSIVITY PROPERTY) s 0 is as close to s as possible (ideally s 0 = O


e (s))

• ( E FFICIENT MVM PROPERTY) Given θ, we can compute f (θ) · x for any x ∈ Fn


in close to s 0 arithmetic operations.
0
• ( E FFICIENT GRADIENT PROPERTY) For any a ∈ Fs , one can evaluate the gradi-
ent of f at a efficiently (ideally as close to s 0 arithmetic operations as possible).

9 In fact the SVD will give the best rank r approximation even if W is not rank r – for now let’s just consider the problem setting

where we are looking for an exact representation.

10
Before we attack Question 2.3, we will take a bit of a detour to consider the problem of learning W
from training data in more detail.

3 Computing gradients
We will formalize the problem of learning from training data in Section 3.1. Then in Section 3.2, we
identify a specific gradient function that is sufficient to run gradient descent for our purposes. We recall
the Baur-Strassen theorem in Section 3.3, which will show that for our gradient problem, it is enough to
ensure that W has small arithmetic circuit complexity. Finally, in Section 3.4, we take a detour to highlight
a really cool result, which unfortunately does not seem to be as well-known as it should be.
We will not be assuming Assumption 2.2 in this section, i.e. in this section we will consider a general
rectangular matrix W (and we will revert to Assumption 2.2 from next section onwards).

3.1 Back to (not so) deep learning


We go back to the single layer neural network that we studied earlier in Section 2.3. In particular, recall
we consider a single layer neural network that is defined by

y = g (W · x) , (1)

where W ∈ Fm×n and g : Fm → Fm is a non-linear function. Further,

Assumption 3.1. We will assume that non-linear function g : Fm → Fm is obtained


by applying the same function g : F → F to each of the m elements.

In other words, equation 1 is equivalently stated as for every 0 ≤ i < m:

y[i ] = g (〈W[i , :], x〉) .

Recall that in Section 2.3, we had claimed (without any argument) that the complexity of learning
the weight matrix W given few samples is governed by the complexity of matrix-vector multiplication for
W. In this section, we will rigorously argue this claim. To do this, we define the learning problem more
formally:
Definition 3.1. Given L training data x (`) , y (`) for ` ∈ [L], we want to compute a matrix W ∈ Fm×n that
¡ ¢

minimizes the error


L ° ³ ´°2
° (`)
° y − g W · x (`) ° .
X
E (W) =
°
2
`=1

We note that the above is not the only error function that is used in training neural networks but
the above is a common choice and hence, we stick with it. Further, note that in the above the training
searches for the ’best’ weight matrix from the set of all matrices in Fm×n . However, since we are interested
in searching for the best weight matrix with a certain class as in Question 2.3, we generalize Definition 3.1
as follows:
Definition 3.2. Given L training data x (`) , y (`) for ` ∈ [L], we want to compute the parameters of an m×n
¡ ¢

matrix θ ∈ Fs(m,n) that minimizes the error (where we use Wθ = f (θ)):


L ° ³ ´°2
° (`)
° y − g Wθ · x (`) ° .
X
E (θ) =
°
2
`=1

11
3.1.1 Gradients and Gradient Descent

(Partial) Derivatives. It turns out that we will only be concerned with studying derivatives of polyno-
mials. For this, we can define the notion of a formal derivative (over univariate polynomials):

Definition 3.3. The formal derivative ∇ X (·) : F[X ] → F[X ] is defined as follows. For every integer i ,
³ ´
∇ X X i = i · X i −1 .

The above definition can be extended to all polynomials in F[X ] by insisting that ∇ X (·) be a linear map.
That is for every α, β ∈ F and f (X ), g (X ) ∈ F[X ] we have

∇ X α f (X ) + βg (X ) = α∇ X f (X ) + β∇ X g (X ) .
¡ ¢ ¡ ¢ ¡ ¢

We note that over R, the above definition when applied to polynomials over R[X ] gives the same
result as the usual notion of derivatives.
We will actually need to work with derivatives of multi-variate polynomials. We will use F[X 1 , . . . , X m ]
to denote the set of multivariate polynomials with variables X 1 , . . . , X m . For example, 3X Y +Y 2 +1.5X 3 Y 4
is in R[X , Y ]. We extend the definition of derivatives from Definition 3.3 to the following (which also
called a gradient)

Definition 3.4. Let f (X 1 , . . . , X n ) be a polynomial in F[X 1 , . . . , X n ]. Then define its gradient as (where we
use X = (X 1 , . . . , X n ) to denote the vector of variables):
¡ ¢ ¡ ¡ ¢ ¡ ¢¢
∇X f (X) = ∇ X 1 f (X) , . . . , ∇ X n f (X) ,

where in ∇ X i f (X) , we think of f (X) as being a polynomial in X i with coefficients in F[X 1 , . . . , X i −1 , X i +1 , . . . , X n ].


¡ ¢

Finally note that ∇ X i f (X) is again a polynomial and we will denote its evaluation at a ∈ Fn as
¡ ¢
¡ ¢
∇ X i f (X) |a . We extend this notation to the gradient by

¡ ¢ ³ ¡ ¢ ¡ ¢ ´
∇X f (X) |a = ∇ X 1 f (X) |a , . . . , ∇ X n f (X) |a .

For example

∇ X ,Y 3X Y + Y 2 + 1.5X 3 Y 4 = 3Y + 4.5X 2 Y 4 , 3X + 2Y + 6X 3 Y 3 .
¡ ¢ ¡ ¢

Gradient Descent. While there exist techniques to solve the above problem theoretically, in practice
Gradient Descent is commonly used to solve the above problem. In particular, one starts off with an
initial state θ = θ 0 ∈ Fs and one keeps changing θ is opposite direction of ∇θ (E (θ)) till the error is below
a pre-specified threshold (or one goes beyond a pre-specified number of iterations). Algorithm 1 has the
details.

3.2 Computing the gradient


It is clear from Algorithm 1, that the most computationally intensive part is computing the gradient.
We first show that if one can compute a related gradient, then we could implement Algorithm 1. In
Section 3.3 we will show that this latter gradient computation is closely tied to computing Wx. We first
argue:

12
Algorithm 1 Gradient Descent
I NPUT: η > 0 and ε > 0
O UTPUT: θ

1: i ←0
2: Pick θ 0 . This could be arbitrary or initialized to something more specific
3: WHILE |E (θ i )| ≥ ε DO . One could also terminate based on number of iterations
4: θ i +1 ← θ i − η · (∇θ (E (θ)))|θi . η is the ’learning rate’
5: i ← i +1
6: RETURN θ i

Lemma 3.5. If for every z ∈ Fm and u ∈ Fn , one can compute ∇θ z T Wθ u |a for any a ∈ Fs in T1 (m, n)
¡ ¡ ¢¢

operations and Wu in T2 (m, n) operations, then one can compute (∇θ (E (θ)))|θ0 for a fixed θ 0 ∈ Fs in
O(L(T1 (m, n) + T2 (m, n))) operations.

Proof. For notational simplicity define


W = Wθ0
and ° ³ ´°2
E ` (θ) = ° y (`) − g Wθ · x (`) ° .
° °
2
Fix ` ∈ [L]. We will show that we can compute ∇θ (E ` (θ))|θ0 with O(T1 (m, n)+T2 (m, n)) operations, which
would be enough since ∇θ (E (θ)) = L`=1 ∇θ (E ` (θ)).
P

For notational simplicity, we will use y, x and E (θ) to denote y (`) , x (`) and E ` (θ) respectively. Note
that
° °2
E (θ) = ° y − g (Wθ · x)°2
à à !!2
m−1
X n−1
X
= y[i ] − g Wθ [i , j ]x[ j ] .
i =0 j =0

Applying the chain rule of the gradient on the above, we get (where g 0 (x) is the derivative of g (x)):
à à !! à !
m−1 n−1 n−1 n−1
0
X X X X¡ ¡ ¢ ¢
∇θ (E (θ)) = −2 y[i ] − g Wθ [i , j ]x[ j ] g Wθ [i , j ]x[ j ] ∇θ Wθ [i , j ] x[ j ] . (2)
i =0 j =0 j =0 j =1

Define a vector z ∈ Fm such that for any 0 ≤ i < m,

z[i ] = −2 y[i ] − g (〈W[i , :], x〉) g 0 (〈W[i , :], x〉) .


¡ ¢

Note that once we compute Wx (which by assumption we can do in T2 (m, n) operation), we can compute
z with O(T2 (m, n)) operations.10 Further, note that z is independent of θ (recall W = Wθ0 ).
10 Here we have assumed that one can compute g (x) and g 0 (x) with O(1) operations and assumed that T (m, n) ≥ m.
2

13
From (2), we get that

m−1 n−1
X³ ¡ ´
y[i ] − g (〈W[i , :], x〉) g 0 (〈W[i , :], x〉)
X¡ ¢ ¢
∇θ (E (θ))|θ0 = −2 ∇θ Wθ [i , j ] |θ0 · x[ j ]
i =0 j =0
m−1
X n−1
X³ ¡ ¢ ´
= z[i ] · ∇θ Wθ [i , j ] |θ0 · x[ j ]
i =0 j =0
à à !!
m−1
X n−1
X
= ∇θ z[i ] · Wθ [i , j ] · x[ j ]
i =0 j =0 |θ 0
= ∇θ z T Wθ x |θ0 .
¡ ¡ ¢¢

In the above, the first equality follows from our notation that W = Wθ0 , the second equality follows
from the definition of z and the third equality follows from the fact that z is independent of θ. The proof
is complete by noting that we can compute ∇θ z T Wθ x |θ0 in T1 (m, n) operations.
¡ ¡ ¢¢

Thus, to efficiently implement gradient descent, we have to efficiently compute ∇θ z T Wθ x |θ0 for
¡ ¡ ¢¢

any fixed z ∈ Fm and x ∈ Fn . Next, we will show that the arithmetic complexity of this operation is the
same (up to constant factors) as the arithmetic complexity of computing z T Wx (which in turn has com-
plexity no worse than that of computing our old friend Wx). In the next section, not only will we show
that this result is true but it is true for any function f : Fs → F. As a bonus, we will present a simple (but
somewhat non-obvious) algorithmic proof.

3.3 Computing gradients very fast


In this section we consider the following general problem:

• Input: An arithmetic circuit C that computes a function f : Fs → F and an evalu-


ation point a ∈ Fs .
¡ ¢
• Output: ∇θ f (θ) |a .

Recall that in the previous section, we were interested in solving the above problem for the function
f z,x (θ) = z T Wθ x where Wθ ∈ Fm×n , z ∈ Fm and x ∈ Fn .
The way we will tackle the above problem is given the arithmetic circuit C for f (θ), we will try to
come up with an arithmetic circuit C 0 to compute ∇θ f (θ) . We first note that given a fixed 0 ≤ ` < s, it
¡ ¢

is fairly easy compute a circuit C `0 that on input a ∈ Fs computes ∇θ[`] f (θ) |a with essentially the same
¡ ¢
¡ ¢
size. This implies that one can compute ∇θ f (θ) with arithmetic circuit complexity O(m · |C |) (where
|C | denotes the size of C ).
We will now recall the Baur-Strassen theorem, which states that the gradient can be computed in the
same (up to constant factors) arithmetic circuit complexity as evaluating f .

Theorem 3.6 (Baur-Strassen Theorem [6]). Let f : Fs → F be a function that has an arithmetic circuit C
such that given θ ∈ Fs , it computes f (θ). Then there exists another arithmetic circuit C 0 that computes for
any given a ∈ Fs , the gradient ∇θ f (θ) |a . Further,
¡ ¢

|C 0 | ≤ O(|C |).

14
The proof of Baur-Strassen theorem is actually algorithmic– Algorithm 2 shows how to compute the
gradient given the arithmetic circuit for f (it is not too hard to see that the algorithm implicitly defines
the claimed arithmetic circuit C 0 ). The proof of correctness of the algorithm follows from the following
version of chain rule for multi-variable function.

Lemma 3.7. Let f : Fs → F be a function composition of a polynomial g ∈ F[H1 , . . . , Hk ] and polynomials


h i ∈ F[X 1 , . . . , X s ] for every i ∈ [k], i.e.

f (X) = g (h 1 (X), . . . , h k (X)) .

Then for every 0 ≤ ` < s, we have

¡ ¢ Xk ¡ ¢ ¡ ¢
∇ X ` f (X) = ∇H j g (H1 , . . . , Hk ) · ∇ X ` h j (X) .
j =1

We note that over R the above is known as the high-dimensional chain rule (and it holds for more
general classes of functions). It turns out that if g and h i are polynomials, then the high-dimensional
chain rule pretty much follows from Definition 3.3.

Algorithm 2 Back-propagation Algorithm


I NPUT: C that computes a function f : Fs → F and an evaluation point a ∈ Fs
¡ ¢
O UTPUT: ∇θ f (θ) |a

1: Let σ be an ordering of gates of C in reverse topological sort with output gate first . This is possible
since the graph of C is a DAG
2: WHILE Next gate g in σ has not been considered DO
3: Let the parent gates of g be h 1 , . . . , h k . k = 0 is allowed and implies no parents
4: IF k = 0 THEN
5: d [g ] ← 1
6: ELSE
7: d [g ] ← 0
8: FOR i ∈ [k] DO
9: d [g ] ← d [g ] + ∇g (h i )|a · d [h i ]
10: RETURN (d [θi ])0≤i <s . θ0 , . . . , θs−1 are input gates

Theorem 3.6 and Lemma 3.5 imply the following connection between the gradient we want to com-
pute the arithmetic circuit complexity of the corresponding matrix-vector multiplication problem:

Corollary 3.8. If for every θ ∈ Fs , Wθ has arithmetic circuit complexity of m, then we can compute (∇θ (E (θ)))|θ0
for every θ 0 ∈ Fs in O(L(m + n)) operations.

3.3.1 Automatic Differentiation

It turns out that Algorithm 2 can be extended to work beyond arithmetic circuits (at least over R). This
uses that fact that the high dimensional chain rule (Lemma 3.7) holds for any differentiable functions
g , h 1 , . . . , h k . In other words, we can consider circuits that compute f where each gate computes a dif-
ferentiable function of its input. In other words, given a circuit for f with ‘reasonable’ gates, one can

15
automatically compile another circuit for its gradient. This idea has lead to the creation of the field of
automatic differentiation (or auto diff) and is at the heart of many recent machine learning progress. In
particular, those familiar with neural networks would notice that Algorithm 2 is the well-known back-
propagation algorithm (and hence the title of Algorithm 2). However, for this survey, we will not need the
full power of auto diff (Corollary 3.8 is all we need).
Next, we take a (wide) detour and state a result that is not as well-known as it should be.

3.4 Multiplying by the transpose


We first recall the definition of the transpose of a matrix:

Definition 3.9. The transpose of a matrix A ∈ Fm×n , denoted by AT ∈ Fn×m is defined as follows (for any
0 ≤ i < n, 0 ≤ j < m:
AT [i , j ] = A[ j , i ].

It is natural to ask (since the transpose it so closely related to the original matrix):

Question 3.1. Is the (arithmetic circuit) complexity of computing AT x related to the


(arithmetic circuit) complexity of computing Ax for every matrix A ∈ Fm×n ? E.g. are
they within Oe (1) of each other?

We will address the above question in the rest of this section.

3.4.1 Transposition principle

It turns out that the answer to Question 3.1 is an emphatic yes:

Theorem 3.10 (Transposition Principle [20]). Fix a matrix A ∈ Fn×n such that there exists an arithmetic
circuit of size s that computes Ax for arbitrary x ∈ Fn . Then there exists an arithmetic circuit of size O(s +n)
that computes AT y for arbitrary y ∈ Fn .

The above result was surprising to the author when he first came to know about it. In-
deed, the knowledge of this result would have saved the author more than a year’s worth
of plodding while working on the paper [18]. For whatever reason, this result is not as
well-known.

It is not too hard to show that the additive n term in the bound in the transposition principle is
necessary.
There exist proofs of the transposition principle that are very structural in the sense that they con-
sider the circuit for computing Ax and then directly change it to compute a circuit for AT y.11 For this
survey we will present a much slicker proof that directly uses the Baur-Strassen theorem (to the best of
our knowledge this proof was first explicitly stated in [27]). For this the following alternate view of AT y
will be very useful:
¢T
y T A = AT y .
¡
(3)
11 At a very high level this involves ‘reversing’ the direction of the edges in the DAG corresponding to the circuit.

16
Proof of Theorem 3.10. Thanks to (3), we will consider the computation of y T A for any y ∈ Fn . We first
claim that:
y T A = ∇x y T Ax .
¡ ¢
(4)
Note that the function y T Ax is exactly the same product we have encountered before in Lemma 3.5.12
Then note that given an arithmetic circuit of size s to compute Ax one can design an arithmetic circuit
that computes y T Ax of size s + O(n) (by simply additionally computing y, Ax , which takes O(n) oper-
­ ®

ations.).
Now, by Theorem 3.6, there is a circuit that computes ∇x y T Ax with arithmetic circuit of size O(s +
¡ ¢

n).13 Equation (4) completes the proof.

4 Towards answering Question 2.3


In this section, we walk through some well studied classes of structured matrices and see how they all
fall short of answering Question 2.3 fully.

4.1 Low rank matrices


We start with low rank matrices: we already addressed why low rank matrices cannot be the answer
for Question 2.3 in Section 2.6 but we’ll walk through the three requirements again. We consider the
standard representation of a rank r matrix W as W = L · R for L ∈ Fn×r and R ∈ Fr ×n . In this case s 0 = 2r n
and θ is just the listing of all the entries in L and R and f is defined in the obvious way.

1. (E XPRESSIVITY PROPERTY) We have s 0 = 2r n. Consider the case e.g. when W is the discrete Fourier
matrix, which has rank r = n (and hence s 0 ≥ Ω(n 2 )) and by Theorem 2.5, W has s = O(n log n).
Thus, E XPRESSIVITY PROPERTY is not satisfied since the gap between s 0 and s is pretty much as
large as possible.

2. (E FFICIENT MVM PROPERTY) This property is satisfied since the obvious matrix-vector multiplica-
tion algorithm (given L and R) takes O(r n) operations.

3. (E FFICIENT GRADIENT PROPERTY) It is easy to see that each entry in L · R is a degree two polynomial
in the entries of θ and hence is also differentiable.

4.2 Sparse matrices (in listing representation)


Next, we consider m sparse matrices in listing representation. In other words, s 0 = O(m) and θ is basically
a list of triples (x i , y i , c i ) for 1 ≤ i ≤ m. The map f is defined as follows:
(
c if ( j , k, c) is in θ
f (θ)[ j , k] = .
0 otherwise

It turns out that sparse matrices do not satisfy two of the three requirements in Question 2.3–
12 However, earlier we where taking the gradient with respect to (essentially) A whereas here it is with respect to x.
13 Here we consider A as given and x and y as inputs. This implies that we need to prove the Baur-Strassen theorem when
¡ T ¢
¡ Ttake¢ derivatives with respect to part of the inputs– but this follows trivially since one can just read off ∇x y Ax from
we only
∇x,y y Ax .

17
• (E XPRESSIVITY PROPERTY) We have s 0 = Θ(m). However, for the discrete Fourier transform we
have m = n 2 and as we have already observed that for the discrete Fourier transform we have
s = O(n log n). Hence, the gap between s 0 and s is as large as possible.

• (E FFICIENT MVM PROPERTY) The obvious algorithm to multiply an m-sparse matrix with an arbi-
trary vector takes O(m) operations and hence E FFICIENT MVM PROPERTY is satisfied.

• (E FFICIENT GRADIENT PROPERTY) It is easy to check that f as defined above is not differentiable
(because the locations of the non-zero values are discrete). E.g. consider the case of m = 1 and
let (x, y) be the location of the non-zero value (and let us assume that W[x, y] = 1). In this case
f (θ)[ j , k] = δx= j ,y=k , where δ is the Kronecker delta function for which the derivative is not defined
at the point (x, y, 1) and hence f is not differentiable.14

As bit of a spoiler alert, (variants) of sparse matrices will actually be crucial in answering Question 2.3
in the affirmative. It turns out that to satisfy E XPRESSIVITY PROPERTY one needs to consider product of
sparse matrices (see Section 4.6) and to satisfy E FFICIENT GRADIENT PROPERTY one needs to go beyond
the listing representation (see Section 5).

4.3 Sparse+low rank


Next, we consider the combination of sparse and low rank matrices. Not only is this a natural combina-
tion to consider but such matrices have been well-studied in the context of robust PCA [12]. However, for
this survey we are interested in this family of matrices since this is exactly the class of matrices consid-
ered in the matrix rigidity problem introduced by Valiant[42]. In particular, we recall the following result
due to Valiant (where the specific statement is from Paturi and Pudlák [35]):

Theorem 4.1 ([42, 35]). Let r, d , σ be positive integers such that d > 4 log2 σ. Assume W has a circuit C with
size
log2 d
s ≤r · ³ ´,
d
2 log2 4 log σ 2

and depth d . Then we can decompose W as

W = S + LR,

where both S ∈ Fm×n and R ∈ Fr ×n are σ- row sparse (i.e. overall they are mσ and r σ sparse respectively)
and L ∈ Fn×r . In other words, W can be written as a sum of rank r and σn-sparse matrix.

The above result has spawned a long line of beautiful work in the area of matrix rigidity, which we do
not have the space to do any justice, see the course notes by Golovnev [22] for more details.
Unfortunately, sparse+low-rank matrices cannot answer Question 2.3 positively either. Specifically,
we will use the following lower bound result.
14 In this survey we are dealing with the classical definition of derivatives. If one defines the Kronecker delta function as a

limit of a distribution and consider derivatives in the sense of theory of distributions then E FFICIENT GRADIENT PROPERTY will
be satisfied. Indeed, many practical implementation that use sparse as the weight matrices W, when trying to learn W use the
distributional definition of the Kronecker delta function.

18
Theorem 4.2 (Thm 2.17 in [30]). Let a ∈ Qn such that all entries in a are algebraically independent over
p
Q. Then there exists an ε such that for every r ≤ ε n such that one can write

V(a)
n = S + R, (5)
n2
where R has rank r , then S has overall sparsity at least 4 .

Let us consider all three required properties in sequence:

• (E XPRESSIVITY PROPERTY) It turns out that the Vandermonde matrices (with the condition as in
Theorem 4.2) still shows that this property is not satisfied for sparse+low rank matrices though the
gap between s 0 and s is not as dramatic as before. Specifically, we claim that Theorem 4.2 shows15
that s 0 ≥ Ω n 3/2 (while s = O(n log2 n) [11]). Thus, while the gap is not quadratic as it was for the
¡ ¢

sparse only or low-rank only case, the gap is still too large for what we are after.

• (E FFICIENT MVM PROPERTY) Since this property is satisfied for rank r and σn-sparse matrices, this
property is also satisfied for their sum.

• (E FFICIENT GRADIENT PROPERTY) Since this property is not satisfied for sparse matrices (with the
listing representation), this property is not satisfied for sum of low rank and sparse matrices as
well.

We would like to stress that the goal of matrix rigidity is different from ours in that the goal of the
program of matrix
³ rigidity
´ is to exhibit an explicit matrix for which any decomposition as R + S for R
n
being rank O log log n needs S to have sparsity Ω n 1+ε for some constant ε > 0. In our context we would
¡ ¢

have liked to show that matrices with small arithmetic circuits are not rigid.

4.4 Vandermonde matrices


So far we have been able to rule out low rank, sparse and sparse+low rank matrices just based on the
discrete Fourier transform. However, the discrete Fourier transform by itself does not need a lot of pa-
rameters. In particular, it is a special case of Vandermonde matrices (Definition 2.6). It is natural to
consider Vandermonde matrices as a potential answer to Question 2.3. In this case we use the obvious
representation where θ is just the vector (a 1 , . . . , a n ) and f is defined as per Definition 2.6. Unfortunately,
Vandermonde matrices cannot answer Question 2.3 positively either:

1. (E XPRESSIVITY PROPERTY) We have s 0 = O(n log2 n) [11]. However, by a simple counting argument it
is easy to see that Vandermonde matrices cannot represent all matrices. Specifically, consider the
set of s̄-sparse matrices with sparsity s̄ = ω(n). Since a Vandermonde matrix is represented by n
parameters, there will be at least one s̄-sparse matrix that cannot be represented as a Vandermonde
matrix. Thus, E XPRESSIVITY PROPERTY is not satisfied.

2. (E FFICIENT MVM PROPERTY) This property is satisfied since one can multiply a Vandermonde ma-
trix with an arbitrary vector in O(n log2 n) = O(s 0 ) operations [11].

3. (E FFICIENT GRADIENT PROPERTY) By definition, each entry in a Vandermonde matrix is a polyno-


mial (of degree at most n − 1) in the entries of θ and hence is also differentiable.
15 Indeed consider any sparse+low rank as in equation 5. If R has rank r at least εpn, this immediately implies s 0 ≥ 2r n ≥
³ ´ p 2
Ω n 3/2 . If on the other hand if r ≤ ε n, then by Theorem 4.2, we have s 0 ≥ n4 .

19
4.5 Low-displacement rank matrices
We now consider a class of structured matrices that have been used in experiments in deep learning to
address the practical questions that motivated Question 2.3.
We begin with the definition of a matrix having a displacement rank of r :

Definition 4.3. A matrix W ∈ Fn×n has a displacement rank with respect to L, R ∈ Fn×n , if the residual

E = LW − WR

has rank r .

We would like to mention that for the above definition to be meaningful, the displacement operators
(L, R) need to satisfy some non-trivial requirements for W. E.g. if L = R = I, then all matrices have dis-
placement rank 0 with respect to (I, I). However, if we insists that L and R do not share any common
eigenvalues, then in the above definition, every E corresponds to a unique matrix W. For the rest of the
section, we will make this assumption.

4.5.1 Some examples and arithmetic circuit complexity

Consider the following matrix (called a Cauchy matrix):

Definition 4.4 (Cauchy Matrix). Arbitrarily fix s, t ∈ Fn such that for every 0 ≤ i , j < n, s[i ] 6= t [ j ], s[i ] 6=
s[ j ] and t [i ] 6= t [ j ] and
1
Cn [i , j ] = .
s[i ] − t [ j ]
It can be shown that this matrix has full rank. We next argue that the Cauchy matrix (Definition 4.4)
has displacement rank 1 with respect to L = diag(s) and R = diag(t ), where recall diag(x) is the diagonal
matrix with x on its diagonal. Indeed, note that in this case we have diag(s)Cn − Cn diag(t ) is the all ones
matrix.
Further, it turns out that the Vandermonde matrix (Definition 2.6) Vn(a) for any a ∈ Fn has displace-
ment rank 1 with respect to L = diag(a) and R being the shift matrix as defined next:

Definition 4.5 (Shift Matrix). The shift matrix Z ∈ Fn×n is defined by


(
1 if i = j − 1
Z[i , j ] =
0 otherwise.

(The reason the matrix Z is called the shift matrix is because when applied to the left or right of a
matrix it shifts the row (or columns respectively) of the matrix.)
It is known how to compute Wx with arithmetic circuit complexity O e (r n), where W has displacement
rank at most r with respect to L, R where these ‘operators’ are either shift or diagonal matrices. In fact De
Sa et al. [18] show that as long as L and R are O(1)-quasiseparable (i.e. all sub-matrices strictly above or
strictly below the main diagonal are O(1)-rank) then any matrix W that has rank r with respect to (L, R)
has arithmetic circuit complexity of O e (r n).

20
4.5.2 Low displacement rank matrices in deep learning literature

Low displacement rank (or LDR) matrices have actually been implemented in deep learning systems
with some success in reducing the memory footprint and the time efficiency of inference [37, 39]. Here
we give a very quick (and necessarily incomplete) overview of the main results of paper of Zhao et al. [44].
Zhao et al. consider LDR with respect to any fixed displacement operators (L, R) as long as

• Both L and R are non-singular diagonalizable matrices,

• Lq = a · I for some 1 ≤ q ≤ n and non-zero a ∈ R,

• I − aBq is non-singular, and


¡ ¢

• The eigenvalues of R are distinct in absolute values.

Zhao et al. fix the displacement operators (L, R) as above and consider one layer neural networks
(as in our case) where the weight matrix W has O(1)-displacement rank with respect to (L, R).16 Note
that such matrices can be represented by just storing the residual LW − WR and hence only needs O(n)
parameters overall. For the rest of this subsection, we will refer to these as LDR neural networks.
They show that for three well-studied properties of single layer neural networks, the one with LDR
weight matrices as just as ‘good’ as arbitrary weight matrices. Arguing these results formally is out of
scope for this survey so here we just given a very high level informal statements (and refer the reader to
the paper [44] for the formal statements and their proofs):

1. The universal approximation theorem states that an LDR neural network can approximate any
continuous function to within arbitrary precision over any point (i.e. under `∞ error norm).

2. The paper also shows that for any probability distribution over an n-dimensional ball, an LDR
neural network can approximate any function (w.r.t. the probability distribution) with squared
error O(1/n 2 ). A similar result was shown for neural networks with arbitrary weight matrix, which
we have already seen needs Ω(n 2 ) parameters (while the LDR neural network only needs O(n)
parameters as observed above).

3. Zhao et al. also show that one can compute the required gradients for the gradient descent algo-
rithm (where roughly speaking the complexity of computing the gradients depends on the arith-
metic circuit complexity of L and R).17

One practical drawback in this setup is that one fixes L and R upfront. Thomas et al. have run exper-
iments where one tries to learn the displacement operator L and R along with the residual matrix [39].

4.5.3 Coming back to Question 2.3

Unfortunately, low displacement rank matrices are not enough to answer Question 2.3 in the affirmative
either.
16 This means that during the learning phase, we already know L and R and we only need to learn the residual.
17 At a high level this should not be surprising given the results in Section 3, though Zhao et al. do not utilize the generic

connection we established in Section 3.

21
• (E XPRESSIVITY PROPERTY) It turns out that the full power of low displacement rank matrices w.r.t.
O(1)-quasiseparable displacement matrices is not known– in other words, it is not known if these
matrices satisfy E XPRESSIVITY PROPERTY. We conjecture that they do not. In a somewhat weak
support of this conjecture, we note that the traditional low displacement operators are either a
(non-zero) diagonal matrix D or (simple variants) of the shift matrix Z (the initial experimental
results on LDR neural networks are for these displacement operators [37])– and in this case there
are even diagonal matrices that have displacement rank Ω(n) with respect to these displacement
operators (which means we have s 0 ≥ Ω(n 2 )).
Indeed, if at least one of L or R is a diagonal matrix, then we note that L − R has all non-zero
diagonal entries18 and is lower/upper triangular and hence W = I has displacement rank of Ω(n)
with respect to such matrices. If both L and R are shift matrices, then we note that for a diagonal
matrix D, we have E = ZD − DZ = D0 · Z, where elements of D0 [i , i ] = D[i , i ] − D[i + 1, i + 1]. Thus,
if we choose D such that all the consecutive elements on the diagonal are different, then we have
that rank of E is the same as rank of Z and thus, D will have displacement rank Ω(n) with respect
to shift matrices.

• (E FFICIENT MVM PROPERTY) Results from [18] show that this property is satisfied when L and R are
O(1)-quasiseparable matrices.

• (E FFICIENT GRADIENT PROPERTY) As mentioned earlier, [44] shows that this property is satisfied if
L and R are fixed. Results in Section 3 imply E FFICIENT GRADIENT PROPERTY are satisfied as long as
W has efficient matrix-vector multiplication.

4.6 Product of sparse matrices (in listing representation)


All of the classes of structured matrices that we have considered so far have all not been able to satisfy
E XPRESSIVITY PROPERTY in Question 2.3. Next we consider the class of product of sparse matrices. De Sa
et al. [18], showed that these can accurately capture W with small arithmetic circuits:

Theorem 4.6. Let W be an n × n matrix such that matrix-vector multiplication of W times an arbitrary
vector v can be represented as a linear arithmetic circuit C comprised of s gates (including inputs) and
having depth d . Then we can represent W as a product of d + 1 matrices each of which is O(s) sparse.

In fact [18] also proves a ‘converse’ of the above result (which means product of sparse matrices
exactly capture the power of (linear) arithmetic circuits for linear maps). Before we present the proof of
the above result, we remark that Theorem 4.6 and its converse in [18] are probably known but we have
not been able to to find a reference that pre-dates [18]– if you are aware of a reference for the above
theorem, please let the author know.

Proof of Theorem 4.6. We will represent C as a product of d matrices, each of size s 0 × s 0 , where s 0 is the
smallest power of 2 that is greater than or equal to s.
Define w 1 , . . . w d such that w k represents the number of gates in the k’th layer of C (note that s =
n + dk=1 w k ). Also, define z 1 , . . . z d such that z 1 = n and z k = w k−1 + z k−1 (z k is the number of gates that
P

have already been used by the time we get to layer k).


18 If WLOG L = Z and R = D, then the diagonal of L − R is the diagonal of D and hence all non-zero by our assumption. If both

L and R are diagonal matrices, i.e. L = D1 and R = D2 , then L − R = D1 − D2 and all these entries are non-zero since we assumed
L and R do not share any eigenvalues.

22
Let g i denote the i ’th gate (and its output) of C (0 ≤ i < s), defined such that (where we want to
multiply v = (v 0 , . . . , v n−1 ) with W):
(
vi 0≤i <n
gi =
αi g i 1 + βi g i 2 n ≤ i < s
where i 1 , i 2 are indices of gates in earlier layers.
For the k’th layer of C , we define the s 0 × s 0 matrix Wk such that it performs the computations of the
gates in that layer. Define the i ’th row of Wk to be:

T
e i

 0 ≤ i < zk
Wk [i :] = αi e Ti + βi e Ti z k ≤ i < z k + w k
 1 2

0 i ≥ z +w
k k

For any 0 ≤ k ≤ d , let v k be vector


· ¸
v
v k = Wk . . . W2 W1 .
0
We’d like to argue that v d contains the outputs of all gates in C (i.e, the n values that make up Wv ). To do
this we argue, by induction on k, that v k is the vector whose first z k+1 entries are g 0 , g 1 , . . . , g (zk+1 −1) , and
whose remaining entries are 0. The base case, k = 0 is trivial. Assuming this holds for the case k − 1, and
consider multiplying v k−1 by Wk . The first z k rows of Wk duplicate the first z k entries of v k−1 . The next
w k rows perform the computation of gates g zk , . . . , g (zk+1 −1) . Finally, the remaining rows pad the output
vector with zeros. Therefore, v k is exactly as desired.
The final matrix product will contain all n elements of the output, as desired. By left multiplying by
some permutation matrix P, we can reorder this vector such that the first n entries are exactly Wv (or
more precisely we left multiply by a ‘truncated’ permutation matrix so that the final answer is exactly
Wv ). One can now check that we have product of d + 1 matrices each of which is O(s) sparse, as desired.

We are now ready to evaluate whether product of sparse matrices can answer Question 2.3 (spoiler
alert: no!):

• (E XPRESSIVITY PROPERTY) If we assume that we only consider W that have circuit with depth O
e (1)
(which capture most of the known efficient matrix-vector multiplication algorithms), then The-
orem 4.6 shows that s 0 = O(d s), which by our assumption on d is Oe (s), which means we have
satisfied E XPRESSIVITY PROPERTY.

• (E FFICIENT MVM PROPERTY) If one uses the obvious algorithm (i.e. multiply successively by each
of the d +1 matrices, each of which is O(s)-sparse), then one can compute the overall matrix vector
multiplication in O(d s) = O(s 0 ) operations. Thus, we also satisfy E FFICIENT MVM PROPERTY.

• (E FFICIENT GRADIENT PROPERTY) This property is not satisfied if we assume the listing representa-
tion for each of the sparse matrices (due to the same reason that a single sparse matrix in listing
representation does not satisfy E FFICIENT GRADIENT PROPERTY).

We came close to answering Question 2.3 with product of sparse matrices– the only catch was that
the listing representation of sparse matrices does not allow us to satisfy E FFICIENT GRADIENT PROPERTY.
Next, we answer Question 2.3 in the positive by coming up with an alternative representation of sparse
matrices that is differentiable.

23
5 Butterfly matrices
In this section, we will present a positive answer to Question 2.3. We start with taking a circuit/matrix-
product view of the FFT in Section 5.1, which in turn motivates the definition of butterfly matrices in
Section 5.2. Finally, we use Butterfly matrices to define the final class of matrices in Section 5.3, which
we will show answer Question 2.3 in the affirmative.

5.1 Fast Fourier Transform (FFT)


As mentioned earlier, a vast majority of efficient matrix vector multiplication algorithms are equivalent
to small (both in size and depth) linear arithmetic circuit. For example the FFT can be thought of as an
efficient arithmetic circuit to compute the Discrete Fourier Transform (indeed when one converts the
linear arithmetic circuit for FFT into a matrix decomposition, then each matrix in the decomposition is
so called Butterfly matrix, with each block matrix in each factor being the same). For an illustration of
this consider the DFT with n = 4 as illustrated in Figure 1.

1 1 1 1
1 -i -1 i
1 -1 1 -1
1 i -1 -i

Figure 1: DFT of order 4.

Figure 2 represent the arithmetic circuit corresponding to FFT with n = 4.

w0 w1 w2 w3

+ 1 + + 1 +
1 −1 1 −i
1 i ax + b y

+ 1 + + 1 + +
1 −1 1 1 a b
1 −1
x y
Semantics of a gate
v0 v1 v2 v3

Figure 2: Arithmetic circuit for 4-DFT from Figure 1.

Finally, Figure 3 is representation of the arithmetic circuit of Figure 2 as a product of a butterfly matrix
and (the bit-reversal) permutation.

24
1 1 1 1 1
1 -i 1 -1 1
1 -1 1 1 1
1 i 1 -1 1

B(4) B(4) P
4 2

Figure 3: Decomposition of DFT of Figure 1 via the arithmetic circuit of Figure 2.

5.2 Butterfly matrices


Butterfly matrices, encoding the recursive divide-and-conquer structure of the fast Fourier transform
(FFT) algorithm as illustrated in Figure 3, have long been used in numerical linear algebra [34, 29] and
machine learning [31, 24, 32, 16, 13]. Here we define butterfly matrices, which we use as a building block
for our hierarchy of kaleidoscope matrices.
· ¸
D1 D2
Definition 5.1. A butterfly factor of size k ≥ 2 (denoted as Bk ) is a matrix of the form Bk =
D3 D4
where each Di is a k2 × k2 diagonal matrix. We restrict k to be a power of 2.

Definition 5.2. A butterfly factor matrix of size n with block size k (denoted as B(n)
k
) is a block diagonal
n
matrix of k (possibly different) butterfly factors of size k:
³ ´
B(n)
k
= diag [Bk ] 1 , [Bk ] 2 , . . . , [Bk ] n
k

Definition 5.3. A butterfly matrix of size n (denoted as B(n) ) is a matrix that can be expressed as a product
of butterfly factor matrices: B(n) = B(n) (n) (n)
n B n . . . B2 . Equivalently, we may define B
(n)
recursively as a matrix
2
that can be expressed in the following form:
· (n) ¸
(n) [B 2 ]1 0
B = B(n)
n ( n2 )
0 [B ]2
n n
(Note that [B( 2 ) ]1 and [B( 2 ) ]2 may be different.)

5.3 The kaleidoscope hierarchy


Using the building block of butterfly matrices, we formally define the kaleidoscope (BB ∗ ) hierarchy and
prove its expressiveness. This class of matrices serves as a fully differentiable alternative to products of
sparse matrices (Section 4.6), with similar expressivity. This family of matrices was defined by Dao et
al. [17].
The building block for this hierarchy is the product of a butterfly matrix and the (conjugate) transpose
of another butterfly matrix (which is simply a product of butterfly factors taken in the opposite order).
Figure 4 visualizes the sparsity patterns of the butterfly factors in BB ∗ , where the red and blue dots
represent the allowed locations of nonzero entries.

25
Figure 4: Visualization of the fixed sparsity pattern of the building blocks in BB ∗ , in the case n = 16.
The red and blue dots represent all the possible locations of the nonzero entries.

We would like to note that the sparsity pattern in a matrix in BB ∗ matches exactly the Beneš net-
work [9, 8], which is a multistage circuit switching network. The goal in a switching network in Beneš
network is to route n input connection to n output connection through a sequence of switches where
the basic building block is a cross-bar switch (where each such switch can ‘swap’ two connections). It
is known that the Beneš network can route any permutation from the input connection to the output
connection by appropriately making the switch swap (or not) its two input connections. In our setup of
BB ∗ , we allow each ‘switch’ in a Beneš network to be replaced by an arbitrary 2 × 2 sub-matrix.
Definition 5.4 (Kaleidoscope hierarchy, kaleidoscope matrices).
• Define B as the set of all matrices that can be expressed in the form B(n) (for some n).

• Define BB ∗ as the set of matrices M of the form M = M1 M∗2 for some M1 , M2 ∈ B.

• Define (BB ∗ )w as the set of matrices M that can be expressed as M = Mw . . . M2 M1 , with each Mi ∈
BB ∗ (1 ≤ i ≤ w). (The notation w represents width.)

• Define (BB ∗ )ew as the set of n × n matrices M that can be expressed as M = SEST for some en × en
matrix E ∈ (BB ∗ )w , where S ∈ Fn×en = In 0 . . . 0 (i.e. M is the upper-left corner of E). (The
£ ¤

notation e represents expansion relative to n.)

• M is a kaleidoscope matrix, abbreviated as K-matrix, if M ∈ (BB ∗ )ew for some w and e.


The kaleidoscope hierarchy, or (BB ∗ ) hierarchy, refers to the families of matrices (BB ∗ )1e ⊆ (BB ∗ )2e ⊆
. . . , for a fixed expansion factor e. Each butterfly matrix can represent the identity matrix, so (BB ∗ )ew ⊆
(BB ∗ )ew+1 . Dao et al. [17] show that the inclusion is proper.

Efficiency in space and speed. Each matrix in (BB ∗ )ew is a product of 2w total butterfly matrices and
transposes of butterfly matrices, each of which is in turn a product of log(ne) factors with 2ne nonze-
ros (NNZ) each. Therefore, each matrix in (BB ∗ )ew has 4wne log(ne) parameters and a matrix-vector
multiplication algorithm of complexity O(wne log ne) (by multiplying the vector with each sparse factor
sequentially).

Difference from family of matrices in Section 4.6. We note that the K-matrices are similar to the family
of matrices considered in Section 4.6 in that they are also product of sparse matrices. The main differ-
ence is that each matrix in the product in addition to being sparse is also structured– i.e. we know upfront
where all the non-zero elements in each factor in a K-matrix will be. This allows us to create a differen-
tiable representation for sparse matrices, which was the missing part of the family of product of (general)
sparse matrices.

26
5.3.1 Answering Question 2.3

We state the main theoretical result, namely, the ability to capture general transformations, expressed as
low-depth linear arithmetic circuits, in the BB ∗ hierarchy. This result is recorded in Theorem 5.5.

Theorem 5.5. Let M be an n × n matrix such that matrix-vector multiplication of M times an arbitrary
vector v can be represented as a linear arithmetic circuit C comprised of s gates (including inputs) and
having depth d . Then, M ∈ (BB ∗ )O(d
¡ s )¢ .
O n

Before we prove Theorem 5.5, we note that it is sufficient to show that K-matrices answer Question 2.3
in the affirmative:

1. (E XPRESSIVITY PROPERTY) Theorem 5.5 along with the observation on number of parameters needed
to represent a matrix in (BB ∗ )ew implies that we have s 0 = O d · ns · n log ns · n = O(d s log s). Thus,
¡ ¡ ¢¢

under the assumption of d = O e (1), we have that s 0 = O


e (s), as desired.

2. (E FFICIENT MVM PROPERTY) Again by Theorem 5.5 along observation on number of operations
needed to do matrix-vector multiplication for a matrix in (BB ∗ )ew (and using the calculations
from the previous bullet), we get that the matrix-vector multiplication takes O(s 0 ) operations, as
desired.

3. (E FFICIENT GRADIENT PROPERTY) Finally, since we know the locations of the non-zero elements
(which form the parameters for K-matrices), it is not too hard to see that each entry in Wθ is a
polynomial in the entries of θ. Since a polynomial in θ is differentiable, this means E FFICIENT
GRADIENT PROPERTY is satisfied as well.

Proof of Theorem 5.5. To prove Theorem 5.5, we make use of the following two theorems.

Theorem 5.6. Let P be an n × n permutation matrix (with n a power of 2). Then P ∈ BB ∗ .


4d ns e
Theorem 5.7. Let S be an n × n matrix of s NNZ. Then S ∈ (BB ∗ )4 .

We first give an overview of how the above two results imply Theorem 5.5 and then briefly outline
how one can prove the two results above. First, we note that (proof of) Theorem 4.6 implies that given
any W with arithmetic circuit of size s and depth d , we can represent W as a product of d many O(s)-
sparse matrices and a permutation matrix. Thus, we can decompose W as product of O(d ) K-matrices.
Then Theorem 5.5 follows from the simple observation that membership in the family of K-matrices is
closed under multiplication.
Theorem 5.6 essentially follows from the known fact that a Beneš network can route an arbitrary
permutation (see [17] for a self-contained proof in the language of K-matrices).
Theorem 5.7 follows by showing that any n-sparse matrix is in (BB ∗ )4 and the fact that member-
ship in the family of K-matrices is closed under addition. To show the inclusion of an n-sparse matrix,
Dao et al. [17] show that any n-sparse matrix S can be decomposed as P1 HP2 VP3 , where P1 , P2 , P3 are
permutation matrices (which by Theorem 5.6 are in BB ∗ ). H is horizontal step matrix, which obeys a
‘Lipschitz-like’ condition. Each column of a horizontal step matrix can have at most one non-zero entry,
and given two non-zero columns k apart, the non-zero entry in the right column must be between 0 and
k rows below the non-zero entry in the left column. Note that to show that a matrix is a horizontal step
matrix, it is sufficient to argue that this condition holds for each pair of neighboring non-zero columns.

27
The matrix V is such that its transpose is a horizontal step matrix. Dao et al. [17] show that any horizontal
step matrix is in B. Combining all of these, we have that S ∈ (BB ∗ )(B)(BB ∗ )(B ∗ )(BB ∗ ) ⊆ (BB ∗ )5 .
Dao et al. [17] observe that with bit more careful analysis we can show inclusion in (BB ∗ )4 . We refer the
interested reader to [17] for the proof details.

6 Open Questions
We conclude by present two open questions (the first one being a specific technical question and the
other one being a bit more vague):

1. There is one unsatisfactory aspect to the results in Section 5.3, i.e. the number of parameters
needed to specify the family of K-matrices that capture matrices with arithmetic circuit of size
s and depth d is O(sd log s). In particular, the dependence on d is not ideal, which leads to the
following:

Open Question 6.1. Is it possible to answer Question 2.3 in the affirmative


with a family that uses s 0 = Oe (s) many parameters to capture all matrices
with arithmetic circuits of size s (irrespective of the depth d )?

2. As mentioned earlier, low rank approximation is ubiquitous in machine learning (and numerical
linear algebra more generally). One intriguing possibility is whether K-matrices can replace low
rank matrices in these applications? Currently, the main technical stumbling block is solving the
following:

Open Question 6.2. Does there exist an efficient algorithm that solves the
following problem– given an arbitrary matrix M ∈ Fn×n and parameters w
and e, find the matrix W ∈ (BB ∗ )ew that is closest (or ‘close enough’) to M (say
in Frobenius norm)?

We note that for low rank matrices, the SVD solves the above question. Thus, the question is asking
whether we can be design the ‘SVD for K-matrices’? Partial progress on a variant of the above
question was made recently in [15].

Acknowledgments
The material in Sections 2 and 3 are based on notes for AR’s Open lectures for PhD students in computer
science at University of Warsaw titled (Dense Structured) Matrix Vector Multiplication in May 2018– we
would like to thank University of Warsaw’s hospitality. The material in Section 5 is based on Dao et
al. [17].
We would like to thank Tri Dao, Albert Gu and Chris Ré for many illuminating discussions during our
collaborations around these topics.
We would like to thank an anonymous reviewer whose comments improved the presentation of the
survey (and for pointing us to Theorem 4.2) and we thank Jessica Grogan for a careful read of an earlier
draft of this survey.
AR is supported in part by NSF grant CCF-1763481.

28
References
[1] Josh Alman. Kronecker products, low-depth circuits, and matrix rigidity. In Samir Khuller and
Virginia Vassilevska Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of
Computing, Virtual Event, Italy, June 21-25, 2021, pages 772–785. ACM, 2021.

[2] Josh Alman and Lijie Chen. Efficient construction of rigid matrices using an NP oracle. In David
Zuckerman, editor, 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019,
Baltimore, Maryland, USA, November 9-12, 2019, pages 1034–1055. IEEE Computer Society, 2019.

[3] Josh Alman and R. Ryan Williams. Probabilistic rank and matrix rigidity. In Hamed Hatami, Pierre
McKenzie, and Valerie King, editors, Proceedings of the 49th Annual ACM SIGACT Symposium on
Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pages 641–652. ACM,
2017.

[4] M. Anthony and P.L. Bartlett. Neural Network Learning: Theoretical Foundations. Neural Network
Learning: Theoretical Foundations. Cambridge University Press, 2009.

[5] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. fairml-
book.org, 2019. http://www.fairmlbook.org.

[6] Walter Baur and Volker Strassen. The complexity of partial derivatives. Theoretical Computer Sci-
ence, 22(3):317–330, 1983.

[7] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the
dangers of stochastic parrots: Can language models be too big? In Madeleine Clare Elish, William
Isaac, and Richard S. Zemel, editors, FAccT ’21: 2021 ACM Conference on Fairness, Accountability,
and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021, pages 610–623. ACM, 2021.

[8] V.E. Beneš. Mathematical Theory of Connecting Networks and Telephone Traffic. ISSN. Elsevier
Science, 1965.

[9] V. E. Beneš. Optimal rearrangeable multistage connecting networks. The Bell System Technical
Journal, 43(4):1641–1656, 1964.

[10] Amey Bhangale, Prahladh Harsha, Orr Paradise, and Avishay Tal. Rigid matrices from rectangular
pcps or: Hard claims have complex proofs. In 61st IEEE Annual Symposium on Foundations of
Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 858–869. IEEE, 2020.

[11] Peter Bürgisser, Michael Clausen, and Mohammad A. Shokrollahi. Algebraic complexity theory, vol-
ume 315. Springer Science & Business Media, 2013.

[12] Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?
J. ACM, 58(3), June 2011.

[13] Krzysztof Choromanski, Mark Rowland, Wenyu Chen, and Adrian Weller. Unifying orthogonal
Monte Carlo methods. In International Conference on Machine Learning, pages 1203–1212, 2019.

[14] James W. Cooley and John W. Tukey. An algorithm for the machine calculation of complex fourier
series. Mathematics of Computation, 19(90):297–301, 1965.

29
[15] Tri Dao, Beidi Chen, Nimit Sharad Sohoni, Arjun D. Desai, Michael Poli, Jessica Grogan, Alexander
Liu, Aniruddh Rao, Atri Rudra, and Christopher Ré. Monarch: Expressive structured matrices for
efficient and accurate training. CoRR, abs/2204.00595, 2022.

[16] Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher Ré. Learning fast algorithms
for linear transforms using butterfly factorizations. In The International Conference on Machine
Learning (ICML), 2019.

[17] Tri Dao, Nimit Sharad Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri
Rudra, and Christopher Ré. Kaleidoscope: An efficient, learnable representation for all structured
linear maps. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa,
Ethiopia, April 26-30, 2020. OpenReview.net, 2020.

[18] Christopher De Sa, Albert Gu, Rohan Puttagunta, Christopher Ré, and Atri Rudra. A two-pronged
progress in structured dense matrix vector multiplication. In Proceedings of the Twenty-Ninth An-
nual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January
7-10, 2018, pages 1060–1079, 2018.

[19] Zeev Dvir and Allen Liu. Fourier and circulant matrices are not rigid. Theory of Computing, 16(20):1–
48, 2020.

[20] Charles M. Fiduccia. On the algebraic complexity of matrix multiplication. PhD thesis, Brown Uni-
versity, 1973. URL: http://cr.yp.to/bib/entries.html#1973/fiduccia-matrix.

[21] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neu-
ral networks. In International Conference on Learning Representations (ICLR), 2019.

[22] Sasha Golovnev. A course on matrix rigidity, 2020. https://golovnev.org/rigidity/. Accessed


August 15, 2021.

[23] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu,
Xingxing Wang, Li Wang, Gang Wang, Jianfei Cai, and Tsuhan Chen. Recent advances in convo-
lutional neural networks. Pattern Recognition, 77:354–377, 2018.

[24] Li Jing, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark, and
Marin Soljačić. Tunable efficient unitary neural networks (eunn) and their application to rnns. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1733–1741.
JMLR. org, 2017.

[25] Thomas Kailath, Sun-Yuan Kung, and Martin Morf. Displacement ranks of matrices and linear
equations. Journal of Mathematical Analysis and Applications, 68(2):395–407, 1979.

[26] Thomas Kailath and Ali H. Sayed. Displacement structure: Theory and applications. SIAM Review,
37(3):297–386, 1995.

[27] E. Kaltofen. Computational differentiation and algebraic complexity theory. In C. H. Bischof, A.


Griewank, and P. M. Khademi, editors, Workshop Report on First Theory Institute on Computa-
tional Differentiation, volume ANL/MCS-TM-183 of Tech. Rep., Argonne, Illinois, pages 28–30, New
York, NY, USA, 1993. Association for Computing Machinery. http://kaltofen.math.ncsu.edu/
bibliography/93/Ka93_diff.pdf.

30
[28] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436– 444, 2015.

[29] Yingzhou Li, Haizhao Yang, Eileen R. Martin, Kenneth L. Ho, and Lexing Ying. Butterfly factoriza-
tion. Multiscale Modeling & Simulation, 13(2):714–732, 2015.

[30] Satyanarayana V. Lokam. Complexity lower bounds using linear algebra. Found. Trends Theor. Com-
put. Sci., 4(1-2):1–155, 2009.

[31] Michael Mathieu and Yann LeCun. Fast approximation of rotations and Hessians matrices. arXiv
preprint arXiv:1404.7195, 2014.

[32] Marina Munkhoeva, Yermek Kapushev, Evgeny Burnaev, and Ivan Oseledets. Quadrature-based
features for kernel approximation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9165–
9174. Curran Associates, Inc., 2018.

[33] Victor Y. Pan. Structured Matrices and Polynomials: Unified Superfast Algorithms. Springer-Verlag
New York, Inc., New York, NY, USA, 2001.

[34] D. Stott Parker. Random butterfly transformations with applications in computational linear alge-
bra. Technical report, UCLA, 1995.

[35] R. Paturi and P. Pudlák. Circuit lower bounds and linear codes. Journal of Mathematical Sciences,
134:2425– 2434, 2006.

[36] Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI. arXiv preprint
arXiv:1907.10597, 2019.

[37] Vikas Sindhwani, Tara N. Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deep
learning. In Advances in Neural Information Processing Systems, pages 3088–3096, 2015.

[38] G. Szegö. Orthogonal Polynomials. Number v. 23 in American Mathematical Society colloquium


publications. American Mathematical Society, 1967.

[39] Anna T. Thomas, Albert Gu, Tri Dao, Atri Rudra, and Christopher Ré. Learning compressed trans-
forms with low displacement rank. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kris-
ten Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information
Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS
2018, December 3-8, 2018, Montréal, Canada, pages 9066–9078, 2018.

[40] Joseph Tsidulko. Google showcases on-device artificial intelligence breakthroughs at I/O. CRN,
2019.

[41] Madeleine Udell and Alex Townsend. Why are big data matrices approximately low rank? SIAM
Journal on Mathematics of Data Science, 1(1):144–160, 2019.

[42] Leslie G. Valiant. Graph-theoretic arguments in low-level complexity. In Jozef Gruska, editor, Math-
ematical Foundations of Computer Science 1977, pages 162–176, Berlin, Heidelberg, 1977. Springer
Berlin Heidelberg.

31
[43] Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank
and sparse decomposition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2017.

[44] Liang Zhao, Siyu Liao, Yanzhi Wang, Zhe Li, Jian Tang, and Bo Yuan. Theoretical properties for
neural networks with weight matrices of low displacement rank. In Doina Precup and Yee Whye
Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of
Proceedings of Machine Learning Research, pages 4082–4090. PMLR, 06–11 Aug 2017.

32

You might also like