0% found this document useful (0 votes)
17 views22 pages

Multiplying Matrices Without Multiplying: Davis Blalock John Guttag

Uploaded by

10.goodpro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views22 pages

Multiplying Matrices Without Multiplying: Davis Blalock John Guttag

Uploaded by

10.goodpro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Multiplying Matrices Without Multiplying

Davis Blalock 1 2 John Guttag 2

Abstract is minimizing the amount of compute time required to ap-


proximate linear operations with a given level of fidelity.
Multiplying matrices is among the most fundamental This setting arises naturally in machine learning and data
and compute-intensive operations in machine learn- mining when one has a data matrix A whose rows are sam-
arXiv:2106.10860v1 [cs.LG] 21 Jun 2021

ing. Consequently, there has been significant work ples and a linear operator B one wishes to apply to these
on efficiently approximating matrix multiplies. We samples. B could be a linear classifier, linear regressor, or
introduce a learning-based algorithm for this task an embedding matrix, among other possibilities.
that greatly outperforms existing methods. Experi-
ments using hundreds of matrices from diverse do- As a concrete example, consider the task of approximating
mains show that it often runs 100× faster than exact a softmax classifier trained to predict image labels given
matrix products and 10× faster than current approx- embeddings derived from a neural network. Here, the rows
imate methods. In the common case that one matrix of A are the embeddings for each image, and the columns
is known ahead of time, our method also has the in- of B are the weight vectors for each class. Classification
teresting property that it requires zero multiply-adds. is performed by computing the product AB and taking the
These results suggest that a mixture of hashing, aver- argmax within each row of the result. In Figure 1, we see
aging, and byte shuffling—–the core operations of our the results of approximating AB using our method and its
method—–could be a more promising building block best-performing rivals (Dasgupta et al., 2010; Mairal et al.,
for machine learning than the sparsified, factorized, 2009) on the CIFAR-10 and CIFAR-100 datasets.
and/or scalar quantized matrix products that have re- Approximating Softmax Classifiers
cently been the focus of substantial research and hard- CIFAR-10
ware investment.
Classification

0.75
Accuracy

0.50
1. Introduction
0.25
Matrix multiplication is among the most fundamental sub-
routines used in machine learning and scientific computing. CIFAR-100
As a result, there has been a great deal of work on imple-
0.6
Classification

menting high-speed matrix multiplication libraries (Paszke


Accuracy

et al., 2017; Guennebaud et al., 2010; Abadi et al., 2016), 0.4

designing custom hardware to accelerate multiplication of 0.2


certain classes of matrices (Han et al., 2016; Chen et al.,
2016; Parashar et al., 2017; Jouppi et al., 2017), speeding 10
0
10
1
10
2

up distributed matrix multiplication (Yu et al., 2017; Dutta Speedup Over Exact Matrix Multiply
et al., 2016; Yu et al., 2020; Irony et al., 2004), and design- Ours Mairal et al.
ing efficient Approximate Matrix Multiplication (AMM) Exact Matrix Multiply Dasgupta et al.

algorithms under various assumptions. Figure 1: Our method achieves a dramatically better
We focus on the AMM task under the assumptions that the speed-accuracy tradeoff than the best existing methods
matrices are tall, relatively dense, and resident in a single when approximating two linear classifiers.
machine’s memory. In this setting, the primary challenge
Our method represents a significant methodological de-
1
MosaicML, San Francisco, CA, USA 2 MIT CSAIL, parture from most traditional approaches to this problem.
Cambridge, MA, USA. Correspondence to: Davis Blalock Traditional AMM methods construct matrices VA , VB ∈
<[email protected]>. RD×d , d  D such that
Proceedings of the 38 th International Conference on Machine
Learning, PMLR 139, 2021. Copyright 2021 by the author(s). AB ≈ (AVA )(VB> B). (1)
Multiplying Matrices Without Multiplying

Often, VA and VB are sparse, embody some sort of sam- 2. Related Work
pling scheme, or have other structure such that these pro-
Because our work draws on ideas from randomized algo-
jection operations are faster than a dense matrix multiply.
rithms, approximate matrix multiplication, vector quantiza-
In short, these methods use linear functions to preprocess
tion, and other fields, the body of work related to our own
A and B and reduce the problem to exact matrix multipli-
is vast. Here, we provide only a high-level overview, and
cation in a lower-dimensional space.
refer the interested reader to (Wang et al., 2016a; 2014a;
Our proposed method, M ADDNESS1 , instead employs a Desai et al., 2016) for more detailed surveys. We also de-
nonlinear preprocessing function and reduces the prob- fer discussion of related vector quantization methods to the
lem to table lookups. Moreover, in the case that B is following sections.
known ahead of time—which happens when applying a
trained linear model to new data, among other situations— 2.1. Linear Approximation
M ADDNESS does not require any multiply-add operations.
Most AMM methods work by projecting A and B into
Our method is most closely related to vector quantization
lower-dimensional spaces and then performing an exact
methods used for similarity search (e.g., (Blalock & Gut-
matrix multiply. One simple option for choosing the pro-
tag, 2017; André et al., 2017; 2019; Jegou et al., 2011;
jection matrices is to use matrix sketching algorithms.
Ge et al., 2014)). However, instead of using an expensive
The most prominent deterministic matrix sketching meth-
quantization function that requires many multiply-adds, we
ods are the Frequent Directions algorithm (Liberty, 2012;
introduce a family of quantization functions that require no
Ghashami et al., 2016) and its many variations (Teng &
multiply-adds.
Chu, 2019; Francis & Raimond, 2018b; Ye et al., 2016;
Our contributions can be summarized as follows: Huang, 2019; Luo et al., 2019; Francis & Raimond, 2018a).
There are also many randomized sketching methods (Sar-
• An efficient family of learned vector quantization func- los, 2006; Kyrillidis et al., 2014; Pagh, 2013; Dasgupta
tions that can encode over 100GB of data per second in et al., 2010; Nelson & Nguyên, 2013) and sampling meth-
a single CPU thread. ods (Drineas et al., 2006b;c).
• A high-speed summation algorithm for low-bitwidth in- A weakness of matrix sketching methods in the context of
tegers that avoids upcasting, saturation, and overflow. matrix multiplication is that they consider each matrix in
• An algorithm based on these functions for approximate isolation. To exploit information about both matrices si-
matrix multiplication. Experiments across hundreds of multaneously, Drineas et al. (2006a) sample columns of
diverse matrices demonstrate that this algorithm signifi- A and rows of B according to a sampling distribution de-
cantly outperforms existing alternatives. It also features pendent upon both matrices. Later work by Manne & Pal
theoretical quality guarantees. (2014) reduces approximation of the matrices to an op-
timization problem, which is solved by steepest descent.
Mroueh et al. (2016), Ye et al. (2016), and Francis & Rai-
1.1. Problem Formulation mond (2018a) introduce variations of the Frequent Direc-
Let A ∈ RN ×D and B ∈ RD×M be two matrices, with tions algorithm that take into account both matrices.
N  D ≥ M . Given a computation time budget τ , All of the above methods differ from our own not merely in
our task is to construct three functions g(·), h(·), and f (·), specifics, but also in problem formulation. These methods
along with constants α and β, such that all assume that there is no training set à and nearly all fo-
cus on large matrices, where provably reduced asymptotic
kαf (g(A), h(B)) + β − ABkF < ε(τ )kABkF (2) complexity for a given level of error is the goal.

2.2. Hashing to Avoid Linear Operations


for as small an error ε(τ ) possible. The constants α and β
are separated from f (·, ·) so that f (·, ·) can produce low- In the neural network acceleration literature, there have
bitwidth outputs (e.g., in the range [0, 255]) even when the been several efforts to accelerate dense linear layers us-
entries of AB do not fall in this range. ing some form of hashing (Spring & Shrivastava, 2017;
Chen et al., 2019; Bakhtiary et al., 2015; Dean et al., 2013;
We assume the existence of a training set Ã, whose rows
Chen et al., 2015). These methods differ from our own
are drawn from the same distribution as the rows of A. This
in the hash functions chosen, in not exploiting a training
is a natural assumption in the case that rows of A repre-
set, and in the overall goal of the algorithm. While we
sent examples in training data, or structured subsets thereof
seek to approximate the entire output matrix, these meth-
(such as patches of images).
ods seek to either sample outputs (Spring & Shrivastava,
1
Multiply-ADDitioN-lESS 2017; Chen et al., 2019), approximate only the largest out-
Multiplying Matrices Without Multiplying

puts (Bakhtiary et al., 2015; Dean et al., 2013), or imple-


ment a fixed, sparse linear operator (Chen et al., 2015).

3. Background - Product Quantization


To lay the groundwork for our own method, we begin by
reviewing Product Quantization (PQ) (Jegou et al., 2011).
PQ is a classic vector quantization algorithm for approxi-
mating inner products and Euclidean distances and serves
as the basis for nearly all vector quantization methods sim-
ilar to our own.
The basic intuition behind PQ is that a> b ≈ â> b, where
kâ − ak is small but â has special structure allowing the
product to be computed quickly. This structure consists of
â being formed by concatenating learned prototypes in dis-
joint subspaces; one obtains a speedup by precomputing the Figure 2: Product Quantization. The g(·) function re-
dot products between b and the prototypes once, and then turns the index of the most similar prototype to the data
reusing these values across many a vectors. The a vectors vector a in each subspace. The h(·) function computes a
here are the (transposed) rows of A and the b vectors are lookup table of dot products between the query vector
the columns of B. b and each protoype in each subspace. The aggregation
In somewhat more detail, PQ consists of the following: function f (·, ·) sums the table entries corresponding to
each index.
1. Prototype Learning - In an initial, offline training
phase, cluster the rows of A (or a training set Ã) us-
ing K-means to create prototypes. A separate K-means Encoding Function, g(A): Given the learned proto-
is run in each of C disjoint subspaces to produce C sets types, PQ replaces each row a of A with the concatenation
of K prototypes. of its C K-means centroid assignments in each of the C
2. Encoding Function, g(a) - Determine the most similar subspaces. Formally:
prototype to a in each subspace. Store these assignments X  2
(c)
as integer indices using C log2 (K) bits. g (c) (a) , argmin aj − Pk,j . (4)
k
j∈J (c)
3. Table Construction, h(B) - Precompute the dot prod-
ucts between b and each prototype in each subspace. We will refer to the resulting sequence of indices as the
Store these partial dot products in C lookup tables of encoding of a and the set of K centroids as a codebook.
size K. For convenience, we will also refer to the vector a(c) ,
4. Aggregation, f (·, ·) - Use the indices and tables to haj i, j ∈ J (c) as the subvector of a in subspace c.
lookup the estimated partial a> b in each subspace, then
sum the results across all C subspaces. Table Construction, h(B): Using these same proto-
types, PQ constructs a lookup table h(c) (b) ∈ RK in each
PQ is depicted for a single pair of vectors a and b in Fig-
of the C subspaces for each column b of B, where
ure 2. We elaborate upon each of these steps below.
(c)
X
Prototype Learning: Let à ∈ RN ×D be a training set, h(c) (b)k , bj Pk,j . (5)
K be a number of prototypes per subspace, C be a number j∈J (c)
of subspaces, and {J (c) }C
c=1 be the mutually exclusive and
Existing work has shown that setting K = 16 and quantiz-
collectively exhaustive sets of indices associated with each
ing the lookup tables to 8 bits can offer enormous speedups
subspace. The training-time task of PQ is to learn C sets of
(c) compared to larger K and/or floating-point tables (Blalock
prototypes P (c) ∈ RK×|J | and assignments z (c) ∈ RN
& Guttag, 2017; André et al., 2017; 2019). This is because
such that:
N XC
16 1-byte entries can be stored in a SIMD register, allowing
2
X X  (c)
Ãij − Pz(c) ,j (3) 16 or more table lookups to be performed in parallel using
i=1 c=1 j∈J (c)
a byte shuffle instruction. Since the table entries naturally
occupy more than 8 bits even for 8-bit data, some means
is minimized. It does this by running K-means separately of quantizing these entries is necessary. This can easily be
in each subspace J (c) and using the resulting centroids and done by subtracting off the minimum entry in each table
assignments to populate P (c) and z (c) . and linearly rescaling such that the maximum entry in any
Multiplying Matrices Without Multiplying

table is at most 255. Ignoring rounding error, this affine Algorithm 1 M ADDNESS H ASH
transform is invertible, and is reflected by the constants α 1: Input: vector x, split indices j 1 , . . . , j 4 , split thresh-
and β in equation 2. See Appendix A for further details. olds v 1 , . . . , v 4
2: i←1 // node index within level of tree
Aggregation, f (·, ·): Given the encoding of a and the 3: for t ← 1 to 4 do
lookup tables for b, the product can be approximated as 4: v ← vit // lookup split threshold for node i at level t
5: b ← xj t ≥ v ? 1 : 0 // above split threshold?
C C 6: i ← 2i − 1 + b // assign to left or right child
X X
a> b = a(c)> b(c) ≈ h(c) (b)k , k = g (c) (a). (6) 7: end for
c=1 c=1 8: return i

4.2. Learning the Hash Function Parameters


4. Our Method
The split indices j 1 , . . . , j 4 and split thresholds v 1 , . . . , v 4
Product Quantization and its variants yield a large speedup
are optimized on the training matrix à using a greedy tree
with N, M  D. However, we require an algorithm that
construction algorithm. To describe this algorithm, we in-
only needs N  M, D, a more relaxed scenario common
troduce the notion of a bucket Bit , which is the set of vectors
when using linear models and transforms. In this setting,
mapped to node i in level t of the tree. The root of the tree
the preprocessing time g(A) can be significant, since N D
is level 0 and B10 contains all the vectors. It will also be
may be similar to (or even larger than) N M .
helpful to define the sum of squared errors (SSE) loss as-
To address this case, we introduce a new g(A) function that sociated with a bucket, or a specific (index, bucket) pair:
yields large speedups even on much smaller matrices. The !2
main idea behind our function is to determine the “most X 1 X 0
similar” prototype through locality-sensitive hashing (In- L(j, B) , xj − xj
|B| 0
x∈B x ∈B
dyk & Motwani, 1998); i.e., rather than compute the Eu- (7)
clidean distance between a subvector a(c) and each proto- X
L(B) , L(j, B).
type, we hash a(c) to one of K buckets where similar sub-
j
vectors tend to hash to the same bucket. The prototypes are
set to the means of the subvectors hashing to each bucket. Using this notation, it suffices to characterize the learning
algorithm by describing the construction of level t of the
tree given the buckets B1t−1 , . . . , B2t−1
t−1 from the previous
4.1. Hash Function Family, g(·)
level. This procedure is given in Algorithm 2.
Because we seek to exploit a training set while also doing
In line 2, we select a fixed number of indices to evaluate.
far less work than even a single linear transform, we found
Several heuristics are possible, including evaluating all in-
that existing hash functions did not meet our requirements.
dices. We found that simply selecting the top n indices
Consequently, we designed our own family of trainable
that contributed the most loss summed across all buckets
hash functions. The family of hash functions we choose
was difficult to beat. In preliminary experiments, we found
is balanced binary regression trees, with each leaf of the
that using n > 4 indices offered no clear benefit (and even
tree acting as one hash bucket. The leaf for a vector x is
choosing n = 1 was nearly as good), so we fix n = 4.
chosen by traversing the tree from the root and moving to
the left child if the value xj at some index j is below a In lines 4-15, we find the minimal loss obtainable by
node-specific threshold v, and to the right child otherwise. splitting all buckets along that index, but with bucket-
To enable the use of SIMD instructions, the tree is lim- specific cutoffs. This loss is minimal not in the sense
ited to 16 leaves and all nodes at a given level of the tree that it leads to the globally optimal tree, but in that it
are required to split on the same index j. The number 16 minimizes the sum of the losses in the buckets produced
holds across many processor architectures, and we refer the in this iteration. To do this, we invoke the subroutine
reader to Appendix B for further vectorization details. optimal split threshold, which takes in a bucket
B and an index j and tests all possible thresholds to
Formally, consider a set of four indices j 1 , . . . , j 4 and four
find one minimizing L(j, B). This can be done in time
arrays of split thresholds v 1 , . . . , v 4 , with vt having length
O(|J (c) ||B| log(|B|)). The pseudocode for this subroutine
2t−1 . A vector x is mapped to an index using Algorithm 1.
is given in Algorithms 3 and 4 in Appendix C.
This function is simple, only depends on a constant number
of indices in the input vector, and can easily be vectorized Once a split index j and an array of split thresholds v are
provided that the matrix whose rows are being encoded is chosen, all that remains is to split the buckets to form the
stored in column-major order. next level of the tree (lines 16-21). This entails forming
Multiplying Matrices Without Multiplying

Algorithm 2 Adding The Next Level to the Hashing Tree formed by concatenating the one-hot encoded representa-
tions of each assignment for the corresponding row of Ã.
1: Input: buckets B1t−1 , . . . , B2t−1training matrix Ã
t−1 ,
For example, if a row were assigned prototypes h3 1 2i
// greedily choose next split index and thresholds
with K = 4, C = 3, its row in G would be h0010 1000
2: Jˆ ← heuristic select idxs(B1t−1 , . . . , B2t−1 t−1 )
0100i ∈ R12 . Our idea is to optimize P conditioned on
3: lmin , j min , v min ← ∞, NaN, NaN
G and Ã. This is an ordinary least squares problem, and
4: for j ∈ Jˆ do
we solve it with ridge regression:
5: l←0 // initialize loss for this index to 0
6: v←[] // empty list of split thresholds P , (G> G + λI)−1 G> Ã. (8)
7: for i ← 1 to 2t−1 do
8: vi , li ← optimal split threshold(j, Bit−1 ) One could obtain better performance by cross-validating to
9: append(v, vi ) // append threshold for bucket i find λ, but for simplicity, we fix λ = 1.
10: l ← l + li // accumulate loss from bucket i
11: end for This procedure allows the prototypes to be nonzero out-
12: if l < lmin then side of their original subspaces. Because of our hashing
13: lmin ← l, j min ← j, v min ← v // new best split procedure, we avoid the dramatically increased overhead
14: end if faced by other methods with non-orthogonal prototypes
15: end for (c.f. (Babenko & Lempitsky, 2015; 2014; Zhang et al.,
// create new buckets using chosen split 2014; Liu et al., 2016; Martinez et al., 2016; 2014)).
16: B←[]
17: for i ← 1 to 2t−1 do 4.4. Fast 8-bit Aggregation, f (·, ·)
18: Bbelow , Babove ← apply split(vimin , Bit−1 ) Let T ∈ RM ×C×K be the tensor of lookup tables for all M
19: append(B, Bbelow ) columns of B. Given the encodings G, the function f (·, ·)
20: append(B, Babove ) is defined as
21: end for
C
22: return B, lmin , j min , v min X
f (g(A), h(B))n,m , Tm,c,k , k = g (c) (an ). (9)
c=1
two child buckets from each current bucket by grouping
vectors whose jth entries are above or below the bucket’s Because the entries of T are stored as 8-bit values, exact
split threshold. summation requires immediately upcasting each looked-up
entry to 16 bits before performing any addition instructions
4.3. Optimizing the Prototypes (Blalock & Guttag, 2017). This not only imposes overhead
directly, but also means that one must perform 16-bit addi-
At this point, we have a complete algorithm. We could tions, which have half the throughput of 8-bit additions.
simply drop our hash-based encoding function into PQ and
approximate matrix products. However, we contribute two We propose an alternative that sacrifices a small amount
additional enhancements: a means of optimizing the pro- of accuracy for a significant increase in speed. Instead of
totypes with no runtime overhead, and a means of quickly using addition instructions, we use averaging instructions,
summing low-bitwidth integers. such as vpavgb on x86 or vrhadd on ARM. While non-
saturating additions compute (a + b) % 256, these instruc-
Several works propose prototype or table optimizations tions compute (a + b + 1)/2. This means that they lose
based on knowledge of B (Babenko et al., 2016; Wang information about the low bit instead of the high bit of the
et al., 2014b), and others optimize them at the expense of sum. We estimate the overall mean by averaging pairs of
slowing down the function g(·) (Zhang et al., 2014; 2015). values, then pairs of pairs, and so on. We refer the reader
In contrast, we introduce a method that does not do ei- to Appendix D for details.
ther of these. The idea is to choose prototypes such that
à can be reconstructed from its prototypes with as little The challenging part of this approach is computing the bias
squared error as possible—this improves results since less in the estimated sum in order to correct for it. We prove in
error means that less information about à is being lost. Appendix D that this bias has a closed-form solution under
the realistic assumption that the low bits are equally likely
Let P ∈ RKC×D be a matrix whose diagonal blocks of to be 0 or 1.
size K × |J (c) | consist of the K learned prototypes in each
subspace c. The training matrix à can be approximately 4.5. Theoretical Guarantees
reconstructed as à ≈ GP , where G serves to select the
appropriate prototype in each subspace. Rows of G are Our central theoretical result is a generalization guarantee
for the overall approximation error of M ADDNESS, stated
Multiplying Matrices Without Multiplying

below. See Appendix F for a proof and additional analysis, We do not need to tune any hyperparameters for
including a discussion of algorithmic complexity. Besides M ADDNESS, but we do take steps to ensure that other
this main guarantee, we also inherit all of the guarantees methods are not hindered by insufficient hyperparameter
for Bolt (Blalock & Guttag, 2017), modulo a small amount tuning. Concretely, we sweep a wide range of hyperparam-
of additional error from averaging integers rather than sum- eter values and allow them to cherry-pick their best hyper-
ming exactly. This follows from Bolt’s guarantees depend- parameters on each test matrix. Further details regarding
ing only on the quantization errors, rather than the method our experimental setup and choices (e.g., use of a single
used to obtain them. thread) can be found in Appendix E.
Theorem 4.1 (Generalization Error of M ADDNESS). Let
5.1. Methods Tested
D be a probability distribution over RD and suppose that
M ADDNESS is trained on a matrix à ∈ RN ×D whose rows Recall that most baselines take the form of selecting a ma-
are drawn independently from D and with maximum singu- trix V ∈ RD×d , d < D such that AB ≈ (AV )(V > B).
lar value bounded by σA . Let C be the number of code- Here d is a free parameter that adjusts the quality versus
books used by M ADDNESS and λ > 0 be the regularization speed tradeoff. We therefore characterize most of these
parameter used in the ridge regression step. Then for any methods by how they set V .
b ∈ RD , any a ∼ D, and any 0 < δ < 1, we have with • PCA. Set V equal to the top principal components of Ã.
probability at least 1 − δ that
• SparsePCA (Mairal et al., 2009). Set V =
ED [L(a, b)] ≤ EÃ [L(a, b)]+ argminV minU 2Ntrain1
kÃ−U V > k2F +λkV k1 , where
U > U = I. This is not the only dictionary learning for-
p !
CσA kbk2 1 8 + ν(C, D, δ) (10)
√ + √ mulation referred to as SparsePCA (Zou & Xue, 2018;
2 λ 256 2n Camacho et al., 2020), but it is a good representative and
is the only one with support in a major Python library.
where L(a, b) , |a> b − αf (g(a), h(b)) − β|, α is the
scale used for quantizing the lookup tables, β is the con- • FastJL (Ailon & Chazelle, 2009). V is set to
stants used in quantizing the lookup tables plus the debias- Rademacher random variables composed with a Fast
ing constant of Section 4.4, and Hadamard Transform (FHT). For simplicity, we exclude
the FHT in the timing.
ν(C, D, δ) , C(4 dlog2 (D)e + 256) log 2 − log δ. (11)
• HashJL (Dasgupta et al., 2010). V is zero except for
a ±1 in each row, with both sign and position chosen
5. Experiments uniformly at random.
To assess M ADDNESS’s effectiveness, we implemented • ScalarQuantize. The matrices are not projected, but in-
both it and existing algorithms in C++ and Python. All of stead linearly quantized to eight bits such that the small-
our code and raw numerical results are publicly available est and largest entries map to either 0 and 255 or -128 and
at https://smarturl.it/Maddness. All experi- 127, as appropriate. We use FBGEMM (Khudia et al.,
ments use a single thread on a Macbook Pro with a 2.6GHz 2018) to perform the quantized matrix multiplies. We
Intel Core i7-4960HQ processor. Unless stated otherwise, neglect the time required to convert from other formats
all timing results use five trials, with each trial reporting the to eight bits, reflecting the optimistic scenario in which
fastest among 20 executions. We use the best, rather than the matrices are already of the appropriate types.
the average, since this is standard practice in performance • Bolt (Blalock & Guttag, 2017). Bolt is the most similar
benchmarking and is robust to the purely additive noise in- method to our own, differing only in the encoding func-
troduced by competing CPU tasks. Standard deviations are tion, the use of averaging instead of upcasting, and the
shown for all curves as shaded areas. Since training can optimization of centroids.
be performed offline and all methods except SparsePCA
(Mairal et al., 2009) train in at most a few minutes, we • Exact Multiplication. We simply compute the matrix
omit profiling of training times. We also do not profile the product AB using a modern BLAS implementation.
time to preprocess B, since 1) this time is inconsequential • M ADDNESS-PQ. A handicapped version of
in most cases, and 2) B is fixed and could be processed of- M ADDNESS without the prototype optimization step.
fline in all the problems we consider. In order to avoid im- The gap between M ADDNESS and M ADDNESS-PQ is
plementation bias, we build upon the source code provided the gain from optimizing the prototypes.
by Blalock & Guttag (2017)2 , which includes highly tuned
implementations of many algorithms to which we compare. We also compared to many additional methods (see Ap-
pendix E), but omit their results since they were not com-
2
https://github.com/dblalock/bolt petitive with those listed here.
Multiplying Matrices Without Multiplying

5.2. How Fast is M ADDNESS? Speed of f() Functions for Different Encoding Sizes

Billion Dot Products/s


We begin by profiling the raw speed of our method. In Fig- 6

ure 3, we time the g(A) functions for various vector quan-


4
tization methods. The A matrices have 214 rows and vary-
ing numbers of columns D. Following Blalock & Guttag 2
(2017), we also vary the number of codebooks C, profil-
ing 8-, 16-, and 32-byte encodings. We measure in bytes 0
rather than codebooks since PQ and OPQ use eight bits per MADDNESS Bolt Popcount PQ / OPQ
codebook while Bolt and M ADDNESS use four. 8B Codes 16B Codes 32B Codes 64B Codes

M ADDNESS is up to two orders of magnitude faster than Figure 4: Given the preprocessed matrices,
existing methods, and its throughput increases with row M ADDNESS computes the approximate output twice as
length. This latter property is because its encoding cost fast as the fastest existing method.
per row is O(C) rather than O(D).
dimensional floating point activations for the full test sets,
Speed of g() Functions and the matrices B are each network’s final dense layer.
3
for Different Encoding Sizes
10
The 50000 × 512-dimensional activations from the train-
8B Encodings
Speed (GB/s)

ing set serve as the training matrices Ã. As shown in


Encoding

1 Figure 5, M ADDNESS significantly outperforms all exist-


10
ing methods, achieving virtually the same accuracy as exact
−1 multiplication more than an order of magnitude faster.
10
3 Approximating Softmax Classifiers
10
32B Encodings 16B Encodings

CIFAR-10
Speed (GB/s)

Classification
Encoding

Accuracy

1 0.75
10
0.50
−1 0.25
10
3
10 CIFAR-100
Speed (GB/s)

Classification
Encoding

Accuracy

1 0.50
10
0.25
−1
10
0 1 2
10 10 10
0 200 400 600 800 1000 Speedup Over Exact Matrix Multiply
Number of Columns in Matrix A MADDNESS ScalarQuantize HashJL
MADDNESS Bolt PQ OPQ MADDNESS-PQ Bolt PCA
Exact FastJL SparsePCA
Figure 3: M ADDNESS encodes the A matrix orders of
magnitude more quickly than existing vector quantiza- Figure 5: M ADDNESS achieves a far better speed-
tion methods. accuracy tradeoff than any existing method when ap-
proximating two softmax classifiers.
We also profile the speed of our aggregation function f (·, ·)
Moreover, our method achieves this performance despite
using the same baselines as Blalock & Guttag (2017). As
having worse support from the hardware. More precisely,
Figure 4 shows, our average-based, matrix-aware aggrega-
it obtains speedups much smaller than the level of com-
tion is significantly faster than the upcasting-based method
pression it provides. For example, the third points from the
of Bolt, its nearest rival.
right in both plots correspond to speedups of roughly 100×.
However, they are compressing each 512 × 4B = 2048B
5.3. Softmax Classifier row of the input down to a mere 4B, a savings of 512×
(sizes not shown in figure). If the hardware could lookup-
As described in Section 1, we approximated linear clas- accumulate as many bytes per cycle as it can multiply-
sifiers on the widely used CIFAR-10 and CIFAR-100 accumulate, our method could be over 4× faster. Com-
datasets (Krizhevsky et al., 2009). The classifiers use as bined with the fact that multiplexers require many fewer
input features the 512-dimensional activations of open- transistors than multipliers, this suggests that a hardware
source, VGG-like neural networks trained on each dataset implementation of our method might offer large efficiency
(Geifman, 2018). The matrices A are the 10000 × 512- gains compared to existing accelerators.
Multiplying Matrices Without Multiplying

5.4. Kernel-Based Classification 5.5. Image Filtering


To assess the efficacy of our method on a larger and more To test the extreme limits of M ADDNESS, we benchmarked
diverse set of datasets than CIFAR-10 and CIFAR-100, the various techniques’ ability to apply small filters to im-
we trained kernel classifiers on the datasets from the UCR ages (after an offline im2row transform to reduce the task
Time Series Archive (Dau et al., 2018). To enable mean- to matrix multiplication). This task is extreme in that D
ingful speed comparison across datasets, we resampled the and M are tiny, affording almost no opportunity to amor-
time series in all datasets to the median length and obtained tize preprocessing costs. As representative example filters,
the matrix B for each dataset by running Stochastic Neigh- we chose 3 × 3 Sobel kernels and 5 × 5 Gaussian kernels.
bor Compression (Kusner et al., 2014) on the training set These are common high-pass and low-pass filters, respec-
with an RBF kernel of bandwidth one. We approximate tively. We took the first 10 images from the first 50 classes
the Euclidean distances used by the kernel via the identity of the Caltech101 dataset (Fei-Fei et al., 2004) as a single
kx − yk22 = kxk22 − 2x> y + kyk22 , which consists only training set, and the first 10 images from the remaining 51
of dot products. This is not the state-of-the-art means of classes as 510 test sets. We constructed the A matrices by
classifying time series, but it does yield fixed-sized matri- extracting each patch of each image as one row. The B
ces and is representative of several modern techniques for matrices have two columns, corresponding to one pair of
constructing highly efficient classifiers (Kusner et al., 2014; Sobel or Gaussian filters (since using these filters in pairs
Wang et al., 2016b; Zhong et al., 2017; Gupta et al., 2017). is common). We report the normalized mean-squared error
Because Stochastic Neighbor Compression optimizes the (NMSE), defined as kĈi,j − ABk2F /kABk2F , where Ĉ is
classifiers to avoid redundancy, this task is quite difficult. the method’s estimate of AB. An NMSE of 0 is perfect
and an NMSE of 1 corresponds to always predicting 0.
As shown in Figure 6, M ADDNESS is significantly faster
than alternatives at a given level of accuracy. A counter- In Figure 7, we see that it is only M ADDNESS that offers
intuitive result, however, is that optimization of the proto- any advantage over exact matrix products. This is likely
types occasionally reduces accuracy—see the red line dip- because two columns afford almost no time to preprocess
ping below the blue one in the lowest subplot. Since the A; indeed, rival vector quantization methods cannot logi-
optimization strictly increases the expressive power, we be- cally do less work than brute force in this setting, and dense
lieve that this is a product of overfitting and could be cor- linear methods can only save work by embedding rows of
rected by not fixing λ = 1 in the ridge regression. A in one-dimensional space. M ADDNESS performs much
Approximating an RBF Kernel Classifier worse on the high-pass filters (top) than the low-pass filters
1.0 (bottom). This is likely because the former produce outputs
Fraction > 0.5

with variance that is orders of magnitude lower than that of


0.5 the original image, making the NMSE denominator tiny.
Approximating a Sobel Filter
1.0
0.0
1 - NMSE

1.0 0.5
Fraction > 0.75

0.5 0.0
0 1 2 3 4 5 6
Speedup Over Exact Matrix Multiply
Approximating a Gaussian Filter
0.0 1.0
1 - NMSE

1.0
Fraction > 0.95

0.5

0.5
0.0
0 5 10 15 20 25
Speedup Over Exact Matrix Multiply
0.0
10
0
10
1
10
2 MADDNESS MADDNESS-PQ SparsePCA
Speedup Over Exact Matrix Multiply
MADDNESS ScalarQuantize HashJL
Figure 7: Despite there being only two columns in
MADDNESS-PQ Bolt PCA the matrix B, M ADDNESS still achieves a significant
Exact FastJL SparsePCA speedup with reasonable accuracy. Methods that are
Figure 6: Fraction of UCR datasets for which each Pareto dominated by exact matrix multiplication on
method preserves a given fraction of the original accu- both tasks are not shown; this includes all methods but
racy. M ADDNESS enables much greater speedups for a M ADDNESS and SparsePCA.
given level of accuracy degradation.
Multiplying Matrices Without Multiplying

6. Discussion and Conclusion engineering work of building and integrating custom op-
erators, data layouts, etc., into existing frameworks and
Because our work draws on a number of different fields but networks; and second, the research necessary to determine
does not fit cleanly into any of them, it is useful to discuss when, how, and to what extent to include approximate ker-
what we have and have not demonstrated, as well as possi- nels inspired by our approach. A particular difficulty with
ble implications and extensions of our work. the latter is that our hash function is not differentiable.
Our main empirical finding is that our proposed method, We believe that accelerating full networks with our ideas
M ADDNESS, achieves order-of-magnitude speedups com- is a promising direction, particularly for inference. This is
pared to existing AMM methods and up to two-order-of- especially true at the hardware level—our method requires
magnitude speedups compared to the dense baseline. It only multiplexers, not multipliers, and can therefore be im-
also compresses matrices by up to three orders of magni- plemented easily and with far less power than current ma-
tude. These results are evaluated on a CPU, and are ob- trix product logic. Moreover, our encoded representation
tainable only when there is a training set for one matrix. and lookup tables have contiguous and uniformly-sized ele-
We also claim superior performance only when one ma- ments, making our approximate GEMM inner loops nearly
trix is larger than the other, and both matrices are tall—the identical to their dense counterparts–i.e., there is no need
regime wherein our extremely fast (but less accurate) en- for complex access patterns or sparsity handling.
coding function is beneficial. Our method also loses utility
when the larger matrix is known ahead of time; this as- In summary, we introduced M ADDNESS, an algorithm that
sumption is common in similarity search, and eliminates achieves up to a 10× better speed-quality tradeoff than ex-
the need for a fast encoding function entirely. Our approx- isting methods for the well-studied problem of approximate
imate integer summation and fused table lookups would matrix multiplication (AMM), as measured on a large, di-
likely be useful independent of any of these assumptions, verse, and challenging set of real-world matrices. Our ap-
but demonstrating this is future work. proach is a significant departure from existing AMM work
in that it relies on hashing and table lookups rather than
We also have several theoretical findings, taking the form of multiply-add operations. Our results suggest that future
guarantees regarding the errors introduced by our method methods similar to our own might hold promise for acceler-
and its constituent subroutines. While we do obtain an ating convolution, deep learning, and other workloads bot-
overall generalization guarantee, this guarantee is not tight. tlenecked by linear transforms.
In particular, it should grow looser with the large matrix’s
Frobenius norm and tighter as its singular values become
more concentrated; at present, however, it simply grows References
looser as the largest singular value grows. The missing step Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
is a guarantee that our encoding function will yield lower J., Devin, M., Ghemawat, S., et al. Tensorflow: A system
quantization errors when the singular values are more con- for large-scale machine learning. In OSDI, volume 16,
centrated, which is its behavior in practice. pp. 265–283, 2016.
We have not demonstrated results using GPUs or other ac-
celerators. While such accelerators are a small minority Achlioptas, D. Database-friendly random projections. In
of hardware, they are often used in machine learning. Our Proceedings of the twentieth ACM SIGMOD-SIGACT-
method is not inherently tied to CPUs, but the differing per- SIGART symposium on Principles of database systems,
formance characteristics of accelerators mean that adapting pp. 274–281, 2001.
our method to them would require both algorithmic and im-
Ailon, N. and Chazelle, B. The Fast Johnson-Lindenstrauss
plementation work, with the details depending on the de-
Transform and Approximate Nearest Neighbors. SIAM
vice. We also have not evaluated a multi-CPU-threaded ex-
Journal on Computing (SICOMP), 39(1):302–322,
tension of our algorithm, though this is because our method
2009. doi: 10.1137/060673096.
is intended to serve as the low-level, compute-bound, block
matrix product routine called by individual threads. André, F., Kermarrec, A.-M., and Le Scouarnec, N. Accel-
Finally, we have not demonstrated results using convo- erated nearest neighbor search with quick adc. In Pro-
lutional layers in neural networks, or results accelerating ceedings of the 2017 ACM on International Conference
full networks. The weight reuse in convolutional layers on Multimedia Retrieval, pp. 159–166, 2017.
presents many opportunities for algorithmic optimizations,
and we hope to exploit them using a specialized extension André, F., Kermarrec, A.-M., and Le Scouarnec, N.
of our method in future work. Accelerating overall net- Quicker adc: Unlocking the hidden potential of product
works will require two significant undertakings: first, the quantization with simd. IEEE transactions on pattern
analysis and machine intelligence, 2019.
Multiplying Matrices Without Multiplying

Babenko, A. and Lempitsky, V. Additive quantization for Dasgupta, A., Kumar, R., and Sarlós, T. A sparse johnson:
extreme vector compression. In Proceedings of the IEEE Lindenstrauss transform. In Proceedings of the forty-
Conference on Computer Vision and Pattern Recogni- second ACM symposium on Theory of computing, pp.
tion, pp. 931–938, 2014. 341–350, 2010.

Babenko, A. and Lempitsky, V. Tree Quan- Dau, H. A., Keogh, E., Kamgar, K., Yeh, C.-C. M., Zhu,
tization for Large-Scale Similarity Search Y., Gharghabi, S., Ratanamahatana, C. A., Yanping,
and Classification. CVPR, pp. 1–9, 2015. Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista,
URL papers3://publication/uuid/ G., and Hexagon-ML. The ucr time series classifica-
F4762974-BB97-4208-B035-508945A90EFC. tion archive, October 2018. https://www.cs.ucr.
edu/˜eamonn/time_series_data_2018/.
Babenko, A., Arandjelović, R., and Lempitsky, V. Pairwise
quantization. arXiv preprint arXiv:1606.01550, 2016. Dean, T., Ruzon, M. A., Segal, M., Shlens, J., Vijaya-
narasimhan, S., and Yagnik, J. Fast, accurate detection
Bakhtiary, A. H., Lapedriza, A., and Masip, D. Speeding of 100,000 object classes on a single machine. In Pro-
up neural networks for large scale classification using ceedings of the IEEE Conference on Computer Vision
wta hashing. arXiv preprint arXiv:1504.07488, 2015. and Pattern Recognition, pp. 1814–1821, 2013.
Bartlett, P. L. and Mendelson, S. Rademacher and gaus- Desai, A., Ghashami, M., and Phillips, J. M. Improved
sian complexities: Risk bounds and structural results. practical matrix sketching with guarantees. IEEE Trans-
Journal of Machine Learning Research, 3(Nov):463– actions on Knowledge and Data Engineering, 28(7):
482, 2002. 1678–1690, 2016.
Blalock, D. W. and Guttag, J. V. Bolt: Accelerated data
Drineas, P., Kannan, R., and Mahoney, M. W. Fast
mining with fast vector compression. In Proceedings
Monte Carlo Algorithms for Matrices I: Approximating
of the 23rd ACM SIGKDD International Conference on
Matrix Multiplication. SIAM Journal on Computing,
Knowledge Discovery and Data Mining, pp. 727–735.
36(1):132–157, January 2006a. ISSN 0097-5397,
ACM, 2017.
1095-7111. doi: 10.1137/S0097539704442684. URL
Blalock, D. W., Ortiz, J. J. G., Frankle, J., and Guttag, http://epubs.siam.org/doi/10.1137/
J. V. What is the state of neural network pruning? S0097539704442684.
In Dhillon, I. S., Papailiopoulos, D. S., and Sze, V.
(eds.), Proceedings of Machine Learning and Systems Drineas, P., Kannan, R., and Mahoney, M. W. Fast
2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. Monte Carlo Algorithms for Matrices II: Com-
mlsys.org, 2020. URL https://proceedings. puting a Low-Rank Approximation to a Matrix.
mlsys.org/book/296.pdf. SIAM Journal on Computing, 36(1):158–183, January
2006b. ISSN 0097-5397, 1095-7111. doi: 10.1137/
Camacho, J., Smilde, A., Saccenti, E., and Westerhuis, J. S0097539704442696. URL http://epubs.siam.
All sparse pca models are wrong, but some are useful. org/doi/10.1137/S0097539704442696.
part i: Computation of scores, residuals and explained
variance. Chemometrics and Intelligent Laboratory Sys- Drineas, P., Kannan, R., and Mahoney, M. W. Fast
tems, 196:103907, 2020. Monte Carlo Algorithms for Matrices III: Comput-
ing a Compressed Approximate Matrix Decomposition.
Chen, B., Medini, T., and Shrivastava, A. Slide: In de- SIAM Journal on Computing, 36(1):184–206, January
fense of smart algorithms over hardware acceleration 2006c. ISSN 0097-5397, 1095-7111. doi: 10.1137/
for large-scale deep learning systems. arXiv preprint S0097539704442702. URL http://epubs.siam.
arXiv:1903.03129, 2019. org/doi/10.1137/S0097539704442702.

Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., and Dutta, S., Cadambe, V., and Grover, P. Short-dot: Com-
Chen, Y. Compressing neural networks with the hashing puting large linear transforms distributedly using coded
trick. In ICML, pp. 2285–2294, 2015. short dot products. In Advances In Neural Information
Processing Systems, pp. 2100–2108, 2016.
Chen, Y.-H., Emer, J., and Sze, V. Eyeriss: A spatial ar-
chitecture for energy-efficient dataflow for convolutional Eckart, C. and Young, G. The approximation of one matrix
neural networks. ACM SIGARCH Computer Architec- by another of lower rank. Psychometrika, 1(3):211–218,
ture News, 44(3):367–379, 2016. 1936.
Multiplying Matrices Without Multiplying

Fei-Fei, L., Fergus, R., and Perona, P. Learning genera- He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings
tive visual models from few training examples: An in- in deep residual networks. In European conference on
cremental bayesian approach tested on 101 object cate- computer vision, pp. 630–645. Springer, 2016b.
gories. In 2004 conference on computer vision and pat-
tern recognition workshop, pp. 178–178. IEEE, 2004. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,
K. Q. Densely connected convolutional networks. In
Francis, D. P. and Raimond, K. An improve- Proceedings of the IEEE conference on computer vision
ment of the parameterized frequent directions al- and pattern recognition, pp. 4700–4708, 2017.
gorithm. Data Mining and Knowledge Discov-
Huang, Z. Near Optimal Frequent Directions for Sketching
ery, 32(2):453–482, March 2018a. ISSN 1384-
Dense and Sparse Matrices. Journal of Machine Learn-
5810, 1573-756X. doi: 10.1007/s10618-017-0542-x.
ing Research, 20(1):23, February 2019.
URL http://link.springer.com/10.1007/
s10618-017-0542-x. Indyk, P. and Motwani, R. Approximate nearest neighbors:
towards removing the curse of dimensionality. In Pro-
Francis, D. P. and Raimond, K. A practical
ceedings of the thirtieth annual ACM symposium on The-
streaming approximate matrix multiplication algo-
ory of computing, pp. 604–613, 1998.
rithm. Journal of King Saud University - Com-
puter and Information Sciences, September 2018b. Irony, D., Toledo, S., and Tiskin, A. Communication
ISSN 13191578. doi: 10.1016/j.jksuci.2018.09. lower bounds for distributed-memory matrix multiplica-
010. URL https://linkinghub.elsevier. tion. Journal of Parallel and Distributed Computing, 64
com/retrieve/pii/S1319157818306396. (9):1017–1026, 2004.
Ge, T., He, K., Ke, Q., and Sun, J. Optimized product Jegou, H., Douze, M., and Schmid, C. Product quantization
quantization. IEEE transactions on pattern analysis and for nearest neighbor search. IEEE transactions on pat-
machine intelligence, 36(4):744–755, 2014. tern analysis and machine intelligence, 33(1):117–128,
2011.
Geifman, Y. cifar-vgg, 3 2018. https://github.
com/geifmany/cifar-vgg. Ji, J., Li, J., Yan, S., Zhang, B., and Tian, Q. Super-bit
locality-sensitive hashing. In Advances in Neural Infor-
Ghashami, M., Liberty, E., Phillips, J. M., and Woodruff, mation Processing Systems, pp. 108–116, 2012.
D. P. Frequent Directions: Simple and Deterministic
Matrix Sketching. SIAM Journal on Computing, 45(5): Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,
1762–1792, January 2016. ISSN 0097-5397, 1095-7111. G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers,
doi: 10.1137/15M1009718. URL http://epubs. A., et al. In-datacenter performance analysis of a ten-
siam.org/doi/10.1137/15M1009718. sor processing unit. In Proceedings of the 44th Annual
International Symposium on Computer Architecture, pp.
Guennebaud, G., Jacob, B., et al. Eigen v3. 1–12. ACM, 2017.
http://eigen.tuxfamily.org, 2010.
Kakade, S. M., Sridharan, K., and Tewari, A. On the
Gupta, C., Suggala, A. S., Goyal, A., Simhadri, H. V., complexity of linear prediction: Risk bounds, margin
Paranjape, B., Kumar, A., Goyal, S., Udupa, R., Varma, bounds, and regularization. In Advances in neural in-
M., and Jain, P. Protonn: Compressed and accurate knn formation processing systems, pp. 793–800, 2009.
for resource-scarce devices. In Proceedings of the 34th
Khudia, D., Basu, P., and Deng, S. Open-sourcing fbgemm
International Conference on Machine Learning-Volume
for state-of-the-art server-side inference, 2018.
70, pp. 1331–1340. JMLR. org, 2017.
Krizhevsky, A., Hinton, G., et al. Learning multiple layers
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz,
of features from tiny images. Technical report, Citeseer,
M. A., and Dally, W. J. Eie: efficient inference engine
2009.
on compressed deep neural network. In Proceedings of
the 43rd International Symposium on Computer Archi- Kusner, M., Tyree, S., Weinberger, K., and Agrawal, K.
tecture, pp. 243–254. IEEE Press, 2016. Stochastic neighbor compression. In International Con-
ference on Machine Learning, pp. 622–630, 2014.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE Kyrillidis, A., Vlachos, M., and Zouzias, A. Approx-
conference on computer vision and pattern recognition, imate Matrix Multiplication with Application to Lin-
pp. 770–778, 2016a. ear Embeddings. arXiv:1403.7683 [cs, math, stat],
Multiplying Matrices Without Multiplying

March 2014. URL http://arxiv.org/abs/ Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
1403.7683. arXiv: 1403.7683. DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and
Lerer, A. Automatic differentiation in pytorch. 2017.
Liberty, E. Simple and Deterministic Matrix Sketch-
ing. arXiv:1206.0594 [cs], June 2012. URL http:// Sarlos, T. Improved Approximation Algorithms for Large
arxiv.org/abs/1206.0594. arXiv: 1206.0594. Matrices via Random Projections. In 2006 47th An-
nual IEEE Symposium on Foundations of Computer Sci-
Liu, S., Shao, J., and Lu, H. Generalized Residual ence (FOCS’06), pp. 143–152, Berkeley, CA, Octo-
Vector Quantization for Large Scale Data. Proceed- ber 2006. IEEE. ISBN 978-0-7695-2720-8. doi: 10.
ings - IEEE International Conference on Multimedia 1109/FOCS.2006.37. URL https://ieeexplore.
and Expo, 2016-Augus, 2016. ISSN 1945788X. doi: ieee.org/document/4031351/.
10.1109/ICME.2016.7552944.
Spring, R. and Shrivastava, A. Scalable and sustainable
Luo, L., Chen, C., Zhang, Z., Li, W.-J., and Zhang, T. deep learning via randomized hashing. In Proceedings
Robust Frequent Directions with Application in Online of the 23rd ACM SIGKDD International Conference on
Learning. Journal of Machine Learning Research, 20 Knowledge Discovery and Data Mining, pp. 445–454,
(1):41, February 2019. 2017.
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online dic- Teng, D. and Chu, D. A Fast Frequent Directions
tionary learning for sparse coding. In Proceedings of the Algorithm for Low Rank Approximation. IEEE
26th annual international conference on machine learn- Transactions on Pattern Analysis and Machine Intelli-
ing, pp. 689–696, 2009. gence, 41(6):1279–1293, June 2019. ISSN 0162-8828,
2160-9292, 1939-3539. doi: 10.1109/TPAMI.2018.
Manne, S. and Pal, M. Fast Approximate Matrix Multi- 2839198. URL https://ieeexplore.ieee.
plication by Solving Linear Systems. arXiv:1408.4230 org/document/8362693/.
[cs], August 2014. URL http://arxiv.org/abs/
1408.4230. arXiv: 1408.4230. Wang, J., Shen, H. T., Song, J., and Ji, J. Hashing for simi-
larity search: A survey. arXiv preprint arXiv:1408.2927,
Martinez, J., Hoos, H. H., and Little, J. J. Stacked quantiz- 2014a.
ers for compositional vector compression. arXiv preprint
arXiv:1411.2173, 2014. Wang, J., Shen, H. T., Yan, S., Yu, N., Li, S., and Wang,
J. Optimized distances for binary code ranking. In Pro-
Martinez, J., Clement, J., Hoos, H. H., and Little, J. J. Re- ceedings of the 22nd ACM international conference on
visiting additive quantization. In European Conference Multimedia, pp. 517–526, 2014b.
on Computer Vision, pp. 137–153. Springer, 2016.
Wang, J., Liu, W., Kumar, S., and Chang, S.-F. Learning
Mroueh, Y., Marcheret, E., and Goel, V. Co-Occuring to hash for indexing big data—a survey. Proceedings of
Directions Sketching for Approximate Matrix Multi- the IEEE, 104(1):34–57, 2016a.
ply. arXiv:1610.07686 [cs], October 2016. URL Wang, W., Chen, C., Chen, W., Rai, P., and Carin, L. Deep
http://arxiv.org/abs/1610.07686. arXiv: metric learning with data summarization. In Joint Euro-
1610.07686. pean Conference on Machine Learning and Knowledge
Nelson, J. and Nguyên, H. L. Osnap: Faster numerical lin- Discovery in Databases, pp. 777–794. Springer, 2016b.
ear algebra algorithms via sparser subspace embeddings. Wu, C.-J., Brooks, D., Chen, K., Chen, D., Choudhury,
In 2013 ieee 54th annual symposium on foundations of S., Dukhan, M., Hazelwood, K., Isaac, E., Jia, Y., Jia,
computer science, pp. 117–126. IEEE, 2013. B., et al. Machine learning at facebook: Understand-
Pagh, R. Compressed matrix multiplication. ACM ing inference at the edge. In 2019 IEEE International
Transactions on Computation Theory, 5(3):1–17, Au- Symposium on High Performance Computer Architec-
gust 2013. ISSN 19423454. doi: 10.1145/2493252. ture (HPCA), pp. 331–344. IEEE, 2019.
2493254. URL http://dl.acm.org/citation. Ye, Q., Luo, L., and Zhang, Z. Frequent Direction Algo-
cfm?doid=2493252.2493254. rithms for Approximate Matrix Multiplication with Ap-
plications in CCA. In IJCAI, pp. 7, 2016.
Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkate-
san, R., Khailany, B., Emer, J., Keckler, S. W., and Dally, Yu, Q., Maddah-Ali, M., and Avestimehr, S. Polynomial
W. J. Scnn: An accelerator for compressed-sparse con- codes: an optimal design for high-dimensional coded
volutional neural networks. ACM SIGARCH Computer matrix multiplication. In Advances in Neural Informa-
Architecture News, 45(2):27–40, 2017. tion Processing Systems, pp. 4403–4413, 2017.
Multiplying Matrices Without Multiplying

Yu, Q., Ali, M., and Avestimehr, A. S. Straggler mitigation


in distributed matrix multiplication: Fundamental limits
and optimal coding. IEEE Transactions on Information
Theory, 2020.
Zhang, T., Du, C., and Wang, J. Composite Quantization
for Approximate Nearest Neighbor Search. Proceedings
of the 31st International Conference on Machine Learn-
ing (ICML-14), 32:838–846, 2014.
Zhang, T., Qi, G.-J., Tang, J., and Wang, J. Sparse compos-
ite quantization. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4548–
4556, 2015.

Zhong, K., Guo, R., Kumar, S., Yan, B., Simcha, D., and
Dhillon, I. Fast classification with binary prototypes.
In Artificial Intelligence and Statistics, pp. 1255–1263,
2017.

Zou, H. and Xue, L. A selective overview of sparse prin-


cipal component analysis. Proceedings of the IEEE, 106
(8):1311–1320, 2018.
Multiplying Matrices Without Multiplying

A. Quantizing Lookup Tables C. Subroutines for Training


Since the lookup table entries naturally occupy more than
M ADDNESS H ASH
8 bits even for 8-bit data (since products of 8-bit values The optimal split threshold algorithm (Algo-
require 16 bits), some means of quantizing these entries rithm 3) finds the best threshold at which to split a
is necessary to enable vectorization. Unfortunately, exist- bucket within a given dimension. To do this, it uses the
ing quantization methods are not applicable to our prob- cumulative sse function (Algorithm 4) to help evalu-
lem setting. The scheme of Blalock & Guttag (2017) re- ate the loss associated with the resulting child buckets.
quires knowledge of B at training time, while the scheme
of André et al. (2017) and André et al. (2019) is only ap- These algorithms exploit the fact that the sum of squared
plicable for nearest-neighbor search. We instead use the errors can be computed using only the sum of values and
following approach, where T ∈ RM ×C×K is the tensor of sum of squared values, both of which can be updated in
lookup tables for all M columns of B, T q is the quantized O(1) time when a vector is moved from one side of the
version of T , δ ∈ RC is a vector of table-specific offsets, split to the other.
and α−1 is an overall scale factor: Algorithm 3 Optimal Split Threshold Within a Bucket
1: Input: bucket B, index j
δc , min Tm,c,k (12)
m,k 2: X ← as 2d array(B)
 
255
 3: X sort = sort rows based on col(X, j)
α−1 l
, 2 , l = max log2 4: sses head ← cumulative sse(X sort , false)
c maxm,k (Tm,c,k − δc )
5: sses tail ← cumulative sse(X sort , true)
(13)
6: losses ← sses head
q −1
Tm,c,k ,α (Tm,c,k − δc ). (14) 7: losses1:N −1 ← losses1:N −1 + sses tail2:N
8: n∗ ← argminn lossesn
This is similar to equations 15 and 16, but with the scale 9: return (Xnsort sort
∗ , j + Xn∗ +1, j )/2, lossesn∗
factor pooled across all codebooks instead of unique to
each input column. The α used here is the same as that in Algorithm 4 Cumulative SSE
equation
P 2, and the matrix β in equation 2 has entries equal 1: Input: 2D array X, boolean reverse
to c δc (plus the debiasing constant from our averaging- 2: N, D ← shape(X)
based aggregation). 3: if reverse then
4: ∀i swap(Xi,d , XN −i+1,d )
B. Quantization and M ADDNESS H ASH 5: end if
6: out ← empty(N )
The use of at most 16 leaves is so that the resulting codes 7: cumX ← empty(D)
use 4 bits. This allows the use of these same shuffle instruc- 8: cumX2 ← empty(D)
tions to accelerate the table lookups as in Blalock & Guttag // Initialize first row of output and cumulative values
(2017). 9: out1 ← 0
The only subtlety in vectorizing our hash function is that 10: for d ← 1 to D do
one must execute line 4 using shuffle instructions such as 11: cumXd ← X1,d
vpshufb on x86, vtbl on ARM, or vtbl on PowerPC. 12: cumX2d ← (X1,d )2
In order to do this, the split values and scalars xj t must be 13: end for
8-bit integers. We quantize them by learning for each split // Compute remaining output rows
index j a pair of scalars (γj , δj ), where 14: for n ← 2 to N do
15: outn ← 0
δj , min vij (15) 16: for d ← 1 to D do
i
$ !% 17: cumXd ← cumXd + X1,d
255 18: cumX2d ← cumX2d + (X1,d )2
γj , 2l , l = log2 (16)
maxi vij − δj 19: outn ← outn +cumX2d −(cumXd ×cumXd /n)
20: end for
This restriction of γj to powers of two allows one to quan- 21: end for
tize xj t values with only shifts instead of multiplies. The 22: return out
v values can be quantized at the end of the training phase,
while the xj t values must be quantized within Algorithm 1
before line 5.
Multiplying Matrices Without Multiplying

D. Aggregation Using Pairwise Averages Proof. The proof follows immediately from considering
the four equiprobable realizations of the pair a, b. In the
Recall that we estimate sums of low-bitwidth integers by cases (0, 0) and (1, 1), 2µ(a, b) = s(a, b). In the cases
averaging pairs of values, then pairs of pairs, and so on. (0, 1) and (1, 0), 2µ(a, b) = 2, while s(a, b) = 1.
One could reduce all C values this way, but we find that
one obtains a better speed-accuracy tradeoff by computing Lemma D.2 (Variance of error when averaging one pair).
the average of blocks of U values and then upcasting to iid
Consider two scalars a and b, a, b ∼ Bernoulli(.5). Then
obtain exact sums of these averages. Multiplying this sum
of averages by U and adding in a bias correction term gives 1
one the overall estimate of the sum. One could tune U for E[ε(a, b)2 ] − E[ε(x, y)]2 =
4
a particular problem and hardware, but we simply set U =
16 in all our experiments. Having a larger U imposes less Proof. Using Lemma D.1, the above can be rewritten as:
overhead because upcasting happens less often, but there
are sharp diminishing returns to this; once upcasting is rare, 1
doing it even less often is of little help thanks to Amdahl’s E[ε(a, b)2 ] =
2
law.
The proof then follows by again considering the four
Because of our assumption that we are operating on ma- equiprobable cases as in Lemma D.1. In the cases (0, 0)
trices, rather than a matrix and a vector, we can also im- and (1, 1), ε(a, b)2 = 0. In the cases (0, 1) and (1, 0),
prove on the aggregation of existing methods (Blalock & (2ŝ(a, b) − s(a, b))2 = (2 − 1)2 = 1.
Guttag, 2017; André et al., 2017; 2019) by fusing the ag-
gregation of two or more output columns to hide read la- Lemma D.3 (Bias of AISE within a subspace). Suppose
tency. Conceptually, this amounts to tiling the loop over that the scalar elements xi of x are drawn from indepen-
output columns and alternating reads between the two cor- dent Bernoulli(.5) distributions. Then
responding tables within the innermost loop. This fusion
does not change the output of the computation—only how E[sU (x) − ŝU (x)] = U log2 (U )/4 (21)
efficiently it runs.
Having addressed these practical details, we may now pro- Proof. Observe that the computation graph can be cast as a
ceed to the analysis of our estimator’s bias. balanced binary tree with U leaves and each parent equal to
Definition D.1 (Averaging Integer Sum Estimator). Let the integer average of its children. Consider the bias intro-
x ∈ {0, 1}C , C % U = 0, U = 2p , p ≥ 0. The Averag- duced at each level t of the tree, where t = 0 corresponds
ing Integer Sum Estimator (AISE) ŝ(x) is defined as: to the leaves and t = log2 (U ) corresponds to the root. The
expected error E[ξ(t, n)] introduced at a node n in level
C/U
X t > 0 is given by:
ŝ(x) , ŝU (xik :jk ) (17)
k=1 1 t−1
(
1
E[ξ(t, n)] = ·2 (22)
x1 x∈R 2
ŝU (x) ,
b 12 (ŝU (xlef t ) + ŝU (xright ) + 1)c otherwise where the 12 follows from Lemma D.1 and the scale 2t−1
(18) is the number of leaf nodes to which the bias is effectively
where ik = (k − 1) · U + 1, jk = iU + U and xlef t and applied. E.g., adding one to the estimated average of four
xright denote vectors formed by taking the initial and final leaf nodes would increase the estimated sum by four. Since
D/2 indices of a given x ∈ RD . there are U · 2−t nodes per level, this means that the total
expected error introduced at level t is 12 · 2t−1 · 2−t = 14 .
Definition D.2 (Pairwise Integer Sum and Sum Estimator).
Summing from t = 1 to t = log2 (U ) completes the proof
For integers a and b, define
of the expected error. Note that t = 0 is omitted since the
s(a, b) , a + b (19) leaf nodes are not the result of averaging operations and so
ŝ(a, b) , 2µ(a, b) (20) introduce no error.

where µ(a, b) , b 12 (a + b + 1)c.


Lemma D.1 (Bias when averaging one pair). Consider two Theorem D.1 (Bias of AISE). Suppose that the scalar el-
iid
scalars a and b, with a, b ∼ Bernoulli(.5). Define ε(a, b) , ements xi of x are drawn from independent Bernoulli(.5)
ŝ(a, b) − s(a, b). Then distributions. Then
1
E[ε(a, b)] = E[s(x) − ŝ(x)] = C log2 (U )/4 (23)
2
Multiplying Matrices Without Multiplying

Proof. This follows immediately from Lemma D.3, the E. Additional Experimental Details
fact that the overall sum is estimated within each of C/U
subspaces of size U , and the assumption that the errors in E.1. Choice of Matrix Multiplication Tasks
each subspace are independent. Because nearly all existing work on approximate matrix
multiplication either focuses on special cases that do not
We also verified Theorem D.1 numerically by summing satisfy our problem definition (André et al., 2019; Jegou
large numbers of integers drawn uniformly from the inter- et al., 2011; Ge et al., 2014) or synthetic matrices, there
val 0, . . . , 255. is not a clear set of benchmark matrix multiply tasks to
Note that the assumption that the elements are indepen- use. We therefore propose a collection of tasks that we
dent is not especially strong in reality. This is because this believe are both reproducible and representative of many
section focuses on the effects on the least significant bits real-world matrices. To the best of our knowledge, our ex-
(which are the ones affected by each averaging operation), periments use over an order of magnitude more matrices
and the least significant bit does tend to be nearly uniformly than any previous study.
random in a great deal of real-world data.
E.2. Choice of Single-Threaded Benchmarks
Given the ubiquity of GPUs and multicore CPUs, it may
not be obvious why single-threaded experiments are desir-
able. There are a number of reasons we restrict our focus
to CPUs and the single-threaded case:

• To enable fair comparisons to existing work, particularly


the nearest rival, Bolt (Blalock & Guttag, 2017).
• To facilitate fair comparisons to our work by future
authors—single-threaded experiments are much easier to
reproduce and extend than multithreaded ones.
• Matrix multiplication is embarrassingly parallel with re-
spect to rows of the input and columns of the output.
There is therefore nothing “interesting” about how our
method parallelizes relative to any other; all methods re-
duce to a single-threaded kernel that can easily be ap-
plied to disjoint submatrices. While we certainly could
spend the considerable time required to construct and
debug multicore benchmarks, this would be unlikely to
yield any useful insights.
• Parallelization in modern numerical libraries is often
managed at a higher level than the lowest-level subrou-
tines. For example, the authors of FBGEMM (Khudia
et al., 2018) state: “Internally, FBGEMM is intention-
ally designed not to create any threads. Usually, such a
library is intended to be used as a backend by deep learn-
ing frameworks, such as PyTorch and Caffe2, that create
and manage their own threads.”3 I.e., a multithreaded
library calls into single-threaded subroutines (such as a
matrix multiplication function); it is this single-threaded
subroutine where we make contributions, and therefore
where we focus our experimental efforts. Independent
of common practices in modern libraries, this pattern is
also the only sensible strategy for small matrices, like
many of those we consider—the overhead of launching
and joining threads is extremely unlikely to be worth it
3
https://engineering.fb.com/ml-applications/fbgemm/
Multiplying Matrices Without Multiplying

for sufficiently small matrices. We could perhaps char- Rademacher random variables, the co-occurring directions
acterize where this breakpoint is, but this is a hardware- sketch (Mroueh et al., 2016), OSNAP (Nelson & Nguyên,
specific result that has little to do with our contributions. 2013), Product Quantization (Jegou et al., 2011), and Opti-
mized Product Quantization (Ge et al., 2014).
• While training of deep neural networks is typically done
on GPUs or other accelerators, trained models (includ- The poor performance of many of these methods is unsur-
ing, but not limited to, neural networks) are commonly prising in our setting. Given that we have access to a train-
deployed on smartphones with just CPUs and/or graph- ing set on which to learn the true principal components, the
ics acceleration that is no better than the CPU (Wu et al., Eckart-Young-Mirsky theorem (Eckart & Young, 1936) in-
2019). Since most of the billions of smartphones in the dicates that PCA should outperform any other individual
world tend to be low-end or old, the need to deploy mod- matrix sketching method employing dense projection ma-
els on CPUs (including those with few cores) is unlikely trices, at least in the limit of infinite training data. Also,
to change for many years. since PQ and OPQ use 256 dense centroids (except in the
Bolt / QuickerADC variations), it is also impossible for
• Creating, benchmarking, and analyzing a performant im- them to perform well when min(D, M ) is not significantly
plementation of our method for GPUs would require a larger than 256.
great deal of engineering work. We plan to create such an
implementation in the future, but believe that the many E.6. UCR Time Series Archive
empirical and theoretical results we currently have are
more than adequate proof of concept and already worth We set the number of returned neighbors to 128 (results
sharing with the community. with 64 and 256 were similar). We omitted datasets with
fewer than 128 training examples, since it is not possible
E.3. SparsePCA Details for Stochastic Neighbor Compression to draw 128 samples
without replacement in this case.
We took steps to ensure that SparsePCA’s results were not
hampered by insufficient hyperparameter tuning. First, In addition to being a large, public corpus of over a hun-
for each matrix product, we tried a range of λ val- dred datasets from a huge variety of different domains, the
ues which we found to encompass the full gamut of UCR Time Series Archive also has the advantage that it can
nearly 0% to nearly 100% sparsity: λ ∈ 2i , i ∈ be used to produce matrix multiplication tasks of a fixed
{−5, −4, −3, −2, −1, 0, 1, 2, 3}. Second, because differ- size. This is necessary for meaningful comparison of speed
ent sparsity patterns may yield different execution times, versus accuracy tradeoffs across datasets. We constructed
we report not times from the single matrix SparsePCA pro- training and test matrices à and A by resampling each
duces for a given (d, λ) pair, but the best times from any time series in each dataset’s train and test set to a length
of 10 random matrices of the same size and at most the of 320 (the closest multiple of 32 to the median length
same sparsity. Finally and most importantly, we plot only of 310). We obtained the matrix B for each dataset by
the Pareto frontier of (speed, quality) pairs produced for a running Stochastic Neighbor Compression (Kusner et al.,
given matrix multiply. I.e., we let SparsePCA cherry-pick 2014) on the training set with an RBF kernel of bandwidth
its best results on each individual matrix multiply. one. We set the number of returned neighbors to 128 (re-
sults with 64 and 256 were similar), yielding a B matrix
of size 320 × 128. Since different datasets have different
E.4. Exact Matrix Multiplication
test set sizes, all results are for a standardized test set size
We also implemented our own matrix product function spe- of 1000 rows. We wanted the length to be a multiple of 32
cialized for tall, skinny matrices. In all cases, we report since existing methods operate best with sizes that are ei-
the timings based on the faster of this function and Eigen’s ther powers of two or, failing that, multiples of large powers
(Guennebaud et al., 2010) matrix multiply function for a of two.
given matrix product.
We approximate Euclidean distances using the identity
kx − yk22 = kxk22 − 2x> y + kyk22 . We approximate only
E.5. Additional Baselines the inner products x> y, since kyk22 can be precomputed
We also tested Frequent Directions / Fast Frequent Direc- for fixed exemplars y and kxk22 doesn’t affect the class pre-
tions (Liberty, 2012; Ghashami et al., 2016; Desai et al., diction since it is constant across all exemplars for a given
2016), many variations of the sampling method of Drineas input x.
et al. (2006a), projection using orthogonalized Gaussian
random matrices (Ji et al., 2012), projection using matrices
of scaled i.i.d. Rademacher random variables (Achliop-
tas, 2001), projection using orthonormalized matrices of
Multiplying Matrices Without Multiplying

E.7. Caltech101 show how to accelerate one internal FC layer in a net-


work, but we don’t want to risk misleading the reader—it
We only extracted valid windows—i.e., never past the edge
would be unclear what conclusions to draw from such re-
of an image. We extracted the windows in CHW order,
sults, particularly given the difficulty of retraining / fine-
meaning that scalars from the same color channel were
tuning without introducing many lurking variables (c.f.,
placed at contiguous indices. The “first” images are based
(Blalock et al., 2020)).
on filename in lexicographic order.
3. It is correct that convolution can be reduced to GEMM
We used pairs of filters because using a single filter would using im2col, and that accelerating convolution using
mean timing a matrix-vector product instead of a matrix- our ideas would be a valuable contribution. However,
matrix product. state-of-the-art algorithms for convolution exploit struc-
ture that is not available to general matrix multiply algo-
To allow meaningful speed comparisons across images, we
rithms. To match the performance of specialized Wino-
resized and center-cropped each image to 224 × 224 as
grad, direct, FFT-based, and hybrid convolution schemes
commonly done in image classification pipelines (He et al.,
that do exploit this additional structure, we would have
2016a;b; Huang et al., 2017). We then extracted sliding
to make modifications to our approach that would make
windows of the appropriate size and used each (flattened)
it less general. For example, the individual spatial po-
window as one row of à or A. We similarly flattened the
sitions should be encoded only once, and then reused at
filters, with each set of coefficients forming one column of
multiple filter positions. Regarding Section 5.5: while
B. In both cases, B has two columns—this is because us-
we do test our method on matrices of flattened image
ing a single filter would mean timing a matrix-vector prod-
patches, we do not claim that the overall pipeline of
uct instead of a matrix-matrix product. Two columns also
flattening + matrix multiply constitutes a state-of-the-
made sense since Sobel filters are often used in horizon-
art convolution method—we only claim that using our
tal and vertical pairings, and Gaussian filters are often used
method in this pipeline outperforms using other AMM
together to perform difference-of-Gaussians transforms.
methods there.
Even though the RGB values at each position are naturally
unsigned 8-bit integers, we allowed all rival methods to In short, while we believe that our ideas show great promise
operate on them as 32-bit floating point, without includ- for accelerating full neural networks and more layer types,
ing the conversion when timing them. Because it only making this happen requires much more research.
requires checking whether values are above a threshold,
M ADDNESS can operate on 8-bit data directly. E.9. Additional Results

E.8. Why Not Speed Up Whole Neural Nets? In Section 5, we showed the classification accuracy as a
function of wall time for the CIFAR-10 and CIFAR-100
Using our ideas to accelerate overall neural networks and softmax classifiers, as well as on the UCR datasets. In
other layer types would be a valuable contribution. In fact, Figure 8 and Figure 9, we instead show normalized mean
we are actively working on this problem. However, as we squared error versus time. In Figure 10 and Figure 11, we
state in the introduction and problem statement, our focus show accuracy or NMSE versus number of operations per-
in this paper is approximate matrix multiplication (AMM) formed, where one operation is either one multiply-add or
and we deliberately make no claim about accelerating en- one table lookup, depending on the method. The first two
tire neural networks or convolutional layers. We limit our figures illustrate that NMSE is closely related to classifi-
scope in this way for several reasons: cation accuracy, but with imperfect NMSE still yielding
excellent accuracy in many cases. The second two fig-
ures show that our method’s superior results are not merely
1. Approximate matrix multiplication is an established re-
caused by the use of faster CPU instructions, but also by
search problem of general interest independent of deep
the use of fewer basic operations at the algorithm level.
learning.
2. Lifting a method from accelerating a single layer to an
overall network is challenging. Just as scalar quantiza-
tion of network parameters is simple for a single layer
but an active area of research for an entire network, so
too is using our method on multiple layers at once an
open research problem. For example, it is not clear
how to deal with the fact that distributions of activations
change throughout training, or how to efficiently incor-
porate our non-differentiable hash function. We could
Multiplying Matrices Without Multiplying

Approximating Softmax Classifiers


1.0 CIFAR-10
Approximating Softmax Classifiers
1 - NMSE

CIFAR-10
0.5

Classification
0.75

Accuracy
0.0 0.50
1.0 CIFAR-100 0.25
1 - NMSE

CIFAR-100
0.5

Classification
Accuracy
0.50
0.0 0.25
100 101 102
Speedup Over Exact Matrix Multiply
MADDNESS ScalarQuantize HashJL 107 108
MADDNESS-PQ Bolt PCA Number of Operations
Exact FastJL SparsePCA MADDNESS ScalarQuantize HashJL
MADDNESS-PQ Bolt PCA
Figure 8: M ADDNESS achieves a far better speed versus Exact FastJL SparsePCA
squared error tradeoff than any existing method when
approximating two softmax classifiers. These results Figure 10: M ADDNESS achieves the best speed versus
parallel the speed versus classification accuracy results, accuracy tradeoff on the CIFAR datasets of any method
except that the addition of our ridge regression is much even when speed is measured as number of operations
more beneficial on CIFAR-100. instead of wall time. Note that fewer operations with a
high accuracy (up and to the left) is better.

Approximating an RBF Kernel Classifier


1.0
Fraction > 0.5

0.5

0.0 1.0
Approximating a Sobel Filter
1 - NMSE

1.0
Fraction > 0.95 Fraction > 0.75

0.5
0.5 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Number of Operations 1e6
0.0
1.0
Approximating a Gaussian Filter
1.0
1 - NMSE

0.5
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 Number of Operations 1e7
100 101 102 MADDNESS MADDNESS-PQ SparsePCA
Speedup Over Exact Matrix Multiply
MADDNESS ScalarQuantize HashJL Figure 11: M ADDNESS still achieves the best speed ver-
MADDNESS-PQ Bolt PCA sus squared error tradeoff on the image processing
Exact FastJL SparsePCA
tasks when speed is measured as number of operations
Figure 9: M ADDNESS achieves the lowest squared error instead of wall time.
at high speedups on the UCR datasets. These results
parallel the speed versus classification accuracy results.
Multiplying Matrices Without Multiplying

F. Theoretical Analysis of M ADDNESS Maddness-Regress for a fixed


Maddness-Build-Tree hypothesis, and then
F.1. Complexity union bounding over the hypothesis space for
Our encoding function g(A), A ∈ RN ×D has complex- Maddness-Build-Tree. Bounding the size of
ity Θ(N C), since it does a constant amount of work per the hypothesis space is straightforward (Lemma F.1), so
row per codebook. Our table creation function h(B), B ∈ the bulk of this section focuses on providing a guarantee
RD×M has complexity Θ(M KCD), since it must com- for Maddness-Regress. We must also prove a bound
pute the inner product between each column of B and on the loss contributed by quantizing the lookup tables
KC prototypes of length D. This is a factor of C worse array P > b.
than PQ since we do not require the prototypes for differ- Lemma F.1 (Number of Hypotheses for
ent codebooks to have disjoint nonzero indices. However, Maddness-Build-Tree). Let C be the number
this reduction in the speed of h(·) is not a concern because of codebooks used by M ADDNESS and let D be the num-
N  M, D; moreover, the matrix B is often known ahead ber of columns in the matrix à on which M ADDNESS is
of time in realistic settings, allowing h(B) to be computed trained. Then there are at most 2C(4dlog2 (D)e+256) unique
offline. Finally, the complexity of our aggregation function trees that Maddness-Build-Tree can generate.
f (·) is Θ(N CM ), since it performs C table lookups for
each of M output columns and N output rows. This means
our overall algorithm has complexity Θ(M C(KD + N )), Proof. Maddness-Build-Tree learns four sets of pa-
which reduces to Θ(N CM ) since we fix K = 16 and our rameters for each of the C trees it produces: split indices,
problem statement requires N  D. split offsets, split scales, and split values.
There are four split indices per tree because there is one
F.2. Proof of Generalization Guarantee for each of the tree’s four levels. Each index requires
In this section, we prove Theorem 4.1, restated below for dlog2 (D)e bits to store, so the split indices require a total
convenience. of 4 dlog2 (D)e bits per tree. For each split index, there is
one split offset and scale, used to map floating point data in
Theorem (Generalization Error of M ADDNESS). Let D
an arbitrary range to the interval [0, 255] to match up with
be a probability distribution over RD and suppose that
the 8-bit split values.
M ADDNESS is trained on a matrix à ∈ RN ×D whose
rows are drawn independently from D and with maximum The offsets require at most 25 bits each for 32-bit floating
singular value bounded by σA . Let C be the number of point data, since the low seven bits can be dropped without
codebooks used by M ADDNESS and λ > 0 be the regular- affecting the post-scaling quantized output. The scales are
ization parameter used in the ridge regression step. Then constrained to be powers of two, and so require at most nine
for any b ∈ RD , any a ∼ D, and any 0 < δ < 1, we have bits for non-subnormal 32-bit floating point inputs (which
with probability at least 1 − δ that have one sign bit and eight exponent bits). The offsets and
scales together therefore contribute 4(25 + 9) = 136 bits
ED [L(a, b)] ≤ EÃ [L(a, b)]+ per tree.
p !
CσA kbk2 1 8 + ν(C, D, δ) There are 15 split values because there is one for the root of
√ + √
2 λ 256 2n each tree, then two for the second level, four for the third,
and eight for the fourth. Each split value is stored using
where L(a, b) , |a> b − αf (g(a), h(b)) − β|, α is the eight bits, so each tree requires 15 · 8 = 120 bits for split
scale used for quantizing the lookup tables, β is the con- values. The total number of bits used for all trees is there-
stants used in quantizing the lookup tables plus the debias- fore C(4 dlog2 (D)e + 256). Note that the constant 256
ing constant of Section 4.4, and being a power of two is just an accident of floating point
formats. The claimed hypothesis count follows from the
ν(C, D, δ) , C(4 dlog2 (D)e + 256) log 2 − log δ.
number of expressible hypotheses being at most two to the
power of the largest number of bits used to store any hy-
The proof relies on the observation that M ADDNESS’s pothesis.
training procedure can be decomposed into two sequential
subroutines: Maddness-Build-Tree, which learns
the function g(a) by constructing a binary decision tree, We now turn our attention to bounding the errors of the
and Maddness-Regress, which learns the function regression component of training. Our strategy for doing so
h(b) by optimizing a prototype matrix P such that is to bound the largest singular value of the learned matrix
g(Ã)P ≈ Ã. This observation allows us to prove 4.1 by of prototypes P . Given such a bound, the norms of both
first providing a guarantee for g(a)> P and P > b can be bounded.
Multiplying Matrices Without Multiplying

Lemma F.2 (Regularized Pseudoinverse Operator Norm Proof. We have


Bound). Let X ∈ RN ×D be an arbitrary matrix with fi-
nite elements. Then every singular value σi of the matrix kg > P k2 ≤ kgk2 kP k∞ (35)
Z , (X > X + λI)−1 X > , λ > 0 is at most 2√1 λ . = CkP k∞ (36)
C
≤ √ kÃk∞ . (37)
Proof. Let U ΣV > be the singular value decomposition of 2 λ
X. Then we have
The first step follows from Cauchy-Schwarz. The second
> −1 > follows from g being zero except for exactly C ones. The
Z = (X X + λI) X (24)
> > −1 > last is an application of Lemma F.3.
= (V ΣU U ΣV + λI) V ΣU (25)
= (V Σ V 2 >
+ λI) −1
V ΣU >
(26) Lemma F.5 (Maximum Table Quantization Loss). Let â =
g(a)> P , where g(·) and P are trained using C codebooks
= (V Σ2 V > + V λIV > )−1 V ΣU > (27) and ridge regression penalty λ > 0 on a matrix à with
= (V Σλ V ) > −1
V ΣU >
(28) maximum singular value at most σA , and a ∈ RD is an
arbitrary vector. Then for any vector b ∈ RD , |â> b− ŷ| <
=V Σ−1 >
λ V V ΣU
>
(29) CσA kbk2

512 λ
, where ŷ , αg(a)> g(b) + β is M ADDNESS’s
=V Σ−1
λ ΣU
>
(30)
approximation to a> b. α and β are the scale and offsets
0 >
=VΣU (31) used to quantize the lookup tables.

where Σλ , Σ2 + λI and Σ0 , (Σ2 + λI)−1 Σ. Step Proof. If M ADDNESS had infinite-precision lookup tables,
27 follows from the equality V λIV > = λV V > = λI. ŷ would exactly equal â> b. We therefore need only bound
Because the matrices V and U > are orthonormal and Σ0 is the error introduced by the quantization. By Lemma F.4,
diagonal, the singular values of Z are equal to the diagonal kâk2 ≤ Cσ √A . This implies that
2 λ
entries of Σ0 . Each entry σi0 is equal to
σi CσA kbk2
σi0 = . (32) kâ> bk ≤ √ (38)
σi2+λ 2 λ
1 and therefore
This expression attains its maximal value of √
2 λ
when
σi2 = λ. −CσA kbk2 CσA kbk2
√ ≤ â> b ≤ √ . (39)
2 λ 2 λ
Lemma F.3 (Ridge Regression Singular Value Bound). Let
X ∈ RN ×D and Y ∈ RD×M be arbitrary matrices and For each of the C codebooks, this means that the value to
let W , (X > X + λI)−1 X > Y , λ > 0 be the ridge be quantized lies in the interval [ −σ2A√kbk
λ
2 σA kbk2
, 2√λ ] of width
regression weight matrix. Then kW k∞ ≤ kY k
√ ∞ , where
2 λ
σA kbk2
√ . Because M ADDNESS quantizes the lookup tables
λ
k·k∞ denotes the largest singular value. such that largest and smallest entries for any row of P
are linearly mapped to 255.5 and −0.5,4 respectively, the
Proof. Observe that W = ZY , where Z , (X > X + worst-case quantization error is when the quantized value
λI)−1 X > . Then by applying Lemma F.2 and recalling lies exactly between two quantization levels. We there-
that Schatten norms are submultiplicative, we have fore need to compute the largest possible gap between a
value and its quantization. Using 256 quantization lev-
kY k∞ els, the largest possible gap is 1/(256/.5) = 1/512 of
kW k∞ ≤ kZk∞ kY k∞ ≤ √ . (33)
2 λ the interval width. Multiplying by the above interval width
yields a maximum quantization error for a given codebook
A kbk2
of σ512 √ . Because the errors in each subspace may not
λ
Lemma F.4 (Bound on M ADDNESS Embedding Norm). agree in sign, their sum is an upper bound on the overall
Let g = g(a) be the encoding of an arbitrary vector a us- quantization error.
ing C codebooks and let P be the prototype matrix learned
by M ADDNESS using training matrix à with ridge regres- At this point, we have all of the pieces necessary to prove a
sion parameter λ > 0. Then generalization guarantee for Maddness-Regress save
C 4
kg > P k2 ≤ √ kÃk∞ (34) We use 255.5 and −0.5 rather than 255 and 0 because the
2 λ latter only guarantees that a point is within 1/510 of the inter-
val width, not 1/512. This is not an important choice and either
where kÃk∞ denotes the largest singular value of Ã. option would be fine.
Multiplying Matrices Without Multiplying

one: a theorem linking the norms of the various vec- Lipschitz constant of the loss, a bound on the L2 norm of
tors and matrices involved to a probabilistic guarantee. g, and a bound on the L2 norm of w, we can apply this
Kakade et al. (2009) provide such a gaurantee, based on theorem. The Lipschitz constant for the absolute loss is 1.
Rademacher complexity (Bartlett & Mendelson, 2002). The L2 norm of g is exactly C. The L2 norm of w can be
Theorem F.1 ((Kakade et al., 2009), Corollary 5). Let F = bounded as
{w> x : kwk2 ≤ W } be the class of linear functions with σA kbk2
bounded L2 norms, let S be a set of n samples drawn i.i.d. kwk2 = kP bk2 ≤ kP k∞ kbk2 ≤ √ (43)
2 λ
from some distribution D over the L2 ball of radius X, and
let L(f ), f ∈ F be a loss function with Lipschitz constant using Lemma F.3.
L. Then for any 0 < δ < 1, it holds with probability at
least 1 − δ over the sample S that Using this lemma, the proof of Theorem 4.1 is immediate;
we begin with Lemma F.6 and simply union bound over all
LXW  p 
ED [L(f )] ≤ ES [L(f )] + √ 8 + −log(δ) . 2C(4dlog2 (D)e+120) hypotheses from Lemma F.1.
2n
(40)

We can now obtain our desired guarantee for the regression


step.
Lemma F.6 (Generalization Error of
Maddness-Regress). Let D be a probability dis-
tribution over RD and suppose that M ADDNESS is
trained on a matrix à ∈ RN ×D whose rows are drawn
independently from D and with maximum singular value
bounded by σA . Let C be the number of codebooks
used by M ADDNESS and λ > 0 the regularization
parameter used in the ridge regression step. Further
let g(a) be a fixed (data-independent) function and
L(a, b) , |a> b − f (g(a), h(b))|. Then for all vectors
b, any vector a ∼ D, and any 0 < δ < 1, we have with
probability at least 1 − δ that

CσA kbk2
ED [L(a, b)] ≤ EÃ [L(a, b)] +√ +
512 λ
(41)
CσA kbk2  p 
√ 8 + −log(δ) .
2 2nλ

Proof. The output of Maddness-Regress can be de-


composed into

ŷ , f (g(a), h(b)) = g > P b + ε + ζ (42)

where g = g(a), P is the matrix of prototypes, ε is


data-independent noise from the averaging process5 , and
ζ is noise from quantizing the lookup table entries. By
A kbk2
Lemma F.5, ζ ≤ Cσ √
512 λ
(accounting for the second term
in Equation 41). We therefore need only obtain a guar-
antee for |g > P b − a> b|. Defining w , P b, we see
that Maddness-Regress is a linear model, and there-
fore subject to Theorem F.1. Given an upper bound on the
5
We continue to make the assumption that the least signifi-
cant bits of the lookup table entries are independent Bernoulli(0.5)
random variables, which is nearly true in practice. Even if this
assumption does not hold, this noise does not contribute to the
generalization gap unless it differs between train and test sets.

You might also like