Multiplying Matrices Without Multiplying: Davis Blalock John Guttag
Multiplying Matrices Without Multiplying: Davis Blalock John Guttag
ing. Consequently, there has been significant work ples and a linear operator B one wishes to apply to these
on efficiently approximating matrix multiplies. We samples. B could be a linear classifier, linear regressor, or
introduce a learning-based algorithm for this task an embedding matrix, among other possibilities.
that greatly outperforms existing methods. Experi-
ments using hundreds of matrices from diverse do- As a concrete example, consider the task of approximating
mains show that it often runs 100× faster than exact a softmax classifier trained to predict image labels given
matrix products and 10× faster than current approx- embeddings derived from a neural network. Here, the rows
imate methods. In the common case that one matrix of A are the embeddings for each image, and the columns
is known ahead of time, our method also has the in- of B are the weight vectors for each class. Classification
teresting property that it requires zero multiply-adds. is performed by computing the product AB and taking the
These results suggest that a mixture of hashing, aver- argmax within each row of the result. In Figure 1, we see
aging, and byte shuffling—–the core operations of our the results of approximating AB using our method and its
method—–could be a more promising building block best-performing rivals (Dasgupta et al., 2010; Mairal et al.,
for machine learning than the sparsified, factorized, 2009) on the CIFAR-10 and CIFAR-100 datasets.
and/or scalar quantized matrix products that have re- Approximating Softmax Classifiers
cently been the focus of substantial research and hard- CIFAR-10
ware investment.
Classification
0.75
Accuracy
0.50
1. Introduction
0.25
Matrix multiplication is among the most fundamental sub-
routines used in machine learning and scientific computing. CIFAR-100
As a result, there has been a great deal of work on imple-
0.6
Classification
up distributed matrix multiplication (Yu et al., 2017; Dutta Speedup Over Exact Matrix Multiply
et al., 2016; Yu et al., 2020; Irony et al., 2004), and design- Ours Mairal et al.
ing efficient Approximate Matrix Multiplication (AMM) Exact Matrix Multiply Dasgupta et al.
algorithms under various assumptions. Figure 1: Our method achieves a dramatically better
We focus on the AMM task under the assumptions that the speed-accuracy tradeoff than the best existing methods
matrices are tall, relatively dense, and resident in a single when approximating two linear classifiers.
machine’s memory. In this setting, the primary challenge
Our method represents a significant methodological de-
1
MosaicML, San Francisco, CA, USA 2 MIT CSAIL, parture from most traditional approaches to this problem.
Cambridge, MA, USA. Correspondence to: Davis Blalock Traditional AMM methods construct matrices VA , VB ∈
<[email protected]>. RD×d , d D such that
Proceedings of the 38 th International Conference on Machine
Learning, PMLR 139, 2021. Copyright 2021 by the author(s). AB ≈ (AVA )(VB> B). (1)
Multiplying Matrices Without Multiplying
Often, VA and VB are sparse, embody some sort of sam- 2. Related Work
pling scheme, or have other structure such that these pro-
Because our work draws on ideas from randomized algo-
jection operations are faster than a dense matrix multiply.
rithms, approximate matrix multiplication, vector quantiza-
In short, these methods use linear functions to preprocess
tion, and other fields, the body of work related to our own
A and B and reduce the problem to exact matrix multipli-
is vast. Here, we provide only a high-level overview, and
cation in a lower-dimensional space.
refer the interested reader to (Wang et al., 2016a; 2014a;
Our proposed method, M ADDNESS1 , instead employs a Desai et al., 2016) for more detailed surveys. We also de-
nonlinear preprocessing function and reduces the prob- fer discussion of related vector quantization methods to the
lem to table lookups. Moreover, in the case that B is following sections.
known ahead of time—which happens when applying a
trained linear model to new data, among other situations— 2.1. Linear Approximation
M ADDNESS does not require any multiply-add operations.
Most AMM methods work by projecting A and B into
Our method is most closely related to vector quantization
lower-dimensional spaces and then performing an exact
methods used for similarity search (e.g., (Blalock & Gut-
matrix multiply. One simple option for choosing the pro-
tag, 2017; André et al., 2017; 2019; Jegou et al., 2011;
jection matrices is to use matrix sketching algorithms.
Ge et al., 2014)). However, instead of using an expensive
The most prominent deterministic matrix sketching meth-
quantization function that requires many multiply-adds, we
ods are the Frequent Directions algorithm (Liberty, 2012;
introduce a family of quantization functions that require no
Ghashami et al., 2016) and its many variations (Teng &
multiply-adds.
Chu, 2019; Francis & Raimond, 2018b; Ye et al., 2016;
Our contributions can be summarized as follows: Huang, 2019; Luo et al., 2019; Francis & Raimond, 2018a).
There are also many randomized sketching methods (Sar-
• An efficient family of learned vector quantization func- los, 2006; Kyrillidis et al., 2014; Pagh, 2013; Dasgupta
tions that can encode over 100GB of data per second in et al., 2010; Nelson & Nguyên, 2013) and sampling meth-
a single CPU thread. ods (Drineas et al., 2006b;c).
• A high-speed summation algorithm for low-bitwidth in- A weakness of matrix sketching methods in the context of
tegers that avoids upcasting, saturation, and overflow. matrix multiplication is that they consider each matrix in
• An algorithm based on these functions for approximate isolation. To exploit information about both matrices si-
matrix multiplication. Experiments across hundreds of multaneously, Drineas et al. (2006a) sample columns of
diverse matrices demonstrate that this algorithm signifi- A and rows of B according to a sampling distribution de-
cantly outperforms existing alternatives. It also features pendent upon both matrices. Later work by Manne & Pal
theoretical quality guarantees. (2014) reduces approximation of the matrices to an op-
timization problem, which is solved by steepest descent.
Mroueh et al. (2016), Ye et al. (2016), and Francis & Rai-
1.1. Problem Formulation mond (2018a) introduce variations of the Frequent Direc-
Let A ∈ RN ×D and B ∈ RD×M be two matrices, with tions algorithm that take into account both matrices.
N D ≥ M . Given a computation time budget τ , All of the above methods differ from our own not merely in
our task is to construct three functions g(·), h(·), and f (·), specifics, but also in problem formulation. These methods
along with constants α and β, such that all assume that there is no training set à and nearly all fo-
cus on large matrices, where provably reduced asymptotic
kαf (g(A), h(B)) + β − ABkF < ε(τ )kABkF (2) complexity for a given level of error is the goal.
table is at most 255. Ignoring rounding error, this affine Algorithm 1 M ADDNESS H ASH
transform is invertible, and is reflected by the constants α 1: Input: vector x, split indices j 1 , . . . , j 4 , split thresh-
and β in equation 2. See Appendix A for further details. olds v 1 , . . . , v 4
2: i←1 // node index within level of tree
Aggregation, f (·, ·): Given the encoding of a and the 3: for t ← 1 to 4 do
lookup tables for b, the product can be approximated as 4: v ← vit // lookup split threshold for node i at level t
5: b ← xj t ≥ v ? 1 : 0 // above split threshold?
C C 6: i ← 2i − 1 + b // assign to left or right child
X X
a> b = a(c)> b(c) ≈ h(c) (b)k , k = g (c) (a). (6) 7: end for
c=1 c=1 8: return i
Algorithm 2 Adding The Next Level to the Hashing Tree formed by concatenating the one-hot encoded representa-
tions of each assignment for the corresponding row of Ã.
1: Input: buckets B1t−1 , . . . , B2t−1training matrix Ã
t−1 ,
For example, if a row were assigned prototypes h3 1 2i
// greedily choose next split index and thresholds
with K = 4, C = 3, its row in G would be h0010 1000
2: Jˆ ← heuristic select idxs(B1t−1 , . . . , B2t−1 t−1 )
0100i ∈ R12 . Our idea is to optimize P conditioned on
3: lmin , j min , v min ← ∞, NaN, NaN
G and Ã. This is an ordinary least squares problem, and
4: for j ∈ Jˆ do
we solve it with ridge regression:
5: l←0 // initialize loss for this index to 0
6: v←[] // empty list of split thresholds P , (G> G + λI)−1 G> Ã. (8)
7: for i ← 1 to 2t−1 do
8: vi , li ← optimal split threshold(j, Bit−1 ) One could obtain better performance by cross-validating to
9: append(v, vi ) // append threshold for bucket i find λ, but for simplicity, we fix λ = 1.
10: l ← l + li // accumulate loss from bucket i
11: end for This procedure allows the prototypes to be nonzero out-
12: if l < lmin then side of their original subspaces. Because of our hashing
13: lmin ← l, j min ← j, v min ← v // new best split procedure, we avoid the dramatically increased overhead
14: end if faced by other methods with non-orthogonal prototypes
15: end for (c.f. (Babenko & Lempitsky, 2015; 2014; Zhang et al.,
// create new buckets using chosen split 2014; Liu et al., 2016; Martinez et al., 2016; 2014)).
16: B←[]
17: for i ← 1 to 2t−1 do 4.4. Fast 8-bit Aggregation, f (·, ·)
18: Bbelow , Babove ← apply split(vimin , Bit−1 ) Let T ∈ RM ×C×K be the tensor of lookup tables for all M
19: append(B, Bbelow ) columns of B. Given the encodings G, the function f (·, ·)
20: append(B, Babove ) is defined as
21: end for
C
22: return B, lmin , j min , v min X
f (g(A), h(B))n,m , Tm,c,k , k = g (c) (an ). (9)
c=1
two child buckets from each current bucket by grouping
vectors whose jth entries are above or below the bucket’s Because the entries of T are stored as 8-bit values, exact
split threshold. summation requires immediately upcasting each looked-up
entry to 16 bits before performing any addition instructions
4.3. Optimizing the Prototypes (Blalock & Guttag, 2017). This not only imposes overhead
directly, but also means that one must perform 16-bit addi-
At this point, we have a complete algorithm. We could tions, which have half the throughput of 8-bit additions.
simply drop our hash-based encoding function into PQ and
approximate matrix products. However, we contribute two We propose an alternative that sacrifices a small amount
additional enhancements: a means of optimizing the pro- of accuracy for a significant increase in speed. Instead of
totypes with no runtime overhead, and a means of quickly using addition instructions, we use averaging instructions,
summing low-bitwidth integers. such as vpavgb on x86 or vrhadd on ARM. While non-
saturating additions compute (a + b) % 256, these instruc-
Several works propose prototype or table optimizations tions compute (a + b + 1)/2. This means that they lose
based on knowledge of B (Babenko et al., 2016; Wang information about the low bit instead of the high bit of the
et al., 2014b), and others optimize them at the expense of sum. We estimate the overall mean by averaging pairs of
slowing down the function g(·) (Zhang et al., 2014; 2015). values, then pairs of pairs, and so on. We refer the reader
In contrast, we introduce a method that does not do ei- to Appendix D for details.
ther of these. The idea is to choose prototypes such that
à can be reconstructed from its prototypes with as little The challenging part of this approach is computing the bias
squared error as possible—this improves results since less in the estimated sum in order to correct for it. We prove in
error means that less information about à is being lost. Appendix D that this bias has a closed-form solution under
the realistic assumption that the low bits are equally likely
Let P ∈ RKC×D be a matrix whose diagonal blocks of to be 0 or 1.
size K × |J (c) | consist of the K learned prototypes in each
subspace c. The training matrix à can be approximately 4.5. Theoretical Guarantees
reconstructed as à ≈ GP , where G serves to select the
appropriate prototype in each subspace. Rows of G are Our central theoretical result is a generalization guarantee
for the overall approximation error of M ADDNESS, stated
Multiplying Matrices Without Multiplying
below. See Appendix F for a proof and additional analysis, We do not need to tune any hyperparameters for
including a discussion of algorithmic complexity. Besides M ADDNESS, but we do take steps to ensure that other
this main guarantee, we also inherit all of the guarantees methods are not hindered by insufficient hyperparameter
for Bolt (Blalock & Guttag, 2017), modulo a small amount tuning. Concretely, we sweep a wide range of hyperparam-
of additional error from averaging integers rather than sum- eter values and allow them to cherry-pick their best hyper-
ming exactly. This follows from Bolt’s guarantees depend- parameters on each test matrix. Further details regarding
ing only on the quantization errors, rather than the method our experimental setup and choices (e.g., use of a single
used to obtain them. thread) can be found in Appendix E.
Theorem 4.1 (Generalization Error of M ADDNESS). Let
5.1. Methods Tested
D be a probability distribution over RD and suppose that
M ADDNESS is trained on a matrix à ∈ RN ×D whose rows Recall that most baselines take the form of selecting a ma-
are drawn independently from D and with maximum singu- trix V ∈ RD×d , d < D such that AB ≈ (AV )(V > B).
lar value bounded by σA . Let C be the number of code- Here d is a free parameter that adjusts the quality versus
books used by M ADDNESS and λ > 0 be the regularization speed tradeoff. We therefore characterize most of these
parameter used in the ridge regression step. Then for any methods by how they set V .
b ∈ RD , any a ∼ D, and any 0 < δ < 1, we have with • PCA. Set V equal to the top principal components of Ã.
probability at least 1 − δ that
• SparsePCA (Mairal et al., 2009). Set V =
ED [L(a, b)] ≤ EÃ [L(a, b)]+ argminV minU 2Ntrain1
kÃ−U V > k2F +λkV k1 , where
U > U = I. This is not the only dictionary learning for-
p !
CσA kbk2 1 8 + ν(C, D, δ) (10)
√ + √ mulation referred to as SparsePCA (Zou & Xue, 2018;
2 λ 256 2n Camacho et al., 2020), but it is a good representative and
is the only one with support in a major Python library.
where L(a, b) , |a> b − αf (g(a), h(b)) − β|, α is the
scale used for quantizing the lookup tables, β is the con- • FastJL (Ailon & Chazelle, 2009). V is set to
stants used in quantizing the lookup tables plus the debias- Rademacher random variables composed with a Fast
ing constant of Section 4.4, and Hadamard Transform (FHT). For simplicity, we exclude
the FHT in the timing.
ν(C, D, δ) , C(4 dlog2 (D)e + 256) log 2 − log δ. (11)
• HashJL (Dasgupta et al., 2010). V is zero except for
a ±1 in each row, with both sign and position chosen
5. Experiments uniformly at random.
To assess M ADDNESS’s effectiveness, we implemented • ScalarQuantize. The matrices are not projected, but in-
both it and existing algorithms in C++ and Python. All of stead linearly quantized to eight bits such that the small-
our code and raw numerical results are publicly available est and largest entries map to either 0 and 255 or -128 and
at https://smarturl.it/Maddness. All experi- 127, as appropriate. We use FBGEMM (Khudia et al.,
ments use a single thread on a Macbook Pro with a 2.6GHz 2018) to perform the quantized matrix multiplies. We
Intel Core i7-4960HQ processor. Unless stated otherwise, neglect the time required to convert from other formats
all timing results use five trials, with each trial reporting the to eight bits, reflecting the optimistic scenario in which
fastest among 20 executions. We use the best, rather than the matrices are already of the appropriate types.
the average, since this is standard practice in performance • Bolt (Blalock & Guttag, 2017). Bolt is the most similar
benchmarking and is robust to the purely additive noise in- method to our own, differing only in the encoding func-
troduced by competing CPU tasks. Standard deviations are tion, the use of averaging instead of upcasting, and the
shown for all curves as shaded areas. Since training can optimization of centroids.
be performed offline and all methods except SparsePCA
(Mairal et al., 2009) train in at most a few minutes, we • Exact Multiplication. We simply compute the matrix
omit profiling of training times. We also do not profile the product AB using a modern BLAS implementation.
time to preprocess B, since 1) this time is inconsequential • M ADDNESS-PQ. A handicapped version of
in most cases, and 2) B is fixed and could be processed of- M ADDNESS without the prototype optimization step.
fline in all the problems we consider. In order to avoid im- The gap between M ADDNESS and M ADDNESS-PQ is
plementation bias, we build upon the source code provided the gain from optimizing the prototypes.
by Blalock & Guttag (2017)2 , which includes highly tuned
implementations of many algorithms to which we compare. We also compared to many additional methods (see Ap-
pendix E), but omit their results since they were not com-
2
https://github.com/dblalock/bolt petitive with those listed here.
Multiplying Matrices Without Multiplying
5.2. How Fast is M ADDNESS? Speed of f() Functions for Different Encoding Sizes
M ADDNESS is up to two orders of magnitude faster than Figure 4: Given the preprocessed matrices,
existing methods, and its throughput increases with row M ADDNESS computes the approximate output twice as
length. This latter property is because its encoding cost fast as the fastest existing method.
per row is O(C) rather than O(D).
dimensional floating point activations for the full test sets,
Speed of g() Functions and the matrices B are each network’s final dense layer.
3
for Different Encoding Sizes
10
The 50000 × 512-dimensional activations from the train-
8B Encodings
Speed (GB/s)
CIFAR-10
Speed (GB/s)
Classification
Encoding
Accuracy
1 0.75
10
0.50
−1 0.25
10
3
10 CIFAR-100
Speed (GB/s)
Classification
Encoding
Accuracy
1 0.50
10
0.25
−1
10
0 1 2
10 10 10
0 200 400 600 800 1000 Speedup Over Exact Matrix Multiply
Number of Columns in Matrix A MADDNESS ScalarQuantize HashJL
MADDNESS Bolt PQ OPQ MADDNESS-PQ Bolt PCA
Exact FastJL SparsePCA
Figure 3: M ADDNESS encodes the A matrix orders of
magnitude more quickly than existing vector quantiza- Figure 5: M ADDNESS achieves a far better speed-
tion methods. accuracy tradeoff than any existing method when ap-
proximating two softmax classifiers.
We also profile the speed of our aggregation function f (·, ·)
Moreover, our method achieves this performance despite
using the same baselines as Blalock & Guttag (2017). As
having worse support from the hardware. More precisely,
Figure 4 shows, our average-based, matrix-aware aggrega-
it obtains speedups much smaller than the level of com-
tion is significantly faster than the upcasting-based method
pression it provides. For example, the third points from the
of Bolt, its nearest rival.
right in both plots correspond to speedups of roughly 100×.
However, they are compressing each 512 × 4B = 2048B
5.3. Softmax Classifier row of the input down to a mere 4B, a savings of 512×
(sizes not shown in figure). If the hardware could lookup-
As described in Section 1, we approximated linear clas- accumulate as many bytes per cycle as it can multiply-
sifiers on the widely used CIFAR-10 and CIFAR-100 accumulate, our method could be over 4× faster. Com-
datasets (Krizhevsky et al., 2009). The classifiers use as bined with the fact that multiplexers require many fewer
input features the 512-dimensional activations of open- transistors than multipliers, this suggests that a hardware
source, VGG-like neural networks trained on each dataset implementation of our method might offer large efficiency
(Geifman, 2018). The matrices A are the 10000 × 512- gains compared to existing accelerators.
Multiplying Matrices Without Multiplying
1.0 0.5
Fraction > 0.75
0.5 0.0
0 1 2 3 4 5 6
Speedup Over Exact Matrix Multiply
Approximating a Gaussian Filter
0.0 1.0
1 - NMSE
1.0
Fraction > 0.95
0.5
0.5
0.0
0 5 10 15 20 25
Speedup Over Exact Matrix Multiply
0.0
10
0
10
1
10
2 MADDNESS MADDNESS-PQ SparsePCA
Speedup Over Exact Matrix Multiply
MADDNESS ScalarQuantize HashJL
Figure 7: Despite there being only two columns in
MADDNESS-PQ Bolt PCA the matrix B, M ADDNESS still achieves a significant
Exact FastJL SparsePCA speedup with reasonable accuracy. Methods that are
Figure 6: Fraction of UCR datasets for which each Pareto dominated by exact matrix multiplication on
method preserves a given fraction of the original accu- both tasks are not shown; this includes all methods but
racy. M ADDNESS enables much greater speedups for a M ADDNESS and SparsePCA.
given level of accuracy degradation.
Multiplying Matrices Without Multiplying
6. Discussion and Conclusion engineering work of building and integrating custom op-
erators, data layouts, etc., into existing frameworks and
Because our work draws on a number of different fields but networks; and second, the research necessary to determine
does not fit cleanly into any of them, it is useful to discuss when, how, and to what extent to include approximate ker-
what we have and have not demonstrated, as well as possi- nels inspired by our approach. A particular difficulty with
ble implications and extensions of our work. the latter is that our hash function is not differentiable.
Our main empirical finding is that our proposed method, We believe that accelerating full networks with our ideas
M ADDNESS, achieves order-of-magnitude speedups com- is a promising direction, particularly for inference. This is
pared to existing AMM methods and up to two-order-of- especially true at the hardware level—our method requires
magnitude speedups compared to the dense baseline. It only multiplexers, not multipliers, and can therefore be im-
also compresses matrices by up to three orders of magni- plemented easily and with far less power than current ma-
tude. These results are evaluated on a CPU, and are ob- trix product logic. Moreover, our encoded representation
tainable only when there is a training set for one matrix. and lookup tables have contiguous and uniformly-sized ele-
We also claim superior performance only when one ma- ments, making our approximate GEMM inner loops nearly
trix is larger than the other, and both matrices are tall—the identical to their dense counterparts–i.e., there is no need
regime wherein our extremely fast (but less accurate) en- for complex access patterns or sparsity handling.
coding function is beneficial. Our method also loses utility
when the larger matrix is known ahead of time; this as- In summary, we introduced M ADDNESS, an algorithm that
sumption is common in similarity search, and eliminates achieves up to a 10× better speed-quality tradeoff than ex-
the need for a fast encoding function entirely. Our approx- isting methods for the well-studied problem of approximate
imate integer summation and fused table lookups would matrix multiplication (AMM), as measured on a large, di-
likely be useful independent of any of these assumptions, verse, and challenging set of real-world matrices. Our ap-
but demonstrating this is future work. proach is a significant departure from existing AMM work
in that it relies on hashing and table lookups rather than
We also have several theoretical findings, taking the form of multiply-add operations. Our results suggest that future
guarantees regarding the errors introduced by our method methods similar to our own might hold promise for acceler-
and its constituent subroutines. While we do obtain an ating convolution, deep learning, and other workloads bot-
overall generalization guarantee, this guarantee is not tight. tlenecked by linear transforms.
In particular, it should grow looser with the large matrix’s
Frobenius norm and tighter as its singular values become
more concentrated; at present, however, it simply grows References
looser as the largest singular value grows. The missing step Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
is a guarantee that our encoding function will yield lower J., Devin, M., Ghemawat, S., et al. Tensorflow: A system
quantization errors when the singular values are more con- for large-scale machine learning. In OSDI, volume 16,
centrated, which is its behavior in practice. pp. 265–283, 2016.
We have not demonstrated results using GPUs or other ac-
celerators. While such accelerators are a small minority Achlioptas, D. Database-friendly random projections. In
of hardware, they are often used in machine learning. Our Proceedings of the twentieth ACM SIGMOD-SIGACT-
method is not inherently tied to CPUs, but the differing per- SIGART symposium on Principles of database systems,
formance characteristics of accelerators mean that adapting pp. 274–281, 2001.
our method to them would require both algorithmic and im-
Ailon, N. and Chazelle, B. The Fast Johnson-Lindenstrauss
plementation work, with the details depending on the de-
Transform and Approximate Nearest Neighbors. SIAM
vice. We also have not evaluated a multi-CPU-threaded ex-
Journal on Computing (SICOMP), 39(1):302–322,
tension of our algorithm, though this is because our method
2009. doi: 10.1137/060673096.
is intended to serve as the low-level, compute-bound, block
matrix product routine called by individual threads. André, F., Kermarrec, A.-M., and Le Scouarnec, N. Accel-
Finally, we have not demonstrated results using convo- erated nearest neighbor search with quick adc. In Pro-
lutional layers in neural networks, or results accelerating ceedings of the 2017 ACM on International Conference
full networks. The weight reuse in convolutional layers on Multimedia Retrieval, pp. 159–166, 2017.
presents many opportunities for algorithmic optimizations,
and we hope to exploit them using a specialized extension André, F., Kermarrec, A.-M., and Le Scouarnec, N.
of our method in future work. Accelerating overall net- Quicker adc: Unlocking the hidden potential of product
works will require two significant undertakings: first, the quantization with simd. IEEE transactions on pattern
analysis and machine intelligence, 2019.
Multiplying Matrices Without Multiplying
Babenko, A. and Lempitsky, V. Additive quantization for Dasgupta, A., Kumar, R., and Sarlós, T. A sparse johnson:
extreme vector compression. In Proceedings of the IEEE Lindenstrauss transform. In Proceedings of the forty-
Conference on Computer Vision and Pattern Recogni- second ACM symposium on Theory of computing, pp.
tion, pp. 931–938, 2014. 341–350, 2010.
Babenko, A. and Lempitsky, V. Tree Quan- Dau, H. A., Keogh, E., Kamgar, K., Yeh, C.-C. M., Zhu,
tization for Large-Scale Similarity Search Y., Gharghabi, S., Ratanamahatana, C. A., Yanping,
and Classification. CVPR, pp. 1–9, 2015. Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista,
URL papers3://publication/uuid/ G., and Hexagon-ML. The ucr time series classifica-
F4762974-BB97-4208-B035-508945A90EFC. tion archive, October 2018. https://www.cs.ucr.
edu/˜eamonn/time_series_data_2018/.
Babenko, A., Arandjelović, R., and Lempitsky, V. Pairwise
quantization. arXiv preprint arXiv:1606.01550, 2016. Dean, T., Ruzon, M. A., Segal, M., Shlens, J., Vijaya-
narasimhan, S., and Yagnik, J. Fast, accurate detection
Bakhtiary, A. H., Lapedriza, A., and Masip, D. Speeding of 100,000 object classes on a single machine. In Pro-
up neural networks for large scale classification using ceedings of the IEEE Conference on Computer Vision
wta hashing. arXiv preprint arXiv:1504.07488, 2015. and Pattern Recognition, pp. 1814–1821, 2013.
Bartlett, P. L. and Mendelson, S. Rademacher and gaus- Desai, A., Ghashami, M., and Phillips, J. M. Improved
sian complexities: Risk bounds and structural results. practical matrix sketching with guarantees. IEEE Trans-
Journal of Machine Learning Research, 3(Nov):463– actions on Knowledge and Data Engineering, 28(7):
482, 2002. 1678–1690, 2016.
Blalock, D. W. and Guttag, J. V. Bolt: Accelerated data
Drineas, P., Kannan, R., and Mahoney, M. W. Fast
mining with fast vector compression. In Proceedings
Monte Carlo Algorithms for Matrices I: Approximating
of the 23rd ACM SIGKDD International Conference on
Matrix Multiplication. SIAM Journal on Computing,
Knowledge Discovery and Data Mining, pp. 727–735.
36(1):132–157, January 2006a. ISSN 0097-5397,
ACM, 2017.
1095-7111. doi: 10.1137/S0097539704442684. URL
Blalock, D. W., Ortiz, J. J. G., Frankle, J., and Guttag, http://epubs.siam.org/doi/10.1137/
J. V. What is the state of neural network pruning? S0097539704442684.
In Dhillon, I. S., Papailiopoulos, D. S., and Sze, V.
(eds.), Proceedings of Machine Learning and Systems Drineas, P., Kannan, R., and Mahoney, M. W. Fast
2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. Monte Carlo Algorithms for Matrices II: Com-
mlsys.org, 2020. URL https://proceedings. puting a Low-Rank Approximation to a Matrix.
mlsys.org/book/296.pdf. SIAM Journal on Computing, 36(1):158–183, January
2006b. ISSN 0097-5397, 1095-7111. doi: 10.1137/
Camacho, J., Smilde, A., Saccenti, E., and Westerhuis, J. S0097539704442696. URL http://epubs.siam.
All sparse pca models are wrong, but some are useful. org/doi/10.1137/S0097539704442696.
part i: Computation of scores, residuals and explained
variance. Chemometrics and Intelligent Laboratory Sys- Drineas, P., Kannan, R., and Mahoney, M. W. Fast
tems, 196:103907, 2020. Monte Carlo Algorithms for Matrices III: Comput-
ing a Compressed Approximate Matrix Decomposition.
Chen, B., Medini, T., and Shrivastava, A. Slide: In de- SIAM Journal on Computing, 36(1):184–206, January
fense of smart algorithms over hardware acceleration 2006c. ISSN 0097-5397, 1095-7111. doi: 10.1137/
for large-scale deep learning systems. arXiv preprint S0097539704442702. URL http://epubs.siam.
arXiv:1903.03129, 2019. org/doi/10.1137/S0097539704442702.
Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., and Dutta, S., Cadambe, V., and Grover, P. Short-dot: Com-
Chen, Y. Compressing neural networks with the hashing puting large linear transforms distributedly using coded
trick. In ICML, pp. 2285–2294, 2015. short dot products. In Advances In Neural Information
Processing Systems, pp. 2100–2108, 2016.
Chen, Y.-H., Emer, J., and Sze, V. Eyeriss: A spatial ar-
chitecture for energy-efficient dataflow for convolutional Eckart, C. and Young, G. The approximation of one matrix
neural networks. ACM SIGARCH Computer Architec- by another of lower rank. Psychometrika, 1(3):211–218,
ture News, 44(3):367–379, 2016. 1936.
Multiplying Matrices Without Multiplying
Fei-Fei, L., Fergus, R., and Perona, P. Learning genera- He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings
tive visual models from few training examples: An in- in deep residual networks. In European conference on
cremental bayesian approach tested on 101 object cate- computer vision, pp. 630–645. Springer, 2016b.
gories. In 2004 conference on computer vision and pat-
tern recognition workshop, pp. 178–178. IEEE, 2004. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,
K. Q. Densely connected convolutional networks. In
Francis, D. P. and Raimond, K. An improve- Proceedings of the IEEE conference on computer vision
ment of the parameterized frequent directions al- and pattern recognition, pp. 4700–4708, 2017.
gorithm. Data Mining and Knowledge Discov-
Huang, Z. Near Optimal Frequent Directions for Sketching
ery, 32(2):453–482, March 2018a. ISSN 1384-
Dense and Sparse Matrices. Journal of Machine Learn-
5810, 1573-756X. doi: 10.1007/s10618-017-0542-x.
ing Research, 20(1):23, February 2019.
URL http://link.springer.com/10.1007/
s10618-017-0542-x. Indyk, P. and Motwani, R. Approximate nearest neighbors:
towards removing the curse of dimensionality. In Pro-
Francis, D. P. and Raimond, K. A practical
ceedings of the thirtieth annual ACM symposium on The-
streaming approximate matrix multiplication algo-
ory of computing, pp. 604–613, 1998.
rithm. Journal of King Saud University - Com-
puter and Information Sciences, September 2018b. Irony, D., Toledo, S., and Tiskin, A. Communication
ISSN 13191578. doi: 10.1016/j.jksuci.2018.09. lower bounds for distributed-memory matrix multiplica-
010. URL https://linkinghub.elsevier. tion. Journal of Parallel and Distributed Computing, 64
com/retrieve/pii/S1319157818306396. (9):1017–1026, 2004.
Ge, T., He, K., Ke, Q., and Sun, J. Optimized product Jegou, H., Douze, M., and Schmid, C. Product quantization
quantization. IEEE transactions on pattern analysis and for nearest neighbor search. IEEE transactions on pat-
machine intelligence, 36(4):744–755, 2014. tern analysis and machine intelligence, 33(1):117–128,
2011.
Geifman, Y. cifar-vgg, 3 2018. https://github.
com/geifmany/cifar-vgg. Ji, J., Li, J., Yan, S., Zhang, B., and Tian, Q. Super-bit
locality-sensitive hashing. In Advances in Neural Infor-
Ghashami, M., Liberty, E., Phillips, J. M., and Woodruff, mation Processing Systems, pp. 108–116, 2012.
D. P. Frequent Directions: Simple and Deterministic
Matrix Sketching. SIAM Journal on Computing, 45(5): Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,
1762–1792, January 2016. ISSN 0097-5397, 1095-7111. G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers,
doi: 10.1137/15M1009718. URL http://epubs. A., et al. In-datacenter performance analysis of a ten-
siam.org/doi/10.1137/15M1009718. sor processing unit. In Proceedings of the 44th Annual
International Symposium on Computer Architecture, pp.
Guennebaud, G., Jacob, B., et al. Eigen v3. 1–12. ACM, 2017.
http://eigen.tuxfamily.org, 2010.
Kakade, S. M., Sridharan, K., and Tewari, A. On the
Gupta, C., Suggala, A. S., Goyal, A., Simhadri, H. V., complexity of linear prediction: Risk bounds, margin
Paranjape, B., Kumar, A., Goyal, S., Udupa, R., Varma, bounds, and regularization. In Advances in neural in-
M., and Jain, P. Protonn: Compressed and accurate knn formation processing systems, pp. 793–800, 2009.
for resource-scarce devices. In Proceedings of the 34th
Khudia, D., Basu, P., and Deng, S. Open-sourcing fbgemm
International Conference on Machine Learning-Volume
for state-of-the-art server-side inference, 2018.
70, pp. 1331–1340. JMLR. org, 2017.
Krizhevsky, A., Hinton, G., et al. Learning multiple layers
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz,
of features from tiny images. Technical report, Citeseer,
M. A., and Dally, W. J. Eie: efficient inference engine
2009.
on compressed deep neural network. In Proceedings of
the 43rd International Symposium on Computer Archi- Kusner, M., Tyree, S., Weinberger, K., and Agrawal, K.
tecture, pp. 243–254. IEEE Press, 2016. Stochastic neighbor compression. In International Con-
ference on Machine Learning, pp. 622–630, 2014.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE Kyrillidis, A., Vlachos, M., and Zouzias, A. Approx-
conference on computer vision and pattern recognition, imate Matrix Multiplication with Application to Lin-
pp. 770–778, 2016a. ear Embeddings. arXiv:1403.7683 [cs, math, stat],
Multiplying Matrices Without Multiplying
March 2014. URL http://arxiv.org/abs/ Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
1403.7683. arXiv: 1403.7683. DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and
Lerer, A. Automatic differentiation in pytorch. 2017.
Liberty, E. Simple and Deterministic Matrix Sketch-
ing. arXiv:1206.0594 [cs], June 2012. URL http:// Sarlos, T. Improved Approximation Algorithms for Large
arxiv.org/abs/1206.0594. arXiv: 1206.0594. Matrices via Random Projections. In 2006 47th An-
nual IEEE Symposium on Foundations of Computer Sci-
Liu, S., Shao, J., and Lu, H. Generalized Residual ence (FOCS’06), pp. 143–152, Berkeley, CA, Octo-
Vector Quantization for Large Scale Data. Proceed- ber 2006. IEEE. ISBN 978-0-7695-2720-8. doi: 10.
ings - IEEE International Conference on Multimedia 1109/FOCS.2006.37. URL https://ieeexplore.
and Expo, 2016-Augus, 2016. ISSN 1945788X. doi: ieee.org/document/4031351/.
10.1109/ICME.2016.7552944.
Spring, R. and Shrivastava, A. Scalable and sustainable
Luo, L., Chen, C., Zhang, Z., Li, W.-J., and Zhang, T. deep learning via randomized hashing. In Proceedings
Robust Frequent Directions with Application in Online of the 23rd ACM SIGKDD International Conference on
Learning. Journal of Machine Learning Research, 20 Knowledge Discovery and Data Mining, pp. 445–454,
(1):41, February 2019. 2017.
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online dic- Teng, D. and Chu, D. A Fast Frequent Directions
tionary learning for sparse coding. In Proceedings of the Algorithm for Low Rank Approximation. IEEE
26th annual international conference on machine learn- Transactions on Pattern Analysis and Machine Intelli-
ing, pp. 689–696, 2009. gence, 41(6):1279–1293, June 2019. ISSN 0162-8828,
2160-9292, 1939-3539. doi: 10.1109/TPAMI.2018.
Manne, S. and Pal, M. Fast Approximate Matrix Multi- 2839198. URL https://ieeexplore.ieee.
plication by Solving Linear Systems. arXiv:1408.4230 org/document/8362693/.
[cs], August 2014. URL http://arxiv.org/abs/
1408.4230. arXiv: 1408.4230. Wang, J., Shen, H. T., Song, J., and Ji, J. Hashing for simi-
larity search: A survey. arXiv preprint arXiv:1408.2927,
Martinez, J., Hoos, H. H., and Little, J. J. Stacked quantiz- 2014a.
ers for compositional vector compression. arXiv preprint
arXiv:1411.2173, 2014. Wang, J., Shen, H. T., Yan, S., Yu, N., Li, S., and Wang,
J. Optimized distances for binary code ranking. In Pro-
Martinez, J., Clement, J., Hoos, H. H., and Little, J. J. Re- ceedings of the 22nd ACM international conference on
visiting additive quantization. In European Conference Multimedia, pp. 517–526, 2014b.
on Computer Vision, pp. 137–153. Springer, 2016.
Wang, J., Liu, W., Kumar, S., and Chang, S.-F. Learning
Mroueh, Y., Marcheret, E., and Goel, V. Co-Occuring to hash for indexing big data—a survey. Proceedings of
Directions Sketching for Approximate Matrix Multi- the IEEE, 104(1):34–57, 2016a.
ply. arXiv:1610.07686 [cs], October 2016. URL Wang, W., Chen, C., Chen, W., Rai, P., and Carin, L. Deep
http://arxiv.org/abs/1610.07686. arXiv: metric learning with data summarization. In Joint Euro-
1610.07686. pean Conference on Machine Learning and Knowledge
Nelson, J. and Nguyên, H. L. Osnap: Faster numerical lin- Discovery in Databases, pp. 777–794. Springer, 2016b.
ear algebra algorithms via sparser subspace embeddings. Wu, C.-J., Brooks, D., Chen, K., Chen, D., Choudhury,
In 2013 ieee 54th annual symposium on foundations of S., Dukhan, M., Hazelwood, K., Isaac, E., Jia, Y., Jia,
computer science, pp. 117–126. IEEE, 2013. B., et al. Machine learning at facebook: Understand-
Pagh, R. Compressed matrix multiplication. ACM ing inference at the edge. In 2019 IEEE International
Transactions on Computation Theory, 5(3):1–17, Au- Symposium on High Performance Computer Architec-
gust 2013. ISSN 19423454. doi: 10.1145/2493252. ture (HPCA), pp. 331–344. IEEE, 2019.
2493254. URL http://dl.acm.org/citation. Ye, Q., Luo, L., and Zhang, Z. Frequent Direction Algo-
cfm?doid=2493252.2493254. rithms for Approximate Matrix Multiplication with Ap-
plications in CCA. In IJCAI, pp. 7, 2016.
Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkate-
san, R., Khailany, B., Emer, J., Keckler, S. W., and Dally, Yu, Q., Maddah-Ali, M., and Avestimehr, S. Polynomial
W. J. Scnn: An accelerator for compressed-sparse con- codes: an optimal design for high-dimensional coded
volutional neural networks. ACM SIGARCH Computer matrix multiplication. In Advances in Neural Informa-
Architecture News, 45(2):27–40, 2017. tion Processing Systems, pp. 4403–4413, 2017.
Multiplying Matrices Without Multiplying
Zhong, K., Guo, R., Kumar, S., Yan, B., Simcha, D., and
Dhillon, I. Fast classification with binary prototypes.
In Artificial Intelligence and Statistics, pp. 1255–1263,
2017.
D. Aggregation Using Pairwise Averages Proof. The proof follows immediately from considering
the four equiprobable realizations of the pair a, b. In the
Recall that we estimate sums of low-bitwidth integers by cases (0, 0) and (1, 1), 2µ(a, b) = s(a, b). In the cases
averaging pairs of values, then pairs of pairs, and so on. (0, 1) and (1, 0), 2µ(a, b) = 2, while s(a, b) = 1.
One could reduce all C values this way, but we find that
one obtains a better speed-accuracy tradeoff by computing Lemma D.2 (Variance of error when averaging one pair).
the average of blocks of U values and then upcasting to iid
Consider two scalars a and b, a, b ∼ Bernoulli(.5). Then
obtain exact sums of these averages. Multiplying this sum
of averages by U and adding in a bias correction term gives 1
one the overall estimate of the sum. One could tune U for E[ε(a, b)2 ] − E[ε(x, y)]2 =
4
a particular problem and hardware, but we simply set U =
16 in all our experiments. Having a larger U imposes less Proof. Using Lemma D.1, the above can be rewritten as:
overhead because upcasting happens less often, but there
are sharp diminishing returns to this; once upcasting is rare, 1
doing it even less often is of little help thanks to Amdahl’s E[ε(a, b)2 ] =
2
law.
The proof then follows by again considering the four
Because of our assumption that we are operating on ma- equiprobable cases as in Lemma D.1. In the cases (0, 0)
trices, rather than a matrix and a vector, we can also im- and (1, 1), ε(a, b)2 = 0. In the cases (0, 1) and (1, 0),
prove on the aggregation of existing methods (Blalock & (2ŝ(a, b) − s(a, b))2 = (2 − 1)2 = 1.
Guttag, 2017; André et al., 2017; 2019) by fusing the ag-
gregation of two or more output columns to hide read la- Lemma D.3 (Bias of AISE within a subspace). Suppose
tency. Conceptually, this amounts to tiling the loop over that the scalar elements xi of x are drawn from indepen-
output columns and alternating reads between the two cor- dent Bernoulli(.5) distributions. Then
responding tables within the innermost loop. This fusion
does not change the output of the computation—only how E[sU (x) − ŝU (x)] = U log2 (U )/4 (21)
efficiently it runs.
Having addressed these practical details, we may now pro- Proof. Observe that the computation graph can be cast as a
ceed to the analysis of our estimator’s bias. balanced binary tree with U leaves and each parent equal to
Definition D.1 (Averaging Integer Sum Estimator). Let the integer average of its children. Consider the bias intro-
x ∈ {0, 1}C , C % U = 0, U = 2p , p ≥ 0. The Averag- duced at each level t of the tree, where t = 0 corresponds
ing Integer Sum Estimator (AISE) ŝ(x) is defined as: to the leaves and t = log2 (U ) corresponds to the root. The
expected error E[ξ(t, n)] introduced at a node n in level
C/U
X t > 0 is given by:
ŝ(x) , ŝU (xik :jk ) (17)
k=1 1 t−1
(
1
E[ξ(t, n)] = ·2 (22)
x1 x∈R 2
ŝU (x) ,
b 12 (ŝU (xlef t ) + ŝU (xright ) + 1)c otherwise where the 12 follows from Lemma D.1 and the scale 2t−1
(18) is the number of leaf nodes to which the bias is effectively
where ik = (k − 1) · U + 1, jk = iU + U and xlef t and applied. E.g., adding one to the estimated average of four
xright denote vectors formed by taking the initial and final leaf nodes would increase the estimated sum by four. Since
D/2 indices of a given x ∈ RD . there are U · 2−t nodes per level, this means that the total
expected error introduced at level t is 12 · 2t−1 · 2−t = 14 .
Definition D.2 (Pairwise Integer Sum and Sum Estimator).
Summing from t = 1 to t = log2 (U ) completes the proof
For integers a and b, define
of the expected error. Note that t = 0 is omitted since the
s(a, b) , a + b (19) leaf nodes are not the result of averaging operations and so
ŝ(a, b) , 2µ(a, b) (20) introduce no error.
Proof. This follows immediately from Lemma D.3, the E. Additional Experimental Details
fact that the overall sum is estimated within each of C/U
subspaces of size U , and the assumption that the errors in E.1. Choice of Matrix Multiplication Tasks
each subspace are independent. Because nearly all existing work on approximate matrix
multiplication either focuses on special cases that do not
We also verified Theorem D.1 numerically by summing satisfy our problem definition (André et al., 2019; Jegou
large numbers of integers drawn uniformly from the inter- et al., 2011; Ge et al., 2014) or synthetic matrices, there
val 0, . . . , 255. is not a clear set of benchmark matrix multiply tasks to
Note that the assumption that the elements are indepen- use. We therefore propose a collection of tasks that we
dent is not especially strong in reality. This is because this believe are both reproducible and representative of many
section focuses on the effects on the least significant bits real-world matrices. To the best of our knowledge, our ex-
(which are the ones affected by each averaging operation), periments use over an order of magnitude more matrices
and the least significant bit does tend to be nearly uniformly than any previous study.
random in a great deal of real-world data.
E.2. Choice of Single-Threaded Benchmarks
Given the ubiquity of GPUs and multicore CPUs, it may
not be obvious why single-threaded experiments are desir-
able. There are a number of reasons we restrict our focus
to CPUs and the single-threaded case:
for sufficiently small matrices. We could perhaps char- Rademacher random variables, the co-occurring directions
acterize where this breakpoint is, but this is a hardware- sketch (Mroueh et al., 2016), OSNAP (Nelson & Nguyên,
specific result that has little to do with our contributions. 2013), Product Quantization (Jegou et al., 2011), and Opti-
mized Product Quantization (Ge et al., 2014).
• While training of deep neural networks is typically done
on GPUs or other accelerators, trained models (includ- The poor performance of many of these methods is unsur-
ing, but not limited to, neural networks) are commonly prising in our setting. Given that we have access to a train-
deployed on smartphones with just CPUs and/or graph- ing set on which to learn the true principal components, the
ics acceleration that is no better than the CPU (Wu et al., Eckart-Young-Mirsky theorem (Eckart & Young, 1936) in-
2019). Since most of the billions of smartphones in the dicates that PCA should outperform any other individual
world tend to be low-end or old, the need to deploy mod- matrix sketching method employing dense projection ma-
els on CPUs (including those with few cores) is unlikely trices, at least in the limit of infinite training data. Also,
to change for many years. since PQ and OPQ use 256 dense centroids (except in the
Bolt / QuickerADC variations), it is also impossible for
• Creating, benchmarking, and analyzing a performant im- them to perform well when min(D, M ) is not significantly
plementation of our method for GPUs would require a larger than 256.
great deal of engineering work. We plan to create such an
implementation in the future, but believe that the many E.6. UCR Time Series Archive
empirical and theoretical results we currently have are
more than adequate proof of concept and already worth We set the number of returned neighbors to 128 (results
sharing with the community. with 64 and 256 were similar). We omitted datasets with
fewer than 128 training examples, since it is not possible
E.3. SparsePCA Details for Stochastic Neighbor Compression to draw 128 samples
without replacement in this case.
We took steps to ensure that SparsePCA’s results were not
hampered by insufficient hyperparameter tuning. First, In addition to being a large, public corpus of over a hun-
for each matrix product, we tried a range of λ val- dred datasets from a huge variety of different domains, the
ues which we found to encompass the full gamut of UCR Time Series Archive also has the advantage that it can
nearly 0% to nearly 100% sparsity: λ ∈ 2i , i ∈ be used to produce matrix multiplication tasks of a fixed
{−5, −4, −3, −2, −1, 0, 1, 2, 3}. Second, because differ- size. This is necessary for meaningful comparison of speed
ent sparsity patterns may yield different execution times, versus accuracy tradeoffs across datasets. We constructed
we report not times from the single matrix SparsePCA pro- training and test matrices à and A by resampling each
duces for a given (d, λ) pair, but the best times from any time series in each dataset’s train and test set to a length
of 10 random matrices of the same size and at most the of 320 (the closest multiple of 32 to the median length
same sparsity. Finally and most importantly, we plot only of 310). We obtained the matrix B for each dataset by
the Pareto frontier of (speed, quality) pairs produced for a running Stochastic Neighbor Compression (Kusner et al.,
given matrix multiply. I.e., we let SparsePCA cherry-pick 2014) on the training set with an RBF kernel of bandwidth
its best results on each individual matrix multiply. one. We set the number of returned neighbors to 128 (re-
sults with 64 and 256 were similar), yielding a B matrix
of size 320 × 128. Since different datasets have different
E.4. Exact Matrix Multiplication
test set sizes, all results are for a standardized test set size
We also implemented our own matrix product function spe- of 1000 rows. We wanted the length to be a multiple of 32
cialized for tall, skinny matrices. In all cases, we report since existing methods operate best with sizes that are ei-
the timings based on the faster of this function and Eigen’s ther powers of two or, failing that, multiples of large powers
(Guennebaud et al., 2010) matrix multiply function for a of two.
given matrix product.
We approximate Euclidean distances using the identity
kx − yk22 = kxk22 − 2x> y + kyk22 . We approximate only
E.5. Additional Baselines the inner products x> y, since kyk22 can be precomputed
We also tested Frequent Directions / Fast Frequent Direc- for fixed exemplars y and kxk22 doesn’t affect the class pre-
tions (Liberty, 2012; Ghashami et al., 2016; Desai et al., diction since it is constant across all exemplars for a given
2016), many variations of the sampling method of Drineas input x.
et al. (2006a), projection using orthogonalized Gaussian
random matrices (Ji et al., 2012), projection using matrices
of scaled i.i.d. Rademacher random variables (Achliop-
tas, 2001), projection using orthonormalized matrices of
Multiplying Matrices Without Multiplying
E.8. Why Not Speed Up Whole Neural Nets? In Section 5, we showed the classification accuracy as a
function of wall time for the CIFAR-10 and CIFAR-100
Using our ideas to accelerate overall neural networks and softmax classifiers, as well as on the UCR datasets. In
other layer types would be a valuable contribution. In fact, Figure 8 and Figure 9, we instead show normalized mean
we are actively working on this problem. However, as we squared error versus time. In Figure 10 and Figure 11, we
state in the introduction and problem statement, our focus show accuracy or NMSE versus number of operations per-
in this paper is approximate matrix multiplication (AMM) formed, where one operation is either one multiply-add or
and we deliberately make no claim about accelerating en- one table lookup, depending on the method. The first two
tire neural networks or convolutional layers. We limit our figures illustrate that NMSE is closely related to classifi-
scope in this way for several reasons: cation accuracy, but with imperfect NMSE still yielding
excellent accuracy in many cases. The second two fig-
ures show that our method’s superior results are not merely
1. Approximate matrix multiplication is an established re-
caused by the use of faster CPU instructions, but also by
search problem of general interest independent of deep
the use of fewer basic operations at the algorithm level.
learning.
2. Lifting a method from accelerating a single layer to an
overall network is challenging. Just as scalar quantiza-
tion of network parameters is simple for a single layer
but an active area of research for an entire network, so
too is using our method on multiple layers at once an
open research problem. For example, it is not clear
how to deal with the fact that distributions of activations
change throughout training, or how to efficiently incor-
porate our non-differentiable hash function. We could
Multiplying Matrices Without Multiplying
CIFAR-10
0.5
Classification
0.75
Accuracy
0.0 0.50
1.0 CIFAR-100 0.25
1 - NMSE
CIFAR-100
0.5
Classification
Accuracy
0.50
0.0 0.25
100 101 102
Speedup Over Exact Matrix Multiply
MADDNESS ScalarQuantize HashJL 107 108
MADDNESS-PQ Bolt PCA Number of Operations
Exact FastJL SparsePCA MADDNESS ScalarQuantize HashJL
MADDNESS-PQ Bolt PCA
Figure 8: M ADDNESS achieves a far better speed versus Exact FastJL SparsePCA
squared error tradeoff than any existing method when
approximating two softmax classifiers. These results Figure 10: M ADDNESS achieves the best speed versus
parallel the speed versus classification accuracy results, accuracy tradeoff on the CIFAR datasets of any method
except that the addition of our ridge regression is much even when speed is measured as number of operations
more beneficial on CIFAR-100. instead of wall time. Note that fewer operations with a
high accuracy (up and to the left) is better.
0.5
0.0 1.0
Approximating a Sobel Filter
1 - NMSE
1.0
Fraction > 0.95 Fraction > 0.75
0.5
0.5 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Number of Operations 1e6
0.0
1.0
Approximating a Gaussian Filter
1.0
1 - NMSE
0.5
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 Number of Operations 1e7
100 101 102 MADDNESS MADDNESS-PQ SparsePCA
Speedup Over Exact Matrix Multiply
MADDNESS ScalarQuantize HashJL Figure 11: M ADDNESS still achieves the best speed ver-
MADDNESS-PQ Bolt PCA sus squared error tradeoff on the image processing
Exact FastJL SparsePCA
tasks when speed is measured as number of operations
Figure 9: M ADDNESS achieves the lowest squared error instead of wall time.
at high speedups on the UCR datasets. These results
parallel the speed versus classification accuracy results.
Multiplying Matrices Without Multiplying
where Σλ , Σ2 + λI and Σ0 , (Σ2 + λI)−1 Σ. Step Proof. If M ADDNESS had infinite-precision lookup tables,
27 follows from the equality V λIV > = λV V > = λI. ŷ would exactly equal â> b. We therefore need only bound
Because the matrices V and U > are orthonormal and Σ0 is the error introduced by the quantization. By Lemma F.4,
diagonal, the singular values of Z are equal to the diagonal kâk2 ≤ Cσ √A . This implies that
2 λ
entries of Σ0 . Each entry σi0 is equal to
σi CσA kbk2
σi0 = . (32) kâ> bk ≤ √ (38)
σi2+λ 2 λ
1 and therefore
This expression attains its maximal value of √
2 λ
when
σi2 = λ. −CσA kbk2 CσA kbk2
√ ≤ â> b ≤ √ . (39)
2 λ 2 λ
Lemma F.3 (Ridge Regression Singular Value Bound). Let
X ∈ RN ×D and Y ∈ RD×M be arbitrary matrices and For each of the C codebooks, this means that the value to
let W , (X > X + λI)−1 X > Y , λ > 0 be the ridge be quantized lies in the interval [ −σ2A√kbk
λ
2 σA kbk2
, 2√λ ] of width
regression weight matrix. Then kW k∞ ≤ kY k
√ ∞ , where
2 λ
σA kbk2
√ . Because M ADDNESS quantizes the lookup tables
λ
k·k∞ denotes the largest singular value. such that largest and smallest entries for any row of P
are linearly mapped to 255.5 and −0.5,4 respectively, the
Proof. Observe that W = ZY , where Z , (X > X + worst-case quantization error is when the quantized value
λI)−1 X > . Then by applying Lemma F.2 and recalling lies exactly between two quantization levels. We there-
that Schatten norms are submultiplicative, we have fore need to compute the largest possible gap between a
value and its quantization. Using 256 quantization lev-
kY k∞ els, the largest possible gap is 1/(256/.5) = 1/512 of
kW k∞ ≤ kZk∞ kY k∞ ≤ √ . (33)
2 λ the interval width. Multiplying by the above interval width
yields a maximum quantization error for a given codebook
A kbk2
of σ512 √ . Because the errors in each subspace may not
λ
Lemma F.4 (Bound on M ADDNESS Embedding Norm). agree in sign, their sum is an upper bound on the overall
Let g = g(a) be the encoding of an arbitrary vector a us- quantization error.
ing C codebooks and let P be the prototype matrix learned
by M ADDNESS using training matrix à with ridge regres- At this point, we have all of the pieces necessary to prove a
sion parameter λ > 0. Then generalization guarantee for Maddness-Regress save
C 4
kg > P k2 ≤ √ kÃk∞ (34) We use 255.5 and −0.5 rather than 255 and 0 because the
2 λ latter only guarantees that a point is within 1/510 of the inter-
val width, not 1/512. This is not an important choice and either
where kÃk∞ denotes the largest singular value of Ã. option would be fine.
Multiplying Matrices Without Multiplying
one: a theorem linking the norms of the various vec- Lipschitz constant of the loss, a bound on the L2 norm of
tors and matrices involved to a probabilistic guarantee. g, and a bound on the L2 norm of w, we can apply this
Kakade et al. (2009) provide such a gaurantee, based on theorem. The Lipschitz constant for the absolute loss is 1.
Rademacher complexity (Bartlett & Mendelson, 2002). The L2 norm of g is exactly C. The L2 norm of w can be
Theorem F.1 ((Kakade et al., 2009), Corollary 5). Let F = bounded as
{w> x : kwk2 ≤ W } be the class of linear functions with σA kbk2
bounded L2 norms, let S be a set of n samples drawn i.i.d. kwk2 = kP bk2 ≤ kP k∞ kbk2 ≤ √ (43)
2 λ
from some distribution D over the L2 ball of radius X, and
let L(f ), f ∈ F be a loss function with Lipschitz constant using Lemma F.3.
L. Then for any 0 < δ < 1, it holds with probability at
least 1 − δ over the sample S that Using this lemma, the proof of Theorem 4.1 is immediate;
we begin with Lemma F.6 and simply union bound over all
LXW p
ED [L(f )] ≤ ES [L(f )] + √ 8 + −log(δ) . 2C(4dlog2 (D)e+120) hypotheses from Lemma F.1.
2n
(40)
CσA kbk2
ED [L(a, b)] ≤ EÃ [L(a, b)] +√ +
512 λ
(41)
CσA kbk2 p
√ 8 + −log(δ) .
2 2nλ