How Can We Be So Dense? The Benefits of Using Highly Sparse Representations
How Can We Be So Dense? The Benefits of Using Highly Sparse Representations
Most artificial networks today rely on dense rep- lead to improved test set accuracies.
resentations, whereas biological networks rely on Despite the above literature the majority of neural networks
sparse representations. In this paper we show today rely on dense representations. One exception is the
how sparse representations can be more robust pervasive use of dropout (Srivastava et al., 2014) as a regu-
to noise and interference, as long as the under- larizer. Dropout randomly “kills” a percentage of the units
lying dimensionality is sufficiently high. A key (in practice usually 50%) on every training input presenta-
intuition that we develop is that the ratio of the tion. Variational dropout techniques tune the dropout rates
operable volume around a sparse vector divided individually per weight (Molchanov et al., 2017). Dropout
by the volume of the representational space de- introduces random sparse representations during learning,
creases exponentially with dimensionality. We and has been shown to be an effective regularizer in many
then analyze computationally efficient sparse net- contexts.
works containing both sparse weights and acti-
vations. Simulations on MNIST and the Google In this paper we discuss certain inherent benefits of high
Speech Command Dataset show that such net- dimensional sparse representations. We focus on robustness
works demonstrate significantly improved robust- and sensitivity to interference. These are central issues with
ness and stability compared to dense networks, today’s neural network systems where even small (Szegedy
while maintaining competitive accuracy. We dis- et al., 2013) and large (Rosenfeld et al., 2018) perturbations
cuss the potential benefits of sparsity on accuracy, can cause dramatic changes to a network’s output. We offer
noise robustness, hyperparameter tuning, learn- two main contributions. First, we analyze high dimensional
ing speed, computational efficiency, and power sparse representations, and show that such representations
requirements. are naturally more robust to noise and interference from
random inputs. When matching sparse patterns, corrupted
versions of a pattern are “close” to the original whereas
1. Introduction random patterns are exponentially hard to match.
The literature on sparse representations in neural networks Our second contribution is an efficient construction of sparse
dates back many decades, with neuroscience as one of the deep networks that is designed to exploit the above proper-
primary motivations. In 1988 Kanerva proposed the use ties. We implement networks where the weights for each
of sparse distributed memories (Kanerva, 1988) to model unit in a layer randomly sample from a sparse subset of
the highly sparse representations seen in the brain. In 1997, the source layer below. In addition the output of each layer
(Olshausen & Field, 1997) showed that incorporating sparse is constrained such that only the k most active units are
priors and sparse cost functions in encoders can lead to re- allowed to be non-zero, where k is much smaller than the
ceptive field representations that are remarkably close to number of units in that layer. In these networks, the num-
what is observed in the primate visual cortex. More recently ber of non-zero products for each layer is approximately
(Lee et al., 2008; Chen et al., 2018) showed hierarchical (sparsity of layer i) × (sparse weights of layer i + 1). This
sparse representations that qualitatively lead to natural look- formulation results in simple differentiable sparse layers that
ing hierarchical feature detectors. (Lee et al., 2009; Nair can be dropped into both standard linear and convolutional
& Hinton, 2009; Srivastava et al., 2013; Rawlinson et al., layers.
1
. Correspondence to: Subutai Ahmad, Luiz Scheinkman We demonstrate significantly improved robustness to noise
<[sahmad, lscheinkman]@numenta.com>. for MNIST and the Google Speech Commands dataset,
while maintaining competitive accuracy in the standard zero
How Can We Be So Dense?
Frequency of matches
these sets. As you decrease θ matching is less strict and you
10 3
can match noisier versions of each prototype. The cost is
that the chance of matching the other vectors also increases 10 4
turns out that for sparse vectors, this cost is offset as you 10 6
increase n. That is, as n increases, the denominator in Eq. 3 a = 64 a = 128 a = 256
10 7
(and the corresponding ”free” space) increases much faster
than the numerator. For a fixed sparsity level, you can main- 10 8
tain highly tolerant matches without the cost of additional 500 1000 1500 2000 2500 3000 3500
Dimensionality (n)
false positives simply by increasing the dimensionality.
Fig 2 illustrates this trend for some example sparsities. In Figure 2. The probability of matches to random binary vectors
this figure we simulated matching with random vectors and (with a active bits) as a function of dimensionality, for various
plotted match rates with random vectors as a function of the levels of sparsity. The probability decreases exponentially with n.
number of active bits and the underlying dimensionality. In Black circles denote the observed frequency of a match (based on
the simulation we repeatedly generated a random prototype a large number of trials). The dotted lines denote the theoretically
vector with |xi | = 24 bits on and then attempted to match predicted probabilities using Eq. 3.
against random test vectors with a bits on. We matched
using a threshold θ of 12 which meant that even vectors that
tained using scalar vectors, and if so, the conditions un-
were up to 50% different from xi would match. We varied
der which they hold. Let xw and xi represent two sparse
a and the dimensionality of the vectors, n.
vectors such that kxw k0 and kxi k0 counts the number of
The chart shows that for sparse binary vectors, match rates non-zero entries in each. Let each non-zero component be
with random vectors drop rapidly as the underlying dimen- independent and sampled from the distributions Pθw (xw )
sionality increases. The horizontal line indicates the proba- and Pθi (xi ). The probability of a significant match is then:
bility of matching xi against dense vectors, with a = n/2.
The probability of dense matches stays relatively high and
unaffected by dimensionality, indicating that both sparse-
P (xw · xi ≥ θ) =
ness and high dimensionality are key to robust matches. In Pkxw k0
(Ahmad & Hawkins, 2016) we develop additional properties, b=θ pb | Ωn (xw , b, kxi k0 ) |
n (4)
including the probability of false negatives.
kxi k0
2.3. Matching Sparse Scalar Vectors where pb is the probability that the dot product is >= θ
Deep networks operate on scalar vectors, and in this section given that the overlap is exactly b components:
we consider how the above ideas apply to sparse scalar
representations. Binary and scalar vectors are similar in that pb = P (xw · xi ≥ θ | kxw · xi k0 = b) (5)
the components containing zero do not affect the dot product,
and thus the combinatorics in Eq. 3 are still applicable. There does not appear to be a closed form way to com-
Eq. 1 represents the set of scalar vectors where the number pute pb for normal or uniform distributions so we resort to
of non-zero multiplies in the dot product is exactly b, and simulations that mimic our network structure.
Eq. 3 represents the probability that the number of non-
zero multiplies is >= θ. However, an additional factor As before, we generated a large number of random vectors
is the distribution of scalar values. If components in one xw and xi , and plotted the frequency of random matches.
vector are extremely large relative to θ, the likelihood of With kxw k0 = k, we focus on simulations where the non-
a significant match will be high even with a single shared zero entries in xw are uniform in [−1/k, 1/k], and the non-
non-zero component. zero entries in xi are uniform in S ∗ [0, 2/k]. We focus
on this formulation because of the relationship to common
We wanted to see if the exponential drop in random matches network structures and weight initialization. xw is a putative
for binary vectors, demonstrated by Figure 2, can be ob- weight vector and xi is an input vector to this layer from the
How Can We Be So Dense?
10 1
a = n2 a = 256
10 2
a = 128
Frequency of matches
Frequency of matches
a = 256 10 2
10 3
a = 64
10 4
a = 128 10 3
10 5
a = 64 10 4
10 6
1000 500
1500 2000 2500 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Dimensionality (n) Scale factor (s)
Figure 3. Left: The probability of matches to random scalar vectors (with a non-zero components) as a function of dimensionality, for
various levels of sparsity. The probability of false matches decreases exponentially with n. Note that the probability for a dense vector,
a = n2 stays relatively high, and does not decrease with dimensionality. Right: The impact of scale on vector matches with a fixed
n = 1000. The larger the scaling discrepancy, the higher the probability of a false match.
previous layer (we assume unit activations are positive, the 3. Sparse Network Description
result of a ReLU-like non-linearity). S controls the scale of
xi relative to xw . Here we discuss a particular sparse network implementation
that is designed to exploit Eq. 3. This implementation is an
Figure 3 (left) shows the behavior with k = 32 and S = extension of our previous work on the HTM Spatial Pooler,
1. We varied the activity of the input vectors kxi k0 = a binary sparse coding algorithm that models sparse code
a and the dimensionality of the vectors, n. We set θ = generation in the neocortex (Hawkins et al., 2011; Cui et al.,
E[xw · xw ]/2.0. The chart demonstrates that under these 2017). Specifically, we formulate a version of the Spatial
conditions we can achieve robust behavior similar to that Pooler that is designed to be a drop-in layer for neural
of binary vectors. Figure 3 (right) plots the effect of S networks trained with back-propagation. Our work is also
on the match probabilities with a fixed n = 1000. As this closely related to previous literature on k-winner take all
chart shows, the error increases significantly as S increases. networks (Majani et al., 1989) and fixed sparsity networks
Taken together, these results show that the fundamental (Makhzani & Frey, 2015).
robustness properties of binary sparse vectors can also hold
for sparse scalar vectors, as long as the overall scaling of Consider a network with L hidden layers. Let y l denote
vectors are in a similar range. the vector of outputs from layer l, respectively, with y 0 as
the input vector. W l and ul are the weights and biases for
each layer. In a standard neural network the weights W l
2.4. Non-uniform Distribution of Vectors
are typically dense and initialized using a uniform random
Eq. 3 assumes the ideal case where vectors are chosen with distribution. The feed forward outputs are then calculated
a uniform random distribution. With a non-uniform distri- as follows:
bution the error rates will be higher. The more non-uniform
the distribution the worse the error rates. For example, if
ŷ l = W l · y l−1 + ul
you mostly end up observing 10 inputs, your error rates will
be bounded at around 10%. Thus, to optimize error rates, y l = f (ŷ l )
it is important to be as close to a uniform distribution as
possible. where f is any activation function, such as tanh(·) or
ReLU(·). (Figure 4 left.)
To implement our sparse networks, we make two modifica-
tions to this basic formulation (Figure 4 right.). First, we ini-
tialize the weights using a sparse random distribution, such
How Can We Be So Dense?
Figure 4. This figure illustrates the differences between a generic dense network layer (left) and a sparse network layer (right). In the
sparse layer, the linear layer subsamples from its input layer (implemented via sparse weights, depicted with fewer arrows). In addition,
the ReLU layer is replaced by a k-winners layer.
that only a fraction of the weights contain non-zero values. To address this we employ a boosting term (Hawkins et al.,
Non-zero weights are initialized using standard Kaiming 2011; Cui et al., 2017) which favors units that have not
initialization (He et al., 2015b). The rest of the connections been active recently. We compute a running average of each
are treated as non-existent, i.e. the corresponding weights unit’s duty cycle (i.e. how frequently it has been one of the
are zero throughout the life of the network. Second, only top k units):
the top-k active units within each layer are maintained in y l ,
and the rest set to zero. This k-winners step is non-linear
and can be thought of as a substitute for the ReLU function. dli (t) = (1 − α)dli (t − 1) + α · [i ∈ topIndicesl ] (6)
Instead of a threshold of 0, the threshold here is adaptive
and corresponds to the k’th largest activation (Makhzani & A boost coefficient bli is then calculated for each unit based
Frey, 2013). on the target duty cycle and the current average duty cycle:
The layer can be trained using standard gradient descent.
Similar to ReLU, the gradient of the layer is calculated as 1 l l
bli (t) = eβ(â −di (t)) (7)
above the threshold and 0 elsewhere. During inference we
increase k by 50%, which led to slightly better accuracies. The target duty cycle âl is a constant reflecting the percent-
In all our simulations the last layer of each network is a age of units that are expected to be active, i.e. âl = |ykl | .
standard linear output layer with log-softmax activation The boost factor, β, is a positive parameter that controls the
function. strength of boosting. β = 0 implies no boosting (bli = 1),
and higher numbers lead to larger boost coefficients. In
3.1. Boosting (Hawkins et al., 2011; Cui et al., 2017) we showed that Eq. 7
One practical issue with the above formulation is that it is encourages each unit to have equal activation frequency and
possible for a small number of units to initially dominate and effectively maximizes the entropy of the layer.
then, through learning, become active for a large percentage The boost coefficients are used during the k-winners step
of patterns (this was also noted in (Makhzani & Frey, 2015; to select which units remain active for this input. Through
Cui et al., 2017)). Having a small number of active units boosting, units which have not been active recently have
negatively impacts the available representational volume. a disproportionately higher impact and are more likely to
It is desirable for every unit to be equally active in order win, whereas overly active units are de-emphasized. To
to maximize the robustness of the representation in Eq. 3. determine the output of the layer, the non-boosted activity
How Can We Be So Dense?
Figure 5. A. Example MNIST images with varying levels of noise. B. Classification accuracy as a function of noise level.
Table 3. Key parameters for each network. L1F and L2F denote the number of filters at the corresponding CNN layer. L1,2,3 sparsity
indicates k/n, the percentage of outputs that were enforced to be non-zero. 100% indicates a special case where we defaulted to traditional
ReLU activations. Wt sparsity indicates the percentage of weights that were non-zero. All parameters are available in the source code.
tor matching in the context of binary sparse representations. test robustness. It is possible that such networks are also
We then constructed efficient neural network formulations more robust, though this remains to be tested. Pruning tech-
of sparse networks that place internal representations in the niques in general are quite orthogonal to ours, and it may be
sweet spot suggested by the theory. In particular we aim to feasible to combine them with the mechanisms discussed
match sparse activations with sparse weights in relatively here.
high dimensional settings. A boosting rule was used to in-
In our work we did not attempt to introduce sparsity into
crease the overall entropy of the internal layers in order to
the convolutional filters themselves. (Li et al., 2016) have
maximize the utilization of the representational space. We
shown it is sometimes possible to remove entire filters from
showed that this formulation increases the overall robustness
large CNNs suggesting that sparsifying filter weights may
of the system to noisy inputs using MNIST and the Google
also be possible, particularly in networks with larger filters.
Speech Command Dataset. Both dense and sparse networks
Introducing sparse convolutions within the context of the
showed high accuracies, but the sparse nets were signifi-
techniques in this paper is an area of future exploration. The
cantly more robust. These results suggest that it is important
techniques described here are straightforward to implement
to look beyond pure test set performance as test accuracy by
and can be extended to other architectures including RNNs.
itself is not a reliable indicator of overall robustness.
This is yet another promising area for future research.
Our work extends the existing literature on sparsity and
pruning. A very recent theoretical paper showed that simple 5.1. Software
linear sparse networks may be more robust to adversarial
attacks (Guo et al., 2018). A number of papers have shown All code and experiments are available at
that it is possible to effectively introduce sparsity through https://github.com/numenta/htmpapers as open source.
pruning and retraining (Han et al., 2015; Frankle & Carbin,
2018; Lee et al., 2018). The mechanisms introduced here Acknowledgements
can be seen as complementary to those techniques. Our
network enforces sparse weights from the beginning by We thank Jeff Hawkins, Ali Rahimi, and John Berkowitz for
construction, and sparse weights are learned as part of the helpful discussions and comments.
training process. In addition, we reduce the overall compu-
tational complexity by enforcing sparse activations, which References
in turn significantly reduces the number of overall non-zero
Ahmad, S., & Hawkins, J. (2016). How do neurons operate
products. This should produce significant power savings for
on sparse distributed representations? A mathematical
optimized hardware implementations.
theory of sparsity, neurons and active dendrites. arXiv,
We demonstrated increased robustness in our networks (pp. arXiv:1601.00720 [q–bio.NC]).
whereas the papers on pruning typically do not explicitly URL https://arxiv.org/abs/1601.00720
How Can We Be So Dense?
Chen, Y., Paiton, D., & Olshausen, B. (2018). The LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).
Sparse Manifold Transform. In S. Bengio, H. Wallach, Gradient-based learning applied to document recognition.
H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Gar- Proceedings of the IEEE.
nett (Eds.) Advances in Neural Information Processing
Lee, H., Ekanadham, C., & Ng, A. Y. (2008). Sparse deep
Systems 31, (pp. 10533–10544). Curran Associates, Inc.
belief net model for visual area V2. Advances In Neural
Cui, Y., Ahmad, S., & Hawkins, J. (2017). The HTM Information Processing Systems.
Spatial Pooler a neocortical algorithm for online Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Con-
sparse distributed coding. Frontiers in Computational volutional deep belief networks for scalable unsupervised
Neuroscience, 11, 111. learning of hierarchical representations. Proceedings of
URL https://www.frontiersin.org/ the 26th Annual International Conference on Machine
articles/10.3389/fncom.2017.00111/ Learning - ICML ’09, (pp. 1–8).
abstract
Lee, N., Ajanthan, T., & Torr, P. H. S. (2018). SNIP: Single-
Frankle, J., & Carbin, M. (2018). The Lottery Ticket Hy- shot Network Pruning based on Connection Sensitivity.
pothesis: Finding Sparse, Trainable Neural Networks. URL http://arxiv.org/abs/1810.02340
URL http://arxiv.org/abs/1803.03635
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P.
Guo, Y., Zhang, C., Zhang, C., & Chen, Y. (2018). Sparse (2016). Pruning Filters for Efficient ConvNets.
DNNs with Improved Adversarial Robustness. In S. Ben- URL http://arxiv.org/abs/1608.08710
gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, & R. Garnett (Eds.) Advances in Neural Infor- Majani, E., Erlanson, R., & Abu-Mostafa, Y. S. (1989). On
mation Processing Systems 31, (pp. 240–249). Curran the k-winners-take-all network. In Advances in neural
Associates, Inc. information processing systems, (pp. 634–642).
Makhzani, A., & Frey, B. (2013). k-Sparse Autoencoders.
Han, S., Pool, J., Tran, J., & Dally, W. (2015). Learning both
URL http://arxiv.org/abs/1312.5663
Weights and Connections for Efficient Neural Network.
In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, Makhzani, A., & Frey, B. (2015). Winner-take-all autoen-
& R. Garnett (Eds.) Advances in Neural Information Pro- coders. Advances in Neural Information Processing.
cessing Systems 28, (pp. 1135–1143). Curran Associates, URL http://papers.nips.cc/paper/
Inc. 5783-winner-take-all-autoencoders
Hawkins, J., Ahmad, S., & Dubinsky, D. (2011). Cortical Molchanov, D., Ashukha, A., & Vetrov, D. (2017). Varia-
Learning Algorithm and Hierarchical Temporal Memory. tional Dropout Sparsifies Deep Neural Networks.
URL http://numenta.org/resources/ URL http://arxiv.org/abs/1701.05369
HTM{_}CorticalLearningAlgorithms.pdf
Nair, V., & Hinton, G. E. (2009). 3D Object Recognition
He, K., Zhang, X., Ren, S., & Sun, J. (2015a). Deep Resid- with Deep Belief Nets. In Y. Bengio, D. Schuurmans,
ual Learning for Image Recognition. J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.)
URL http://arxiv.org/abs/1512.03385 Advances in Neural Information Processing Systems 22,
(pp. 1339–1347). Curran Associates, Inc.
He, K., Zhang, X., Ren, S., & Sun, J. (2015b). Delving Deep
Olshausen, B. A., & Field, D. J. (1997). Sparse coding with
into Rectifiers: Surpassing Human-Level Performance on
an overcomplete basis set: A strategy employed by V1?
ImageNet Classification.
Vision Research, 37, 3311–3325.
URL http://arxiv.org/abs/1502.01852
Rawlinson, D., Ahmed, A., & Kowadlo, G. (2018). Sparse
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Ac- Unsupervised Capsules Generalize Better.
celerating Deep Network Training by Reducing Internal URL http://arxiv.org/abs/1804.06094
Covariate Shift.
URL http://arxiv.org/abs/1502.03167 Rosenfeld, A., Zemel, R., & Tsotsos, J. K. (2018). The
Elephant in the Room.
Kanerva, P. (1988). Sparse Distributed Memory. Cambridge, URL http://arxiv.org/abs/1808.03305
MA: The MIT Press.
Sainath, T. N., & Parada, C. (2015). Convolutional neural
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, networks for small-footprint keyword spotting. In Six-
R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropa- teenth Annual Conference of the International Speech
gation Applied to Handwritten Zip Code Recognition. Communication Association.
How Can We Be So Dense?