0% found this document useful (0 votes)
79 views11 pages

How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views11 pages

How Can We Be So Dense? The Benefits of Using Highly Sparse Representations

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

How Can We Be So Dense?

The Benefits of Using Highly Sparse


Representations

Subutai Ahmad 1 Luiz Scheinkman 1


Numenta, Redwood City, California, USA

Abstract 2018) showed that introducing sparsity terms can sometimes


arXiv:1903.11257v2 [cs.LG] 2 Apr 2019

Most artificial networks today rely on dense rep- lead to improved test set accuracies.
resentations, whereas biological networks rely on Despite the above literature the majority of neural networks
sparse representations. In this paper we show today rely on dense representations. One exception is the
how sparse representations can be more robust pervasive use of dropout (Srivastava et al., 2014) as a regu-
to noise and interference, as long as the under- larizer. Dropout randomly “kills” a percentage of the units
lying dimensionality is sufficiently high. A key (in practice usually 50%) on every training input presenta-
intuition that we develop is that the ratio of the tion. Variational dropout techniques tune the dropout rates
operable volume around a sparse vector divided individually per weight (Molchanov et al., 2017). Dropout
by the volume of the representational space de- introduces random sparse representations during learning,
creases exponentially with dimensionality. We and has been shown to be an effective regularizer in many
then analyze computationally efficient sparse net- contexts.
works containing both sparse weights and acti-
vations. Simulations on MNIST and the Google In this paper we discuss certain inherent benefits of high
Speech Command Dataset show that such net- dimensional sparse representations. We focus on robustness
works demonstrate significantly improved robust- and sensitivity to interference. These are central issues with
ness and stability compared to dense networks, today’s neural network systems where even small (Szegedy
while maintaining competitive accuracy. We dis- et al., 2013) and large (Rosenfeld et al., 2018) perturbations
cuss the potential benefits of sparsity on accuracy, can cause dramatic changes to a network’s output. We offer
noise robustness, hyperparameter tuning, learn- two main contributions. First, we analyze high dimensional
ing speed, computational efficiency, and power sparse representations, and show that such representations
requirements. are naturally more robust to noise and interference from
random inputs. When matching sparse patterns, corrupted
versions of a pattern are “close” to the original whereas
1. Introduction random patterns are exponentially hard to match.

The literature on sparse representations in neural networks Our second contribution is an efficient construction of sparse
dates back many decades, with neuroscience as one of the deep networks that is designed to exploit the above proper-
primary motivations. In 1988 Kanerva proposed the use ties. We implement networks where the weights for each
of sparse distributed memories (Kanerva, 1988) to model unit in a layer randomly sample from a sparse subset of
the highly sparse representations seen in the brain. In 1997, the source layer below. In addition the output of each layer
(Olshausen & Field, 1997) showed that incorporating sparse is constrained such that only the k most active units are
priors and sparse cost functions in encoders can lead to re- allowed to be non-zero, where k is much smaller than the
ceptive field representations that are remarkably close to number of units in that layer. In these networks, the num-
what is observed in the primate visual cortex. More recently ber of non-zero products for each layer is approximately
(Lee et al., 2008; Chen et al., 2018) showed hierarchical (sparsity of layer i) × (sparse weights of layer i + 1). This
sparse representations that qualitatively lead to natural look- formulation results in simple differentiable sparse layers that
ing hierarchical feature detectors. (Lee et al., 2009; Nair can be dropped into both standard linear and convolutional
& Hinton, 2009; Srivastava et al., 2013; Rawlinson et al., layers.
1
. Correspondence to: Subutai Ahmad, Luiz Scheinkman We demonstrate significantly improved robustness to noise
<[sahmad, lscheinkman]@numenta.com>. for MNIST and the Google Speech Commands dataset,
while maintaining competitive accuracy in the standard zero
How Can We Be So Dense?

noise scenario. We discuss the number of weights used by


sparse networks in these datasets, and the impact of addi-
tional pruning. Our work extends the existing literature on
sparse networks and pruning (see Section 5 for a comparison
with some prior work). At the end of the paper we discuss
some possible areas for future work.

2. High Dimensional Sparse Representations


In this section we develop some basic properties of sparse
representations as they relate to noise robustness and in-
terference. In a typical neural network an input vector is
matched against a stored weight vector using a dot product.
This is then followed by a threshold-like non-linearity such
as tanh(·) or ReLU(·).
Ideally we would like the outputs of each layer to be in-
variant to noise or corrupted inputs. When comparing two
sparse vectors via a dot product, the results are unaffected
by the zero components of either vector. A key quantity we
consider is the ratio of the matching volume around a proto-
type vector divided by the volume of the whole space. The
larger the match volume around a vector, the more robust
it is to noise. The smaller the ratio, the less likely it is that
random inputs can affect the match.

2.1. Matching Sparse Binary Vectors


We quantify the above ratio using binary vectors (following Figure 1. An illustration of the conceptual effect of decreasing the
our previous work in (Ahmad & Hawkins, 2016)). In this match threshold θ and increasing n, the dimensionality. The large
section we show that the ratio decreases exponentially with grey circles denote the universe of possible patterns. The smaller
increased dimensionality, while maintaining a large match circles each represent the set of matches around one vector. When
volume. Let x be a binary vector of length n, and let |x| θ is high (A), very few random vectors can match these vectors
denote the number of non-zero entries. The dot product xi · (small white circles). As you decrease θ, the set of potential
xj counts the overlap, or number of shared bits, between two matches increases (larger white circles in B). If you then increase
n, the universe of possible patterns increases, and the relative sizes
such vectors. We would like to understand the probability of
of the white circles shrink rapidly.
two vectors having significant overlap, i.e. overlap greater
than some threshold θ.
We define the overlap set, Ωn (xi , b, k), as the set of all |xi |
vectors of size k that have exactly b bits of overlap with xi . X
| Ωn (xi , b, |xj |) | (2)
The number of such vectors can be calculated as: b=θ

   If we select vectors from a uniform random distribution, the


n |xi | n − |xi | probability of significant overlap can be calculated as:
|Ω (xi , b, k)| = (1)
b k−b
P|xi |
b=θ | Ωn (xi , b, |xj |) |
P (xi · xj ≥ θ) = n
 (3)
The left half of the above product counts all the ways we |xj |
can select exactly b bits out of active bits in |xi |. The right
half counts the number of ways we can select the remaining n

where |xj | is the set of all possible comparison vectors.
k − b bits from the components of xi that are zero. The
product of these two quantities represents the number of
2.2. Impact of Dimensionality and Sparsity
all vectors with exactly b bits of overlap with |xi |. We can
now count the number of vectors that match xi , i.e. where Two key factors in Eq. 3 are the number of non-zero com-
xi · xj ≥ θ as: ponents, |xi |, and the dimensionality, n. Figure 1 provides
How Can We Be So Dense?

an intuitive description of their impact. Assume we have


M prototype vectors, and we want to match noisy versions Match probability for sparse vectors
of these vectors. Around each prototype there is a set of
matching vectors. If the threshold is very high, the set of 100
matching vectors is small (illustrated by the small circles in 10 1

Figure 1A) and there will be quite a bit of space between a = n2


10 2

Frequency of matches
these sets. As you decrease θ matching is less strict and you
10 3
can match noisier versions of each prototype. The cost is
that the chance of matching the other vectors also increases 10 4

because there is less free space in between (Figure 1B). It 10 5

turns out that for sparse vectors, this cost is offset as you 10 6
increase n. That is, as n increases, the denominator in Eq. 3 a = 64 a = 128 a = 256
10 7
(and the corresponding ”free” space) increases much faster
than the numerator. For a fixed sparsity level, you can main- 10 8

tain highly tolerant matches without the cost of additional 500 1000 1500 2000 2500 3000 3500
Dimensionality (n)
false positives simply by increasing the dimensionality.
Fig 2 illustrates this trend for some example sparsities. In Figure 2. The probability of matches to random binary vectors
this figure we simulated matching with random vectors and (with a active bits) as a function of dimensionality, for various
plotted match rates with random vectors as a function of the levels of sparsity. The probability decreases exponentially with n.
number of active bits and the underlying dimensionality. In Black circles denote the observed frequency of a match (based on
the simulation we repeatedly generated a random prototype a large number of trials). The dotted lines denote the theoretically
vector with |xi | = 24 bits on and then attempted to match predicted probabilities using Eq. 3.
against random test vectors with a bits on. We matched
using a threshold θ of 12 which meant that even vectors that
tained using scalar vectors, and if so, the conditions un-
were up to 50% different from xi would match. We varied
der which they hold. Let xw and xi represent two sparse
a and the dimensionality of the vectors, n.
vectors such that kxw k0 and kxi k0 counts the number of
The chart shows that for sparse binary vectors, match rates non-zero entries in each. Let each non-zero component be
with random vectors drop rapidly as the underlying dimen- independent and sampled from the distributions Pθw (xw )
sionality increases. The horizontal line indicates the proba- and Pθi (xi ). The probability of a significant match is then:
bility of matching xi against dense vectors, with a = n/2.
The probability of dense matches stays relatively high and
unaffected by dimensionality, indicating that both sparse-
P (xw · xi ≥ θ) =
ness and high dimensionality are key to robust matches. In Pkxw k0
(Ahmad & Hawkins, 2016) we develop additional properties, b=θ pb | Ωn (xw , b, kxi k0 ) |
n (4)
including the probability of false negatives.

kxi k0

2.3. Matching Sparse Scalar Vectors where pb is the probability that the dot product is >= θ
Deep networks operate on scalar vectors, and in this section given that the overlap is exactly b components:
we consider how the above ideas apply to sparse scalar
representations. Binary and scalar vectors are similar in that pb = P (xw · xi ≥ θ | kxw · xi k0 = b) (5)
the components containing zero do not affect the dot product,
and thus the combinatorics in Eq. 3 are still applicable. There does not appear to be a closed form way to com-
Eq. 1 represents the set of scalar vectors where the number pute pb for normal or uniform distributions so we resort to
of non-zero multiplies in the dot product is exactly b, and simulations that mimic our network structure.
Eq. 3 represents the probability that the number of non-
zero multiplies is >= θ. However, an additional factor As before, we generated a large number of random vectors
is the distribution of scalar values. If components in one xw and xi , and plotted the frequency of random matches.
vector are extremely large relative to θ, the likelihood of With kxw k0 = k, we focus on simulations where the non-
a significant match will be high even with a single shared zero entries in xw are uniform in [−1/k, 1/k], and the non-
non-zero component. zero entries in xi are uniform in S ∗ [0, 2/k]. We focus
on this formulation because of the relationship to common
We wanted to see if the exponential drop in random matches network structures and weight initialization. xw is a putative
for binary vectors, demonstrated by Figure 2, can be ob- weight vector and xi is an input vector to this layer from the
How Can We Be So Dense?

Matching sparse scalar vectors: effect of scale


10 1

10 1
a = n2 a = 256
10 2

a = 128
Frequency of matches

Frequency of matches
a = 256 10 2
10 3

a = 64
10 4
a = 128 10 3

10 5
a = 64 10 4

10 6

1000 500
1500 2000 2500 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Dimensionality (n) Scale factor (s)
Figure 3. Left: The probability of matches to random scalar vectors (with a non-zero components) as a function of dimensionality, for
various levels of sparsity. The probability of false matches decreases exponentially with n. Note that the probability for a dense vector,
a = n2 stays relatively high, and does not decrease with dimensionality. Right: The impact of scale on vector matches with a fixed
n = 1000. The larger the scaling discrepancy, the higher the probability of a false match.

previous layer (we assume unit activations are positive, the 3. Sparse Network Description
result of a ReLU-like non-linearity). S controls the scale of
xi relative to xw . Here we discuss a particular sparse network implementation
that is designed to exploit Eq. 3. This implementation is an
Figure 3 (left) shows the behavior with k = 32 and S = extension of our previous work on the HTM Spatial Pooler,
1. We varied the activity of the input vectors kxi k0 = a binary sparse coding algorithm that models sparse code
a and the dimensionality of the vectors, n. We set θ = generation in the neocortex (Hawkins et al., 2011; Cui et al.,
E[xw · xw ]/2.0. The chart demonstrates that under these 2017). Specifically, we formulate a version of the Spatial
conditions we can achieve robust behavior similar to that Pooler that is designed to be a drop-in layer for neural
of binary vectors. Figure 3 (right) plots the effect of S networks trained with back-propagation. Our work is also
on the match probabilities with a fixed n = 1000. As this closely related to previous literature on k-winner take all
chart shows, the error increases significantly as S increases. networks (Majani et al., 1989) and fixed sparsity networks
Taken together, these results show that the fundamental (Makhzani & Frey, 2015).
robustness properties of binary sparse vectors can also hold
for sparse scalar vectors, as long as the overall scaling of Consider a network with L hidden layers. Let y l denote
vectors are in a similar range. the vector of outputs from layer l, respectively, with y 0 as
the input vector. W l and ul are the weights and biases for
each layer. In a standard neural network the weights W l
2.4. Non-uniform Distribution of Vectors
are typically dense and initialized using a uniform random
Eq. 3 assumes the ideal case where vectors are chosen with distribution. The feed forward outputs are then calculated
a uniform random distribution. With a non-uniform distri- as follows:
bution the error rates will be higher. The more non-uniform
the distribution the worse the error rates. For example, if
ŷ l = W l · y l−1 + ul
you mostly end up observing 10 inputs, your error rates will
be bounded at around 10%. Thus, to optimize error rates, y l = f (ŷ l )
it is important to be as close to a uniform distribution as
possible. where f is any activation function, such as tanh(·) or
ReLU(·). (Figure 4 left.)
To implement our sparse networks, we make two modifica-
tions to this basic formulation (Figure 4 right.). First, we ini-
tialize the weights using a sparse random distribution, such
How Can We Be So Dense?

Figure 4. This figure illustrates the differences between a generic dense network layer (left) and a sparse network layer (right). In the
sparse layer, the linear layer subsamples from its input layer (implemented via sparse weights, depicted with fewer arrows). In addition,
the ReLU layer is replaced by a k-winners layer.

that only a fraction of the weights contain non-zero values. To address this we employ a boosting term (Hawkins et al.,
Non-zero weights are initialized using standard Kaiming 2011; Cui et al., 2017) which favors units that have not
initialization (He et al., 2015b). The rest of the connections been active recently. We compute a running average of each
are treated as non-existent, i.e. the corresponding weights unit’s duty cycle (i.e. how frequently it has been one of the
are zero throughout the life of the network. Second, only top k units):
the top-k active units within each layer are maintained in y l ,
and the rest set to zero. This k-winners step is non-linear
and can be thought of as a substitute for the ReLU function. dli (t) = (1 − α)dli (t − 1) + α · [i ∈ topIndicesl ] (6)
Instead of a threshold of 0, the threshold here is adaptive
and corresponds to the k’th largest activation (Makhzani & A boost coefficient bli is then calculated for each unit based
Frey, 2013). on the target duty cycle and the current average duty cycle:
The layer can be trained using standard gradient descent.
Similar to ReLU, the gradient of the layer is calculated as 1 l l
bli (t) = eβ(â −di (t)) (7)
above the threshold and 0 elsewhere. During inference we
increase k by 50%, which led to slightly better accuracies. The target duty cycle âl is a constant reflecting the percent-
In all our simulations the last layer of each network is a age of units that are expected to be active, i.e. âl = |ykl | .
standard linear output layer with log-softmax activation The boost factor, β, is a positive parameter that controls the
function. strength of boosting. β = 0 implies no boosting (bli = 1),
and higher numbers lead to larger boost coefficients. In
3.1. Boosting (Hawkins et al., 2011; Cui et al., 2017) we showed that Eq. 7
One practical issue with the above formulation is that it is encourages each unit to have equal activation frequency and
possible for a small number of units to initially dominate and effectively maximizes the entropy of the layer.
then, through learning, become active for a large percentage The boost coefficients are used during the k-winners step
of patterns (this was also noted in (Makhzani & Frey, 2015; to select which units remain active for this input. Through
Cui et al., 2017)). Having a small number of active units boosting, units which have not been active recently have
negatively impacts the available representational volume. a disproportionately higher impact and are more likely to
It is desirable for every unit to be equally active in order win, whereas overly active units are de-emphasized. To
to maximize the robustness of the representation in Eq. 3. determine the output of the layer, the non-boosted activity
How Can We Be So Dense?

Algorithm 1 k-winners layer


1: ŷ l = wl · y (l−1) + ul N ETWORK T EST S CORE N OISE S CORE
l l
2: bli (t) = eβ(â −di (t)) DENSE CNN-1 99.14 ± 0.03 74,569 ± 3,200
3: topIndicesl = topk(bl ŷ l ) DENSE CNN-2 99.31 ± 0.06 97,040 ± 2,853
4: yl = 0 SPARSE CNN-1 98.41 ± 0.08 100,306 ± 1,735
5: y l [topIndicesl ] = ŷ l SPARSE CNN-2 99.09 ± 0.05 103,764 ± 1,125
6: dli (t) = (1 − α)dli (t − 1) + α · [yil (t) ∈ topIndicesl ] D ENSE CNN-2 SP3 99.13 ± 0.07 100,318 ± 2,762
S PARSE CNN-2 D3 98.89 ± 0.13 102,328± 1,720
S PARSE CNN-2 W1 98.2± 0.19 100,322± 2,082
S PARSE CNN-2 DSW 98.92 ± 0.09 70,566 ± 2,857
of each winning unit is kept and the remaining units are
set to zero. The duty cycle is then updated. The complete
Table 1. MNIST results for dense and sparse architectures. We
pseudo-code description for the k-winners layer is described show classification accuracies and total noise scores (the total
in Algorithm 1. In our simulations we used β = 1.0 or 1.5 number of correct classification for all noise levels). Results are
for all sparse simulations. averaged over 10 random seeds, ± one standard deviation. CNN-1
and CNN-2 indicate one or two convolutional layers, respectively.
3.2. Sparse Convolutional Layers
We can apply the above algorithm to convolutional networks
(CNNs) (LeCun et al., 1989). A canonical CNN layer uses to let duty cycle calculations update frequently and settle.
a linear convolutional layer containing a number of filters, Hyperparameters such as the learning rate and network size
followed by a max-pooling (downsampling) layer, followed were chosen using a validation set consisting of 10, 000
by ReLU. In order to implement sparse CNN layers, the randomly chosen training samples. We then report final
k-winners layer is applied to the output of the max-pooling results on the test set using networks trained on the full
layer instead of ReLU (just as in our non-convolutional training set.
layers). However, since each filter in a CNN shares weights
Results Without Noise: State of the art accuracies on
across the image, duty cycles are accumulated per filter.
MNIST using convolutional neural networks (without dis-
In our simulations dense and sparse CNN nets both have a
tortions or other training augmentation) are in the range
hidden layer (which is dense or sparse, respectively) after the
98.3 − 99% respectively1 . Table 1 (left column) lists the
last convolutional layer, followed by a linear plus softmax
classification accuracies for the networks in our experiments.
layer. We used 5X5 filters throughout with a stride of 1. In
Our accuracies are in the same range, for both sparse and
our tests, the weight sparsity of CNN layers did not impact
dense networks. Table 3 lists the key parameters for each
the results. We suspect this is due to the small size of each
of the listed networks (see also the next section for a more
kernel and did not use sparse weights for the CNN filters in
in-depth discussion).
our experiments.
Results With Noise: In order to test noise robustness we
4. Results generated MNIST images with varying levels of additive
noise. For each test image we randomly set η% of the pixels
4.1. MNIST to a constant value near white (the constant value was two
standard deviations over the mean pixel intensity). Figure 5
We first trained our networks on MNIST (LeCun et al., (A) shows sample images for different noise levels. We
1998). We trained both dense and sparse implementations. generated 11 different noise levels with η ranging between
Each network consisted of one or two convolutional layers, 0 and 0.5 in increments of 0.05. We also computed an over-
followed by a hidden layer, followed by a linear + softmax all noise score which counted the total number of correct
output layer. Sparse nets consisted of sparse convolutional classifications across all noise levels.
layers followed by a sparse hidden layer.
The right column of Table 1 shows the noise scores for
Networks were trained using standard stochastic gradient each of the architectures. Networks in the top section of the
descent to minimize cross entropy loss. We used starting table (Dense CNN-1 and Dense CNN-2) are composed of
learning rates in the range 0.01 − 0.04, and the learning standard dense convolutional and hidden layers. Networks
rate was decreased by a factor between 0.5 and 0.9 after in the middle section (Sparse CNN-1 and Sparse CNN-2)
each epoch. We also tried batch normalization (Ioffe & are composed of sparse convolutional and sparse hidden
Szegedy, 2015) and found it did not help for MNIST (it layers. Networks in the last section contain a mixture of
did help significantly for Google Speech Commands results dense and sparse layers. Overall the architectures with
- see below). For sparse networks, we used a small mini-
1
batch size (around 4), for the first epoch only, in order Source: http://yann.lecun.com/exdb/mnist
How Can We Be So Dense?

sparse layers performed significantly better on the noise


score than the fully dense networks. Sparse CNN-2, the two N ETWORK T EST S CORE N OISE S CORE
layer completely sparse network, had the best noise score. D ENSE CNN-2 (D R =0.0) 96.37± 0.37 8,730± 471
The two fully dense networks performed substantially worse D ENSE CNN-2 (D R =0.5) 95.69± 0.48 7,681± 368
than the others on noise, even though their test accuracies S PARSE CNN-2 96.65± 0.21 11,233± 1013
were comparable. Figure 5 plots the accuracy of fully dense S UPER - SPARSE CNN-2 96.57± 0.16 10,752± 942
and sparse networks at different noise levels. Note that raw
test score was not a predictor of noise robustness, suggesting Table 2. Classification on Google Speech Commands for a number
that focusing on pure test set accuracy alone is not sufficient of architectures. We show test and noise scores, averaged over
for gauging performance under adverse conditions. 10 random seeds, ± one standard deviation. Dr corresponds to
different dropout levels.
Ablation studies: In order to judge the relative contribu-
tions of sparse layers we ran experiments where we replaced
various sparse components with their dense counterparts,
created during training augmentation) achieve accuracies in
i.e. dense CNNs with sparse hidden layers, and vice versa.
the range 91 − 92% (Sainath & Parada, 2015; Tang & Lin,
Dense CNN-2 SP3 contained two dense CNN layers fol-
2017). In (Tang & Lin, 2017) they demonstrated improved
lowed by the sparse third layer from Sparse CNN-2. Sparse
accuracies in the range of 95 − 96% using residual networks
CNN-2 D3 contained the same CNN layers as Sparse CNN-
(ResNets (He et al., 2015b;a)).
2 followed by the dense third layer from Dense CNN-2.
Sparse CNN-2 W1 was identical to Sparse CNN-2 except A Kaggle competition using GSC (also limited to 10 cate-
that the weight sparsity was 1 (i.e. fully dense weights). gories) took place between November 2017 and early 20182 .
Sparse CNN-2 DSW contained a third layer with dense For our simulations we use the preprocessing code pro-
outputs, but with a weight sparsity of 0.3%. vided by one of the top-10 contestants (Tuguldur, 2018)
who achieved around 97 − 97.5% accuracies using variants
The results of these networks are shown in the bottom third
of ResNet and VGG (Simonyan & Zisserman, 2014) archi-
of Table 1. From a noise robustness perspective, most of the
tectures. Following this implementation, audio samples in
variants (except for Sparse CNN-2 DSW) performed well,
our simulations are converted to 32-band Mel spectograms
better than the best pure dense network. This supports the
before being fed to the network. During training we aug-
idea that sparsity in many forms may be helpful with robust-
ment the data by randomly adjusting the amplitude, speed,
ness. It is interesting to note that the standard deviation of
and pitch of each training sample, and by randomly shifting
the noise score in these variants was also higher than that
and stretching samples in the frequency domain. No data
of the pure sparse networks. Overall the results with mixed
augmentation is performed on the validation or test sets.
networks were encouraging, and suggest a clear benefit to
introducing sparsity at any level. We trained dense and sparse convolutional networks, with
hyperparameters chosen based on the validation set. We
Impact of Dropout: The above results did not use dropout
were able to achieve reasonable accuracies using two con-
(Srivastava et al., 2014), which is generally thought to im-
volutional layers, followed by a hidden layer and then a
prove robustness. We found that dropout did occasionally
linear + softmax output layer. Our sparse networks had
improve the robustness of dense networks, but any improve-
sparse convolutional layers as well as a sparse hidden layer.
ments were modest and the dropout percentage had to be
Unlike MNIST we found that batch normalization (Ioffe &
tuned carefully. For sparse nets dropout consistently reduced
Szegedy, 2015) accelerated learning significantly, and we
accuracies. Even with the optimal dropout percentage, the
used it for every layer.
noise scores of dense networks were significantly lower than
sparse nets. Using the above setup we were able to achieve test set
accuracies in the range of 96.5 − 97.2% classifying the
4.2. Google Speech Commands Dataset ten categories corresponding to the digits ”zero” through
”nine”. Table 2 (left column) shows mean accuracy on the
In order to test sparse nets on a different domain, we applied test set. Both dense and sparse networks had about the same
them to the Google Speech Commands dataset (GSC). This accuracy. Dropout had a negative effect on the accuracy.
audio dataset was made publicly available in 2017 (Warden, Table 3 lists the key parameters in each network.
2017) and consists of 65,000 one-second long utterances
of 30 keywords spoken by thousands of individuals. The Results With Noise: As with MNIST, we again created
dataset contains predefined training, validation, and test sets. noisy versions of the test set. For each test audio sample
A we generated a random white noise sample and blended
Reference convolutional nets using ten of the keyword cate-
2
gories (plus artificial ”silence” and ”unknown” categories https://www.kaggle.com/c/
tensorflow-speech-recognition-challenge
How Can We Be So Dense?

Figure 5. A. Example MNIST images with varying levels of noise. B. Classification accuracy as a function of noise level.

them together: each filter.


As an example, the number of non-zero multiplies between

A = (1 − η)A + ηwhiteNoise the first two convolutional layers in the GSC Sparse CNN-2
network is 12, 544∗1600∗6400 = 1.23X1010 , about 10.5X
smaller than the corresponding dense network. The number
We generated 11 different noise levels, with η ranging from of non-zero multiplies between the second convolutional
0 to 0.5 in increments of 0.05. Our overall noise score layer and the hidden layer in the same network is 200 ∗
counted the total number of classifications across all noise 640, 000 ∗ 1000 = 1.28X1011 , about 20X smaller than the
levels. dense network. For Super Sparse CNN-2, that ratio is 35X
As can be seen in Table 2 sparse networks performed signif- as compared to the dense version.
icantly better than the best dense network. We included a As can be seen, the number of non-zeros products is sig-
”Super-Sparse CNN-2” with a significantly sparser hidden nificantly smaller in the sparse net implementations. Un-
layer. The hidden layer for this network had 10% weight fortunately we found that current versions of deep learning
sparsity, and a lower output sparsity (Table 3). This network frameworks, including PyTorch and Tensorflow do not have
had slightly lower noise score, but its score was still signifi- adequate support for sparse matrices to exploit these prop-
cantly higher than that of the dense networks. Overall these erties, and our implementations ran at the same speed as
results demonstrate that the robustness of sparse networks the corresponding dense networks. We suspect this is due
seen with MNIST can scale to other domains. to the fact that highly sparse networks are not sufficiently
popular in practice. We hope that studies such as this one
4.3. Computational Considerations will encourage highly optimized sparse implementations.
In standard networks, the size of each weight matrix is (Note that such optimizations may be non-trivial as the set
|W l | = |y l−1 ||y l | and the order of complexity of the feed- of k-winners changes on every step.) When this becomes
forward operation can be approximated by the number of feasible our numbers suggest there is a strong possibility
multiplications, |y l−1 |2 |y l |. The computational efficiency for large performance gains and/or improvements in power
of sparse systems is closely related to the fraction of non- usage. It is also worth noting that this reduction in compu-
zeros. In our sparse hidden layers, both activations and tational complexity does not come at a cost. Rather, our
weight values are sparse and the number of non-zero prod- experiments showed that sparse representations can lead to
uct terms in the forward computation is proportional to improved accuracies under noisy conditions.
k l−1 wl |y l−1 ||y l |, where 0 < wl ≤ 1 is the fraction of
non-zero weights. In our convolutional layers, only activa- 5. Discussion
tions values are sparse and the number of non-zero prod-
uct terms in the forward computation is proportional to In this paper we illustrated benefits of sparse representations.
k l−1 ∗ K l ∗ K l ∗ |y l |, where K l is the kernel width of We developed intuitions and theory for the structure of vec-
How Can We Be So Dense?

N ETWORK L1 F L1 SPARSITY L2 F L2 SPARSITY L3 N L3 SPARSITY W T SPARSITY


MNIST
D ENSE CNN-1 30 100% 1000 100% 100%
D ENSE CNN-2 30 100% 30 100% 1000 100% 100%

S PARSE CNN-1 30 9.3% 150 33.3% 30%


S PARSE CNN-2 32 8.7% 64 29.3 % 700 14.3% 30%

D ENSE CNN-2 SP3 30 100% 30 100% 700 14.3% 30%


S PARSE CNN-2 D3 32 8.7% 64 29.3 % 1000 100% 100%
S PARSE CNN-2 W1 32 8.7% 64 29.3 % 700 14.3% 100%
S PARSE CNN-2 DSW 32 8.7% 64 29.3 % 1000 100% 30%
GSC
D ENSE CNN-2 64 100% 64 100% 1000 100% 100%
S PARSE CNN-2 64 9.5% 64 12.5% 1000 10% 40%
S UPER S PARSE CNN-2 64 9.5% 64 12.5% 1500 6.7% 10%

Table 3. Key parameters for each network. L1F and L2F denote the number of filters at the corresponding CNN layer. L1,2,3 sparsity
indicates k/n, the percentage of outputs that were enforced to be non-zero. 100% indicates a special case where we defaulted to traditional
ReLU activations. Wt sparsity indicates the percentage of weights that were non-zero. All parameters are available in the source code.

tor matching in the context of binary sparse representations. test robustness. It is possible that such networks are also
We then constructed efficient neural network formulations more robust, though this remains to be tested. Pruning tech-
of sparse networks that place internal representations in the niques in general are quite orthogonal to ours, and it may be
sweet spot suggested by the theory. In particular we aim to feasible to combine them with the mechanisms discussed
match sparse activations with sparse weights in relatively here.
high dimensional settings. A boosting rule was used to in-
In our work we did not attempt to introduce sparsity into
crease the overall entropy of the internal layers in order to
the convolutional filters themselves. (Li et al., 2016) have
maximize the utilization of the representational space. We
shown it is sometimes possible to remove entire filters from
showed that this formulation increases the overall robustness
large CNNs suggesting that sparsifying filter weights may
of the system to noisy inputs using MNIST and the Google
also be possible, particularly in networks with larger filters.
Speech Command Dataset. Both dense and sparse networks
Introducing sparse convolutions within the context of the
showed high accuracies, but the sparse nets were signifi-
techniques in this paper is an area of future exploration. The
cantly more robust. These results suggest that it is important
techniques described here are straightforward to implement
to look beyond pure test set performance as test accuracy by
and can be extended to other architectures including RNNs.
itself is not a reliable indicator of overall robustness.
This is yet another promising area for future research.
Our work extends the existing literature on sparsity and
pruning. A very recent theoretical paper showed that simple 5.1. Software
linear sparse networks may be more robust to adversarial
attacks (Guo et al., 2018). A number of papers have shown All code and experiments are available at
that it is possible to effectively introduce sparsity through https://github.com/numenta/htmpapers as open source.
pruning and retraining (Han et al., 2015; Frankle & Carbin,
2018; Lee et al., 2018). The mechanisms introduced here Acknowledgements
can be seen as complementary to those techniques. Our
network enforces sparse weights from the beginning by We thank Jeff Hawkins, Ali Rahimi, and John Berkowitz for
construction, and sparse weights are learned as part of the helpful discussions and comments.
training process. In addition, we reduce the overall compu-
tational complexity by enforcing sparse activations, which References
in turn significantly reduces the number of overall non-zero
Ahmad, S., & Hawkins, J. (2016). How do neurons operate
products. This should produce significant power savings for
on sparse distributed representations? A mathematical
optimized hardware implementations.
theory of sparsity, neurons and active dendrites. arXiv,
We demonstrated increased robustness in our networks (pp. arXiv:1601.00720 [q–bio.NC]).
whereas the papers on pruning typically do not explicitly URL https://arxiv.org/abs/1601.00720
How Can We Be So Dense?

Chen, Y., Paiton, D., & Olshausen, B. (2018). The LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).
Sparse Manifold Transform. In S. Bengio, H. Wallach, Gradient-based learning applied to document recognition.
H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Gar- Proceedings of the IEEE.
nett (Eds.) Advances in Neural Information Processing
Lee, H., Ekanadham, C., & Ng, A. Y. (2008). Sparse deep
Systems 31, (pp. 10533–10544). Curran Associates, Inc.
belief net model for visual area V2. Advances In Neural
Cui, Y., Ahmad, S., & Hawkins, J. (2017). The HTM Information Processing Systems.
Spatial Pooler a neocortical algorithm for online Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Con-
sparse distributed coding. Frontiers in Computational volutional deep belief networks for scalable unsupervised
Neuroscience, 11, 111. learning of hierarchical representations. Proceedings of
URL https://www.frontiersin.org/ the 26th Annual International Conference on Machine
articles/10.3389/fncom.2017.00111/ Learning - ICML ’09, (pp. 1–8).
abstract
Lee, N., Ajanthan, T., & Torr, P. H. S. (2018). SNIP: Single-
Frankle, J., & Carbin, M. (2018). The Lottery Ticket Hy- shot Network Pruning based on Connection Sensitivity.
pothesis: Finding Sparse, Trainable Neural Networks. URL http://arxiv.org/abs/1810.02340
URL http://arxiv.org/abs/1803.03635
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P.
Guo, Y., Zhang, C., Zhang, C., & Chen, Y. (2018). Sparse (2016). Pruning Filters for Efficient ConvNets.
DNNs with Improved Adversarial Robustness. In S. Ben- URL http://arxiv.org/abs/1608.08710
gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, & R. Garnett (Eds.) Advances in Neural Infor- Majani, E., Erlanson, R., & Abu-Mostafa, Y. S. (1989). On
mation Processing Systems 31, (pp. 240–249). Curran the k-winners-take-all network. In Advances in neural
Associates, Inc. information processing systems, (pp. 634–642).
Makhzani, A., & Frey, B. (2013). k-Sparse Autoencoders.
Han, S., Pool, J., Tran, J., & Dally, W. (2015). Learning both
URL http://arxiv.org/abs/1312.5663
Weights and Connections for Efficient Neural Network.
In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, Makhzani, A., & Frey, B. (2015). Winner-take-all autoen-
& R. Garnett (Eds.) Advances in Neural Information Pro- coders. Advances in Neural Information Processing.
cessing Systems 28, (pp. 1135–1143). Curran Associates, URL http://papers.nips.cc/paper/
Inc. 5783-winner-take-all-autoencoders

Hawkins, J., Ahmad, S., & Dubinsky, D. (2011). Cortical Molchanov, D., Ashukha, A., & Vetrov, D. (2017). Varia-
Learning Algorithm and Hierarchical Temporal Memory. tional Dropout Sparsifies Deep Neural Networks.
URL http://numenta.org/resources/ URL http://arxiv.org/abs/1701.05369
HTM{_}CorticalLearningAlgorithms.pdf
Nair, V., & Hinton, G. E. (2009). 3D Object Recognition
He, K., Zhang, X., Ren, S., & Sun, J. (2015a). Deep Resid- with Deep Belief Nets. In Y. Bengio, D. Schuurmans,
ual Learning for Image Recognition. J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.)
URL http://arxiv.org/abs/1512.03385 Advances in Neural Information Processing Systems 22,
(pp. 1339–1347). Curran Associates, Inc.
He, K., Zhang, X., Ren, S., & Sun, J. (2015b). Delving Deep
Olshausen, B. A., & Field, D. J. (1997). Sparse coding with
into Rectifiers: Surpassing Human-Level Performance on
an overcomplete basis set: A strategy employed by V1?
ImageNet Classification.
Vision Research, 37, 3311–3325.
URL http://arxiv.org/abs/1502.01852
Rawlinson, D., Ahmed, A., & Kowadlo, G. (2018). Sparse
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Ac- Unsupervised Capsules Generalize Better.
celerating Deep Network Training by Reducing Internal URL http://arxiv.org/abs/1804.06094
Covariate Shift.
URL http://arxiv.org/abs/1502.03167 Rosenfeld, A., Zemel, R., & Tsotsos, J. K. (2018). The
Elephant in the Room.
Kanerva, P. (1988). Sparse Distributed Memory. Cambridge, URL http://arxiv.org/abs/1808.03305
MA: The MIT Press.
Sainath, T. N., & Parada, C. (2015). Convolutional neural
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, networks for small-footprint keyword spotting. In Six-
R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropa- teenth Annual Conference of the International Speech
gation Applied to Handwritten Zip Code Recognition. Communication Association.
How Can We Be So Dense?

Simonyan, K., & Zisserman, A. (2014). Very Deep Convo-


lutional Networks for Large-Scale Image Recognition.
URL http://arxiv.org/abs/1409.1556
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &
Salakhutdinov, R. (2014). Dropout: A Simple Way to
Prevent Neural Networks from Overfitting. Journal of
Machine Learning Research, 15, 1929–1958.
URL http://jmlr.org/papers/v15/
srivastava14a.html
Srivastava, R. K., Masci, J., Kazerounian, S., Gomez, F., &
Schmidhuber, J. (2013). Compete to Compute. In C. J. C.
Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q.
Weinberger (Eds.) Advances in Neural Information
Processing Systems 26, (pp. 2310–2318). Curran
Associates, Inc.
URL http://papers.nips.cc/paper/
5059-compete-to-compute.pdf

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,


D., Goodfellow, I., & Fergus, R. (2013). Intriguing prop-
erties of neural networks.
URL http://arxiv.org/abs/1312.6199
Tang, R., & Lin, J. (2017). Deep Residual Learning for
Small-Footprint Keyword Spotting.
URL https://arxiv.org/abs/1710.10361
Tuguldur, E.-O. (2018). pytorch-speech-commands.
URL https://github.com/tugstugi/
pytorch-speech-commands

Warden, P. (2017). Speech Commands: A public dataset for


single-word speech recognition. Dataset available from
http://download.tensorflow.org/data/speech commands v0.01.tar.gz.

You might also like