K Sparse
K Sparse
K Sparse
(ii) We explore how different sparsity levels (k) im- sparse encoding stage used for classification does not
pact representations and classification performance. exactly match the encoding used for dictionary train-
(iii) We show that by solely relying on sparsity as the ing (Coates & Ng, 2011). For example, while in k -
regularizer and as the only nonlinearity, we can achieve means, it is natural to have a hard-assignment of the
much better results than the other methods, including points to the nearest cluster in the encoding stage, it
RBMs, denoising autoencoders (Vincent et al., 2008) has been shown in (Van Gemert et al., 2008) that soft
and dropout (Hinton et al., 2012). (iv) We demon- assignments result in better classification performance.
strate that k -sparse autoencoders are suitable for pre- Similarly, for the k -sparse autoencoder, instead of us-
training and achieve results comparable to state-of- ing the k largest elements of W ⊺ x+b as the features, we
the-art on MNIST and NORB datasets. have observed that slightly better performance is ob-
tained by using the αk largest hidden units where α ≥ 1
In this paper, Γ is an estimated support set and Γc is
is selected using validation data. So at the test time,
its complement. W † is the pseudo-inverse of W and
we use the support set defined by suppαk (W ⊺ x + b).
suppk (x) is an operator that returns the indices of
The algorithm is summarized as follows.
the k largest coefficients of its input vector. z Γ is
the vector obtained by restricting the elements of z k -Sparse Autoencoders:
to the indices of Γ and WΓ is the matrix obtained by Training:
restricting the columns of W to the indices of Γ. 1) Perform the feedforward phase and compute
z = W ⊺x + b
2. Description of the Algorithm 2) Find the k largest activations of z and set
the rest to zero.
2.1. The Basic Autoencoder z (Γ)c = 0 where Γ = suppk (z)
A shallow autoencoder maps an input vector x to a 3) Compute the output and the error using the
hidden representation using the function z = f (P x+b), sparsified z.
parameterized by {P, b}. f is the activation function, x̂ = W z + b′
e.g., linear, sigmoidal or ReLU. The hidden represen- E = ∥x − x̂∥22
tation is then mapped linearly to the output using 3) Backpropagate the error through the k largest
x̂ = W z + b′ . The parameters are optimized to mini- activations defined by Γ and iterate.
mize the mean square error of ∥x̂−x∥22 over all training Sparse Encoding:
points. Often, tied weights are used, so that P = W ⊺ . Compute the features h = W ⊺ x + b. Find its αk
largest activations and set the rest to zero.
2.2. The k -Sparse Autoencoder h(Γ)c = 0 where Γ = suppαk (h)
following steps. WΓ to find the non-zero values of the support set. The
pseudo-inverse of the matrix WΓ is a matrix, such as
1. Support Estimation Step: PΓ , that minimizes the following cost function.
Γ = suppk (z n + W ⊺ (x − W z n )) (2)
WΓ† = arg min∥x − WΓ z Γ ∥22
2. Inversion Step: PΓ
(5)
= arg min∥x − WΓ PΓ x∥22
z n+1
Γ = WΓ† x = (WΓ⊺ WΓ )−1 WΓ⊺ x PΓ
(3)
z n+1
(Γ)c = 0 Finding the exact pseudo-inverse of WΓ is computa-
tionally expensive, so instead, we perform a single step
Assume H = W ⊺ W − I and z 0 is the true sparse so- of gradient descent. The gradient with respect to PΓ
lution. The first step of ITI estimates the support is found as follows:
set as Γ = suppk (W ⊺ x) = suppk (z 0 + Hz 0 ). If W
was orthogonal, we would have Hz0 = 0 and the algo-
rithm would succeed in the first iteration. But if W ∂∥x − WΓ z Γ ∥22 ∂∥x − WΓ z Γ ∥22
= x (6)
is overcomplete, Hz 0 behaves as a noise vector whose ∂PΓ ∂z Γ
variance decreases after each iteration. After estimat- The first term of the right hand side of the Equation
ing the support set of z as Γ, we restrict W to the (6) is the dictionary update stage, which is computed
indices included in Γ and form WΓ . We then use the as follows:
pseudo-inverse of WΓ to estimate the non-zero values
minimizing ∥x−WΓ z Γ ∥22 . Lastly, we refine the support
estimation and repeat the whole process until conver- ∂∥x − WΓ z Γ ∥22
= (WΓ z Γ − x)z ⊺Γ (7)
gence. ∂z Γ
Therefore, in order to approximate the pseudo-inverse,
3.2. Sparse Coding with the k -Sparse we first find the dictionary derivative and then “back-
Autoencoder propagate” it to find the update of the pseudo-inverse.
Here, we show that we can derive the k -sparse autoen- We can view these operations in the context of an au-
coder tarining algorithm by approximating a sparse toencoder with linear activations where P is the en-
coding algorithm that uses the ITI algorithm jointly coder weight matrix and W is the decoder weight ma-
with a dictionary update stage. trix. At each iteration, instead of back-propagating
The conventional approach of sparse coding is to fix through all the hidden units, we just back-propagate
the sparse code matrix Z, while updating the dictio- through the units with the k largest activities, defined
nary. However, here, after estimating the support set by suppk (W ⊺ x), which is the first iteration of ITI.
in the first step of the ITI algorithm, we jointly per- Keeping the k largest hidden activities and ignoring
form the inversion step of ITI and the dictionary up- the others is the same as forming WΓ by restricting
date step, while fixing just the support set of the sparse W to the estimated support set. Back-propagation on
code Z. In other words, we update the atoms of the the decoder weights is the same as gradient descent on
dictionary and allow the corresponding non-zero values the dictionary and back-propagation on the encoder
to change at the same time to minimize ∥X − WΓ ZΓ ∥22 weights is the same as approximating the pseudo-
over both WΓ and ZΓ . inverse of the corresponding WΓ .
When we are performing sparse recovery with the ITI We can perform support estimation in the feedforward
algorithm using a fixed dictionary, we should perform phase by assuming P = W ⊺ (i.e., the autoencoder has
a fixed number of iterations to get the perfect recon- tied weights). In this case, support estimation can
struction of the signal. But, in sparse coding, since we be done by computing z = W ⊺ x + b and picking the k
learnt a dictionary that is adapted to the signals, as largest activations; the bias just accounts for the mean
shown in Section 3.3, we can find the support set just and subtracts its contribution. Then the “inversion”
with the first iteration of ITI: and “dictionary update” steps are done at the same
time by back-propagation through just the units with
the k largest activities.
Γz = suppk (W ⊺ x) (4)
In summary, we can view k -sparse autoencoders as the
In the inversion step of the ITI algorithm, once we approximation of a sparse coding algorithm which uses
estimate the support set, we use the pseudo-inverse of ITI in the sparse recovery stage.
k -Sparse Autoencoders
ages of CIFAR-10. Each patch is then locally contrast- started with mk = 0.9 and ηk = 0.01, and then the
normalized and ZCA whitened. This preprocessing learning rate was linearly decreased to 0.001 over 200
pipeline is the same as the one used in (Coates et al., epochs.
2011) for feature extraction.
4.2.3. Implementations
4.2. Training of k -Sparse Autoencoders
While most of the conventional sparse coding algo-
4.2.1. Scheduling of the Sparsity Level rithms require complex matrix operations such as ma-
trix inversion or SVD decomposition, the k -sparse au-
When we are enforcing low sparsity levels in k -sparse toencoders only need matrix multiplications and sort-
autoencoders (e.g., k =15 on MNIST), one issue that ing operations in both dictionary learning stage and
might arise is that in the first few epochs, the al- the sparse encoding stage. (For a parallel, distributed
gorithm greedily assigns individual hidden units to implementation, the sorting operation can be replaced
groups of training cases, in a manner similar to k- by a method that recursively applies a threshold until
means clustering. In subsequent epochs, these hidden k values remain.) We used an efficient GPU implemen-
units will be picked and re-enforced and other hidden tation obtained using the publicly available gnumpy
units will not be adjusted. That is, too much sparsity library (Tieleman, 2010) on a single Nvidia GTX 680
can prevent gradient back-propagation from adjusting GPU.
the weights of these other ‘dead’ hidden units. We can
address this problem by scheduling the sparsity level
4.3. Effect of Sparsity Level
over epochs as follows.
In k -sparse autoencoders, we are able to tune the value
Suppose we are aiming for a sparsity level of k = 15.
of k to obtain the desirable sparsity level which makes
Then, we start off with a large sparsity level (e.g.
the algorithm suitable for a wide variety of datasets.
k = 100) for which the k -sparse autoencoder can train
For example, one application could be pre-training a
all the hidden units. We then linearly decrease the
shallow or deep discriminative neural network. For
sparsity level from k = 100 to k = 15 over the first
large values of k (e.g., k = 100 on MNIST), the algo-
half of the epochs. This initializes the autoencoder in
rithm tends to learn very local features as is shown in
a good regime, for which all of the hidden units have
Figure 1a and 2a. These features are too primitive to
a significant chance of being picked. Then, we keep
be used for classification using a shallow architecture
k = 15 for the second half of the epochs. With this
since a naive linear classifier does not have enough ca-
scheduling, we can train all of the filters, even for low
pacity to combine these features and achieve a good
sparsity levels.
classification rate. However, these features could be
used for pre-training deep neural nets.
4.2.2. Training Hyper-parameters
As we decrease the the sparsity level (e.g., k = 40 on
We optimized the model parameters using stochastic
MNIST), the output is reconstructed using a smaller
gradient descent with momentum as follows.
number of hidden units and thus the features tend to
v k+1 = mk v k − ηk ∇f (xk ) be more global, as can be seen in Figure 1b,1c and 2b.
(11) For example, in the MNIST dataset, the lengths of the
xk+1 = xk + v k
strokes increase when the sparsity level is decreased.
These less local features are suitable for classification
Here, vk is the velocity vector, mk is the momentum using a shallow architecture. Nevertheless, forcing too
and ηk is the learning rate at the k -th iteration. We much sparsity (e.g., k = 10 on MNIST), results in fea-
also use a Gaussian distribution with a standard devi- tures that are too global and do not factor the input
ation of σ for initialization of the weights. We use dif- into parts, as depicted Figure 1d and 2c.
ferent momentum values, learning rates and initializa-
tions based on the task and the dataset, and validation Fig. 3 shows the visualization of filters of the k -sparse
is used to select hyperparameters. In the unsupervised autoencoder with 1000 hidden units and sparsity level
MNIST task, the values were σ = 0.01 , mk = 0.9 and of k = 50 learnt from random image patches extracted
ηk = 0.01, for 5000 epochs. In the supervised MNIST from CIFAR-10 dataset. We can see that the k -sparse
task, training started with mk = 0.25 and ηk = 1, and autoencoder has learnt localized Gabor filters from
then the learning rate was linearly decreased to 0.001 natural image patches.
over 200 epochs. In the unsupervised NORB task, the Fig. 4 plots histograms of the hidden unit activities
values were σ = 0.01, mk = 0.9 and ηk = 0.0001, for for various unsupervised learning algorithms, includ-
5000 epochs. In the supervised NORB task, training
k -Sparse Autoencoders
(a) k = 70
(b) k = 40
(c) k = 25
(d) k = 10
Figure 1. Filters of the k -sparse autoencoder for different sparsity levels k, learnt from MNIST with 1000 hidden units.
(a) k = 200
(b) k = 150
(c) k = 50
Figure 2. Filters of the k -sparse autoencoder for different sparsity levels k, learnt from NORB with 4000 hidden units.
k -Sparse Autoencoders
Error Rate
Raw Pixels 7.20% Error Rate
RBM 1.81% Raw Pixels 23%
Dropout Autoencoder (50% hidden) 1.80% RBM (weight decay) 10.6%
Denoising Autoencoder 1.95% Dropout Autoencoder 10.1%
(20% input dropout) Denoising Autoencoder 9.5%
Dropout + Denoising Autoencoder 1.60% (20% input dropout)
(20% input and 50% hidden) k -Sparse Autoencoder, k = 200 10.4%
k -Sparse Autoencoder, k = 40 1.54% k -Sparse Autoencoder, k = 150 8.6%
k -Sparse Autoencoder, k = 25 1.35% k -Sparse Autoencoder, k = 75 9.5%
k -Sparse Autoencoder, k = 10 2.10% Table 2. Performance of unsupervised learning
methods (without fine-tuning) with 4000 hidden
Table 1. Performance of unsupervised learning methods
units on NORB.
(without fine-tuning) with 1000 hidden units on MNIST.
0
0 1 2 3 4 5 6
Error
Without Pre-Training 12.7%
Error
DBN 8.3%
Without Pre-Training 1.60%
DBM 7.2%
RBM + F.T. 1.24%
third-order RBM 6.5%
Shallow Dropout AE + F.T. 1.05%
Shallow Dropout AE + F.T. 8.2%
(%50 hidden)
(%50 hidden)
Denoising AE + F.T. 1.20%
Shallow Denoising AE + F.T. 7.9%
(%20 input dropout)
(%20 input dropout)
Deep Dropout AE + F.T. 0.85%
Deep Dropout AE + F.T. 7.0%
(Layer-wise pre-training, %50 hidden)
(Layer-wise pre-training, %50 hidden)
k -Sparse AE + F.T. 1.08%
Shallow k -Sparse AE + F.T. 7.8%
(k =25)
(k =150)
Deep k -Sparse AE + F.T. 0.97%
Deep k -Sparse AE + F.T. 7.4%
(Layer-wise pre-training)
(k =150, Layer-wise pre-training)
Table 3. Performance of supervised learning methods on
Table 4. Performance of supervised learning
MNIST. Pre-training was performed using the correspond-
ing unsupervised learning algorithm with 1000 hidden units, methods on NORB. Pre-training was performed
and then the model was fine-tuned. using the corresponding unsupervised learning
algorithm with 4000 hidden units, and then the
model was fine-tuned.
to adjust the weights of the last hidden layer and sparse autoencoder on top of them to obtain another
also to fine-tune the weights in the previous layers. set of hidden codes. Then we use the parameters of
This procedure is often referred to as discriminative these autoencoders to initialize a discriminative neural
fine-tuning. In this section, we report results us- network with two hidden layers.
ing unsupervised learning algorithms such as RBMs,
In the fine-tuning stage of the deep neural net, we first
DBNs (Salakhutdinov & Larochelle, 2010), DBMs
fix the parameters of the first and second layers and
(Salakhutdinov & Larochelle, 2010), third-order RBM
train a softmax classifier on top of the second layer.
(Nair & Hinton, 2009), dropout autoencoders, denois-
We then hold the weights of the first layer fixed and
ing autoencoders and k -sparse autoencoders to ini-
train the second layer and softmax jointly using the
tialize a shallow discriminative neural network for
initialization of the softmax that we found in the pre-
the MNIST and NORB datasets. We used back-
vious step. Finally, we jointly fine-tune all of the layers
propagation to fine-tune the weights. The regulariza-
with the previous initialization. We have observed that
tion method used in the fine-tuning stage of different
this method of layer-wise fine-tuning can improve the
algorithms is the same as the one used in the train-
classification performance compared to the case where
ing of the corresponding unsupervised learning task.
we fine-tune all the layers at the same time.
For instance, we fine-tuned the weights obtained from
dropout autoencoder with dropout regularization or in In all of the fine-tuning steps, we keep the αk largest
denoising autoencoder, we fine-tuned the discrimina- hidden codes, where k = 25, α = 3 in MNIST and k =
tive neural net by adding noise to the input. In a sim- 150, α = 2 in NORB in both hidden layers. Tables 3
ilar manner, in the fine-tuning stage of the k -sparse and 4 report the classification results of different deep
autoencoder, we used the αk largest hidden units in supervised learning methods.
the corresponding discriminative neural network, as
explained in Section 2.2. Tables 3 and 4 reports the 5. Conclusion
error rates obtained by different methods. In this work, we proposed a very fast sparse coding
method called k -sparse autoencoder, which achieves
4.6. Deep Supervised Learning Results exact sparsity in the hidden representation. The main
The k -sparse autoencoder can be used as a building message of this paper is that we can use the result-
block of a deep neural network, using greedy layer- ing representations to achieve state-of-the-art classifi-
wise pre-training (Bengio et al., 2007). We first train cation results, solely by enforcing sparsity in the hid-
a shallow k -sparse autoencoder and obtain the hidden den units and without using any other nonlinearity or
codes. We then fix the features and train another k - regularization. We also discussed how the k -sparse au-
toencoder could be used for pre-training shallow and
k -Sparse Autoencoders
Engan, Kjersti, Aase, Sven Ole, and Hakon Husoy, Tieleman, Tijmen. Gnumpy: an easy way to use gpu
J. Method of optimal directions for frame design. boards in python. Department of Computer Science,
In Acoustics, Speech, and Signal Processing, 1999. University of Toronto, 2010.
Proceedings., 1999 IEEE International Conference Tropp, Joel A and Gilbert, Anna C. Signal recovery
on, volume 5, pp. 2443–2446. IEEE, 1999. from random measurements via orthogonal match-
Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, ing pursuit. Information Theory, IEEE Transac-
Manzagol, Pierre-Antoine, Vincent, Pascal, and tions on, 53(12):4655–4666, 2007.
Bengio, Samy. Why does unsupervised pre-training
Van Gemert, Jan C, Geusebroek, Jan-Mark, Veenman,
help deep learning? The Journal of Machine Learn-
Cor J, and Smeulders, Arnold WM. Kernel code-
ing Research, 11:625–660, 2010.
books for scene categorization. In Computer Vision–
Gregor, Karol and LeCun, Yann. Learning fast ap- ECCV 2008, pp. 696–709. Springer, 2008.
proximations of sparse coding. In Proceedings of the
Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua,
27th International Conference on Machine Learning
and Manzagol, Pierre-Antoine. Extracting and com-
(ICML-10), pp. 399–406, 2010.
posing robust features with denoising autoencoders.
Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, In Proceedings of the 25th international conference
Alex, Sutskever, Ilya, and Salakhutdinov, Rus- on Machine learning, pp. 1096–1103. ACM, 2008.
lan R. Improving neural networks by preventing