0% found this document useful (0 votes)
108 views9 pages

Regularization of Neural Networks Using Dropconnect: Hinton Et Al. 2012 Hinton Et Al. 2012

dropconnect for regularization

Uploaded by

yijing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views9 pages

Regularization of Neural Networks Using Dropconnect: Hinton Et Al. 2012 Hinton Et Al. 2012

dropconnect for regularization

Uploaded by

yijing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Regularization of Neural Networks using DropConnect

Li Wan [email protected]
Matthew Zeiler [email protected]
Sixin Zhang [email protected]
Yann LeCun [email protected]
Rob Fergus [email protected]
Dept. of Computer Science, Courant Institute of Mathematical Science, New York University

Abstract Recently, Hinton et al. proposed a new form of regular-


We introduce DropConnect, a generalization ization called Dropout (Hinton et al., 2012). For each
of Dropout (Hinton et al., 2012), for regular- training example, forward propagation involves ran-
izing large fully-connected layers within neu- domly deleting half the activations in each layer. The
ral networks. When training with Dropout, error is then backpropagated only through the remain-
a randomly selected subset of activations are ing activations. Extensive experiments show that this
set to zero within each layer. DropCon- significantly reduces over-fitting and improves test per-
nect instead sets a randomly selected sub- formance. Although a full understanding of its mech-
set of weights within the network to zero. anism is elusive, the intuition is that it prevents the
Each unit thus receives input from a ran- network weights from collaborating with one another
dom subset of units in the previous layer. to memorize the training examples.
We derive a bound on the generalization per- In this paper, we propose DropConnect which general-
formance of both Dropout and DropCon- izes Dropout by randomly dropping the weights rather
nect. We then evaluate DropConnect on a than the activations. Like Dropout, the technique is
range of datasets, comparing to Dropout, and suitable for fully connected layers only. We compare
show state-of-the-art results on several image and contrast the two methods on four different image
recognition benchmarks by aggregating mul- datasets.
tiple DropConnect-trained models.
2. Motivation
1. Introduction
To demonstrate our method we consider a fully con-
Neural network (NN) models are well suited to do- nected layer of a neural network with input v =
mains where large labeled datasets are available, since [v1 , v2 , . . . , vn ]T and weight parameters W (of size
their capacity can easily be increased by adding more d × n). The output of this layer, r = [r1 , r2 , . . . , rd ]T
layers or more units in each layer. However, big net- is computed as a matrix multiply between the input
works with millions or billions of parameters can easily vector and the weight matrix followed by a non-linear
overfit even the largest of datasets. Correspondingly, activation function, a, (biases are included in W with
a wide range of techniques for regularizing NNs have a corresponding fixed input of 1 for simplicity):
been developed. Adding an `2 penalty on the network
r = a(u) = a(W v) (1)
weights is one simple but effective approach. Other
2.1. Dropout
forms of regularization include: Bayesian methods
(Mackay, 1995), weight elimination (Weigend et al., Dropout was proposed by (Hinton et al., 2012) as
1991) and early stopping of training. In practice, us- a form of regularization for fully connected neural
ing these techniques when training big networks gives network layers. Each element of a layer’s output is
superior test performance to smaller networks trained kept with probability p, otherwise being set to 0 with
without regularization. probability (1 − p). Extensive experiments show that
Proceedings of the 30 th International Conference on Ma-
Dropout improves the network’s generalization ability,
chine Learning, Atlanta, Georgia, USA, 2013. JMLR: giving improved test performance.
W&CP volume 28. Copyright 2013 by the author(s).
When Dropout is applied to the outputs of a fully con-
Regularization of Neural Networks using DropConnect

Outputs Previous layer mask


r (d x 1) u (d x 1)
Input
Predictions x
o (k x 1) Features
v (n x 1)

Current layer output mask


DropConnect
weights
W (d x n)
Softmax Activation Feature
layer function extractor
s(r;Ws) a(u) g(x;Wg)

b) DropConnect c) Effective Dropout


a) Model Layout mask M mask M’
Figure 1. (a): An example model layout for a single DropConnect layer. After running feature extractor g() on input x, a
random instantiation of the mask M (e.g. (b)), masks out the weight matrix W . The masked weights are multiplied with
this feature vector to produce u which is the input to an activation function a and a softmax layer s. For comparison, (c)
shows an effective weight mask for elements that Dropout uses when applied to the previous layer’s output (red columns)
and this layer’s output (green rows). Note the lack of structure in (b) compared to (c).

nected layer, we can write Eqn. 1 as: the biases are also masked out during training. From
Eqn. 2 and Eqn. 3, it is evident that DropConnect is
r = m ? a(W v) (2) the generalization of Dropout to the full connection
where ? denotes element wise product and m is a bi- structure of a layer1 .
nary mask vector of size d with each element, j, drawn
independently from mj ∼ Bernoulli(p). The paper structure is as follows: we outline details on
training and running inference in a model using Drop-
Many commonly used activation functions such as Connect in section 3, followed by theoretical justifica-
tanh, centered sigmoid and relu (Nair and Hinton, tion for DropConnect in section 4, GPU implementa-
2010), have the property that a(0) = 0. Thus, Eqn. 2 tion specifics in section 5, and experimental results in
could be re-written as, r = a(m ? W v), where Dropout section 6.
is applied at the inputs to the activation function.
3. Model Description
2.2. DropConnect
We consider a standard model architecture composed
DropConnect is the generalization of Dropout in which
of four basic components (see Fig. 1a):
each connection, rather than each output unit, can
be dropped with probability 1 − p. DropConnect is 1. Feature Extractor: v = g(x; Wg ) where v are the out-
similar to Dropout as it introduces dynamic sparsity put features, x is input data to the overall model,
within the model, but differs in that the sparsity is and Wg are parameters for the feature extractor. We
on the weights W , rather than the output vectors of a choose g() to be a multi-layered convolutional neural
layer. In other words, the fully connected layer with network (CNN) (LeCun et al., 1998), with Wg being
DropConnect becomes a sparsely connected layer in the convolutional filters (and biases) of the CNN.
which the connections are chosen at random during 2. DropConnect Layer: r = a(u) = a((M ? W )v) where
the training stage. Note that this is not equivalent to v is the output of the feature extractor, W is a fully
setting W to be a fixed sparse matrix during training. connected weight matrix, a is a non-linear activation
function and M is the binary mask matrix.
For a DropConnect layer, the output is given as:
3. Softmax Classification Layer: o = s(r; Ws ) takes as
r = a ((M ? W ) v) (3) input r and uses parameters Ws to map this to a k
dimensional output (k being the number of classes).
where M is a binary matrix encoding the connection 4. Cross Entropy Loss: A(y, o) = − Pk yi log(oi ) takes
i=1
information and Mij ∼ Bernoulli(p). Each element probabilities o and the ground truth labels y as input.
of the mask M is drawn independently for each exam-
1
ple during training, essentially instantiating a differ- This holds when a(0) = 0, as is the case for tanh and
ent connectivity for each example seen. Additionally, relu functions.
Regularization of Neural Networks using DropConnect

The overall model f (x; θ, M ) therefore maps input Algorithm 1 SGD Training with DropConnect
data x to an output o through a sequence of operations Input: example x, parameters θt−1 from step t − 1,
given the parameters θ = {Wg , W, Ws } and randomly- learning rate η
drawn mask M . The correct value of o is obtained by Output: updated parameters θt
summing out over all possible masks M : Forward Pass:
X Extract features: v ← g(x; Wg )
o = EM [f (x; θ, M )] = p(M )f (x; θ, M ) (4) Random sample M mask: Mij ∼ Bernoulli(p)
M Compute activations: r = a((M ? W )v)
This reveals the mixture model interpretation of Drop- Compute output: o = s(r; Ws )
Connect (and Dropout), where the output is a mixture Backpropagate Gradients:
of 2|M | different networks, each with weight p(M ). Differentiate loss A0θ with respect to parameters θ:
If pP= 0.5, then these weights are equal and o = Update softmax layer: Ws = Ws − ηA0Ws
1 1
P
|M | M f (x; θ, M ) = |M | M s(a((M ? W )v); Ws ) Update DropConnect layer: W = W − η(M ? A0W )
Update feature extractor: Wg = Wg − ηA0Wg
3.1. Training
Training the model described in Section 3 begins by
selecting an example x from the training set and ex-
tracting features for that example, v. These features
are input to the DropConnect layer where a mask ma- Algorithm 2 Inference with DropConnect
trix M is first drawn from a Bernoulli(p) distribution Input: example x, parameters θ, # of samples Z.
to mask out elements of both the weight matrix and Output: prediction u
the biases in the DropConnect layer. A key compo- Extract features: v ← g(x; Wg )
nent to successfully training with DropConnect is the Moment matching of u:
selection of a different mask for each training exam- µ ← EM [u] σ 2 ← VM [u]
ple. Selecting a single mask for a subset of training for z = 1 : Z do %% Draw Z samples
examples, such as a mini-batch of 128 examples, does for i = 1 : d do %% Loop over units in r
not regularize the model enough in practice. Since the Sample from 1D Gaussian ui,z ∼ N (µi , σi2 )
memory requirement for the M ’s now grows with the ri,z ← a(ui,z )
size of each mini-batch, the implementation needs to end for
be carefully designed as described in Section 5. end for PZ
Pass result r̂ = z=1 rz /Z to next layer
Once a mask is chosen, it is applied to the weights and
biases in order to compute the input to the activa- i.e. averaging before the activation rather than after.
tion function. This results in r, the input to the soft- Although this seems to work in practice, it is not jus-
max layer which outputs class predictions from which tified mathematically, particularly for the relu activa-
cross entropy between the ground truth labels is com- tion function.2
puted. The parameters throughout the model θ then
can be updated via stochastic gradient descent (SGD) We take a different approach. Consider a single
by backpropagating gradients of the loss function with unit ui before the activation function a(): ui =
P
respect to the parameters, A0θ . To update the weight j (Wij vj )Mij . This is a weighted sum of Bernoulli
matrix W in a DropConnect layer, the mask is ap- variables Mij , which can be approximated by a Gaus-
plied to the gradient to update only those elements sian via moment matching. The mean and variance
that were active in the forward pass. Additionally, of the units u are: EM [u] = pW v and VM [u] =
when passing gradients down to the feature extractor, p(1 − p)(W ? W )(v ? v). We can then draw samples
the masked weight matrix M ? W is used. A summary from this Gaussian and pass them through the activa-
of these steps is provided in Algorithm 1. tion function a() before averaging them and present-
ing them to the next layer. Algorithm 2 summarizes
the method. Note that the sampling can be done ef-
3.2. Inference
ficiently, since the samples for each unit and exam-
At inference
P time, we need to compute r = ple can be drawn in parallel. This scheme is only an
1/|M | M a((M ? W )v), which naively requires the approximation in the case of multi-layer network, it
evaluation of 2|M | different masks – plainly infeasible. works well in practise as shown in Experiments.
2
The Dropout work
P (Hinton et al., 2012)
P made the ap- Consider u ∼ N (0, 1), with √ a(u) = max(u, 0).
proximation: M a((M ? W )v) ≈ a( M (M ? W )v), a(EM (u)) = 0 but EM (a(u)) = 1/ 2π ≈ 0.4.
Regularization of Neural Networks using DropConnect

Implementation Mask Weight Time(ms) Speedup


fprop bprop acts bprop weights total
CPU float 480.2 1228.6 1692.8 3401.6 1.0 ×
CPU bit 392.3 679.1 759.7 1831.1 1.9 ×
GPU float(global memory) 21.6 6.2 7.2 35.0 97.2 ×
GPU float(tex1D memory) 15.1 6.1 6.0 27.2 126.0 ×
GPU bit(tex2D aligned memory) 2.4 2.7 3.1 8.2 414.8 ×
GPU(Lower Bound) cuBlas + read mask weight 0.3 0.3 0.2 0.8
Table 1. Performance comparison between different implementations of our DropConnect layer on NVidia GTX580 GPU
relative to a 2.67Ghz Intel Xeon (compiled with -O3 flag). Input dimension and Output dimension are 1024 and mini-batch
size is 128. As reference we provide traditional matrix multiplication using the cuBlas library.

4. Model Generalization Bound 2. Once a random instantiation of the mask is created, it


We now show a novel bound for the Rademacher com- is non-trivial to access all the elements required during
plexity of the model R̂` (F) on the training set (see the matrix multiplications so as to maximize perfor-
appendix for derivation): mance.
 √ √ 
R̂` (F) ≤ p 2 kdBs n dBh R̂` (G) (5) The first problem is not hard to address. Each ele-
ment of the mask matrix is stored as a single bit to
where max|Ws | ≤ Bs , max|W | ≤ B, k is the num- encode the connectivity information rather than as a
ber of classes, R̂` (G) is the Rademacher complexity of float. The memory cost is thus reduced by 32 times,
the feature extractor, n and d are the dimensionality which becomes 256M for the example above. This not
of the input and output of the DropConnect layer re- only reduces the memory footprint, but also reduces
spectively. The important result from Eqn. 5 is that the bandwidth required as 32 elements can be accessed
the complexity is a linear function of the probability p with each 4-byte read. We overcome the second prob-
of an element being kept in DropConnect or Dropout. lem using an efficient memory access pattern using 2D
When p = 0, the model complexity is zero, since the texture aligned memory. These two improvements are
input has no influence on the output. When p = 1, it crucial for an efficient GPU implementation of Drop-
returns to the complexity of a standard model. Connect as shown in Table 1. Here we compare to a
naive CPU implementation with floating point masks
5. Implementation Details and get a 415× speedup with our efficient GPU design.

Our system involves three components implemented


on a GPU: 1) a feature extractor, 2) our DropConnect 6. Experiments
layer, and 3) a softmax classification layer. For 1 and We evaluate our DropConnect model for regularizing
3 we utilize the Cuda-convnet package (Krizhevsky, deep neural networks trained for image classification.
2012), a fast GPU based convolutional network library. All experiments use mini-batch SGD with momentum
We implement a custom GPU kernel for performing on batches of 128 images with the momentum param-
the operations within the DropConnect layer. Our eter fixed at 0.9.
code is available at http:///cs.nyu.edu/~wanli/
dropc. We use the following protocol for all experiments un-
less otherwise stated:
A typical fully connected layer is implemented as a
matrix-matrix multiplication between the input vec- • Augment the dataset by: 1) randomly selecting
tors for a mini-batch of training examples and the cropped regions from the images, 2) flipping images
weight matrix. The difficulty in our case is that each horizontally, 3) introducing 15% scaling and rotation
training example requires it’s own random mask ma- variations.
trix applied to the weights and biases of the DropCon- • Train 5 independent networks with random permuta-
nect layer. This leads to several complications: tions of the training sequence.
• Manually decrease the learning rate if the network
1. For a weight matrix of size d × n, the corresponding stops improving as in (Krizhevsky, 2012) according to
mask matrix is of size d × n × b where b is the size of a schedule determined on a validation set.
the mini-batch. For a 4096×4096 fully connected layer • Train the fully connected layer using Dropout, Drop-
with mini-batch size of 128, the matrix would be too Connect, or neither (No-Drop).
large to fit into GPU memory if each element is stored • At inference time for DropConnect we draw Z = 1000
as a floating point number, requiring 8G of memory.
Regularization of Neural Networks using DropConnect

samples at the inputs to the activation function of the neuron model error(%) voting
fully connected layer and average their activations. 5 network error(%)
relu No-Drop 1.62 ± 0.037 1.40
To anneal the initial learning rate we choose a fixed
Dropout 1.28 ± 0.040 1.20
multiplier for different stages of training. We report
DropConnect 1.20 ± 0.034 1.12
three numbers of epochs, such as 600-400-200 to define
sigmoid No-Drop 1.78 ± 0.037 1.74
our schedule. We multiply the initial rate by 1 for the
Dropout 1.38 ± 0.039 1.36
first such number of epochs. Then we use a multiplier
DropConnect 1.55 ± 0.046 1.48
of 0.5 for the second number of epochs followed by
tanh No-Drop 1.65 ± 0.026 1.49
0.1 again for this second number of epochs. The third
Dropout 1.58 ± 0.053 1.55
number of epochs is used for multipliers of 0.05, 0.01,
DropConnect 1.36 ± 0.054 1.35
0.005, and 0.001 in that order, after which point we
report our results. We determine the epochs to use for Table 2. MNIST classification error rate for models with
our schedule using a validation set to look for plateaus two fully connected layers of 800 neurons each. No data
in the loss function, at which point we move to the augmentation is used in this experiment.
next multiplier. 3 Connect perform better than not using either method.
DropConnect mostly performs better than Dropout in
Once the 5 networks are trained we report two num-
this task, with the gap widening when utilizing the
bers: 1) the mean and standard deviation of the classi-
voting over the 5 models.
fication errors produced by each of the 5 independent
networks, and 2) the classification error that results To further analyze the effects of DropConnect, we
when averaging the output probabilities from the 5 show three explanatory experiments in Fig. 2 using a 2-
networks before making a prediction. We find in prac- layer fully connected model on MNIST digits. Fig. 2a
tice this voting scheme, inspired by (Ciresan et al., shows test performance as the number of hidden units
2012), provides significant performance gains, achiev- in each layer varies. As the model size increases, No-
ing state-of-the-art results in many standard bench- Drop overfits while both Dropout and DropConnect
marks when combined with our DropConnect layer. improve performance. DropConnect consistently gives
a lower error rate than Dropout. Fig. 2b shows the ef-
6.1. MNIST fect of varying the drop rate p for Dropout and Drop-
Connect for a 400-400 unit network. Both methods
The MNIST handwritten digit classification task (Le- give optimal performance in the vicinity of 0.5, the
Cun et al., 1998) consists of 28×28 black and white im- value used in all other experiments in the paper. Our
ages, each containing a digit 0 to 9 (10-classes). Each sampling approach gives a performance gain over mean
digit in the 60, 000 training images and 10, 000 test inference (as used by Hinton (Hinton et al., 2012)),
images is normalized to fit in a 20 × 20 pixel box while but only for the DropConnect case. In Fig. 2c we
preserving their aspect ratio. We scale the pixel values plot the convergence properties of the three methods
to the [0, 1] range before inputting to our models. throughout training on a 400-400 network. We can
For our first experiment on this dataset, we train mod- see that No-Drop overfits quickly, while Dropout and
els with two fully connected layers each with 800 out- DropConnect converge slowly to ultimately give supe-
put units using either tanh, sigmoid or relu activation rior test performance. DropConnect is even slower to
functions to compare to Dropout in (Hinton et al., converge than Dropout, but yields a lower test error
2012). The first layer takes the image pixels as input, in the end.
while the second layer’s output is fed into a 10-class In order to improve our classification result, we choose
softmax classification layer. In Table 2 we show the a more powerful feature extractor network described in
performance of various activations functions, compar- (Ciresan et al., 2012) (relu is used rather than tanh).
ing No-Drop, Dropout and DropConnect in the fully This feature extractor consists of a 2 layer CNN with
connected layers. No data augmentation is utilized in 32-64 feature maps in each layer respectively. The
this experiment. We use an initial learning rate of 0.1 last layer’s output is treated as input to the fully con-
and train for 600-400-20 epochs using our schedule. nected layer which has 150 relu units on which No-
From Table 2 we can see that both Dropout and Drop- Drop, Dropout or DropConnect are applied. We re-
3
port results in Table 3 from training the network on
In all experiments the bias learning rate is 2× the a) the original MNIST digits, b) cropped 24 × 24 im-
learning rate for the weights. Additionally weights are ini-
tialized with N (0, 0.1) random values for fully connected ages from random locations, and c) rotated and scaled
layers and N (0, 0.01) for convolutional layers. versions of these cropped images. We use an initial
Regularization of Neural Networks using DropConnect

2.1
No−Drop 2.4
2 Dropout
DropConnect Dropout (mean)
1.9 2.2
DropConnect (mean)
1.8 Dropout (sampling)
−2
2

Cross Entropy
DropConnect (sampling) 10

% Test Error
% Test Error

1.7

1.6 1.8
1.5 No−Drop Train
1.6 No−Drop Test
1.4
Dropout Train
1.3 1.4 Dropout Test
DropConnect Train
1.2
DropConnect Test
1.2 −3
1.1 10
200 400 800 1600 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100 200 300 400 500 600 700 800 900
Hidden Units % of Elements Kept Epoch

Figure 2. Using the MNIST dataset, in a) we analyze the ability of Dropout and DropConnect to prevent overfitting
as the size of the 2 fully connected layers increase. b) Varying the drop-rate in a 400-400 network shows near optimal
performance around the p = 0.5 proposed by (Hinton et al., 2012). c) we show the convergence properties of the train/test
sets. See text for discussion.

learning rate of 0.01 with a 700-200-100 epoch sched- Since this experiment is not aimed at optimal perfor-
ule, no momentum and preprocess by subtracting the mance we report a single model’s performance with-
image mean. out voting. We train for 150-0-0 epochs with an ini-
tial learning rate of 0.001 and their default weight de-
crop rotation model error(%) voting cay. DropConnect prevents overfitting of the fully con-
scaling 5 network error(%) nected layer better than Dropout in this experiment.
no no No-Drop 0.77±0.051 0.67
Dropout 0.59±0.039 0.52 model error(%)
DropConnect 0.63±0.035 0.57 No-Drop 23.5
yes no No-Drop 0.50±0.098 0.38 Dropout 19.7
Dropout 0.39±0.039 0.35 DropConnect 18.7
DropConnect 0.39±0.047 0.32 Table 4. CIFAR-10 classification error using the simple
yes yes No-Drop 0.30±0.035 0.21 feature extractor described in (Krizhevsky, 2012)(layers-
Dropout 0.28±0.016 0.27 80sec.cfg) and with no data augmentation.
DropConnect 0.28±0.032 0.21
Table 3. MNIST classification error. Previous state of the Table 5 shows classification results of the network us-
art is 0 .47 % (Zeiler and Fergus, 2013) for a single model ing a larger feature extractor with 2 convolutional
without elastic distortions and 0.23% with elastic distor- layers and 2 locally connected layers as described
tions and voting (Ciresan et al., 2012). in (Krizhevsky, 2012)(layers-conv-local-11pct.cfg). A
128 neuron fully connected layer with relu activations
We note that our approach surpasses the state-of-the-
is added between the softmax layer and feature extrac-
art result of 0.23% (Ciresan et al., 2012), achieving a
tor. Following (Krizhevsky, 2012), images are cropped
0.21% error rate, without the use of elastic distortions
to 24x24 with horizontal flips and no rotation or scal-
(as used by (Ciresan et al., 2012)).
ing is performed. We use an initial learning rate of
0.001 and train for 700-300-50 epochs with their de-
6.2. CIFAR-10 fault weight decay. Model voting significantly im-
CIFAR-10 is a data set of natural 32x32 RGB images proves performance when using Dropout or DropCon-
(Krizhevsky, 2009) in 10-classes with 50, 000 images nect, the latter reaching an error rate of 9.41%. Ad-
for training and 10, 000 for testing. Before inputting ditionally, we trained a model with 12 networks with
these images to our network, we subtract the per-pixel DropConnect and achieved a state-of-the-art result of
mean computed over the training set from each image. 9.32%, indicating the power of our approach.
The first experiment on CIFAR-10 (summarized in
6.3. SVHN
Table 4) uses the simple convolutional network fea-
ture extractor described in (Krizhevsky, 2012)(layers- The Street View House Numbers (SVHN) dataset in-
80sec.cfg) that is designed for rapid training rather cludes 604, 388 images (both training set and extra set)
than optimal performance. On top of the 3-layer and 26, 032 testing images (Netzer et al., 2011). Simi-
feature extractor we have a 64 unit fully connected lar to MNIST, the goal is to classify the digit centered
layer which uses No-Drop, Dropout, or DropConnect. in each 32x32 RGB image. Due to the large variety of
No data augmentation is utilized for this experiment. colors and brightness variations in the images, we pre-
Regularization of Neural Networks using DropConnect

model error(%) 5 network voting model error(%) voting


error(%) 5 network error(%)
No-Drop 11.18± 0.13 10.22 No-Drop 4.48 ± 0.78 3.36
Dropout 11.52± 0.18 9.83 Dropout 3.96 ± 0.16 3.03
DropConnect 11.10± 0.13 9.41 DropConnect 4.14 ± 0.06 3.23
Table 5. CIFAR-10 classification error using a larger fea- Table 7. NORM classification error for the jittered-
ture extractor. Previous state-of-the-art is 9.5% (Snoek cluttered dataset, using 2 training folds. The previous
et al., 2012). Voting with 12 DropConnect networks pro- state-of-art is 3.57% (Ciresan et al., 2012).
duces an error rate of 9.32%, significantly beating the formance on this dataset. We trained with an initial
state-of-the-art.
learning rate of 0.001 and anneal for 100-40-10 epochs.
process the images using local contrast normalization
as in (Zeiler and Fergus, 2013). The feature extractor In this experiment we beat the previous state-of-the-
is the same as the larger CIFAR-10 experiment, but art result of 3.57% using No-Drop, Dropout and Drop-
we instead use a larger 512 unit fully connected layer Connect with our voting scheme. While Dropout sur-
with relu activations between the softmax layer and passes DropConnect slightly, both methods improve
the feature extractor. After contrast normalizing, the over No-Drop in this benchmark as shown in Table 7.
training data is randomly cropped to 28 × 28 pixels
and is rotated and scaled. We do not do horizontal 7. Discussion
flips. Table 6 shows the classification performance for
5 models trained with an initial learning rate of 0.001 We have presented DropConnect, which generalizes
for a 100-50-10 epoch schedule. Hinton et al. ’s Dropout (Hinton et al., 2012) to the en-
tire connectivity structure of a fully connected neural
Due to the large training set size both Dropout and network layer. We provide both theoretical justifica-
DropConnect achieve nearly the same performance as tion and empirical results to show that DropConnect
No-Drop. However, using our data augmentation tech- helps regularize large neural network models. Results
niques and careful annealing, the per model scores eas- on a range of datasets show that DropConnect often
ily surpass the previous 2.80% state-of-the-art result outperforms Dropout. While our current implementa-
of (Zeiler and Fergus, 2013). Furthermore, our vot- tion of DropConnect is slightly slower than No-Drop or
ing scheme reduces the relative error of the previous Dropout, in large models models the feature extractor
state-of-to-art by 30% to achieve 1.94% error. is the bottleneck, thus there is little difference in over-
model error(%) 5 network voting all training time. DropConnect allows us to train large
error(%) models while avoiding overfitting. This yields state-
No-Drop 2.26 ± 0.072 1.94 of-the-art results on a variety of standard benchmarks
Dropout 2.25 ± 0.034 1.96 using our efficient GPU implementation of DropCon-
DropConnect 2.23 ± 0.039 1.94 nect.
Table 6. SVHN classification error. The previous state-of-
the-art is 2.8% (Zeiler and Fergus, 2013). 8. Appendix
8.1. Preliminaries
6.4. NORB
Definition 1 (DropConnect Network). Given data
In the final experiment we evaluate our models on
set S with ` entries: {x1 , x2 , . . . , x` } with labels
the 2-fold NORB (jittered-cluttered) dataset (LeCun
{y1 , y2 , . . . , y` }, we define thePDropConnect network
et al., 2004), a collection of stereo images of 3D mod-
as a mixture model: o = M p(M )f (x; θ, M ) =
els. For each image, one of 6 classes appears on a
EM [f (x; θ, M )]
random background. We train on 2-folds of 29, 160
images each and the test on a total of 58, 320 images. Each network f (x; θ, M ) has weights p(M ) and net-
The images are downsampled from 108×108 to 48×48 work parameters are θ = {Ws , W, Wg }. Ws are the
as in (Ciresan et al., 2012). softmax layer parameters, W are the DropConnect
layer parameters and Wg are the feature extractor pa-
We use the same feature extractor as the larger
rameters. Further more, M is the DropConnect layer
CIFAR-10 experiment. There is a 512 unit fully con-
mask.
nected layer with relu activations placed between the
softmax layer and feature extractor. Rotation and Now we reformulate the cross-entropy loss on top of
scaling of the training data is applied, but we do not the softmax into a single parameter function that com-
crop or flip the images as we found that to hurt per- bines the softmax output and labels, as a logistic.
Regularization of Neural Networks using DropConnect

Definition 2 (Logistic Loss). The following loss func- layer has the linear transformation function H and ac-
tion defined on k-class classification is call the logistic tivation function a. By Lemma 4 and Lemma 5, we
loss function: know the network complexity is bounded by:

X exp oi X R̂` (H ◦ G) ≤ c dB R̂` (F)
Ay (o) = − yi ln P = −oi + ln exp(oj )
i j exp(oj ) j where c = 1 for identity neuron and c = 2 for others.
where y is binary vector with i th
bit set on Lemma 6. Let FM be the class of real functions
h ithat
depend on M , then R̂` (EM [FM ]) ≤ EM R̂` (FM )
Lemma 1. Logistic loss function A has the following
properties: 1) Ay (0) = ln k, 2) −1 ≤ A0y (o) ≤ 1, and
3)A00y (o) ≥ 0.
P 
Proof. R̂` (EM [FM ]) = R̂` p (m) FM
M h

i
P P
Definition 3 (Rademacher complexity). For a sample M R̂` (p(m)F M ) ≤ M |p(m)| R̂` (F M ) = E M R̂ ` (F M )
S = {x1 , . . . , x` } generated by a distribution D on set
X and a real-valued function class F in domain X, the
Theorem 1 (DropConnect Network Complexity).
empirical Rademacher complexity of F is the random
Consider the DropConnect neural network defined in
variable:
" # Definition 1. Let R̂` (G) be the empirical Rademacher
` complexity of the feature extractor and R̂` (F) be the
2X
R̂` (F) = Eσ sup | σi f (xi )| | x1 , . . . , x` empirical Rademacher complexity of the whole net-
f ∈F ` i=1
work. In addition, we assume:
where sigma = {σ1 , . . . , σ` } are independent uniform 1. weight parameter of DropConnect layer |W | ≤ Bh
{±1}-valued (Rademacher) random variables. The
h i √ of s, i.e. |Ws | ≤ Bs (L2-norm of
2. weight parameter
Rademacher complexity of F is R` (F) = ES R̂` (F) . it is bounded by dkBs ).
 √ √ 
Then we have: R̂` (F) ≤ p 2 kdBs n dBh R̂` (G)
8.2. Bound Derivation
Lemma 2 ((Ledoux and Talagrand, 1991)). Let F Proof.
be class of real functions and H = [Fj ]kj=1 be a k- h i
R̂` (F) = R̂` (EM [f (x; θ, M ]) ≤ EM R̂` (f (x; θ, M )(6)
dimensional function class. If A: Rk → R is a Lips-
√ √
chitz function with constant L and satisfies A(0) = 0,
h i
≤ ( dkBs ) dEM R̂` (a ◦ hM ◦ g) (7)
then R̂` (A ◦ H) ≤ 2kLR̂` (F) √ h i
Lemma 3 (Classifier Generalization Bound). Gener- = 2 kdBs EM R̂` (hM ◦ g) (8)
alization bound of a k-class classifier with logistic loss where hM = (M ? W )v. Equation (6) is based on
function is directly related Rademacher complexity of Lemma 6, Equation (7) is based on Lemma 5 and
that classifier: q Equation (8) follows from Lemma 4.
P`
E[Ay (o)] ≤ 1` i=1 Ayi (oi ) + 2k R̂` (F) + 3 ln(2/δ)
h i
2` EM R̂` (hM ◦ g)
Lemma 4. For all neuron activations: sigmoid, tanh " `
#
2
X
and relu, we have: R̂` (a ◦ F) ≤ 2R̂` (F) = EM,σ sup T
σi W DM g(xi )

(9)
h∈H,g∈G `
i=1
Lemma 5 (Network Layer Bound). Let G be the class " * + #
`
of real functions Rd → R with input dimension F, i.e.
2X

= EM,σ sup DM W, σi g(xi )

d
G = [Fj ]j=1 and HB is a linear transform function h∈H,g∈G ` i=1
parametrized by W with kW k2 ≤ B, then R̂` (H ◦ G) ≤
 " #n 
√ h i
2X `
j


dB R̂` (F) ≤ EM max kDM W k Eσ  sup σi g (xi ) (10)

W g j ∈G ` i=1

j=1

h P
`
i √ √  √
Proof. R̂` (H ◦ G) = Eσ suph∈H,g∈G 2` i=1 σi h ◦ g(xi )

≤ Bh p nd nR̂` (G) = pn dBh R̂` (G)
h D P` E i
2
= Eσ supg∈G,kW k≤B W, ` i=1 σi g(xi )

 h id 
2 P` j j

≤ BEσ supf j ∈F
` i=1 iσ f (x i ) where DM in Equation (9) is an diagonal matrix
ij=1√

√ h P
2 ` with diagonal elements equal to m and inner prod-
= B dEσ supf ∈F ` i=1 σi f (xi ) = dB R̂` (F)

uct properties
 √lead to√Equation
 (10). Thus, we have:
Remark 1. Given a layer in our network, we denote R̂` (F) ≤ p 2 kdBs n dBh R̂` (G)
the function of all layers before as G = [Fj ]dj=1 . This
Regularization of Neural Networks using DropConnect

References A. S. Weigend, D. E. Rumelhart, and B. A. Huberman.


Generalization by weight-elimination with applica-
D. Ciresan, U. Meier, and J. Schmidhuber. Multi-
tion to forecasting. In NIPS, 1991.
column deep neural networks for image classifica-
tion. In Proceedings of the 2012 IEEE Confer- M. D. Zeiler and R. Fergus. Stochastic pooling for
ence on Computer Vision and Pattern Recognition regualization of deep convolutional neural networks.
(CVPR), CVPR ’12, pages 3642–3649, Washington, In ICLR, 2013.
DC, USA, 2012. IEEE Computer Society. ISBN 978-
1-4673-1226-4.

G. E. Hinton, N. Srivastava, A. Krizhevsky,


I. Sutskever, and R. Salakhutdinov. Improving neu-
ral networks by preventing co-adaptation of feature
detectors. CoRR, abs/1207.0580, 2012.

A. Krizhevsky. Learning Multiple Layers of Features


from Tiny Images. Master’s thesis, University of
Toront, 2009.

A. Krizhevsky. cuda-convnet. http://code.google.


com/p/cuda-convnet/, 2012.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.


Gradient-based learning applied to document recog-
nition. Proceedings of the IEEE, 86(11):2278 –2324,
nov 1998. ISSN 0018-9219. doi: 10.1109/5.726791.

Y. LeCun, F. J. Huang, and L. Bottou. Learning meth-


ods for generic object recognition with invariance to
pose and lighting. In Proceedings of the 2004 IEEE
computer society conference on Computer vision and
pattern recognition, CVPR’04, pages 97–104, Wash-
ington, DC, USA, 2004. IEEE Computer Society.

M. Ledoux and M. Talagrand. Probability in Banach


Spaces. Springer, New York, 1991.

D. J. C. Mackay. Probable networks and plausible


predictions - a review of practical bayesian methods
for supervised neural networks. In Bayesian methods
for backpropagation networks. Springer, 1995.

V. Nair and G. E. Hinton. Rectified Linear Units Im-


prove Restricted Boltzmann Machines. In ICML,
2010.

Y. Netzer, T. Wang, Coates A., A. Bissacco, B. Wu,


and A. Y. Ng. Reading digits in natural images with
unsupervised feature learning. In NIPS Workshop
on Deep Learning and Unsupervised Feature Learn-
ing 2011, 2011.

J. Snoek, H. Larochelle, and R. A. Adams. Practi-


cal bayesian optimization of machine learning algo-
rithms. In Neural Information Processing Systems,
2012.

You might also like