Regularization of Neural Networks Using Dropconnect: Hinton Et Al. 2012 Hinton Et Al. 2012
Regularization of Neural Networks Using Dropconnect: Hinton Et Al. 2012 Hinton Et Al. 2012
Li Wan [email protected]
Matthew Zeiler [email protected]
Sixin Zhang [email protected]
Yann LeCun [email protected]
Rob Fergus [email protected]
Dept. of Computer Science, Courant Institute of Mathematical Science, New York University
nected layer, we can write Eqn. 1 as: the biases are also masked out during training. From
Eqn. 2 and Eqn. 3, it is evident that DropConnect is
r = m ? a(W v) (2) the generalization of Dropout to the full connection
where ? denotes element wise product and m is a bi- structure of a layer1 .
nary mask vector of size d with each element, j, drawn
independently from mj ∼ Bernoulli(p). The paper structure is as follows: we outline details on
training and running inference in a model using Drop-
Many commonly used activation functions such as Connect in section 3, followed by theoretical justifica-
tanh, centered sigmoid and relu (Nair and Hinton, tion for DropConnect in section 4, GPU implementa-
2010), have the property that a(0) = 0. Thus, Eqn. 2 tion specifics in section 5, and experimental results in
could be re-written as, r = a(m ? W v), where Dropout section 6.
is applied at the inputs to the activation function.
3. Model Description
2.2. DropConnect
We consider a standard model architecture composed
DropConnect is the generalization of Dropout in which
of four basic components (see Fig. 1a):
each connection, rather than each output unit, can
be dropped with probability 1 − p. DropConnect is 1. Feature Extractor: v = g(x; Wg ) where v are the out-
similar to Dropout as it introduces dynamic sparsity put features, x is input data to the overall model,
within the model, but differs in that the sparsity is and Wg are parameters for the feature extractor. We
on the weights W , rather than the output vectors of a choose g() to be a multi-layered convolutional neural
layer. In other words, the fully connected layer with network (CNN) (LeCun et al., 1998), with Wg being
DropConnect becomes a sparsely connected layer in the convolutional filters (and biases) of the CNN.
which the connections are chosen at random during 2. DropConnect Layer: r = a(u) = a((M ? W )v) where
the training stage. Note that this is not equivalent to v is the output of the feature extractor, W is a fully
setting W to be a fixed sparse matrix during training. connected weight matrix, a is a non-linear activation
function and M is the binary mask matrix.
For a DropConnect layer, the output is given as:
3. Softmax Classification Layer: o = s(r; Ws ) takes as
r = a ((M ? W ) v) (3) input r and uses parameters Ws to map this to a k
dimensional output (k being the number of classes).
where M is a binary matrix encoding the connection 4. Cross Entropy Loss: A(y, o) = − Pk yi log(oi ) takes
i=1
information and Mij ∼ Bernoulli(p). Each element probabilities o and the ground truth labels y as input.
of the mask M is drawn independently for each exam-
1
ple during training, essentially instantiating a differ- This holds when a(0) = 0, as is the case for tanh and
ent connectivity for each example seen. Additionally, relu functions.
Regularization of Neural Networks using DropConnect
The overall model f (x; θ, M ) therefore maps input Algorithm 1 SGD Training with DropConnect
data x to an output o through a sequence of operations Input: example x, parameters θt−1 from step t − 1,
given the parameters θ = {Wg , W, Ws } and randomly- learning rate η
drawn mask M . The correct value of o is obtained by Output: updated parameters θt
summing out over all possible masks M : Forward Pass:
X Extract features: v ← g(x; Wg )
o = EM [f (x; θ, M )] = p(M )f (x; θ, M ) (4) Random sample M mask: Mij ∼ Bernoulli(p)
M Compute activations: r = a((M ? W )v)
This reveals the mixture model interpretation of Drop- Compute output: o = s(r; Ws )
Connect (and Dropout), where the output is a mixture Backpropagate Gradients:
of 2|M | different networks, each with weight p(M ). Differentiate loss A0θ with respect to parameters θ:
If pP= 0.5, then these weights are equal and o = Update softmax layer: Ws = Ws − ηA0Ws
1 1
P
|M | M f (x; θ, M ) = |M | M s(a((M ? W )v); Ws ) Update DropConnect layer: W = W − η(M ? A0W )
Update feature extractor: Wg = Wg − ηA0Wg
3.1. Training
Training the model described in Section 3 begins by
selecting an example x from the training set and ex-
tracting features for that example, v. These features
are input to the DropConnect layer where a mask ma- Algorithm 2 Inference with DropConnect
trix M is first drawn from a Bernoulli(p) distribution Input: example x, parameters θ, # of samples Z.
to mask out elements of both the weight matrix and Output: prediction u
the biases in the DropConnect layer. A key compo- Extract features: v ← g(x; Wg )
nent to successfully training with DropConnect is the Moment matching of u:
selection of a different mask for each training exam- µ ← EM [u] σ 2 ← VM [u]
ple. Selecting a single mask for a subset of training for z = 1 : Z do %% Draw Z samples
examples, such as a mini-batch of 128 examples, does for i = 1 : d do %% Loop over units in r
not regularize the model enough in practice. Since the Sample from 1D Gaussian ui,z ∼ N (µi , σi2 )
memory requirement for the M ’s now grows with the ri,z ← a(ui,z )
size of each mini-batch, the implementation needs to end for
be carefully designed as described in Section 5. end for PZ
Pass result r̂ = z=1 rz /Z to next layer
Once a mask is chosen, it is applied to the weights and
biases in order to compute the input to the activa- i.e. averaging before the activation rather than after.
tion function. This results in r, the input to the soft- Although this seems to work in practice, it is not jus-
max layer which outputs class predictions from which tified mathematically, particularly for the relu activa-
cross entropy between the ground truth labels is com- tion function.2
puted. The parameters throughout the model θ then
can be updated via stochastic gradient descent (SGD) We take a different approach. Consider a single
by backpropagating gradients of the loss function with unit ui before the activation function a(): ui =
P
respect to the parameters, A0θ . To update the weight j (Wij vj )Mij . This is a weighted sum of Bernoulli
matrix W in a DropConnect layer, the mask is ap- variables Mij , which can be approximated by a Gaus-
plied to the gradient to update only those elements sian via moment matching. The mean and variance
that were active in the forward pass. Additionally, of the units u are: EM [u] = pW v and VM [u] =
when passing gradients down to the feature extractor, p(1 − p)(W ? W )(v ? v). We can then draw samples
the masked weight matrix M ? W is used. A summary from this Gaussian and pass them through the activa-
of these steps is provided in Algorithm 1. tion function a() before averaging them and present-
ing them to the next layer. Algorithm 2 summarizes
the method. Note that the sampling can be done ef-
3.2. Inference
ficiently, since the samples for each unit and exam-
At inference
P time, we need to compute r = ple can be drawn in parallel. This scheme is only an
1/|M | M a((M ? W )v), which naively requires the approximation in the case of multi-layer network, it
evaluation of 2|M | different masks – plainly infeasible. works well in practise as shown in Experiments.
2
The Dropout work
P (Hinton et al., 2012)
P made the ap- Consider u ∼ N (0, 1), with √ a(u) = max(u, 0).
proximation: M a((M ? W )v) ≈ a( M (M ? W )v), a(EM (u)) = 0 but EM (a(u)) = 1/ 2π ≈ 0.4.
Regularization of Neural Networks using DropConnect
samples at the inputs to the activation function of the neuron model error(%) voting
fully connected layer and average their activations. 5 network error(%)
relu No-Drop 1.62 ± 0.037 1.40
To anneal the initial learning rate we choose a fixed
Dropout 1.28 ± 0.040 1.20
multiplier for different stages of training. We report
DropConnect 1.20 ± 0.034 1.12
three numbers of epochs, such as 600-400-200 to define
sigmoid No-Drop 1.78 ± 0.037 1.74
our schedule. We multiply the initial rate by 1 for the
Dropout 1.38 ± 0.039 1.36
first such number of epochs. Then we use a multiplier
DropConnect 1.55 ± 0.046 1.48
of 0.5 for the second number of epochs followed by
tanh No-Drop 1.65 ± 0.026 1.49
0.1 again for this second number of epochs. The third
Dropout 1.58 ± 0.053 1.55
number of epochs is used for multipliers of 0.05, 0.01,
DropConnect 1.36 ± 0.054 1.35
0.005, and 0.001 in that order, after which point we
report our results. We determine the epochs to use for Table 2. MNIST classification error rate for models with
our schedule using a validation set to look for plateaus two fully connected layers of 800 neurons each. No data
in the loss function, at which point we move to the augmentation is used in this experiment.
next multiplier. 3 Connect perform better than not using either method.
DropConnect mostly performs better than Dropout in
Once the 5 networks are trained we report two num-
this task, with the gap widening when utilizing the
bers: 1) the mean and standard deviation of the classi-
voting over the 5 models.
fication errors produced by each of the 5 independent
networks, and 2) the classification error that results To further analyze the effects of DropConnect, we
when averaging the output probabilities from the 5 show three explanatory experiments in Fig. 2 using a 2-
networks before making a prediction. We find in prac- layer fully connected model on MNIST digits. Fig. 2a
tice this voting scheme, inspired by (Ciresan et al., shows test performance as the number of hidden units
2012), provides significant performance gains, achiev- in each layer varies. As the model size increases, No-
ing state-of-the-art results in many standard bench- Drop overfits while both Dropout and DropConnect
marks when combined with our DropConnect layer. improve performance. DropConnect consistently gives
a lower error rate than Dropout. Fig. 2b shows the ef-
6.1. MNIST fect of varying the drop rate p for Dropout and Drop-
Connect for a 400-400 unit network. Both methods
The MNIST handwritten digit classification task (Le- give optimal performance in the vicinity of 0.5, the
Cun et al., 1998) consists of 28×28 black and white im- value used in all other experiments in the paper. Our
ages, each containing a digit 0 to 9 (10-classes). Each sampling approach gives a performance gain over mean
digit in the 60, 000 training images and 10, 000 test inference (as used by Hinton (Hinton et al., 2012)),
images is normalized to fit in a 20 × 20 pixel box while but only for the DropConnect case. In Fig. 2c we
preserving their aspect ratio. We scale the pixel values plot the convergence properties of the three methods
to the [0, 1] range before inputting to our models. throughout training on a 400-400 network. We can
For our first experiment on this dataset, we train mod- see that No-Drop overfits quickly, while Dropout and
els with two fully connected layers each with 800 out- DropConnect converge slowly to ultimately give supe-
put units using either tanh, sigmoid or relu activation rior test performance. DropConnect is even slower to
functions to compare to Dropout in (Hinton et al., converge than Dropout, but yields a lower test error
2012). The first layer takes the image pixels as input, in the end.
while the second layer’s output is fed into a 10-class In order to improve our classification result, we choose
softmax classification layer. In Table 2 we show the a more powerful feature extractor network described in
performance of various activations functions, compar- (Ciresan et al., 2012) (relu is used rather than tanh).
ing No-Drop, Dropout and DropConnect in the fully This feature extractor consists of a 2 layer CNN with
connected layers. No data augmentation is utilized in 32-64 feature maps in each layer respectively. The
this experiment. We use an initial learning rate of 0.1 last layer’s output is treated as input to the fully con-
and train for 600-400-20 epochs using our schedule. nected layer which has 150 relu units on which No-
From Table 2 we can see that both Dropout and Drop- Drop, Dropout or DropConnect are applied. We re-
3
port results in Table 3 from training the network on
In all experiments the bias learning rate is 2× the a) the original MNIST digits, b) cropped 24 × 24 im-
learning rate for the weights. Additionally weights are ini-
tialized with N (0, 0.1) random values for fully connected ages from random locations, and c) rotated and scaled
layers and N (0, 0.01) for convolutional layers. versions of these cropped images. We use an initial
Regularization of Neural Networks using DropConnect
2.1
No−Drop 2.4
2 Dropout
DropConnect Dropout (mean)
1.9 2.2
DropConnect (mean)
1.8 Dropout (sampling)
−2
2
Cross Entropy
DropConnect (sampling) 10
% Test Error
% Test Error
1.7
1.6 1.8
1.5 No−Drop Train
1.6 No−Drop Test
1.4
Dropout Train
1.3 1.4 Dropout Test
DropConnect Train
1.2
DropConnect Test
1.2 −3
1.1 10
200 400 800 1600 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100 200 300 400 500 600 700 800 900
Hidden Units % of Elements Kept Epoch
Figure 2. Using the MNIST dataset, in a) we analyze the ability of Dropout and DropConnect to prevent overfitting
as the size of the 2 fully connected layers increase. b) Varying the drop-rate in a 400-400 network shows near optimal
performance around the p = 0.5 proposed by (Hinton et al., 2012). c) we show the convergence properties of the train/test
sets. See text for discussion.
learning rate of 0.01 with a 700-200-100 epoch sched- Since this experiment is not aimed at optimal perfor-
ule, no momentum and preprocess by subtracting the mance we report a single model’s performance with-
image mean. out voting. We train for 150-0-0 epochs with an ini-
tial learning rate of 0.001 and their default weight de-
crop rotation model error(%) voting cay. DropConnect prevents overfitting of the fully con-
scaling 5 network error(%) nected layer better than Dropout in this experiment.
no no No-Drop 0.77±0.051 0.67
Dropout 0.59±0.039 0.52 model error(%)
DropConnect 0.63±0.035 0.57 No-Drop 23.5
yes no No-Drop 0.50±0.098 0.38 Dropout 19.7
Dropout 0.39±0.039 0.35 DropConnect 18.7
DropConnect 0.39±0.047 0.32 Table 4. CIFAR-10 classification error using the simple
yes yes No-Drop 0.30±0.035 0.21 feature extractor described in (Krizhevsky, 2012)(layers-
Dropout 0.28±0.016 0.27 80sec.cfg) and with no data augmentation.
DropConnect 0.28±0.032 0.21
Table 3. MNIST classification error. Previous state of the Table 5 shows classification results of the network us-
art is 0 .47 % (Zeiler and Fergus, 2013) for a single model ing a larger feature extractor with 2 convolutional
without elastic distortions and 0.23% with elastic distor- layers and 2 locally connected layers as described
tions and voting (Ciresan et al., 2012). in (Krizhevsky, 2012)(layers-conv-local-11pct.cfg). A
128 neuron fully connected layer with relu activations
We note that our approach surpasses the state-of-the-
is added between the softmax layer and feature extrac-
art result of 0.23% (Ciresan et al., 2012), achieving a
tor. Following (Krizhevsky, 2012), images are cropped
0.21% error rate, without the use of elastic distortions
to 24x24 with horizontal flips and no rotation or scal-
(as used by (Ciresan et al., 2012)).
ing is performed. We use an initial learning rate of
0.001 and train for 700-300-50 epochs with their de-
6.2. CIFAR-10 fault weight decay. Model voting significantly im-
CIFAR-10 is a data set of natural 32x32 RGB images proves performance when using Dropout or DropCon-
(Krizhevsky, 2009) in 10-classes with 50, 000 images nect, the latter reaching an error rate of 9.41%. Ad-
for training and 10, 000 for testing. Before inputting ditionally, we trained a model with 12 networks with
these images to our network, we subtract the per-pixel DropConnect and achieved a state-of-the-art result of
mean computed over the training set from each image. 9.32%, indicating the power of our approach.
The first experiment on CIFAR-10 (summarized in
6.3. SVHN
Table 4) uses the simple convolutional network fea-
ture extractor described in (Krizhevsky, 2012)(layers- The Street View House Numbers (SVHN) dataset in-
80sec.cfg) that is designed for rapid training rather cludes 604, 388 images (both training set and extra set)
than optimal performance. On top of the 3-layer and 26, 032 testing images (Netzer et al., 2011). Simi-
feature extractor we have a 64 unit fully connected lar to MNIST, the goal is to classify the digit centered
layer which uses No-Drop, Dropout, or DropConnect. in each 32x32 RGB image. Due to the large variety of
No data augmentation is utilized for this experiment. colors and brightness variations in the images, we pre-
Regularization of Neural Networks using DropConnect
Definition 2 (Logistic Loss). The following loss func- layer has the linear transformation function H and ac-
tion defined on k-class classification is call the logistic tivation function a. By Lemma 4 and Lemma 5, we
loss function: know the network complexity is bounded by:
√
X exp oi X R̂` (H ◦ G) ≤ c dB R̂` (F)
Ay (o) = − yi ln P = −oi + ln exp(oj )
i j exp(oj ) j where c = 1 for identity neuron and c = 2 for others.
where y is binary vector with i th
bit set on Lemma 6. Let FM be the class of real functions
h ithat
depend on M , then R̂` (EM [FM ]) ≤ EM R̂` (FM )
Lemma 1. Logistic loss function A has the following
properties: 1) Ay (0) = ln k, 2) −1 ≤ A0y (o) ≤ 1, and
3)A00y (o) ≥ 0.
P
Proof. R̂` (EM [FM ]) = R̂` p (m) FM
M h
≤
i
P P
Definition 3 (Rademacher complexity). For a sample M R̂` (p(m)F M ) ≤ M |p(m)| R̂` (F M ) = E M R̂ ` (F M )
S = {x1 , . . . , x` } generated by a distribution D on set
X and a real-valued function class F in domain X, the
Theorem 1 (DropConnect Network Complexity).
empirical Rademacher complexity of F is the random
Consider the DropConnect neural network defined in
variable:
" # Definition 1. Let R̂` (G) be the empirical Rademacher
` complexity of the feature extractor and R̂` (F) be the
2X
R̂` (F) = Eσ sup | σi f (xi )| | x1 , . . . , x` empirical Rademacher complexity of the whole net-
f ∈F ` i=1
work. In addition, we assume:
where sigma = {σ1 , . . . , σ` } are independent uniform 1. weight parameter of DropConnect layer |W | ≤ Bh
{±1}-valued (Rademacher) random variables. The
h i √ of s, i.e. |Ws | ≤ Bs (L2-norm of
2. weight parameter
Rademacher complexity of F is R` (F) = ES R̂` (F) . it is bounded by dkBs ).
√ √
Then we have: R̂` (F) ≤ p 2 kdBs n dBh R̂` (G)
8.2. Bound Derivation
Lemma 2 ((Ledoux and Talagrand, 1991)). Let F Proof.
be class of real functions and H = [Fj ]kj=1 be a k- h i
R̂` (F) = R̂` (EM [f (x; θ, M ]) ≤ EM R̂` (f (x; θ, M )(6)
dimensional function class. If A: Rk → R is a Lips-
√ √
chitz function with constant L and satisfies A(0) = 0,
h i
≤ ( dkBs ) dEM R̂` (a ◦ hM ◦ g) (7)
then R̂` (A ◦ H) ≤ 2kLR̂` (F) √ h i
Lemma 3 (Classifier Generalization Bound). Gener- = 2 kdBs EM R̂` (hM ◦ g) (8)
alization bound of a k-class classifier with logistic loss where hM = (M ? W )v. Equation (6) is based on
function is directly related Rademacher complexity of Lemma 6, Equation (7) is based on Lemma 5 and
that classifier: q Equation (8) follows from Lemma 4.
P`
E[Ay (o)] ≤ 1` i=1 Ayi (oi ) + 2k R̂` (F) + 3 ln(2/δ)
h i
2` EM R̂` (hM ◦ g)
Lemma 4. For all neuron activations: sigmoid, tanh " `
#
2
X
and relu, we have: R̂` (a ◦ F) ≤ 2R̂` (F) = EM,σ sup T
σi W DM g(xi )
(9)
h∈H,g∈G `
i=1
Lemma 5 (Network Layer Bound). Let G be the class " * +#
`
of real functions Rd → R with input dimension F, i.e.
2X
= EM,σ sup DM W, σi g(xi )
d
G = [Fj ]j=1 and HB is a linear transform function h∈H,g∈G ` i=1
parametrized by W with kW k2 ≤ B, then R̂` (H ◦ G) ≤
" #n
√ h i
2X `
j
dB R̂` (F) ≤ EM max kDM W k Eσ sup
σi g (xi ) (10)
W g j ∈G
` i=1
j=1
h P
`
i √ √ √
Proof. R̂` (H ◦ G) = Eσ suph∈H,g∈G 2` i=1 σi h ◦ g(xi )
≤ Bh p nd nR̂` (G) = pn dBh R̂` (G)
h D P` E i
2
= Eσ supg∈G,kW k≤B W, ` i=1 σi g(xi )
h id
2 P` j j
≤ BEσ supf j ∈F
` i=1 iσ f (x i )
where DM in Equation (9) is an diagonal matrix
ij=1√
√ h P
2 ` with diagonal elements equal to m and inner prod-
= B dEσ supf ∈F ` i=1 σi f (xi ) = dB R̂` (F)
uct properties
√lead to√Equation
(10). Thus, we have:
Remark 1. Given a layer in our network, we denote R̂` (F) ≤ p 2 kdBs n dBh R̂` (G)
the function of all layers before as G = [Fj ]dj=1 . This
Regularization of Neural Networks using DropConnect