Convolutional Neural Networks With Alternately Updated Clique
Convolutional Neural Networks With Alternately Updated Clique
Abstract
0
Improving information flow in deep networks helps to 0
ease the training difficulties and utilize parameters more Block
efficiently. Here we propose a new convolutional neu- unfold 1 2
ral network architecture with alternately updated clique 1
(CliqueNet). In contrast to prior networks, there are both
forward and backward connections between any two layers
in the same block. The layers are constructed as a loop and 2 4 3
are updated alternately. The CliqueNet has some unique
properties. For each layer, it is both the input and output of
any other layer in the same block, so that the information 3
flow among layers is maximized. During propagation, the
newly updated layers are concatenated to re-update previ-
ously updated layer, and parameters are reused for mul- 4 1 2 3 4
tiple times. This recurrent feedback structure is able to
bring higher level visual information back to refine low- Stage-I feature Stage-II feature
level filters and achieve spatial attention. We analyze the
Figure 1. An illustration of a block with 4 layers. Any layer is
features generated at different stages and observe that using
both the input and output of another one. Node 0 denotes the input
refined features leads to a better result. We adopt a multi-
layer of this block.
scale feature strategy that effectively avoids the progressive
growth of parameters. Experiments on image recognition
would make it hard for latter layer to access the gradient in-
datasets including CIFAR-10, CIFAR-100, SVHN and Ima-
formation from previous layers, which may cause gradient
geNet show that our proposed models achieve the state-of-
vanishing and parameter redundancy problems [17, 18].
the-art performance with fewer parameters 1 .
Successfully adopted in ResNet [13] and Highway Net-
1. Introduction work [34], skip connection is an efficient way to make
top layers accessible to the information from bottom lay-
In recent years, the structure and topology of deep neural ers, and ease the network training at the same time, due
networks have attracted significant research interests, since to its relief of the gradient vanishing problem. The resid-
the convolutional neural network (CNN) based models have ual block structure in ResNet [13] also inspires a series
achieved huge success in a wide range of tasks of computer of ResNet variations, including ResNext [40], WRN [41],
vision. A notable trend of those CNN architectures is that PolyNet [44], etc. To further activate the gradient and in-
the layers are going deeper, from AlexNet [23] with 5 con- formation flow in networks, DenseNet [17] is a newly pro-
volutional layers, the VGG network and GoogleLeNet with posed structure, where any layer in a block is the output of
19 and 22 layers, respectively [32, 36], to recent ResNets all preceding layers, and the input of all subsequent layers.
[13] whose deepest model has more than one thousand Recent studies show that the skip connection mechanism
layers. However, inappropriately designed deep networks can be extrapolated as a recurrent neural network (RNN)
∗ Corresponding author or LSTM [14], when weights are shared among different
1 Code address: http://github.com/iboing/CliqueNet layers [27, 5, 21]. In this way, the deep residual network
is treated as a long sequence and hidden units are linked feedback mechanism. In each Clique Block, both forward
by skip connections. While this recurrent structure benefits and feedback are densely connected. The information flow
feature re-usage and iterative learning, the residual informa- is maximized and feature maps are repeatedly refined by
tion is restricted among neighboring layers and cannot be attention. We show that our network architecture can sup-
considered across multiple layers, because the recurrence press the activations of background and noises, and achieve
only happens once at each single layer. competitive results without resorting to data augmentation.
Attention mechanism is another focus of recent stud- The contributions in this study are listed as follows:
ies on network structure [39, 37, 1, 28] and applications • We propose a new convolutional neural network archi-
[3, 29, 24, 8]. When people watch a picture or a scene, the tecture called CliqueNet, which incorporates both for-
information on our target is better captured if we re-look at ward and backward connections between any two lay-
or re-think the target with additional attention. In cognition ers in the same block. The layers constructed as a loop
theory, the activity of a neuron in visual cortex is influenced are updated alternately. The CliqueNet that combines
by other cortical area’s responses transferred through feed- both recurrent structure and attention mechanism, is
back connections [19, 15]. This motivates the introduce of able to maximize information flow and achieve feature
feedback to deep networks [35, 42]. The feedback connec- refinement. We show that the refined features are more
tions that bring back higher-level semantic information in a discriminative and lead to a better performance.
top-down manner are able to re-weight the focus, and sup- • We adopt a multi-scale feature strategy that effectively
press the non-relevant neuron activations of background and circumvents the progressive increment of parameters,
noises. despite the extra feedback connections.
Inspired by the recurrent structure and attention mecha- • We conduct experiments on four benchmark datasets
nism, in this study, we propose a new convolutional neu- including CIFAR-10, CIFAR-100, SVHN and Ima-
ral network architecture with alternately updated clique geNet to demonstrate the superiority of our models.
(CliqueNet). In contrast to prior network structures, there
are both forward and feedback connections between any 2. Related Work
two layers in the same block. As illustrated in Figure 1, the A number of deep networks with large model capacity
layers in Clique Block are constructed as a clique and are have been proposed. For widening the network, the Incep-
updated alternately. Concretely, the several previous layers tion modules in GoogLeNet [36] fuse the features in dif-
are concatenated to update the next layer, after which, the ferent map size to construct a multi-scale representation.
newly updated layer is concatenated to re-update the pre- Multi-column [6] nets and Deeply-Fused Nets [38] also use
vious layer, so that information flow and feedback mecha- fusion strategy and have a wide network structure. Wide
nism can be maximized. Each layer in a block is both the residual networks [41] increase the width and decrease the
input and output of another one, which means they are more depth to improve the performance, while FractalNet [25]
densely connected than DenseNets [17]. We adopt a multi- deepen and widen at the same time. However, simply
scale feature strategy to compose the final representation widening the network is easy to consume more runtime and
with the block features in different map sizes. memory [44]. For deepening the networks, skip connec-
CliqueNet architecture has some unique properties. tions or shortcut paths are widely adopted strategies to ease
An intuition would tell that our proposal is parameter- the network training [13, 34]. In [18], it is shown that some
demanding, because given a block with n layers, DenseNet of the layers in ResNets are dispensable and cause parame-
[17] needs Cn2 groups of parameters, while ours needs A2n ters redundancy. So they randomly drop a subset of layers to
(C and A represents combination operator and permutation ease the training and achieve a better performance. To fur-
operator, respectively). However, the filters in DenseNet ther increase information flow, DenseNets [17] replace the
increase linearly as the depth rises [5], which may leads to identity mapping in residual block by concatenating oper-
the rapid growth of parameters. In our architecture, only the ation, so that new feature learning can be reinforced while
Stage-II feature in each block is fed into the next block. It keeping old feature re-usage. In line with this view, dual
turns out that this is a more parameter-efficient way. In ad- path networks (DPN) [5] are proposed to combine both ad-
dition, traditional neural networks add a new layer with its vantages of residual path and densely connected path.
corresponding parameters. As for CliqueNet, the weights Both residual path and densely connected path corre-
among layers in a block keep recycling during propagation. spond to a recurrent propagation, and their success has been
The layers can be updated alternately for multiple times so attributed to the recurrent structure and iterative refinement
that a deeper representation space is attained with the fixed [27, 11, 21]. Studies incorporating recurrent connections
number of parameters. into CNNs also show superiority in object recognition [26],
CliqueNet also shows a strong ability for representation scene parsing [31] and some other tasks. CliqueNet dif-
learning due to the combination of recurrent structure and fers from these structures in that the iterative mechanism
Input
Transition
Transition
Transition
Stage-II Stage-II
Prediction
Figure 2. A CliqueNet with three blocks. The input layer together with the Stage-II feature in each block are concatenated to be the block
feature, and form part of the final representation after global pooling. The Stage-II feature passes through transition layers, which include
a convolution and an average pooling to change map sizes, and then becomes the input of the next block.
exists in each step of the propagation, instead of just be- 3. CliqueNet Architecture
tween neighboring layers or from the top layer to the bot-
tom layer; all layers in a block participate in the recurrent The CliqueNet architecture has two main ingredients, the
loop so that the filters are communicated sufficiently and the block with alternately updated clique (Clique Block) to en-
blocks play both roles of information carrier and refiner. able feature refinement, and the multi-scale feature strategy
Recent studies have embraced the attention mechanism that facilitates parameter efficiency.
as an effective technique to strengthen some neurons that
feature the target, and improve the performance as a re- 3.1. Clique Block
sult. It is proved fruitful in many applications, including In order to maximize the information flow among lay-
image recognition [37, 8], image captioning [3], image- ers, we design the Clique Block. Any two layers in the
text matching [29], and saliency detection [24]. In gen- same block are connected bidirectionally except for the in-
eral, visual attention can be achieved by formulating an op- put node. Compared with Dense Block [17] where each
timization problem [1], weighting the activations spatially layer is the output of all previous layers, and the input of
or channel-wisely [3, 16], and introducing feedback con- all subsequent layers, Clique Block makes each layer both
nections [39, 35, 42]. In [42], the model makes consecu- the input and output of any other layers. The propagation
tive decisions for a more accurate prediction via feedback of a Clique Block with 5 layers is illustrated in Table 1. At
connections. The input of the next decision is based on the first stage, the input layer (X0 ) initializes all layers in
the output of the last decision. Experiments show that the this block by single directional connections. Each updated
top-down propagation is capable of refining lower-level fea- layer is concatenated to update the next layer. From the sec-
tures, and improving classification performance [35], espe- ond stage, the layers begin updating alternately. All layers
cially on datasets with noise and occlusion [39, 28]. But except the top layer to be updated are concatenated as the
how to make a proper attention mechanism and boost the bottom layer, and their corresponding parameters are also
supervision between layers remains further exploration. concatenated. Accordingly, the ith (i ≥ 1) layer in the kth
There are also some studies that design attention mecha- (k ≥ 2) loop can be formulated as:
nism tied with recurrent neural networks [28, 24, 8]. A re-
cent report [2] tries to propose a loopy net, but it just repeats !
(k) (k)
X X
the skip connections and does not make layers communi- (k−1)
Xi =g Wli ∗ Xl + Wmi ∗ Xm (1)
cated. The loopy inference adopted in [4, 45] shares a sim- l<i m>i
ilar motivation with our work. However, they do not incor-
porate feedback connections, which are important for fea- where ∗ denotes the convolution operation with parameters
ture refinement. CliqueNet enables true cycling because of W , and g is the non-linear activation function. Wij keeps
the alternate propagation. Although alternate updating has re-used in different stages. Each layer will always receive
been an important method in the optimization theory [9], the feedback information from the layers that are updated
it has not been introduced into deep learning areas. At the more lately. It achieves a spatial attention mechanism due
best of out knowledge, we are the first to use updated lay- to the top-down refinement brought by each propagation.
ers to re-update previous layers alternately, and these layers This recurrent feedback structure ensures that the commu-
construct a loop to cycle for multiple times. nication is maximized among all layers in the block.
Bottom Layers Weights Top Layer Feature
(1)
X0 W01 X1
(1) (1)
{X0 , X1 } {W02 , W12 } X2
(1)
{X0 , X1 , X2 }
(1)
{W03 , W13 , W23 }
(1)
X3 Stage-I
(1) (1) (1) (1)
{X0 , X1 , X2 , X3 } {W04 , W14 , W24 , W34 } X4
(1) (1) (1) (1) (1)
{X0 , X1 , X2 , X3 , X4 } {W05 , W15 , W25 , W35 , W45 } X5
(1) (1) (1) (1) (2)
{X2 , X3 , X4 , X5 } {W21 , W31 , W41 , W51 } X1
(1) (1) (1) (2) (2)
{X3 , X4 , X5 , X1 } {W32 , W42 , W52 , W12 } X2
(1) (1) (2) (2)
{X4 , X5 , X1 , X2 } {W43 , W53 , W13 , W23 }
(2)
X3 Stage-II
(1) (2) (2) (2) (2)
{X5 , X1 , X2 , X3 } {W54 , W14 , W24 , W34 } X4
(2) (2) (2) (2) (2)
{X1 , X2 , X3 , X4 } {W15 , W25 , W35 , W45 } X5
···
Table 1. A diagram of CliqueNet’s propagation in a block with 5 layers. Wij is the weights of parameter from Xi to Xj and keeps re-used.
“{}” denotes the concatenation operator. The Stage-II feature is to be transited as the input layer (X0 ) of the next block.
II+II train
II+II test For the purpose of analyzing the features generated in
10 -2
I+II train different stages, we conduct experiments on CIFAR-10
I+II test
I+I train dataset (with no data augmentation) using different versions
I+I test of CliqueNets. As Table 2 shows, the CliqueNet (I+I) only
0 50 100 150 200 250 300 considers the Stage-I feature. The CliqueNet (I+II) uses the
epoch
Stage-I feature and input layer as block feature to access
Figure 3. Training and testing curves of different versions of loss function, but transits the Stage-II feature into the next
CliqueNets. Learning rate is divided by 10 at epoch 150 and 225.
block. The CliqueNet (II+II) adopts our aforementioned
strategy. They all have 3 blocks with 5 layers in each block.
3.2. Feature at Different Stages Each layer contains 36 filters. The experimental settings are
We analyze the features produced at different stages, and following [17]. The main results are shown in Figure 3. It
adopt a multi-scale feature strategy to avoid the rapid incre- is found that the introduce of Stage-II feature indeed leads
ment of parameters. to a better result by a significant margin. We adopt the
CliqueNet (II+II) structure for the following experiments.
The first stage is used to initialize all layers in the block,
and the layers are refined repeatedly since the second stage. 3.3. Extra Techniques
Given that the Stage-II feature is refined with attention and
assimilates more high level visual information, we make the In addition to the structures mentioned above, we con-
Stage-II feature together with the input layer in each block sider some techniques to help strengthen the model and im-
concatenated as the block feature, and then accessed to the prove the state of the art. In the experimental section, we
loss function after global pooling. Only the Stage-II feature conduct experiments with and without these additional tech-
is fed into the next block as their input layer X0 ; see Fig- niques to show the effectiveness of our model.
ure 2. In this way, the final representation is characterized Attentional transition. The CliqueNet includes feedback
by multi-scale feature maps, and the dimensionality in each connections to refine lower level activations using higher
block will not increase progressively. Because higher stage level visual information. The attention mechanism weight
propagation comes with more computational cost and am- the feature maps spatially to weaken the noises and back-
plifies the model complexity, we only consider the first two ground. The channel-wise attention, adopted in [3, 37, 16],
stages. also benefits recognition problem because it recalibrates
different filters to prevent overfitting and inspire new fea- convolution(1 × 1)
tures learning. In CliqueNet, we incorporate channel-
wise attention mechanism in transition layers, following the Global Pooling
method proposed in [16]. As depicted in Figure 4, the fil-
ters are globally averaged after the convolution in transi- 𝑊×𝐻×𝐶
tion. They are followed by two fully connected (FC) layers. 1×1×𝐶 FC, Relu
The first FC layer has half of the filters and is activated by
1 × 1 × 𝐶/2 FC, Sigmoid
Relu function. The second FC layer has the same number
of filters and is activated by Sigmoid function, so that the 1×1×𝐶
activation is scaled into [0, 1] and acts on the input layer by
filter-wise multiplication. Different from [16] which sets
this module at each residual layer, we only add it to transi- Filter-wise multiplication
tion layers in order to adjust the filters into the next block.
Bottleneck and compression. Bottleneck is an effective
way to decrease the number of parameters and provide fur-
ther potential to enlarge model capacity. It is conjectured 𝑊×𝐻×𝐶 pooling(2 × 2)
[41] that bottleneck architecture is suitable for deeper net-
works and large dataset like ImageNet, and recent stud-
Figure 4. A schema for attentional transition. The transition layer
ies have embraced bottleneck for a better performance
consists of convolution and pooling. The filter-wise multiplication
[13, 17, 37, 5]. So we introduce bottleneck to our large happens after convolution and before down pooling. W , H and C
models. The 3 × 3 convolution kernels in each block are are width, height and channels of feature maps.
replaced by 1 × 1, and produce a middle layer, after which,
a 3 × 3 convolution layer follows to produce the top layer.
Layer S0 S1 S2 S3
The middle layer and top layer contain the same number of
feature maps. Compression is another tool adopted in [17] Convolution
conv (7 × 7), 64, stride 2
to make the model more compact. Instead of compressing (112 × 112)
the number of filters in transition layers as they do, we only Pooling
max pool (3 × 3), stride 2
compress the features that are accessed to the loss function, (56 × 56)
i.e. the Stage-II concatenated with its input layer. The mod- Block 1
36 × 5 36 × 5 36 × 5 40 × 6
els with compression have an extra convolutional layer with (56 × 56)
1 × 1 kernel size before global pooling. It generates half the Transition: conv (1 × 1), avg pool (2 × 2)
number of filters to enhance model compactness and keep Block 2
64 × 6 80 × 6 80 × 5 80 × 6
the dimensionality of the final feature in a proper range. (28 × 28)
Transition: conv (1 × 1), avg pool (2 × 2)
3.4. Implementation Block 3
100 × 6 120 × 6 150 × 6 160 × 6
(14 × 14)
In our experiments, we test our models on benchmark Transition: conv (1 × 1), avg pool (2 × 2)
datasets without the aforementioned extra techniques to
Block 4
show the effectiveness of CliqueNet, and further improve 80 × 6 100 × 6 120 × 6 160 × 6
(7 × 7)
the state-of-the-art performance with them. There are two
structure parameters, the sum of layers in all blocks, T, and Table 3. Structures on ImageNet. The first number in each block is
the number of filters per layer, k. For our models without the number of filters per layer, and the second denotes the number
bottleneck, convolution layers in each block are with 3 × 3 of layers in this block.
kernel size and padded by one pixel to keep the feature maps
in the same size. Blocks are linked by transition layers,
where a convolution layer with 1 × 1 kernel size is followed 16 × 16, and 8 × 8, respectively. Before entering the first
by 2 × 2 average pooling. All convolutions are performed block, the input images pass through a 3 × 3 convolution
in a unit composed of three consecutive operations: batch with output channels set to be 64 as the input layer (X0 ) of
normalization[20], Relu, and the convolution. Stage-II fea- the first block. As for ImageNet, we use four blocks with
ture with its input layer from all blocks are concatenated bottleneck and compression, and compare our results with
after global pooling, and end with a fully-connected layer and without attentional transition. The initial transition has
with softmax. 7 × 7 convolution with stride 2 and 3 × 3 max pooling with
For experiments on CIFAR and SVHN, there are three stride 2 on the 224 × 224 input images. Our four network
blocks in total, in which the feature map sizes are 32 × 32, structures on ImageNet are shown in Table 3.
Model A B C FLOPs Params CIFAR-10 CIFAR-100 SVHN
Recurrent CNN [26] - - - - 1.86M 8.69 31.75 1.80
Stochastic Depth ResNet [18] - - - - 1.7M 11.66 37.8 1.75
dasNet [35] - - - - - 9.22 33.78 -
FractalNet [25] - - - - 38.6M 7.33 28.2 1.87
DenseNet (k = 12, T = 36) [17] - - - 0.53G 1.0M 7.00 27.55 1.79
DenseNet (k = 12, T = 96) [17] - - - 3.54G 7.0M 5.77 23.79 1.67
DenseNet (k = 24, T = 96) [17] - - - 13.78G 27.2M 5.83 23.42 1.59
CliqueNet (k = 36, T = 12) - - - 0.91G 0.94M 5.93 27.32 1.77
CliqueNet (k = 64, T = 15) - - - 4.21G 4.49M 5.12 23.98 1.62
CliqueNet (k = 80, T = 15) - - - 6.45G 6.94M 5.10 23.32 1.56
CliqueNet (k = 80, T = 18) - - - 9.45G 10.14M 5.06 23.14 1.51
DenseNet (k = 12, T = 96) [17] - X X 0.58G 0.8M 5.92 24.15 1.76
DenseNet (k = 24, T = 246) [17] - X X 10.84G 15.3M 5.19 19.64 1.74
CliqueNet (k = 36, T = 12) X - - 0.91G 0.98M 5.8 26.41 -
CliqueNet (k = 36, T = 12) - - X 0.98G 1.04M 5.69 26.45 -
CliqueNet (k = 36, T = 12) X - X 0.98G 1.08M 5.61 25.55 1.69
CliqueNet (k = 80, T = 15) X - X 6.88G 8M 5.17 22.78 1.53
CliqueNet (k = 150, T = 30) X X X 8.49G 10.02M 5.06 21.83 1.64
Table 4. Error rates (%) on CIFAR-10, CIFAR-100, and SVHN without any data augmentation. In CliqueNets and DenseNets, k is the
number of filters per layer, and T is the total number of layers in three blocks. “A, B, C” represents attentional transition, bottleneck and
compression, respectively. The FLOPs of DenseNets are calculated by ourselves.
4. Experiments flip. The images are normalized into [0, 1] using mean val-
ues and standard deviations. We report the single-crop error
We evaluate the CliqueNet on benchmark classification rate on the validation set.
datasets, including CIFAR-10, CIFAR-100, SVHN and Im- Training Details. For fair comparison, we do not take much
ageNet, and compare our results with the state of the arts. hyper-parameter tuning, and most of our training strategies
are following [13, 17]. We train our models using stochas-
4.1. Datasets and Training Details
tic gradient descent (SGD) with 0.9 Nesterov momentum
CIFAR. The CIFAR-10 and CIFAR-100 datasets [22] are and 10−4 weight decay. The parameters are initialized ac-
both 32 × 32 colored images. CIFAR-10 dataset consists cording to [12] and the weights of fully connected layer are
of 60,000 images in 10 classes, with 6,000 images in each using Xavier initialization [10]. For CIFAR and SVHN,
class. There are 50,000 images for training and 10,000 im- we train for 300 epochs and 40 epochs, respectively, with
ages for testing. CIFAR-100 dataset is similar to CIFAR-10 batchsize of 64. The learning rate is set to be 0.1 initially
but has 100 classes, each of which contains 600 images. For and is divided by 10 at 50% and 75% of the training proce-
data normalization, we preprocess the dataset by subtracting dure. Compared with ImageNet, the experiments on CIFAR
the mean and dividing by the standard deviation. and SVHN are not resorting to any data augmentation, and
SVHN. The Street View House Number (SVHN) [30] we add a dropout layer [33] with drop out rate 0.2 after each
dataset contains 32 × 32 colored images of house numbers convolution layer following[17]. For ImageNet, we train
cropped from Google Street View. There are 73,257 images our models for 100 epochs and drop the learning rate by 0.1
in the training set, 26,032 in the testing set and 531,131 dig- at epoch 30, 60, and 90. Because we have only server with
its for additional training. Following the common practice 4 GPUs and are constrained by GPU memory, the batchsize
[41, 18, 25, 17], we use all training samples without aug- is 160 for our models on ImageNet, instead of 256 as most
mentation and divide images by 255 for normalization. We studies did.
report the lowest error rate on the testing set.
4.2. Results on CIFAR and SVHN
ImageNet. We also conduct experiments on ILSVRC
2012 dataset[7], which contains 1.2 million training im- Our experimental results on CIFAR and SVHN are
ages, 50,000 validation images, and 100,000 test images shown in Table 4. The first part in the table includes some
with 1,000 classes. Following [13, 17], we adopt the stan- methods before DenseNets and some other studies that also
dard data augmentation for the training sets. A 224 × 224 incorporate feedback connections or attention mechanism.
crop is randomly sampled from the images or its horizontal The second and third parts compare the CliqueNets with
Dense Block Clique Block
Model Params top-1 top-5
0 0 1.0
ResNet-18 [13] 11.7M 30.43 10.76 0.9
CliqueNet-S0∗ 5.7M 27.52 8.98 1 1 0.8
ResNet-34 [13] 21.8M 26.73 8.74 2 2 0.7
Table 5. Single crop error rates (%) on ImageNet. The ∗ indicates Figure 5. Visualization of the weights in the first block in pre-
the models without attentional transition. trained DenseNet (left) and CliqueNet (right) by calculating the
average absolute value of Wij . Node 0 denotes the input layer of
this block.
DenseNets when they both have no extra technique. The
last two parts show the situation with extra techniques. The
best result and the second best result are marked by red bold (36-12) with both attentional transition and compression
and bold, respectively. leads to a better result than its original version and its origi-
Without extra techniques. The first three parts show nal version with only attentional transition or compression.
that, when extra techniques are not considered, CliqueNets Compared with its counterpart DenseNet (12-36), it drops
outperform most previous methods on CIFAR-10, CIFAR- an error rate of 1.39% on CIFAR-10, 2% on CIFAR-100,
100, and SVHN with significantly fewer parameters. Be- and 0.1% on SVHN, with just 0.08M more parameters. The
cause the layers in CliqueNet can be re-updated but con- CliqueNet (80-15) with attentional transition and compres-
tribute features in each cycle, the depth of CliqueNet is sion also has an improvement than its original version, and
much shallower than other models. For our smallest model increases the state of the art of SVHN to 1.53% with 8M pa-
CliqueNet (36-12), (representing k = 36, and T = 12), rameters, while the previously best result 1.59% on SVHN
each block contains 4 layers. It has the same number of performed by DenseNet (24-96) has three times more pa-
filters, 144, in each block as DenseNet (12-36), but re- rameters. The bottleneck architecture is effective to save
duce the error rate from 7% to 5.93% on CIFAR-10 with parameters, and our largest model CliqueNet (150-15) with
slightly fewer parameters than its counterpart DenseNet bottleneck further improves the performance on CIFAR-10
(12-36). Although the ResNet with stochastic depth [18] and CIFAR-100, but increases parameter and computation
achieved a slightly better performance with 1.7M parame- cost moderately.
ters on SVHN than CliqueNet (36-12), our model drops the
4.3. Results on ImageNet
error rate on CIFAR-10 and CIFAR-100 by a large margin.
As the model capacity goes larger, we find that the perfor- Because we have limited computational resource and can
mance of CliqieNets is getting better without overfitting. As only spread a batch among 4 GPUs, we use a batchsize of
for our model CliqueNet (80-15), it has already achieved 160 on ImageNet, instead of 256 in most studies. Although
the state of the art on three datasets, and even outperforms a smaller batchsize would impair the performance training
the DenseNets that use extra techniques on CIFAR-10 and for the same epochs, the CliqueNets achieve a comparable
SVHN. It has only 6.94M parameters, which are a quarter result on ImageNet with ResNets or DenseNets; see Table 5.
of DenseNet (24-96) with 27.2M parameters, and a half of This indicates that our proposed models can also be applied
DenseNet (24-246) using bottleneck and compression with on large datasets.
15.3M parameters. The CliqueNet-S0∗ and CliqueNet-S1∗ outperform the
With extra techniques. The CliqueNets realize spatial at- ResNet-18 and ResNet-34 with only a half of their param-
tention mechanism due to its recurrent feedback propaga- eters. Larger models also achieve on par with the state
tion. When armed with channel-wise attention, they achieve of the art performed by ResNets and DenseNets. When
an improved performance. This is demonstrated by the the attentional transition is considered, the CliqueNet con-
CliqueNet (36-12) with attentional transition. It has a better tains both spatial attention and channel-wise attention, and
result on CIFAR-10 and CIFAR-100 with slightly more pa- has a better performance accordingly. The CliqueNet-S2
rameters. The compression has the same effect by making and CliqueNet-S3 both reduce about 1% top-1 error rate
the model more compact. It is shown that the attentional compared with their original versions, CliqueNet-S2∗ and
transition is compatible with compression. The CliqueNet CliqueNet-S3∗ that do not have attentional transition.
4.4. Further Discussion Input Stage-I Stage-II