0% found this document useful (0 votes)
63 views9 pages

Deepcaps: Going Deeper With Capsule Networks: Suranga - Seneviratne@Sydney - Edu.Au, Ranga@Uom - LK

DeepCaps is a deep capsule network architecture that uses 3D convolution-based dynamic routing to go deeper than previous capsule networks. It achieves state-of-the-art results on CIFAR10, SVHN, and Fashion MNIST, with a 68% reduction in parameters compared to the original capsule network. DeepCaps also introduces a class-independent decoder network to strengthen the use of reconstruction loss for regularization.

Uploaded by

tayyabmujahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views9 pages

Deepcaps: Going Deeper With Capsule Networks: Suranga - Seneviratne@Sydney - Edu.Au, Ranga@Uom - LK

DeepCaps is a deep capsule network architecture that uses 3D convolution-based dynamic routing to go deeper than previous capsule networks. It achieves state-of-the-art results on CIFAR10, SVHN, and Fashion MNIST, with a 68% reduction in parameters compared to the original capsule network. DeepCaps also introduces a class-independent decoder network to strengthen the use of reconstruction loss for regularization.

Uploaded by

tayyabmujahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

DeepCaps: Going Deeper with Capsule Networks

Jathushan Rajasegaran1 Vinoj Jayasundara1 Sandaru Jayasekara1 Hirunima Jayasekara1


Suranga Seneviratne2 Ranga Rodrigo1
1
Department of Electronic and Telecommunication Engineering, University of Moratuwa
2
School of Computer Science, University of Sydney
{brjathu, vinojjayasundara, sandaruamashan, nhirunima}@gmail.com
arXiv:1904.09546v1 [cs.CV] 21 Apr 2019

[email protected], [email protected]

Abstract The capsule network (CapsNet) model proposed by


Sabour et al. [19] comprises only one convolution layer
Capsule Network is a promising concept in deep learn- and one fully-connected capsule layer. The proposed archi-
ing, yet its true potential is not fully realized thus far, provid- tecture works well with the MNIST [16] dataset, nonethe-
ing sub-par performance on several key benchmark datasets less the performance on datasets with more complex objects
with complex data. Drawing intuition from the success such as CIFAR10 [14] is not on par with the CNNs, due to
achieved by Convolutional Neural Networks (CNNs) by go- the nature of complex shapes in CIFAR10 in comparison to
ing deeper, we introduce DeepCaps1 , a deep capsule net- MNIST.
work architecture which uses a novel 3D convolution based A naive attempt of creating a deep CapsNet by simply
dynamic routing algorithm. With DeepCaps, we surpass stacking such fully-connected capsule layers will result in
the state-of-the-art results in the capsule network domain an architecture similar to a MLP model which has several
on CIFAR10, SVHN and Fashion MNIST, while achieving limitations. First, dynamic routing used in capsule networks
a 68% reduction in the number of parameters. Further, is an extremely computationally expensive procedure, and
we propose a class-independent decoder network, which having multiple routing layers incur higher costs of training
strengthens the use of reconstruction loss as a regulariza- and inference time. Second, it has been recently shown that
tion term. This leads to an interesting property of the de- stacking fully connected capsule layers on top of each other
coder, which allows us to identify and control the physical will result in poor learning in the middle layers [24]. This
attributes of the images represented by the instantiation pa- is due to the fact that when there are too many capsules,
rameters. the coupling coefficients tend to be too small, consequently
1. Introduction dampening the gradient flow and inhibiting learning. Third,
it has been shown that, especially in the lower layers, cor-
In the last few years, convolutional neural networks related units tend to concentrate in local regions [21]. Al-
(CNNs) made breakthroughs in many computer vision though localized routing can conspicuously take advantage
tasks, and significantly outperformed many conventional of this observation, such localized routing cannot be imple-
curated feature driven models. Two common themes of in- mented in fully connected capsules.
creasing the performance of CNNs are to increase the depth
In order to address these limitations caused by stacking
and the width of the network (e.g., the number of levels
capsule layers, we propose the following solutions. To re-
of the network and the number of units at each level) and
duce the computational complexity introduced by multiple
to use as much training data as possible. Although CNNs
layers needing dynamic routing, several avenues are possi-
have been successful, they have few limitations such as the
ble: Reducing the number of routing iterations in the initial
invariance caused by pooling and the inability to under-
layers that are larger in size reduces the complexity while
stand spatial relationship between features. To address these
not affecting the features as they need not be complex in
limitations, Sabour et al. proposed Capsule Networks [19]
nature. In addition, using 3D-convolution-inspired routing
which have shown promising comparable results to CNNs
in the middle layers –due to parameter sharing– reduces
in several standard datasets. Intuitively, attempting to go
the number of parameters. We can address the problem of
deeper with capsule networks is a step in the right direction
poor learning in the middle layers due to naive stacking by
to further enhance their performance.
improving the gradient flow, that involve skip connections
1 https://github.com/brjathu/deepcaps coupled with convolutions. Moreover, while reducing the

1
complexity, the deep capsule network must be able to han- The rest of the paper is organized as follows: In Sec-
dle richer data sets than MNIST. We propose that localized tion 2, we discuss the related work on Capsule Networks,
routing will be able to capture the higher level information Section 3 describes our DeepCaps architecture and the
better than fully connected routing. novel 3D routing algorithm, Section 4 outlines the class-
Sabour et al. [19] used regularization through the in- independent decoder network. Section 5 shows our results.
corporation of reconstruction error (which is generated by Finally, Section 6 concludes the paper.
the decoder network) to reduce over fitting. Nevertheless,
a stronger regularization than [19] is necessary to reduce 2. Related Work
overfitting when developing deeper networks, due to the in-
One of the major issues which we face with deep net-
herent increase in the model complexity with model depth.
works is the vanishing/exploding gradients. When the er-
Hence, in an attempt to enhance the regularization, we pro-
ror signal is passed through many layers, it can vanish and
pose a class-independent decoder. We observed an interest-
wash out by the time it reaches the beginning of the net-
ing property of this decoder, which provides controllability
work [2], [4], which hinders the convergence. This issue is
over the learning and perturbation of instantiation param-
being addressed in many models proposed, where ResNets
eters. In existing capsule networks and decoders, it is not
[5] and Highway Networks [20] bypass signals from one
possible to guarantee that the physical property represented
layer to the next via identity connections. Stochastic depth
by a given instantiation parameter is the same across all the
[10] shortens ResNets by randomly dropping layers dur-
classes. In the proposed decoder, it is guaranteed that the
ing training to allow better information and gradient flow.
represented property will be the same for any given instanti-
DenseNets [9] ensure the maximum information flow be-
ation parameter across all the classes, providing higher con-
tween layers in the network, by connecting all layers (with
trollability, which is immensely advantageous in practical
matching feature-map sizes) directly with each other. To
applications and theoretical studies.
preserve the feed-forward nature, each layer obtains addi-
To this end, in this paper, we propose DeepCaps: a deep tional inputs from all preceding layers and passes on its own
capsule network architecture by leveraging two key ideas: feature-maps to all subsequent layers. They create short
Dynamic routing and Going deeper in the network. The paths from early layers to latter layers.
novel dynamic routing algorithm that we propose achieves The idea of grouping the neurons is proposed in Hinton
parameter reduction and localized routing, making the rout- et al. [7]. As an extension to this, Sabour et al. [19] pro-
ing possible in a convolutional framework rather than re- posed a dynamic routing algorithm between capsules, us-
sorting to fully-connected capsules, while skip connections ing the concept of routing by agreement between capsules.
allow us to train deeper networks. More specifically, we Dynamic routing helps the network to achieve eqivarience,
make the following contributions in the paper: where CNNs can only achieve in-variance by the pooling
operation. In addition to dynamic routing, Hinton et al. [8]
• Proposing a novel deep architecture for capsule net- used EM routing for matrix capsules representing each en-
works, termed DeepCaps, that aims at improving the tity by a pose matrix. There have been many extensions to
performance of the capsule networks for more com- these: HitNet [3] uses a hybrid hit and miss layer for data
plex image datasets. Further, we propose a novel 3D- augmentations. Dilin et al. [23] solves the dynamic rout-
convolution-based dynamic routing algorithm to aid ing as an optimization problem, and achieves better perfor-
the learning process of DeepCaps. mance by introducing KL divergence between the coupling
distribution. CapsGan [11] uses a capsule network as the
• Proposing a novel class-independent decoder network, discriminator in the GAN pipeline, to get visually better re-
which acts as a better regularization term. We further sults than convolutional GANs. In contrast to these, our
investigate on the observation that this novel decoder work focuses on going deeper with the capsule networks
has the ability to provide controllability over the in- and increase its performance on more complex datasets.
stantiation parameters. SegCaps [15] uses capsules for image segmentation and
they achieve the state-of-the-art results on LUNA16 dataset.
• Evaluating the performance of DeepCaps on several This is the closest work to ours on the basis of routing.
benchmark datasets: We significantly outperform the They use 2D convolution for the voting procedure. By us-
existing state-of-the-art capsule network architectures, ing 2D convolutions, it takes all the capsule along depth as
while requiring a significantly lower number of param- the inputs for the transformation, thus, mixing the informa-
eters. For example, for the CIFAR10 dataset, Deep- tion contained in the capsules. In our 3D-convolution-based
Caps achieves a 3% improvement in the accuracy in routing, we design the strides along the depth to be the cap-
comparison to [19], along with a 68% reduction in the sule dimension, as a result of which, each capsules along
number of parameters. the depth dimension is voted separately.

2
Our work explores the possibilities of creating deeper
networks consisting of multiple capsule layers. We believe, XXX
to the best of our knowledge, that this is the first attempt vi,j,k,m = Φ̃l (i − p, j − q, k − r) · Ψlt (p, q, r)
to go deeper with capsule networks. Further, the instantia- p q r

tion parameters of the capsule networks have shown a novel (1)


way of representing the images, by encoding physical vari- In order to keep the shape of the intermediate votes V to
ations such as rotation and skewness in a vector. A small be consistent with number of channels in the input capsule
perturbation in an instantiation parameter will change the tensor Φ̃l , we use (1, 1, nl ) as the stride for the 3D convo-
corresponding physical variations in the reconstructed im- lution operations.
age. Still which parameter causes what kind of changes in Subsequently, we reshape the intermediate votes V to
the reconstructed images has been not studied. the inceptive votes Ṽ for the proposed iterative routing al-
gorithm. It has the shape of (wl+1 , wl+1 , nl+1 , cl+1 , cl ),
since we are predicting cl+1 capsule tensors for each s ∈ cl .
3. DeepCaps Here, the value of wl+1 can be analytically calculated using
the Eq. 2 below:
One of the main drawbacks with dynamic routing in the
current form [19] is that it can only be implemented in a
fully connected manner (e.g., it cannot be implemented in wl − Kernel size + 2 × Padding
wl+1 = + 1 (2)
a convolutional manner). In [19], after the primary capsule Stride
layer, capsule vectors are flattened and dynamically routed If the dynamic routing algorithm in [19] was used for
to the classification capsules. Thus, if it is necessary to routing, it would have routed all capsules in layer l to all the
go deep into the architecture with the dynamic routing al- capsules in layer l + 1. However, the feature maps result-
gorithm in [19], we need to keep stacking fully connected ing from the convolution operation have localized features,
capsule layers, which is equivalent to stacking fully con- thus, adjacent capsules share similar information. We can
nected layers in MLP models. This is not computationally eliminate this redundancy by routing a block of capsules s,
efficient as the feature space is large at the start of the net- from layer l to the capsules in layer l + 1, instead of rout-
work. Hence, in order to stack convolutional capsule layers ing each capsule in layer l individually. This modification
similar to the conventional CNNs, a novel dynamic routing results in a significant reduction of the number of parame-
algorithm is required. ters by a factor c · (wl wl+1 )2 , in comparison to the dynamic
routing algorithm.
3.1. 3D Convolution Based Dynamic Routing Similarly, with a 3D convolutional kernel transforming
a subset of capsules in a block to one vote, we achieve lo-
Let the output of the capsule layer l be Φl ∈ calized voting. For example, a 3 × 3 × 8 kernel will trans-
(wl ,wl ,cl ,nl )
R , where wl is the height and the width of the form the adjacent 9 capsules to one vote. In other words,
feature map, cl is the number of 3D capsule tensors, and nl in layer l, a low level entity may be represented by either
is the number of atoms (i.e. capsule dimension). In this sec- a single capsule, or more often a group of capsules, which
tion, we illustrate the novel mechanism that we propose in are adjacent to each other. Hence, rather than routing them
order to route the 3D capsule tensors from layer l to predict separately to a higher level capsule, we group them together
l+1 l+1 l+1 l+1
the new 3D capsule tensor Φl+1 ∈ R(w ,w ,c ,n ) . and route. Due to these additional requirements that are not
fulfilled by the existing routing algorithms, we propose the
First, we reshape Φl into a single channel tensor Φ̃l ,
following novel routing algorithm.
which has a shape of (wl , wl , cl ×nl , 1) and convolve it with
First, we initialize the logits Bs as 0, where Bs ∈
(cl+1 × nl+1 ) number of 3D convolutional kernels. Let Ψlt l+1 l+1 l+1

be the tth kernel in layer l where t ∈ [cl+1 × nl+1 ], which R(w ,w ,c ) , for each s ∈ [cl ]. The corresponding cou-
results in the intermediate votes V, and has the shape of pling coefficients Ks are calculated using a softmax 3D
(wl+1 , wl+1 , cl , cl+1 × nl+1 ). Keeping the size of Ψlt and function, as defined by Eq. 3, which we propose as a 3D
stride as nl along with depth, allow us to get a vote for sin- version of the existing softmax function. [19]
gle capsule from layer l. See Fig. 1. Using a 3D convolution Ks = softmax 3D(Bs )
kernel with height and the width of the kernel greater than
1 as the transformation matrix, allows us to predict higher exp(bpqrs ) (3)
kpqrs = P P P
level capsules using a set of lower level capsules. x y z exp(bxyzs )

Each element vi,j,k,m in V can be obtained by perform- Here, the logits are normalized among all the predicted
ing the 3D convolution operation, which is defined accord- capsules from capsule tensor s in layer l. This is due to
ing to the Eq. 1 below: the fact that a single capsule tensor in layer l predicts all

3
Figure 1. Dynamic routing using 3D convolutions: In a high level explanation, each capsule tensor in layer l will predict cl+1 capsule
tensors. Therefore, cl number of predictions are available for a capsule tensor in layer l + 1. In the first routing iteration, all are equally
weighted and summed together to get the final prediction S. Then, in the following iterations, coupling coefficients are updated according
to the agreement with S and Ṽ.

the possible outputs of every (p, q, r)th capsule in the layer Algorithm 1 Dynamic Routing using 3D convolution
l + 1. In other words, each capsule tensor in layer l + 1 will 1: procedure ROUTING
have cl corresponding predictions from layer l. Each pre- l l l l
2: Require: Φl ∈ R(w ,w ,c ,n ) , r and cl+1 , nl+1
diction will be weighted with kpqrs to get a single prediction 3:
l l l
Φ̃l ← Reshape(Φl ) ∈ R(w ,w ,c ×n ,1)
l

Spqr , which will be passed through squash 3D function, 4:


l+1 l+1 l l+1
V ← Conv3D(Φ̃l ) ∈ R(w ,w ,c ,c ×n )
l+1

as defined by Eq. 4, to limit the length of a capsule vector l+1 l+1 l+1 l+1 l
5: Ṽ ← Reshape(V) ∈ R(w ,w ,n ,c ,c )
between 0 and 1, as it represents the probability of existence l+1 l+1 l+1 l

of an entity. 6: B ← 0 ∈ R(w ,w ,c ,c )
Let p ∈ wl+1 , q ∈ wl+1 , r ∈ cl+1 and s ∈ cl
Ŝpqr = squash 3D(Spqr ) 7: for i iterations do
8: for all p, q, r, kpqrs ← softmax 3D(bpqrs )
kSpqr k2 Spqr (4) P
= · 9: for all s, Spqr ← s kpqrs · Ṽpqrs
1 + kSpqr k2 kSpqr k
10: for all s, Ŝpqr ← squash 3D(Spqr )
The key concept of the routing algorithm proposed 11: for all s, bpqrs ← bpqrs + Ŝpqr · Ṽpqrs
by [19] is routing by agreement between the outputs of the 12: return Φl+1 = Ŝ
capsules. The agreement between Ŝ and Ṽs is measured by
their dot product and the logits are updated with the agree-
ment measure.
We iterate through the proposed routing algorithm i In the first few layers of the network, as the feature map
times, where we empirically set i = 3 following [19]. Af- space is large, routing is computationally expensive at the
ter the iterations, the output of the layer l + 1, Φl+1 can be start. Hence, we keep the number of routing iterations as
obtained by Ŝ. one at the first few layers. We need to stack layers to build
a deep capsule network. However, since all the operations
3.2. DeepCaps Architecture are required to be in capsule form, stacking of convolutional
layers will not be useful as it produces the outputs as scalar
Even though the architecture proposed by [19] performs
feature maps. Therefore, in order to address these require-
well with MNIST, fashion MNIST [25] and similar datasets,
ments, we propose ConvCaps layer, which is similar to a
its performance on CIFAR10 and other datasets containing
convolutional layer, except that its outputs will be squashed
complex objects can be considered sub-par. This is due to
4D tensors. We use ConvCaps layer where i = 1, and for
the fact that the MNIST images can be easily classified with
any i > 1 we use ConvCaps3D layer.
low level features such as edges and blobs, while CIFAR10 l l l l
images require high level understanding of features. Thus, Let Φl ∈ R(w ,w ,c ,n ) be the input to the ConvCaps
l+1 l+1 l+1 l+1
in this paper we propose a novel deep capsule architecture layer and Φl+1 ∈ R(w ,w ,c ,n ) be the output
l+1
which contains 16 convolutional capsule layers and a fully- from the layer l. w is obtained from the convolutional
connected capsule layer. However, going deep with capsule strides and padding, refer (Eq. 2). First Φl is reshaped
networks poses several challenges, which we discuss and into (wl , wl , cl × nl ) and convoluted with (cl+1 × nl+1 )
attempt to solve by proposing customized layers below. filters, producing (cl+1 × nl+1 ) feature maps of width

4
and height (wl+1 , wl+1 ). This will then be reshaped into input image is encoded into the final capsule vector. Finally,
(wl+1 , wl+1 , cl+1 , nl+1 ) shaped Φl+1 tensor and squash we use a decoder network to reconstruct the input image, as
function is applied to the capsules. This helps us to con- proposed in [19]. However, the decoder proposed in [19]
vert the feature maps into the capsule domain. In [19], when merely consists of two fully connected layers, which can-
i = 1 the predictions are equally weighted sum of the votes. not properly reconstruct the spatial relationships learned by
The convolution operation is an alternative way, except it the capsule network. Hence, we replace the decoder in [19]
gives a weighted sum of the input capsules to predict next with a deconvolutional decoder, which is better at recon-
layer votes. Further, when i is set to a value greater than 1, structing spatial relationships.
the ConvCaps3D layer is used with 3D convolution based
dynamic routing algorithm 1. 3.3. Loss Function
In order to reshape ConvCaps, we introduce We use the margin loss [19] as the loss function for
FlatCaps, which are used to remove the spatial re- DeepCaps. The marginal loss function enhances the class
lationship between adjacent capsules in ConvCaps layer probability of the true class, while suppressing the class
l, while keeping the part-whole relationships between probabilities of the other classes.
the capsules in ConvCaps layer l and FC caps layer
l + 1. Thus, the FlatCaps takes a (wl , wl , cl , nl ) shaped Lk = Tk max(0, m+ − kvk k)2
tensor and reshape it into a (al , nl ) shaped matrix, where, (5)
+ λ(1 − Tk ) max(0, kvk k − m− )2
al = wl × wl × cl .
FC caps are similar to the fully connected layers in Here Tk is 1 if the true class is k and zero otherwise. We
l l
deep neural networks. Here, Φl ∈ R(a ,n ) is mapped use m+ = 0.9 and m− = 0.1 as the lower bound for the
l+1 l+1 correct class and the upper bound of the incorrect class as in
into Φl+1 ∈ R(a ,n ) . Each capsule in Φl is trans-
Sabour et al. [19]. λ is used to control the effect of gradient
formed into a capsule in Φl+1 by a transformation matrix
l l+1 back propagation at the initial phase of the training.
Wi,j ∈ Rn ×n . Here, the W s are learned during the
training process via back propagation.
4. Class Independent Decoder Network
With the use of these layers, we build our DeepCaps
architecture as illustrated by Fig. 2. The model contains Our decoder network consists of deconvolutional layers
four main blocks, skip connected CapsCells, 3D convolu- [26] which reconstructs the input data by utilizing the in-
tional CapsCells, a fully-connected capsule layer and a de- stantiation parameters extracted from the DeepCaps model.
coder network. A skip connected capsule cell has three In comparison with the fully-connected layer decoder [19],
ConvCaps layers with the first layer output convolved and this captures more spatial relationships while reconstructing
skip connected to the last layer output. The motivation be- the images. Further, we use binary cross entropy as the loss
hind skip connections is to reduce vanishing gradients in function for improved performance [12].
deep models. In addition, this allows us to route low-level The existing decoder, which is used as regularization for
capsules to high-level capsules with skip connections. We Capsule Networks, is class dependent. Let P ∈ Ra×b con-
use element-wise layer addition to join the two capsule lay- tains the activity vector for all classes, where a is the num-
ers’ outputs after the skip connections. Since the capsules ber of classes in final class capsule and b is the capsule di-
are represented with vectors, a channel-wise concatenation mension. P is masked by the class with highest probability,
was not used as it duplicates the same capsule, but element- results in P̂ as shown in below Eq. 6:
wise addition reduces the bias and reduces the susceptibility (
to noise. Subsequently, we have a cell with ConvCaps3D pi,j i = t
layer, where the number of routing iterations is kept to 3. p̂i,j = (6)
0 i 6= t
Then, the ConvCaps outputs are flattened and concate-
nated with the outputs of the capsules before 3D routing Here i ∈ [a], j ∈ [b] and t = argmaxi (kPi k22 ) for the
(in CapsCell 3) prior to the dynamic routing. Intuitively, inference stage, and t = true label in the training stage.
this step aids to generalize the model for a broad range of Matrix P̂ is vectorized and fed in to the decoder network,
diverse datasets. For an example, low level capsules from as illustrated by Fig. 3. This vectorized P̂ ∈ Ra×b contains
cell 1 or 2 would be sufficient for datasets consisting of im- non-zero values from t · b to (t + 1) · b dimensions and
ages with poor information content such as MNIST, while zeros elsewhere. Therefore, the decoder network gets the
we need to go deep enough until 3D ConvCaps capsules for class information from the dimension-specific distribution,
datasets consisting of images with rich information content which provides class information to the decoder indirectly,
such as CIFAR10. Once all the capsules are collected and making the decoder class dependant.
concatenated, they are routed to the class capsules via the Hence, we propose a novel class-independent decoder
FC caps layer. Here, the decision making happens and the network which acts as a better regularizer for the capsule

5
Apart from regularization, a key advantage of having a
decoder network is that it can be utilized for tasks such
as data generation [19]. However, a significant limitation
of these decoders is the lack of controllability over which
physical parameter is captured by which instantiation pa-
rameter. For example, if a certain instantiation parameter
for a given class causes rotation for that particular class,
there is no guarantee that the same instantiation parame-
ter would cause rotation in any other classes. As a result,
generating data with similar requirements, such as the same
thickness or skewness, is a challenge.
+
As a solution to these issues, we propose the following
procedure. Instead of masking the non-predicted class in-
stantiation parameters, we only send the Pt ∈ R1×b , as il-
lustrated by Fig. 4. In contrast to the decoder learning pro-
cedure in [19], the learning of each instantiation parameter
in the proposed method is drawn from the same joint distri-
bution. Hence, the entity encapsulated by the any given in-
stantiation parameter, which is learned by the decoder, will
+ be the same irrespective of the image label.
Further, this procedure helps us to understand the types
of variations in the MNIST dataset. For example, rotation
and elongation being a dominant variation in the dataset
while localized changes being less dominant among char-
acters is reflected by the variance of the activity vector.
In other words, the instatiation parameters causing rota-
tions have higher variance whereas those causing localized
changes have lower variance.
+

Figure 3. Decoder network used in [19], which takes all the vectorized
masked activity vectors.

Figure 2. A four cell DeepCaps model, with first three cells using
i = 1 and in the last cell 3D convolution based dynamic routing is
applied.

networks, since it is forced to learn the activity vectors


jointly within a constrained Rb space. In our setting, only
vector Pt ∈ R1×b is fed into the decoder, where t = Figure 4. Proposed decoder, which takes only the activity vectors of
true label in the training stage, and t = argmaxi (kPi k22 ). the predicted class.

6
5. Experiments and Results We rescaled the images only for CIFAR10 and SVHN
datasets as a data augmentation, since they have richer high-
5.1. Implementation level features compared to MNIST and F-MNIST. Having
We used Keras and Tensorflow libraries for the devel- 64 × 64 resolution images allows us to add more layers to
opment of DeepCaps. For the training procedure, we used go down deep in the network.
Adam optimizer [13] with an initial learning rate of 0.001, For the models trained on CIFAR10, DeepCaps has only
which is reduced by half after each 20 epochs. During the 7.22 million parameters, while CapsNet [19] has 22.48 mil-
initial phases of the training, λ in Eq. 5 is set to 0.2 and in- lion parameters. Still we achieved 91.01% on CIFAR10
creased to 0.5 in the latter part of the training. The models with a single model, where CapsNet has a 7 ensembles
were trained on GTX-1080 and V100 GPUs, and a weighted accuracy of 89.40%. We tested both models’ inference
average ensembling was used for the 7-ensemble models re- time on NVIDIA V100 GPU, CapsNet takes 2.86 ms for
ported in Table 1. a 32 × 32 × 3 image, while our model takes only 1.38 ms
for a 64 × 64 × 3 image.
5.2. Classification Results
We test our DeepCaps model with several benchmark 5.3. Class-Independent Decoder Image Reconstruc-
datasets, CIFAR10 [14], SVHN [18] , Fashion-MNIST [25] tion
and MNIST [16], and compare its performance with the Our class-independent decoder acts as a better regular-
existing capsule network architectures. For CIFAR10 and ization term, yet it also helps to jointly learn the inter class
SVHN, we resize the 32 × 32 × 3 images to 64 × 64 × 3 and reconstruction. Hence, all the instantiation parameters are
for other datasets, original image sizes are used throughout distributed in the same space. For example, specific varia-
our experiments. tions in the handwritten digit, such as boldness, rotation and
skewness are captured at the same locations of the instan-
Table 1. Classification accuracies of DeepCaps, CapsNet [19] and
tiation parameter vector for all the classes. In other words,
other variants of Capsule Networks, with the state-of-the-art re-
sults. We outperform all the capsule domain networks in CI- for class ‘9’ if the 7th instantiation parameter is responsible
FAR10, SVHN and Fashion-MNIST datasets, while achieving for rotation, then it will be the same 7th parameter caus-
similar performace on the MNIST dataset. ing rotation in any other classes as well. The outputs of
the decoder used in [19] is also subjected to changes in the
Model CIFAR10 SVHN F-MNIST MNIST perturbation of activity vectors, yet, a specific instantiation
DenseNet [9] 96.40% 98.41% 95.40% - parameter may cause rotation in the reconstructed output for
ResNet [6] 93.57% - - 99.59% one class, and at the same time, it will not be the same in-
DPN [1] 96.35% - 95.70% -
Wan et al. [22] - - - 99.79% stantiation parameter causing rotation in another class. This
Zhong et al. [27] 96.92% - 96.35% - is due to the fact that the activity vectors are distributed in
Sabour et al. [19] 89.40% 95.70% 93.60% 99.75% a dimensional-wise separable activity vector space. Using
Nair et al. [17] 67.53% 91.06 % 89.80% 99.50% our class-independent decoder, we can generate data for any
HitNet [3] 73.30% 94.50% 92.30% 99.68%
class with a certain requirement. For example, if we want to
DeepCaps 91.01% 97.16% 94.46% 99.72%
DeepCaps (7-ensembles) 92.74% 97.56% 94.73% - generate bold data from a text, once we find the instantiation
parameter responsible for the boldness for any class, then
Even though our results are slightly below or on-par we can perturb it to generate bold letters across all classes,
with the state-of-the-art results, our results comfortably sur- which we can not do in [19], unless we know all the loca-
pass all the existing capsule network models in CIFAR10, tions of instantiation parameters corresponding to boldness
SVHN and Fashion-MNIST datasets. If we take the cap- for all the classes. See Fig. 5.
sule network implementations with best results, there is a With this class-independent decoder, we can label each
3.25% improvement in CIFAR10 and 1.86% improvement instantiation parameter causing specific changes in the re-
in SVHN compared to the capsule network model proposed constructed images. For the models that we trained, we ob-
in [19]. For Fashion-MNIST dataset, we outperform the re- served that the 28th parameter always causes the vertical
sults of HitNet [3] by 1.62% and for MNIST, DeepCaps elongation, and the 1st parameter is responsible for thick-
produced on-par state-of-the-art results. Table 1 shows ness. Further, we observed that, when we rank these instan-
our results in comparison with the existing capsule net- tiation parameters by variance, the instantiation parameters
work results and state-of-the-art results for the correspond- with the higher variance causes global variations such as
ing datasets. We highlight that we were able to achieve a rotation, elongation and thickness, while parameters with
near state-of-the-art performance across the datasets while lower variance are responsible for localized changes. See
surpassing the results of all the existing capsule network Fig. 6. The instantiation parameter space is not restricted
models. to be orthogonal, hence, few instantiation parameters share

7
Figure 5. Left half of images are generated by our decoder network, and the right half of the images are generated by decoder used in [19].
When the 28th dimension of the activity vector is changed between [-0.075,0.075], we can clearly observe that all the variations in the left
half of the images are the same, like elongation in vertical direction. In the right half images, the variations are different for each class. For
an example ‘7’ is shrunken in the vertical dimension, ‘1’ is elongated in the vertical direction and ‘9’ is showing some rotation.

from the concepts of skip connections and 3D convolutions.


Skip connections within a capsule cell allow good gradient
0.04 flow in back propagation. At the bottom of the network,
Variance Across Classes

we use a higher number of routing iterations when the skip


0.03 connections jump more than a layer. 3D convolutions are
used to generate votes from the capsule tensors which are
0.02 used for dynamic routing. This helps us to route a localized
group of capsules to a certain higher-level capsule. As a
0.01 result, we were able to go deeper with capsules using less
computational complexity compared to Sabour et al. [19].
0.00 Our model surpasses the state-of-the-art performance on CI-
0 5 10 15 20 25 30
Instantiation Parameter FAR10, SVHN and Fashion-MNIST, while achieving the
Figure 6. All the 32 instantiation parameters and its variance across state-of-the art performance on MNIST datasets in the Cap-
the MNIST dataset. Although, instantiation parameter space is not sule Network domain.
orthogonal, high variance instantiation parameters show clear sep- Further, we introduced a novel class-independent de-
arable changes in the reconstructed images, while, low variance coder network, which acts as a regularization for the Deep-
instantiation parameters show mixed changes.
Caps. Since it learns from the activity vectors which are dis-
Rotation tributed in the same space, we observed that across all the
(10)

Vertical elongation
classes, a specific instantiation parameter captures a specific
(18)
change. This opens up new avenues in practical applications
Thickness such as data generation.
(1)
Furthermore, we were able to get better performance on
Vertical expansion
(30) comparatively complex datasets such as CIFAR10, where
Localized skewness
the CapsNet in [19] did not show significant performance.
(6)
As future work, we would like to build even deeper and
Figure 7. Perturbations on a single instantiation parameter of the higher level understanding models and apply on Ima-
above digit shows that, high variance instantiation parameters geNet dataset. The class-independent decoder network also
cause global changes and low variance instantiation parameters showed potential in data generation applications with spe-
cause localized changes. cific requirements such as generate text data with same
styles. Further, we hope to investigate on eliminating the
correlation between the instantiation parameters.
a common attribute of an image. Yet, the instantiation pa-
rameters with higher variance demonstrates clearly separa-
7. Acknowledgement
ble variations as illustrated by Fig. 7.
The authors thank National Research Coun-
6. Conclusion cil, Sri Lanka (Grant 12-018), and the Faculty
of Information Technology of the University of
In this paper, we proposed a new deep architecture for Moratuwa, Sri Lanka for providing computational re-
Capsule Networks, termed DeepCaps, drawing intuition sources.

8
References [19] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing be-
tween capsules,” in NIPS, Long Beach, CA, 2017, pp. 3856–
[1] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual 3866. 1, 2, 3, 4, 5, 6, 7, 8
path networks,” in NIPS, Longbeach, CA, 2017, pp. 4467–
4475. 7 [20] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training
very deep networks,” in NIPS, Montreal, QC, 2015, pp.
[2] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and 2377–2385. 2
S. Yang, “Adanet: Adaptive structural learning of artificial
neural networks,” in ICML, Sydney, Australia, 2017, pp. [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
874–883. 2 D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,
“Going deeper with convolutions,” in CVPR, Boston, MA,
[3] A. Deliège, A. Cioppa, and M. Van Droogenbroeck, “Hitnet:
2015, pp. 1–9. 1
a neural network with capsules embedded in a hit-or-miss
layer, extended with hybrid data augmentation and ghost [22] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Reg-
capsules,” CoRR, 2018. 2, 7 ularization of neural networks using dropconnect,” ICML,
vol. 28, no. 3, pp. 1058–1066, Jun 2013. 7
[4] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hy-
percolumns for object segmentation and fine-grained local- [23] D. Wang and Q. Liu, “An optimization view on dynamic
ization,” in CVPR, Boston, MA, 2015, pp. 447–456. 2 routing between capsules,” 2018. 2
[5] K. He, X. Zhang, and S. Ren, “Deep residual learning for [24] E. Xi, S. Bing, and Y. Jin, “Capsule network performance on
image recognition,” in CVPR, Las Vegas, NV, 2016, pp. 770– complex data,” CoRR, 2017. 1
778. 2 [25] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning novel image dataset for benchmarking machine learning al-
for image recognition,” in CVPR, Las Vegas, NV, 2016, pp. gorithms,” CoRR, 2017. 4, 7
770–778. 7 [26] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “De-
[7] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming convolutional networks,” in CVPR, San Francisco, CA, June
auto-encoders,” in ICANN, Espoo, Finland, 2011, pp. 44–51. 2010, pp. 2528–2535. 5
2 [27] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random
[8] G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with erasing data augmentation,” CoRR, 2017. 7
em routing,” in ICLR, Vancouver, BC, 2018. 2
[9] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-
berger, “Densely connected convolutional networks.” in
CVPR, vol. 1, no. 2, Honolulu, HI, 2017, p. 3. 2, 7
[10] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,
“Deep networks with stochastic depth,” in ECCV. Amster-
dam, Netherland: Springer, 2016, pp. 646–661. 2
[11] A. Jaiswal, W. AbdAlmageed, Y. Wu, and P. Natarajan,
“Capsulegan: Generative adversarial capsule network,” in
ECCV, Munich, Germany, 2018, pp. 526–535. 2
[12] V. Jayasundara, S. Jayasekara, H. Jayasekara, J. Rajasegaran,
S. Seneviratne, and R. Rodrigo, “Textcaps: Handwritten
character recognition with very small datasets,” in WACV,
Waikoloa Village, HI, 2019, pp. 254–262. 5
[13] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” in ICLR, San Diego, CA, 2015. 7
[14] A. Krizhevsky and G. Hinton, “Learning multiple layers of
features from tiny images,” Citeseer, Tech. Rep., 2009. 1, 7
[15] R. LaLonde and U. Bagci, “Capsules for object segmenta-
tion,” arXiv preprint, 2018. 2
[16] Y. LeCun, C. Cortes, and C. J. C. Burges, “The mnist
database of handwritten digits,” 1998. 1, 7
[17] P. Q. Nair, R. Doshi, and S. Keselj, “Pushing the limits of
capsule networks,” 2018. 7
[18] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.
Ng, “Reading digits in natural images with unsupervised fea-
ture learning,” in NIPS, Granada, Spain, 2011. 7

You might also like