Deepcaps: Going Deeper With Capsule Networks: Suranga - Seneviratne@Sydney - Edu.Au, Ranga@Uom - LK
Deepcaps: Going Deeper With Capsule Networks: Suranga - Seneviratne@Sydney - Edu.Au, Ranga@Uom - LK
[email protected], [email protected]
1
complexity, the deep capsule network must be able to han- The rest of the paper is organized as follows: In Sec-
dle richer data sets than MNIST. We propose that localized tion 2, we discuss the related work on Capsule Networks,
routing will be able to capture the higher level information Section 3 describes our DeepCaps architecture and the
better than fully connected routing. novel 3D routing algorithm, Section 4 outlines the class-
Sabour et al. [19] used regularization through the in- independent decoder network. Section 5 shows our results.
corporation of reconstruction error (which is generated by Finally, Section 6 concludes the paper.
the decoder network) to reduce over fitting. Nevertheless,
a stronger regularization than [19] is necessary to reduce 2. Related Work
overfitting when developing deeper networks, due to the in-
One of the major issues which we face with deep net-
herent increase in the model complexity with model depth.
works is the vanishing/exploding gradients. When the er-
Hence, in an attempt to enhance the regularization, we pro-
ror signal is passed through many layers, it can vanish and
pose a class-independent decoder. We observed an interest-
wash out by the time it reaches the beginning of the net-
ing property of this decoder, which provides controllability
work [2], [4], which hinders the convergence. This issue is
over the learning and perturbation of instantiation param-
being addressed in many models proposed, where ResNets
eters. In existing capsule networks and decoders, it is not
[5] and Highway Networks [20] bypass signals from one
possible to guarantee that the physical property represented
layer to the next via identity connections. Stochastic depth
by a given instantiation parameter is the same across all the
[10] shortens ResNets by randomly dropping layers dur-
classes. In the proposed decoder, it is guaranteed that the
ing training to allow better information and gradient flow.
represented property will be the same for any given instanti-
DenseNets [9] ensure the maximum information flow be-
ation parameter across all the classes, providing higher con-
tween layers in the network, by connecting all layers (with
trollability, which is immensely advantageous in practical
matching feature-map sizes) directly with each other. To
applications and theoretical studies.
preserve the feed-forward nature, each layer obtains addi-
To this end, in this paper, we propose DeepCaps: a deep tional inputs from all preceding layers and passes on its own
capsule network architecture by leveraging two key ideas: feature-maps to all subsequent layers. They create short
Dynamic routing and Going deeper in the network. The paths from early layers to latter layers.
novel dynamic routing algorithm that we propose achieves The idea of grouping the neurons is proposed in Hinton
parameter reduction and localized routing, making the rout- et al. [7]. As an extension to this, Sabour et al. [19] pro-
ing possible in a convolutional framework rather than re- posed a dynamic routing algorithm between capsules, us-
sorting to fully-connected capsules, while skip connections ing the concept of routing by agreement between capsules.
allow us to train deeper networks. More specifically, we Dynamic routing helps the network to achieve eqivarience,
make the following contributions in the paper: where CNNs can only achieve in-variance by the pooling
operation. In addition to dynamic routing, Hinton et al. [8]
• Proposing a novel deep architecture for capsule net- used EM routing for matrix capsules representing each en-
works, termed DeepCaps, that aims at improving the tity by a pose matrix. There have been many extensions to
performance of the capsule networks for more com- these: HitNet [3] uses a hybrid hit and miss layer for data
plex image datasets. Further, we propose a novel 3D- augmentations. Dilin et al. [23] solves the dynamic rout-
convolution-based dynamic routing algorithm to aid ing as an optimization problem, and achieves better perfor-
the learning process of DeepCaps. mance by introducing KL divergence between the coupling
distribution. CapsGan [11] uses a capsule network as the
• Proposing a novel class-independent decoder network, discriminator in the GAN pipeline, to get visually better re-
which acts as a better regularization term. We further sults than convolutional GANs. In contrast to these, our
investigate on the observation that this novel decoder work focuses on going deeper with the capsule networks
has the ability to provide controllability over the in- and increase its performance on more complex datasets.
stantiation parameters. SegCaps [15] uses capsules for image segmentation and
they achieve the state-of-the-art results on LUNA16 dataset.
• Evaluating the performance of DeepCaps on several This is the closest work to ours on the basis of routing.
benchmark datasets: We significantly outperform the They use 2D convolution for the voting procedure. By us-
existing state-of-the-art capsule network architectures, ing 2D convolutions, it takes all the capsule along depth as
while requiring a significantly lower number of param- the inputs for the transformation, thus, mixing the informa-
eters. For example, for the CIFAR10 dataset, Deep- tion contained in the capsules. In our 3D-convolution-based
Caps achieves a 3% improvement in the accuracy in routing, we design the strides along the depth to be the cap-
comparison to [19], along with a 68% reduction in the sule dimension, as a result of which, each capsules along
number of parameters. the depth dimension is voted separately.
2
Our work explores the possibilities of creating deeper
networks consisting of multiple capsule layers. We believe, XXX
to the best of our knowledge, that this is the first attempt vi,j,k,m = Φ̃l (i − p, j − q, k − r) · Ψlt (p, q, r)
to go deeper with capsule networks. Further, the instantia- p q r
be the tth kernel in layer l where t ∈ [cl+1 × nl+1 ], which R(w ,w ,c ) , for each s ∈ [cl ]. The corresponding cou-
results in the intermediate votes V, and has the shape of pling coefficients Ks are calculated using a softmax 3D
(wl+1 , wl+1 , cl , cl+1 × nl+1 ). Keeping the size of Ψlt and function, as defined by Eq. 3, which we propose as a 3D
stride as nl along with depth, allow us to get a vote for sin- version of the existing softmax function. [19]
gle capsule from layer l. See Fig. 1. Using a 3D convolution Ks = softmax 3D(Bs )
kernel with height and the width of the kernel greater than
1 as the transformation matrix, allows us to predict higher exp(bpqrs ) (3)
kpqrs = P P P
level capsules using a set of lower level capsules. x y z exp(bxyzs )
Each element vi,j,k,m in V can be obtained by perform- Here, the logits are normalized among all the predicted
ing the 3D convolution operation, which is defined accord- capsules from capsule tensor s in layer l. This is due to
ing to the Eq. 1 below: the fact that a single capsule tensor in layer l predicts all
3
Figure 1. Dynamic routing using 3D convolutions: In a high level explanation, each capsule tensor in layer l will predict cl+1 capsule
tensors. Therefore, cl number of predictions are available for a capsule tensor in layer l + 1. In the first routing iteration, all are equally
weighted and summed together to get the final prediction S. Then, in the following iterations, coupling coefficients are updated according
to the agreement with S and Ṽ.
the possible outputs of every (p, q, r)th capsule in the layer Algorithm 1 Dynamic Routing using 3D convolution
l + 1. In other words, each capsule tensor in layer l + 1 will 1: procedure ROUTING
have cl corresponding predictions from layer l. Each pre- l l l l
2: Require: Φl ∈ R(w ,w ,c ,n ) , r and cl+1 , nl+1
diction will be weighted with kpqrs to get a single prediction 3:
l l l
Φ̃l ← Reshape(Φl ) ∈ R(w ,w ,c ×n ,1)
l
as defined by Eq. 4, to limit the length of a capsule vector l+1 l+1 l+1 l+1 l
5: Ṽ ← Reshape(V) ∈ R(w ,w ,n ,c ,c )
between 0 and 1, as it represents the probability of existence l+1 l+1 l+1 l
of an entity. 6: B ← 0 ∈ R(w ,w ,c ,c )
Let p ∈ wl+1 , q ∈ wl+1 , r ∈ cl+1 and s ∈ cl
Ŝpqr = squash 3D(Spqr ) 7: for i iterations do
8: for all p, q, r, kpqrs ← softmax 3D(bpqrs )
kSpqr k2 Spqr (4) P
= · 9: for all s, Spqr ← s kpqrs · Ṽpqrs
1 + kSpqr k2 kSpqr k
10: for all s, Ŝpqr ← squash 3D(Spqr )
The key concept of the routing algorithm proposed 11: for all s, bpqrs ← bpqrs + Ŝpqr · Ṽpqrs
by [19] is routing by agreement between the outputs of the 12: return Φl+1 = Ŝ
capsules. The agreement between Ŝ and Ṽs is measured by
their dot product and the logits are updated with the agree-
ment measure.
We iterate through the proposed routing algorithm i In the first few layers of the network, as the feature map
times, where we empirically set i = 3 following [19]. Af- space is large, routing is computationally expensive at the
ter the iterations, the output of the layer l + 1, Φl+1 can be start. Hence, we keep the number of routing iterations as
obtained by Ŝ. one at the first few layers. We need to stack layers to build
a deep capsule network. However, since all the operations
3.2. DeepCaps Architecture are required to be in capsule form, stacking of convolutional
layers will not be useful as it produces the outputs as scalar
Even though the architecture proposed by [19] performs
feature maps. Therefore, in order to address these require-
well with MNIST, fashion MNIST [25] and similar datasets,
ments, we propose ConvCaps layer, which is similar to a
its performance on CIFAR10 and other datasets containing
convolutional layer, except that its outputs will be squashed
complex objects can be considered sub-par. This is due to
4D tensors. We use ConvCaps layer where i = 1, and for
the fact that the MNIST images can be easily classified with
any i > 1 we use ConvCaps3D layer.
low level features such as edges and blobs, while CIFAR10 l l l l
images require high level understanding of features. Thus, Let Φl ∈ R(w ,w ,c ,n ) be the input to the ConvCaps
l+1 l+1 l+1 l+1
in this paper we propose a novel deep capsule architecture layer and Φl+1 ∈ R(w ,w ,c ,n ) be the output
l+1
which contains 16 convolutional capsule layers and a fully- from the layer l. w is obtained from the convolutional
connected capsule layer. However, going deep with capsule strides and padding, refer (Eq. 2). First Φl is reshaped
networks poses several challenges, which we discuss and into (wl , wl , cl × nl ) and convoluted with (cl+1 × nl+1 )
attempt to solve by proposing customized layers below. filters, producing (cl+1 × nl+1 ) feature maps of width
4
and height (wl+1 , wl+1 ). This will then be reshaped into input image is encoded into the final capsule vector. Finally,
(wl+1 , wl+1 , cl+1 , nl+1 ) shaped Φl+1 tensor and squash we use a decoder network to reconstruct the input image, as
function is applied to the capsules. This helps us to con- proposed in [19]. However, the decoder proposed in [19]
vert the feature maps into the capsule domain. In [19], when merely consists of two fully connected layers, which can-
i = 1 the predictions are equally weighted sum of the votes. not properly reconstruct the spatial relationships learned by
The convolution operation is an alternative way, except it the capsule network. Hence, we replace the decoder in [19]
gives a weighted sum of the input capsules to predict next with a deconvolutional decoder, which is better at recon-
layer votes. Further, when i is set to a value greater than 1, structing spatial relationships.
the ConvCaps3D layer is used with 3D convolution based
dynamic routing algorithm 1. 3.3. Loss Function
In order to reshape ConvCaps, we introduce We use the margin loss [19] as the loss function for
FlatCaps, which are used to remove the spatial re- DeepCaps. The marginal loss function enhances the class
lationship between adjacent capsules in ConvCaps layer probability of the true class, while suppressing the class
l, while keeping the part-whole relationships between probabilities of the other classes.
the capsules in ConvCaps layer l and FC caps layer
l + 1. Thus, the FlatCaps takes a (wl , wl , cl , nl ) shaped Lk = Tk max(0, m+ − kvk k)2
tensor and reshape it into a (al , nl ) shaped matrix, where, (5)
+ λ(1 − Tk ) max(0, kvk k − m− )2
al = wl × wl × cl .
FC caps are similar to the fully connected layers in Here Tk is 1 if the true class is k and zero otherwise. We
l l
deep neural networks. Here, Φl ∈ R(a ,n ) is mapped use m+ = 0.9 and m− = 0.1 as the lower bound for the
l+1 l+1 correct class and the upper bound of the incorrect class as in
into Φl+1 ∈ R(a ,n ) . Each capsule in Φl is trans-
Sabour et al. [19]. λ is used to control the effect of gradient
formed into a capsule in Φl+1 by a transformation matrix
l l+1 back propagation at the initial phase of the training.
Wi,j ∈ Rn ×n . Here, the W s are learned during the
training process via back propagation.
4. Class Independent Decoder Network
With the use of these layers, we build our DeepCaps
architecture as illustrated by Fig. 2. The model contains Our decoder network consists of deconvolutional layers
four main blocks, skip connected CapsCells, 3D convolu- [26] which reconstructs the input data by utilizing the in-
tional CapsCells, a fully-connected capsule layer and a de- stantiation parameters extracted from the DeepCaps model.
coder network. A skip connected capsule cell has three In comparison with the fully-connected layer decoder [19],
ConvCaps layers with the first layer output convolved and this captures more spatial relationships while reconstructing
skip connected to the last layer output. The motivation be- the images. Further, we use binary cross entropy as the loss
hind skip connections is to reduce vanishing gradients in function for improved performance [12].
deep models. In addition, this allows us to route low-level The existing decoder, which is used as regularization for
capsules to high-level capsules with skip connections. We Capsule Networks, is class dependent. Let P ∈ Ra×b con-
use element-wise layer addition to join the two capsule lay- tains the activity vector for all classes, where a is the num-
ers’ outputs after the skip connections. Since the capsules ber of classes in final class capsule and b is the capsule di-
are represented with vectors, a channel-wise concatenation mension. P is masked by the class with highest probability,
was not used as it duplicates the same capsule, but element- results in P̂ as shown in below Eq. 6:
wise addition reduces the bias and reduces the susceptibility (
to noise. Subsequently, we have a cell with ConvCaps3D pi,j i = t
layer, where the number of routing iterations is kept to 3. p̂i,j = (6)
0 i 6= t
Then, the ConvCaps outputs are flattened and concate-
nated with the outputs of the capsules before 3D routing Here i ∈ [a], j ∈ [b] and t = argmaxi (kPi k22 ) for the
(in CapsCell 3) prior to the dynamic routing. Intuitively, inference stage, and t = true label in the training stage.
this step aids to generalize the model for a broad range of Matrix P̂ is vectorized and fed in to the decoder network,
diverse datasets. For an example, low level capsules from as illustrated by Fig. 3. This vectorized P̂ ∈ Ra×b contains
cell 1 or 2 would be sufficient for datasets consisting of im- non-zero values from t · b to (t + 1) · b dimensions and
ages with poor information content such as MNIST, while zeros elsewhere. Therefore, the decoder network gets the
we need to go deep enough until 3D ConvCaps capsules for class information from the dimension-specific distribution,
datasets consisting of images with rich information content which provides class information to the decoder indirectly,
such as CIFAR10. Once all the capsules are collected and making the decoder class dependant.
concatenated, they are routed to the class capsules via the Hence, we propose a novel class-independent decoder
FC caps layer. Here, the decision making happens and the network which acts as a better regularizer for the capsule
5
Apart from regularization, a key advantage of having a
decoder network is that it can be utilized for tasks such
as data generation [19]. However, a significant limitation
of these decoders is the lack of controllability over which
physical parameter is captured by which instantiation pa-
rameter. For example, if a certain instantiation parameter
for a given class causes rotation for that particular class,
there is no guarantee that the same instantiation parame-
ter would cause rotation in any other classes. As a result,
generating data with similar requirements, such as the same
thickness or skewness, is a challenge.
+
As a solution to these issues, we propose the following
procedure. Instead of masking the non-predicted class in-
stantiation parameters, we only send the Pt ∈ R1×b , as il-
lustrated by Fig. 4. In contrast to the decoder learning pro-
cedure in [19], the learning of each instantiation parameter
in the proposed method is drawn from the same joint distri-
bution. Hence, the entity encapsulated by the any given in-
stantiation parameter, which is learned by the decoder, will
+ be the same irrespective of the image label.
Further, this procedure helps us to understand the types
of variations in the MNIST dataset. For example, rotation
and elongation being a dominant variation in the dataset
while localized changes being less dominant among char-
acters is reflected by the variance of the activity vector.
In other words, the instatiation parameters causing rota-
tions have higher variance whereas those causing localized
changes have lower variance.
+
Figure 3. Decoder network used in [19], which takes all the vectorized
masked activity vectors.
Figure 2. A four cell DeepCaps model, with first three cells using
i = 1 and in the last cell 3D convolution based dynamic routing is
applied.
6
5. Experiments and Results We rescaled the images only for CIFAR10 and SVHN
datasets as a data augmentation, since they have richer high-
5.1. Implementation level features compared to MNIST and F-MNIST. Having
We used Keras and Tensorflow libraries for the devel- 64 × 64 resolution images allows us to add more layers to
opment of DeepCaps. For the training procedure, we used go down deep in the network.
Adam optimizer [13] with an initial learning rate of 0.001, For the models trained on CIFAR10, DeepCaps has only
which is reduced by half after each 20 epochs. During the 7.22 million parameters, while CapsNet [19] has 22.48 mil-
initial phases of the training, λ in Eq. 5 is set to 0.2 and in- lion parameters. Still we achieved 91.01% on CIFAR10
creased to 0.5 in the latter part of the training. The models with a single model, where CapsNet has a 7 ensembles
were trained on GTX-1080 and V100 GPUs, and a weighted accuracy of 89.40%. We tested both models’ inference
average ensembling was used for the 7-ensemble models re- time on NVIDIA V100 GPU, CapsNet takes 2.86 ms for
ported in Table 1. a 32 × 32 × 3 image, while our model takes only 1.38 ms
for a 64 × 64 × 3 image.
5.2. Classification Results
We test our DeepCaps model with several benchmark 5.3. Class-Independent Decoder Image Reconstruc-
datasets, CIFAR10 [14], SVHN [18] , Fashion-MNIST [25] tion
and MNIST [16], and compare its performance with the Our class-independent decoder acts as a better regular-
existing capsule network architectures. For CIFAR10 and ization term, yet it also helps to jointly learn the inter class
SVHN, we resize the 32 × 32 × 3 images to 64 × 64 × 3 and reconstruction. Hence, all the instantiation parameters are
for other datasets, original image sizes are used throughout distributed in the same space. For example, specific varia-
our experiments. tions in the handwritten digit, such as boldness, rotation and
skewness are captured at the same locations of the instan-
Table 1. Classification accuracies of DeepCaps, CapsNet [19] and
tiation parameter vector for all the classes. In other words,
other variants of Capsule Networks, with the state-of-the-art re-
sults. We outperform all the capsule domain networks in CI- for class ‘9’ if the 7th instantiation parameter is responsible
FAR10, SVHN and Fashion-MNIST datasets, while achieving for rotation, then it will be the same 7th parameter caus-
similar performace on the MNIST dataset. ing rotation in any other classes as well. The outputs of
the decoder used in [19] is also subjected to changes in the
Model CIFAR10 SVHN F-MNIST MNIST perturbation of activity vectors, yet, a specific instantiation
DenseNet [9] 96.40% 98.41% 95.40% - parameter may cause rotation in the reconstructed output for
ResNet [6] 93.57% - - 99.59% one class, and at the same time, it will not be the same in-
DPN [1] 96.35% - 95.70% -
Wan et al. [22] - - - 99.79% stantiation parameter causing rotation in another class. This
Zhong et al. [27] 96.92% - 96.35% - is due to the fact that the activity vectors are distributed in
Sabour et al. [19] 89.40% 95.70% 93.60% 99.75% a dimensional-wise separable activity vector space. Using
Nair et al. [17] 67.53% 91.06 % 89.80% 99.50% our class-independent decoder, we can generate data for any
HitNet [3] 73.30% 94.50% 92.30% 99.68%
class with a certain requirement. For example, if we want to
DeepCaps 91.01% 97.16% 94.46% 99.72%
DeepCaps (7-ensembles) 92.74% 97.56% 94.73% - generate bold data from a text, once we find the instantiation
parameter responsible for the boldness for any class, then
Even though our results are slightly below or on-par we can perturb it to generate bold letters across all classes,
with the state-of-the-art results, our results comfortably sur- which we can not do in [19], unless we know all the loca-
pass all the existing capsule network models in CIFAR10, tions of instantiation parameters corresponding to boldness
SVHN and Fashion-MNIST datasets. If we take the cap- for all the classes. See Fig. 5.
sule network implementations with best results, there is a With this class-independent decoder, we can label each
3.25% improvement in CIFAR10 and 1.86% improvement instantiation parameter causing specific changes in the re-
in SVHN compared to the capsule network model proposed constructed images. For the models that we trained, we ob-
in [19]. For Fashion-MNIST dataset, we outperform the re- served that the 28th parameter always causes the vertical
sults of HitNet [3] by 1.62% and for MNIST, DeepCaps elongation, and the 1st parameter is responsible for thick-
produced on-par state-of-the-art results. Table 1 shows ness. Further, we observed that, when we rank these instan-
our results in comparison with the existing capsule net- tiation parameters by variance, the instantiation parameters
work results and state-of-the-art results for the correspond- with the higher variance causes global variations such as
ing datasets. We highlight that we were able to achieve a rotation, elongation and thickness, while parameters with
near state-of-the-art performance across the datasets while lower variance are responsible for localized changes. See
surpassing the results of all the existing capsule network Fig. 6. The instantiation parameter space is not restricted
models. to be orthogonal, hence, few instantiation parameters share
7
Figure 5. Left half of images are generated by our decoder network, and the right half of the images are generated by decoder used in [19].
When the 28th dimension of the activity vector is changed between [-0.075,0.075], we can clearly observe that all the variations in the left
half of the images are the same, like elongation in vertical direction. In the right half images, the variations are different for each class. For
an example ‘7’ is shrunken in the vertical dimension, ‘1’ is elongated in the vertical direction and ‘9’ is showing some rotation.
Vertical elongation
classes, a specific instantiation parameter captures a specific
(18)
change. This opens up new avenues in practical applications
Thickness such as data generation.
(1)
Furthermore, we were able to get better performance on
Vertical expansion
(30) comparatively complex datasets such as CIFAR10, where
Localized skewness
the CapsNet in [19] did not show significant performance.
(6)
As future work, we would like to build even deeper and
Figure 7. Perturbations on a single instantiation parameter of the higher level understanding models and apply on Ima-
above digit shows that, high variance instantiation parameters geNet dataset. The class-independent decoder network also
cause global changes and low variance instantiation parameters showed potential in data generation applications with spe-
cause localized changes. cific requirements such as generate text data with same
styles. Further, we hope to investigate on eliminating the
correlation between the instantiation parameters.
a common attribute of an image. Yet, the instantiation pa-
rameters with higher variance demonstrates clearly separa-
7. Acknowledgement
ble variations as illustrated by Fig. 7.
The authors thank National Research Coun-
6. Conclusion cil, Sri Lanka (Grant 12-018), and the Faculty
of Information Technology of the University of
In this paper, we proposed a new deep architecture for Moratuwa, Sri Lanka for providing computational re-
Capsule Networks, termed DeepCaps, drawing intuition sources.
8
References [19] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing be-
tween capsules,” in NIPS, Long Beach, CA, 2017, pp. 3856–
[1] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual 3866. 1, 2, 3, 4, 5, 6, 7, 8
path networks,” in NIPS, Longbeach, CA, 2017, pp. 4467–
4475. 7 [20] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training
very deep networks,” in NIPS, Montreal, QC, 2015, pp.
[2] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and 2377–2385. 2
S. Yang, “Adanet: Adaptive structural learning of artificial
neural networks,” in ICML, Sydney, Australia, 2017, pp. [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
874–883. 2 D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,
“Going deeper with convolutions,” in CVPR, Boston, MA,
[3] A. Deliège, A. Cioppa, and M. Van Droogenbroeck, “Hitnet:
2015, pp. 1–9. 1
a neural network with capsules embedded in a hit-or-miss
layer, extended with hybrid data augmentation and ghost [22] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Reg-
capsules,” CoRR, 2018. 2, 7 ularization of neural networks using dropconnect,” ICML,
vol. 28, no. 3, pp. 1058–1066, Jun 2013. 7
[4] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hy-
percolumns for object segmentation and fine-grained local- [23] D. Wang and Q. Liu, “An optimization view on dynamic
ization,” in CVPR, Boston, MA, 2015, pp. 447–456. 2 routing between capsules,” 2018. 2
[5] K. He, X. Zhang, and S. Ren, “Deep residual learning for [24] E. Xi, S. Bing, and Y. Jin, “Capsule network performance on
image recognition,” in CVPR, Las Vegas, NV, 2016, pp. 770– complex data,” CoRR, 2017. 1
778. 2 [25] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning novel image dataset for benchmarking machine learning al-
for image recognition,” in CVPR, Las Vegas, NV, 2016, pp. gorithms,” CoRR, 2017. 4, 7
770–778. 7 [26] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “De-
[7] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming convolutional networks,” in CVPR, San Francisco, CA, June
auto-encoders,” in ICANN, Espoo, Finland, 2011, pp. 44–51. 2010, pp. 2528–2535. 5
2 [27] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random
[8] G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with erasing data augmentation,” CoRR, 2017. 7
em routing,” in ICLR, Vancouver, BC, 2018. 2
[9] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-
berger, “Densely connected convolutional networks.” in
CVPR, vol. 1, no. 2, Honolulu, HI, 2017, p. 3. 2, 7
[10] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger,
“Deep networks with stochastic depth,” in ECCV. Amster-
dam, Netherland: Springer, 2016, pp. 646–661. 2
[11] A. Jaiswal, W. AbdAlmageed, Y. Wu, and P. Natarajan,
“Capsulegan: Generative adversarial capsule network,” in
ECCV, Munich, Germany, 2018, pp. 526–535. 2
[12] V. Jayasundara, S. Jayasekara, H. Jayasekara, J. Rajasegaran,
S. Seneviratne, and R. Rodrigo, “Textcaps: Handwritten
character recognition with very small datasets,” in WACV,
Waikoloa Village, HI, 2019, pp. 254–262. 5
[13] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” in ICLR, San Diego, CA, 2015. 7
[14] A. Krizhevsky and G. Hinton, “Learning multiple layers of
features from tiny images,” Citeseer, Tech. Rep., 2009. 1, 7
[15] R. LaLonde and U. Bagci, “Capsules for object segmenta-
tion,” arXiv preprint, 2018. 2
[16] Y. LeCun, C. Cortes, and C. J. C. Burges, “The mnist
database of handwritten digits,” 1998. 1, 7
[17] P. Q. Nair, R. Doshi, and S. Keselj, “Pushing the limits of
capsule networks,” 2018. 7
[18] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.
Ng, “Reading digits in natural images with unsupervised fea-
ture learning,” in NIPS, Granada, Spain, 2011. 7