Net 2018 11 1 - 12
Net 2018 11 1 - 12
Net 2018 11 1 - 12
1. INTRODUCTION
In 1986 Geoffrey E. Hinton revolutionized artificial neural
networks with the use of the backpropagation algorithm.
Since then artificial intelligence has made a big leap for-
wards, especially in computer vision. Deep Learning gained Figure 1: Basic architecture of a CNN; figure from [15]
in popularity, when deep convolutional networks performed
extraordinarily well on the ImageNet challenge in 2012 [10].
It has since been a very active field of research, with compa- Fig. 1 shows the basic architecture of a Convolutional Neural
nies like Google and Microsoft being dedicated to it. This Network. The network first extracts learned features, which
brought forth ideas like Inception [23] and Residual blocks are then fed through a fully connected neural network, that
[5], boosting the performance of CNNs. However, all of these produces a classification. The network can learn features by
advancements build upon the basic structure of a CNN. chaining together convolutional blocks. Such a block consists
of a convolutional layer, an activation function and a pooling
Based on his research in human vision, Geoffrey E. Hinton layer. The convolutional layer learns multiple simple fea-
stated that there is something fundamentally wrong with tures, also called kernels. To learn more complex, non-linear,
CNNs [7]. By trying to replicate the human visual cortex, problems, the output is fed through a non-linear activation
he came up with the idea of a capsule as a group of neurons. function (e.g. ReLU). To connect the blocks together, the
In “Dynamic routing between capsules” [20] Hinton et. al
developed a first working implementation, proving this theory.
In computer graphics, a scene is built by putting known
parts into relation, forming a more complex object. Inverse
graphics does the exact opposite, the scene is deconstructed
into parts and their relationships. The main goal of the
Capsule Network is to be capable of performing inverse
graphics [7]. To achieve this, Hinton proposed to encode the
idea of an entity inside a neural network, a capsule [7].
Figure 2: Max pooling example; figure from [2]
In the following, we first explain the basic structure of a
Convolutional Neural Network and its possible shortcomings outputs of the previous block need to be routed to the next
in Section 2. In Section 3.1 and 3.2, we describe the architec- block’s inputs. The most commonly used routing algorithm
3. CAPSULE NETWORKS
Hinton’s basic idea was to create a neural network capable of
inverse graphics. In other words the network should be able
to deconstruct a scene into co-related parts. To achieve this,
the architecture of a neural network needs to be changed to
Figure 4: Both images are being classified as “face” by a reflect the idea of an entity. Every entity gets its own part of
CNN; figure from [9] the network, encapsulating a number of neurons. This entity
is called a capsule.
CNNs try to make the neural activities invariant to small
changes in the viewpoint, pooling them together [7]. This 3.1 The capsule
is helpful for classification tasks, since the label should be A normal layer of neurons will be divided into many cap-
the same, no matter where the object is (spacial invariance). sules, which in turn contain the neurons [20] (see Sec. 3.2).
3.2.3 DigitCaps
Following is the DigitCaps layer, fully connected to the
primary capsules. These are now pure 16D capsules getting
Figure 5: An capsule and neuron in comparison [8] their inputs from the previous primary capsules. The weight
matrix Wij transforms the 8D output of primary capsule i
to a 16D vector as input for digit capsule j (ûj|i ) Eq. 2.
Therefore a capsule is a wrapper around a dedicated group X
of neurons. Fig. 5 shows a simplified comparison between sj = cij ûj|i , ûj|i = Wij ui (2)
a capsule and a neuron. A neuron computes a scalar value i
from a list of scalar values. Since a capsule essentially wraps Therefore each digit capsule has a weighted sum of 32 × 6 × 6
a group of neurons, it computes a vector from a list of in- 8D vectors as input (sj ). Instead of using pooling, the new
put vectors. It is now able to encode entity parameters like technique of routing-by-agreement is used to focus on the
location, skew, etc. [20]. However this also means, that most important inputs. This is discussed in section 3.3.
it does not represent the probability for the existence of
a feature anymore. Instead the length of the vector can
3.2.4 Class predictions
be used as the probability for feature existence, while not
The 10 16D vectors correspond to the numbers 0-9 (10
losing the important pose information. Furthermore this
classes). Because of the squashing function, the length of
also enables the network to learn the parameters by itself,
each of the vectors can be directly used as a probability for
removing the need for crafting them by hand. This means
each class. Hence there are no more fully connected layers
that a n-dimensional (nD) Capsule can learn n parameters
needed for classification (compare to Fig. 1).
and outputs a n-dimensional vector.
For the output vector to model a probability, it’s length has 3.3 Routing-by-agreement
to stay between 0 and 1. Normal activation functions like Routing-by-agreement is a novel dynamic routing technique.
ReLU only work on scalar values, hence a novel non-linear In contrast to pooling, the routing happens at runtime. The
squashing function Eq. 1 was introduced. goal of this technique is to redirect previous capsule outputs
to a following capsule where it agrees with other inputs. In
||sj ||2 sj the scope of inverse graphics this can be compared to routing
vj = (1)
1 + ||sj ||2 ||sj || a detected nose to the face-capsule and not the car capsule.
A detected nose, eye and mouth agree together in the face-
capsule, while the nose would not agree with a wheel and
To understand how the inputs of the capsule are combined
door in the car capsule.
to sj , we will now look into the architecture of the Capsule
Network presented in [20].
This works because of “coincidence filtering”. In a high
dimensional space - in this case the parameter dimension - it
3.2 Architecture is very unlikely for agreements to lie close to another. So a
cluster of agreements can not be, in a probabilistic way, a
coincidence.
3.2.2 PrimaryCaps In simple terms, the algorithm finds the mean vector of the
The next layer consists of 32 channels, each channel a 6 × 6 cluster (sj ), and weighs all inputs based on their distance to
grid of so called primary capsules. They serve as a transition this mean (bij ) and normalizes the weights with the “routing
between the scalar values of the convolution to 8D vector softmax” (ci ) 3.
outputs. The primary capsules can be seen as another convo- exp(bij )
cij = P (3)
lutional layer with a 9 × 9 × 256 kernel, just with squashing k exp(bik )
3.6 Matrix Capsules with EM routing 3.7.3 Better generalization to new viewpoints
Hinton et. al published another paper, currently under open CNNs memorize, that an object can be viewed from different
review, called “Matrix Capsules with EM routing” [6]. They viewpoints. This requires the network to “see” all different
propose to use a EM [6] algorithm instead of the current transformations possible. Capsule Networks however general-
routing algorithm 1 from [20]. Additionally they changed ize better to new viewpoints, because parameter information
the capsules to use a 4 × 4 pose matrix instead of a vector. of a capsule can capture these viewpoints as mere linear
Such a matrix is used in computer graphics to compute the transformations [7]. Therefore CapsNets are not as prone to
scene, like it would be seen through a virtual camera. This misclassification of unseen data, as shown in Sec. 3.6.
is called the viewport. Since the network is able to learn this
matrix, it is able to become viewport invariant [6].
3.7.4 Defense against white-box adversarial attacks
They tested this new network on smallNORB dataset (Fig. Common attacks on CNNs use the Fast Gradient Sign Method.
11) and outperformed the current state of the art CNN by It evaluates the gradient of each pixel against the loss of the
45%, reducing the error percentage from 2.56% to 1.4% [6]. network. The pixels are then changed marginally to maximize
Furthermore, they conducted an experiment, training the the loss without distorting the original image. This method
network only on specific viewpoints and testing it on unseen can drop the accuracy of CNNs to below 20%. Capsule
viewpoints [6]. Both networks were trained to the same error Networks however maintain an accuracy over 70% [4].
of 3.7% on seen viewpoints. While the baseline CNN’s error
increased to 20% on unseen viewpoints, the Capsule Network 3.7.5 Validatable
still achieved 13.5%. Based on these results, CapsNets seem A problem for industry usage of CNNs is their black box
to be able to generalize better than CNNs, being able to behaviour. It is neither predictable how a CNN will perform
adapt to 3D viewports in 2D images. on new data, nor can its performance be properly analyzed
and understood. Because Capsule Networks build upon the
concept of inverse graphics, the network’s reasoning can be
explained considerably better than CNNs. Shahroudnejad