Sitemap

Dynamic Routing Between Capsules

3 min readMay 15, 2018

--

[Link](https://arxiv.org/abs/1710.09829)

* The goal of capsules is to add equivariance (rotational, color, etc.) to deep learning models
* The main idea of capsules is to represent a neuron using a vectored value instead of a scalar value. Each capsule represents an object. The dimensions of the vector represent what Hinton calls “instantiation parameters” of the object which are essentially the aesthetic properties: pose, deformation hue, texture, etc. The magnitude of the vector represents whether the object is present. The number of dimensions of the vector increases with layers since higher layers represent higher level objects.
* There is an assumed hierarchy of objects represented by capsules. At lower levels, capsules represent low level objects such as nose, eyes, mouth. A higher level capsule would represent a face.
* If a higher level capsule turns on, then the evidence from the lower level capsules must be consistent. E.g. A horizontal mouth and diagonal nose does not make a face. If evidence is not consistent, the higher capsule has been pruned as a hypothesis. This is the essence of the algorithm.

The forward pass is described below for two neurons $i$ in layer $L$ and $j$ in layer $L+1$.
$$u_i: \text{output of the capsule $i$ below}$$
$$\hat u_{j|i}: \text{prediction vector (predicted instantiation of object $j$)}$$
$$c_{ij}: \text{coupling coefficient}$$
$$b_{ij}: \text{unnormalized coupling coefficient}$$
$$s_j : \text{weighted instantiation of object $j$}$$
$$v_j : \text{output of capsule $j$}$$
$$W_{ij}: \text{weights of the neural network, learned by backpropagation as usual}$$

* The idea is that here is an outer loop learning parameters $W_{ij}$ which make proposals of higher level objects based on lower level objects (e.g. $W_{ij}$ tells you how to propose 45 degree tilted mouth given two 45 degree tilted parallel lines). There is also an inner loop checking to see if the learned proposals are consistent (if two lines say a mouth is 90 degrees tilted and two other lines say the same mouth is 50 degrees tilted…).
* $\hat u_{j|i} = W_{ij} u_i$ is capsule $i$’s proposed state of capsule $j$ and is fixed during the inner loop

The inner loop consists of the following:
$$c_{ij} = \frac{e^{b_{ij}}}{\sum_k e^{b_{ik}}}$$
$$s_j = \sum_i c_{ij} \hat u_{j|i}$$
$$v_j = \frac{||s_j||²}{1+||s_j||²} (\frac{s_j}{||s_j||})$$
$$a_{ij} = v_j \cdot \hat u_{j|i}$$
$$b_{ij} = b_{ij} + a_{ij}$$

1. quite simply: each capsule in the lower layer gets a limited amount of voting power in a democratic process to decide which objects exist in the next layer. $c_{ij}$ is the amount of “votes” capsule $i$ devotes to its proposal of capsule $j$. Naturally this is computed by considering all other capsules $k$ that $i$ can contribute to in the next layer.
2. capsule $j$ receives a ballot of votes from capsules of the lower layer, and averages them.
3. capsule $j$ is passed through a regularized sigmoid type of thing. The function $\frac{x²}{1+x²}$ has the idea of a sigmoid. Essentially when $s_j$ is large, it’s regularized to length one. When it’s small, it keeps it’s small magnitude.
4. by dotting the community proposal $v_j$ against individual proposals $\hat u_{j|i}$, we see how well each lower capsule’s proposal (eyes at 30 degrees) agrees with the community (eyes at 70 degrees). If there is good agreement between community and individual, the individual is sucked into voting for this object proposal by driving $b_{ij}$ up.
5. The algorithm is repeated until convergence in an inner loop.

* the final layer contains N capsules where N is the number of classification classes. The magnitude of each capsule represents the probability of that class existing in the image.
* in the CNN framework, the routing between two capsule layers requires $[vec1 \cdot vec2]\cdot [chan1 \cdot chan2]
\cdot [k \cdot k]$ weights where $vec$, $chan$, $k$ are the capsule dimension, number of channels, and filter size of the layers. This is typically on the order of $10⁶$.
* The authors showed moderate robustness to affine transformations despite not training on data with affine transormations (why?)
* Also moderate improvements over CNN on overlapping digits
* The problem is always, given the capacity to model equivariance, why would the model learn it?
* e.g. why wouldn’t two images with different hue use 2 capsules? Answer: regularization (but how would you make sure the model takes the simplest solution possible?)
* they demonstrate equivariance by reconstructing the original image from the last layer of capsules. They also train on this reconstruction loss to make sure capsules learn instantiation parameters

--

--

Kevin Shen
Kevin Shen

Written by Kevin Shen

MSc. at University of Toronto

No responses yet