Capsule neural network
A Capsule Neural Network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.[1]
The idea is to add structures called capsules to a convolutional neural network (CNN), and to reuse output from several of those capsules to form more stable (with respect to various perturbations) representations for higher order capsules.[2] The output is a vector consisting of the probability of an observation, and a pose for that observation. This vector is similar to what is done for example when doing classification with localization in CNNs.
Among other benefits, capsnets address the "Picasso problem" in image recognition: images that have all the right parts but that are not in the correct spatial relationship (e.g., in a "face", the positions of the mouth and one eye are switched). For image recognition, capsnets exploit the fact that while viewpoint changes have nonlinear effects at the pixel level, they have linear effects at the part/object level.[3] This can be compared to inverting the rendering of an object of multiple parts.[4]
story
In 2000 Geoffrey Hinton et. al. described an imaging system that combined segmentation and recognition into a single inference process using parse trees. So-called credibility networks described the joint distribution over the latent variables and over the possible parse trees. That system proved useful on the MNIST handwritten digit database.[4]
A dynamic routing mechanism for capsule networks was introduced by Hinton and his team in 2017. The approach was claimed to reduce error rates on MNIST and to reduce training set sizes. Results were claimed to be considerably better than a CNN on highly overlapped digits.[1]
In Hinton's original idea one minicolumn would represent and detect one multidimensional entity.[5][note 1]
Transformer
An invariant is an object property that does not change as a result of some transformation. For example, the area of a circle does not change if the circle is shifted to the left.
Informally, an equivariant is a property that changes predictably under transformation. For example, the center of a circle moves by the same amount as the circle when shifted.[6]
A nonequivariant is a property whose value does not change predictably under a transformation. For example, transforming a circle into an ellipse means that its perimeter can no longer be computed as π times the diameter.
In computer vision, the class of an object is expected to be an invariant over many transformations. I.e., a cat is still a cat if it is shifted, turned upside down or shrunken in size. However, many other properties are instead equivariant. The volume of a cat changes when it is scaled.
Equivariant properties such as a spatial relationship are captured in a pose, data that describes an object's translation, rotation, scale and reflection. Translation is a change in location in one or more dimensions. Rotation is a change in orientation. Scale is a change in size. Reflection is a mirror image.[1]
Unsupervised capsnets learn a global linear manifold between an object and its pose as a matrix of weights. In other words, capsnets can identify an object independent of its pose, rather than having to learn to recognize the object while including its spatial relationships as part of the object. In capsnets, the pose can incorporate properties other than spatial relationships, e.g., color (cats can be of various colors).
Multiplying the object by the manifold poses the object (for an object, in space).[7]
Pooling
Capsnets reject the pooling layer strategy of conventional CNNs that reduces the amount of detail to be processed at the next higher layer. Pooling allows a degree of translational invariance (it can recognize the same object in a somewhat different location) and allows a larger number of feature types to be represented. Capsnet proponents argue that pooling:[1]
- violates biological shape perception in that it has no intrinsic coordinate frame;
- provides invariance (discarding positional information) instead of equivariance (disentangling that information);
- ignores the linear manifold that underlies many variations among images;
- routes statically instead of communicating a potential "find" to the feature that can appreciate it;
- damages nearby feature detectors, by deleting the information they rely upon.
Capsules
A capsule is a set of neurons that individually activate for various properties of a type of object, such as position, size and hue. Formally, a capsule is a set of neurons that collectively produce an activity vector with one element for each neuron to hold that neuron's instantiation value (e.g., hue).[1] Graphics programs use instantiation value to draw an object. Capsnets attempt to derive these from their input. The probability of the entity's presence in a specific input is the vector's length, while the vector's orientation quantifies the capsule's properties.[1][3]
Artificial neurons traditionally output a scalar, real-valued activation that loosely represents the probability of an observation. Capsnets replace scalar-output feature detectors with vector-output capsules and max-pooling with routing-by-agreement.[1]
Because capsules are independent, when multiple capsules agree, the probability of correct detection is much higher. A minimal cluster of two capsules considering a six-dimensional entity would agree within 10% by chance only once in a million trials. As the number of dimensions increase, the likelihood of a chance agreement across a larger cluster with higher dimensions decreases exponentially.[1]
Capsules in higher layers take outputs from capsules at lower layers, and accept those whose outputs cluster. A cluster causes the higher capsule to output a high probability of observation that an entity is present and also output a high-dimensional (20-50+) pose.[1]
Higher-level capsules ignore outliers, concentrating on clusters. This is similar to the Hough transform, the RHT and RANSAC from classic digital image processing.[1]
Routing by agreement
The outputs from one capsule (child) are routed to capsules in the next layer (parent) according to the child's ability to predict the parents' outputs. Over the course of a few iterations, each parents' outputs may converge with the predictions of some children and diverge from those of others, meaning that that parent is present or absent from the scene.[1]
For each possible parent, each child computes a prediction vector by multiplying its output by a weight matrix (trained by backpropagation).[3] Next the output of the parent is computed as the scalar product of a prediction with a coefficient representing the probability that this child belongs to that parent. A child whose predictions are relatively close to the resulting output successively increases the coefficient between that parent and child and decreases it for parents that it matches less well. This increases the contribution that that child makes to that parent, thus increasing the scalar product of the capsule’s prediction with the parent’s output. After a few iterations, the coefficients strongly connect a parent to its most likely children, indicating that the presence of the children imply the presence of the parent in the scene.[1] The more children whose predictions are close to a parent's output, the more quickly the coefficients grow, driving convergence. The pose of the parent (reflected in its output) progressively becomes compatible with that of its children.[3]
The coefficients' initial logits are the log prior probabilities that a child belongs to a parent. The priors can be trained discriminatively along with the weights. The priors depend on the location and type of the child and parent capsules, but not on the current input. At each iteration, the coefficients are adjusted via a "routing" softmax so that they continue to sum to 1 (to express the probability that a given capsule is the parent of a given child.) Softmax amplifies larger values and diminishes smaller values beyond their proportion of the total. Similarly, the probability that a feature is present in the input is exaggerated by a nonlinear "squashing" function that reduces values (smaller ones drastically and larger ones such that they are less than 1).[3]
This dynamic routing mechanism provides the necessary deprecation of alternatives ("explaining away") that is needed for segmenting overlapped objects.
This learned routing of signals has no clear biological equivalent. Some operations can be found in cortical layers, but they do not seem to relate this technique.
Math/code
The pose vector is rotated and translated by a matrix into a vector that predicts the output of the parent capsule.
Capsules in the next higher level are fed the sum of the predictions from all capsules in the lower layer, each with a coupling coefficient
Procedure softmax
The coupling coefficients from a capsule in layer to all capsules in layer sum to one, and are defined by a "routing softmax". The initial logits are prior log probabilities for the routing. That is the prior probability that capsule in layer should connect to capsule in layer . Normalization of the coupling coefficients:[1]
For this procedure to be optimum it would have to memorize several values, and reset those values on each iteration. That is if the vector changes, then the memorized values must be updated. It is not shown how this should be done. Neither memorizing the divisor is shown.[1]
Procedure squash
Because the length of the vectors represents probabilities they should be between zero (0) and one (1), and to do that a squashing function is applied:Cite error: The <ref>
tag has too many names (see the help page).
Capsnets explore the intuition that the human visual system creates a tree-like structure for each focal point and coordinates these trees to recognize objects. However, with capsnets each tree is "carved" from a fixed network (by adjusting coefficients) rather than assembled on the fly.[1]
Alternatives
CapsNets are claimed to have four major conceptual advantages over convolutional neural networks (CNN):
- Viewpoint invariance: the use of pose matrices allows capsule networks to recognize objects regardless of the perspective from which they are viewed.
- Fewer parameters: Because capsules group neurons, the connections between layers require fewer parameters.
- Better generalization to new viewpoints: CNNs, when trained to understand rotations, often learn that an object can be viewed similarly from several different rotations. However, capsule networks generalize better to new viewpoints because pose matrices can capture these characteristics as linear transformations.
- Defense against white-box adversarial attacks: the Fast Gradient Sign Method (FGSM) is a typical method for attacking CNNs. It evaluates the gradient of each pixel against the loss of the network, and changes each pixel by at most epsilon (the error term) to maximize the loss. Although this method can drop the accuracy of CNNs dramatically (e.g: to below 20%), capsule networks maintain accuracy above 70%.
Purely convolutional nets cannot generalize to unlearned viewpoints (other than translation). For other affine transformations, either feature detectors must be repeated on a grid that grows exponentially with the number of transformation dimensions, or the size of the labelled training set must (exponentially) expand to encompass those viewpoints. These exponential explosions make them unsuitable for larger problems.[1]
Capsnet's transformation matrices learn the (viewpoint independent) spatial relationship between a part and a whole, allowing the latter to be recognized based on such relationships. However, capsnets assume that each location displays at most one instance of a capsule's object. This assumption allows a capsule to use a distributed representation (its activity vector) of an object to represent that object at that location.[1]
Capsnets use neural activities that vary with viewpoint. They do not have to normalize objects (as in spatial transformer networks) and can even recognize multiply transformed objects. Capsnets can also process segmented objects.[1]
See also
Notes
- ^ In Hinton's own words this is "wild speculation".
References
- ^ a b c d e f g h i j k l m n o p q r Sabour, Sara; Frosst, Nicholas; Hinton, Geoffrey E. (2017-10-26). "Dynamic Routing Between Capsules". arXiv:1710.09829 [cs.CV].
- ^ Hinton, Geoffrey E.; Krizhevsky, Alex; Wang, Sida D. (2011-06-14). Transforming Auto-Encoders. Lecture Notes in Computer Science. Vol. 6791. Springer, Berlin, Heidelberg. pp. 44–51. CiteSeerX 10.1.1.220.5099. doi:10.1007/978-3-642-21735-7_6. ISBN 9783642217340.
{{cite book}}
:|journal=
ignored (help) - ^ a b c d e Srihari, Sargur. "Capsule Nets" (PDF). University of Buffalo. Retrieved 2017-12-07.
- ^ a b Hinton, Geoffrey E; Ghahramani, Zoubin; Teh, Yee Whye (2000). Solla, S. A.; Leen, T. K.; Müller, K. (eds.). Advances in Neural Information Processing Systems 12 (PDF). MIT Press. pp. 463–469.
- ^ Meher Vamsi (2017-11-15), Geoffrey Hinton Capsule theory, retrieved 2017-12-06
- ^ "Understanding Matrix capsules with EM Routing (Based on Hinton's Capsule Networks)". jhui.github.io. Retrieved 2017-12-31.
- ^ Tan, Kendrick (November 10, 2017). "Capsule Networks Explained". kndrck.co. Retrieved 2017-12-26.
{{cite web}}
: Cite has empty unknown parameter:|dead-url=
(help)
External links
- Guo, Xifeng (2017-12-08), CapsNet-Keras: A Keras implementation of CapsNet in NIPS2017 paper "Dynamic Routing Between Capsules". Now test error = 0.34%., retrieved 2017-12-08
- Liao, Huadong (2017-12-08), CapsNet-Tensorflow: A Tensorflow implementation of CapsNet(Capsules Net) in Hinton's paper Dynamic Routing Between Capsules, retrieved 2017-12-08
- A PyTorch implementation of the NIPS 2017 paper "Dynamic Routing Between Capsules", Gram.AI, 2017-12-08, retrieved 2017-12-08
- What's wrong with convolutional neural nets on YouTube
- "Deep Learning". www.cedar.buffalo.edu. Retrieved 2017-12-07.
- Anonymous authors (November 2017). "MATRIX CAPSULES WITH EM ROUTING".
{{cite web}}
: Cite has empty unknown parameter:|dead-url=
(help) - De Brabandere, Bert; Jia, Xu; Tuytelaars, Tinne; Van Gool, Luc (2016-05-31). "Dynamic Filter Networks". arXiv:1605.09673 [cs.LG].
- Dai, Jifeng; Qi, Haozhi; Xiong, Yuwen; Li, Yi; Zhang, Guodong; Hu, Han; Wei, Yichen (2017-03-17). "Deformable Convolutional Networks". arXiv:1703.06211 [cs.CV].