M-theory (learning framework)
In Machine Learning and Computer Vision, M-Theory is a learning framework inspired by functioning of visual cortex and originally developed for recognition and classification of objects in visual scenes. M-Theory was later applied to other areas, such as speech recognition. On certain image recognition tasks, algorithms based on M-Theory achieved human-level performance.[1] The core principle of M-Theory is using representations invariant to various transformations of images (such as rotation, translation and scale). In contrast with other approaches using invariant representations, in M-Theory they are not hardcoded into the algorithms, but learned. M-Theory also builds on developments in Compressed Sensing. The theory proposes multilayered hierarchical learning architecture, similar to that of visual cortex. In contrast with some other models exploiting similar ideas (such as Memory-prediction framework), M-Theory architecture is purely feedforward. It doesn’t consider feedback flow of information from higher levels of cortical hierarchy.
Intuition
Invariant Representations
A great challenge in vision is that the same object can be seen in a variety of conditions. It can be seen from different distances, different viewpoints, under different lighting conditions, partially occluded, etc. For particular classes objects, such as faces, highly complex specific transformations may be relevant, such as changing facial expressions. In process of learning to recognize images, it would be greatly beneficial to abstract these variations away. It would result in much simpler classification problem and can result in great reduction of sample complexity of the model.
A simple computational experiment illustrates this idea. Two classifier are trained to distinguish images of planes from those of cars. One classifier, images with arbitrary viewpoints are used for training and testing. Another classifier is given only with rectified images seen from a fixed viewpoint. Rectification emulates the existence of a subsystem that constructs invariant representation of images. One can see that the second classifier performs quite well even after receiving a single example from each category, while performance of the first classifier is close to random guess even after seeing 20 examples.
Invariant representations has been incorporated into several learning architectures, such as neocognitrons. Most of these architectures, however, provided invariance through custom-designed features or properties of architecture itself. While it helps to take into account some sorts of transformations, such as translations, it is very nontrivial to accommodate for other sorts of transformations, such as 3D rotations and changing facial expressions. M-Theory provides a framework of how such transformations can be learned. In addition to higher flexibility, this theory also allows to explain how human brain can have similar capabilities.
Results from Compressed Sensing
Another idea constituting M-Theory comes from the field of compressed sensing. An implication from Johnson–Lindenstrauss lemma says that a particular number of images can be embedded into a low-dimensional feature space with the same distances between images by using random projections. This result suggests that dot product between the observed image and some other image stored in memory, called template, can be used as a feature helping to distinguish the image from other images. The template need not to be anyhow related to the image, it could be chosen randomly.
Combining Compressed Sensing and Invariance
The two ideas outlined in previous sections can be brought together to construct a framework for learning invariant representations. To do so, let’s observe how dot product between image and a template behaves when image is transformed (by such transformation as translation, rotation, scale, etc.). If transformation is a member of a group of transformations, then the following holds:
In other words, the dot product of transformed image and a template is equal to the dot product of original image and inversely transformed template. For instance, for image rotated by 90 degrees, the inversely transformed template would be rotated by -90 degrees (see picture).
Consider the set of dot products of an image to all possible transformations of template: . If one applies a transformation to , the set would become . But because of the property (1), this is equal to . The set is equal to just the set of all elements in . To see this, note that every is in due to the closure property of groups, and for every in G there exist its prototype such as (namely, ). Thus, . One can see that the set of dot products remains the same despite that a transformation was applied to the image! This set by itself may serve as a (very cumbersome) invariant representation of an image. More computationally manageable invariant representations are derived in the next section.
In the introductory section, it was claimed that M-Theory allows to learn invariant representations. This is because templates and their transformed versions can be learned from visual experience - by exposing the system to ‘slideshows’ of moving, rotating and deforming objects. It is plausible that similar visual experiences occur in early period of human life, for instance when infants twiddle toys in their hands. Because templates may be totally unrelated to images that the system later will try to classify, memories of these visual experiences may serve as a basis for recognizing many different kinds of objects in later life. This framework works for generic transformations; however, later it is shown that for some kinds of transformations, specific templates are preferable.
Theoretical Aspects
From Orbits to Distribution Measures
This section shows that an image can be characterized by a set of one-dimensional probability distributions. These probability distributions in their turn can be described by either histograms or of statistical moments of it. This provides computationally manageable invariant representations of an image.
Orbit is a set of images generated from a single image under the action of the group .
In other words, images of an object and of its transformations correspond to a orbit . If two orbits have a point in common they are identical everywhere[2], i.e an orbit is an invariant and unique representation of an image. So, two images are called equivalent when they belong to the same orbit: if such that . Conversely, two orbits are different if none of the images in one orbit coincide with any image in the other.[3]
A natural question arises: how can one compare two orbits? There are several possible approaches. One of them employs the fact that intuitively two empirical orbits are the same irrespective of the ordering of their points. Thus, one can consider a probability distribution induced by the group’s action on images ( can be seen as a realisation of a random variable).
This probability distribution can be almost uniquely characterized by one-dimensional probability distributions induced by the (one-dimensional) results of projections , where are a set of templates (randomly chosen images) (based on the Cramer-Wold theorem [4] and concentration of measures).
Consider images . Let , where is a universal constant. Then
with probability , for all .
This theorem (informally) says that an approximately invariant and unique signature of an image can be obtained from the estimates of 1-D probability distributions for . The number of projections needed to discriminate orbits, induced by images, up to precision (and with confidence ) is , where is a universal constant.
To classify an image, the following “recipe” can be used:
- Memorize a set of images/objects called templates
- Memorize observed transformations for each template
- Compute dot products of its transformations with image
- Compute histogram of the resulting values
- Compare the obtained histogram with signatures stored in memory
These 1-D probability distributions can be characterized with N-bin histograms or set of statistical moments. For example, HMAX represents an architecture in which pooling is done with a max operation.
Non-Compact Groups of Transformations
Approximating a group of transformations with finite number of transformations is guaranteed when the group is compact. Indeed, according to open cover definition of compact set, for any set of open sets covering the compact set, it is possible to select a finite number of open sets that will also cover the compact set. Thus, it is possible to select a finite number of balls with radius that will cover the group, regardless of how small is. For non-compact group, it may be not true.
Such groups as all translations and all scalings of the image are not compact, as they allow arbitrarily big transformations. However, even with non-compact group, invariance is achievable within certain range of transformations.[2]
Assume that is a subset of transformations from for which the transformed patterns exist in memory. For an image and template , assume that is equal to zero everywhere except some subset of . This subset is called support of and denoted as . It can be proven that if for a transformation , support set will also lie within , then representation of transformed by would be the same as for original .[2] This theorem determines the range of transformations for which invariance is guaranteed to hold.
One can see that the smaller is , the larger is the range of transformations for which invariance is guaranteed to hold. It means that for non-compact group, not all templates would work equally well anymore. Those templates are preferable that has a reasonably small for a generic image. This property is called localization: templates are sensitive only to images within a small range of transformations. Note that although minimizing is not absolutely necessary for the system to work, it can improve it characteristics. Requiring localization simultaneously for translation and scale yields a very specific kind of templates: Gabor functions.[2]
The desirability of custom templates for non-compact group is in conflict with the principle of learning invariant representations. However, it appears plausible that for certain kinds of regularly encountered image transformations, evolutionary adaptations to their processing has developed in living organisms. One of such adaptations would be specific templates for rotation and scaling. Neurobiological data suggests that there is Gabor-like tuning in the first layer of visual cortex.[5] The optimality of Gabor templates for translations and scalings is a possible explanation of this phenomenon.
Non-Group Transformations
Many interesting transformations of images do not form groups. For instance, transformations of images associated with 3D rotation of corresponding 3D object do not form a group. Moreover, such transformations are not even functions of image. Indeed, two different 3D objects may look similar from certain angle but different from another angle. If image corresponds to both objects from the first viewpoint, and images and correspond to the second viewpoint, then transformation of should result in and simultaneously. Similar problem make it impossible to define an inverse for such transformation (note that transformations of images, not associated objects, are considered). It makes the machinery described above not directly applicable to such transformations.
However, approximate invariance is still achievable even for non-group transformations, if certain conditions on the transformation and the template set are met:
- The transformation should be at least twice differentiable function of its parameters;
- The localization condition, outlined in previous section, must hold approximately. It means that if , then for sufficiently big transformation , .
As it was said in the previous section, for specific case of translations and scaling, localization condition can be satisfied by use of generic Gabor templates. However, for general case (non-group) transformation, localization condition can be satisfied only for specific class of objects.[2] More specifically, in order to satisfy the condition, templates must be similar to the objects one would like to recognize. For instance, if one would like to build a system to recognize 3D rotated faces, one need to use other 3D rotated faces as templates. This may explain the existence of such specialized modules in the brain as one responsible for face recognition.[2]
A remark need to be made that even with custom templates, the localization condition is unlikely to hold for raw images. Imagine a dot product of images of a head looking forward and a head rotated 90 degrees around vertical axis. The two images would have a strong overlap in central area, thus their dot product will be far greater than 0. However, it is possible to satisfy the localization condition if one is working with noise-like encoding of images. This encoding may result from the functioning of the lower levels of hierarchical image recognition architecture. It yields a design of at least two level architecture, where the first level operates with generic Gabor-like templates, and higher levels operate with class-specific templates to handle non-group transformation.
Hierarchical Architectures
The previous section suggests one motivation for hierarchical image recognition architectures. However, they have other benefits as well.
Firstly, hierarchical architectures best accomplish the goal of ‘parsing’ a complex visual scene with many objects consisting of many parts, whose relative position may greatly vary. In this case, different elements of the system must react to different objects and parts. Hierarchical architectures have required elements in place, while single-layer architectures don’t.
Secondly, hierarchical architectures which have invariant representations for parts of objects may facilitate learning of complex compositional concepts. This facilitation may happen through reusing of learned representations of parts that were constructed before in process of learning of other concepts. As a result, sample complexity of learning compositional concepts may be greatly reduced.
Finally, hierarchical architectures have better tolerance to clutter. Clutter problem arises when target object appears on non-clear background, which may produce distracting signal to the vision system. Hierarchical architecture provides signatures for parts of target objects, which do not include parts of background and are not affected by background variations.[6]
In order to build a functional hierarchical architecture, one must make lower layers of hierarchy in some sense 'transparent' to transformations they don't handle. In order for a higher level to handle some transformation that was not handled by lower levels, representations generated by lower levels must interact with the transformation in a similar way as do raw images. Specifically, it is required that , where is a transformation from some class that is not handled by lower layer , is a pooling function of that layer, and stands for "distribution of values of the expression for all ". This property is called covariance. It is necessary to make the basic logic outlined in "Intuition" section applicable for hierarchical architectures.
Relation to Biology
M-theory is based on a quantitative theory of the ventral stream of visual cortex[7][8]. Understanding how visual cortex works in object recognition is still a challenging task for neuroscience. Humans and primates are able to memorize and recognize objects after seeing just couple of examples unlike any state-of-the art machine vision systems that usually require a lot of data in order to recognize objects. Prior to , the use of visual neuroscience in computer vision has been limited to early vision for deriving stereo algorithms (e.g.,[9]) and to justify the use of DoG (derivative-of-Gaussian) filters and more recently of Gabor filters.[10][11] No real attention has been given to biologically plausible features of higher complexity. While mainstream computer vision has always been inspired and challenged by human vision, it seems to have never advanced past the very first stages of processing in the simple cells in V1 and V2. Although some of the systems inspired - to various degrees - by neuroscience, have been tested on at least some natural images, neurobiological models of object recognition in cortex have not yet been extended to deal with real-world image databases.[12]
M-theory learning framework employs a novel hypothesis about the main computational function of the ventral stream: the representation of new objects/images in terms of a signature, which is invariant to transformations learned during visual experience. This allows recognition from very few labeled examples - in the limit, just one.
Neuroscience suggests that natural functionals for a neuron to compute is a high-dimensional dot product between an “image patch” and another image patch (called template) which is stored in terms of synaptic weights (synapses per neuron). The standard computational model of a neuron is based on a dot product and a threshold. Another important feature of the visual cortex is that it consists of simple and complex cells. This idea was originally proposed by Hubel and Wiesel.[9] M-theory employs this idea. Simple cells compute dot products of an image and transformations of templates for ( is a number of simple cells), and complex cells are responsible for pooling and computing empirical histograms or statistical moments of it.
Applications
Applications to Computer Vision
In[13][14] authors applied M-theory to unconstrained face recognition in natural photographs. Unlike the DAR (detection, alignment, and recognition) method, which handles clutter by detecting objects and cropping closely around them so that very little background remains, this approach accomplishes detection and alignment implicitly by storing transformations of training images (templates) rather than explicitly detecting and aligning or cropping faces at test time. This system is built according to the principles of a recent theory of invariance in hierarchical networks and can evade the clutter problem generally problematic for feedforward systems. The resulting end-to-end system achieves a drastic improvement in the state of the art on this end-to-end task, reaching the same level of performance as the best systems operating on aligned, closely cropped images (no outside training data). It also performs well on two newer datasets, similar to LFW, but more difficult: significantly jittered (misaligned) version of LFW and SUFR-W (for example, the model’s accuracy in the LFW “unaligned & no outside data used” category is 87.55±1.41% compared to state-of-the-art APEM (adaptive probabilistic elastic matching): 81.70±1.78%).
The theory was also applied to a range of recognition tasks: from invariant single object recognition in clutter to multiclass categorization problems on publicly available data sets (CalTech5, CalTech101, MIT-CBCL) and complex (street) scene understanding tasks that requires the recognition of both shape-based as well as texture-based objects (on StreetScenes data set).[12] The approach performs really well: It has the capability of learning from only a few training examples and was shown to outperform several more complex state-of-the-art systems constellation models, the hierarchical SVM-based face- detection system). A key element in the approach is a new set of scale and position-tolerant feature detectors, which are biologically plausible and agree quantitatively with the tuning properties of cells along the ventral stream of visual cortex. These features are adaptive to the training set, though we also show that a universal feature set, learned from a set of natural images unrelated to any categorization task, likewise achieves good performance.
Applications to Speech Recognition
This theory can also be extended for the speech recognition domain. As an example, in[15] an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluated its validity for voiced speech sound classification was proposed. Authors empirically demonstrated that a single-layer, phone-level representation, extracted from base speech features, improves segment classification accuracy and decreases the number of training examples in comparison with standard spectral and cepstral features for an acoustic classification task on TIMIT dataset [16]
Citations
- ^ Serre T., Oliva A., Poggio T. (2007) A feedforward architecture accounts for rapid categorization. PNAS, vol. 104, no. 15, pp. 6424-6429
- ^ a b c d e f F Anselmi, JZ Leibo, L Rosasco, J Mutch, A Tacchetti, T Poggio (2014) Unsupervised learning of invariant representations in hierarchical architectures arXiv preprint arXiv:1311.4158
- ^ H. Schulz-Mirbach. Constructing invariant features by averaging techniques. In Pattern Recognition, 1994. Vol. 2 - Conference B: Computer Vision amp; Image Processing., Proceedings of the 12th IAPR International. Conference on, volume 2, pages 387 –390 vol.2, 1994.
- ^ H. Cramer and H. Wold. Some theorems on distribution functions. J. London Math. Soc., 4:290–294, 1936.
- ^ F. Anselmi, J.Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, T. Poggio (2013) Magic Materials: a theory of deep hierarchical architectures for learning sensory representations. CBCL paper, Massachusetts Institute of Technology, Cambridge, MA
- ^ Liao Q., Leibo J., Mroueh Y., Poggio T. (2014) Can a biologically-plausible hierarchy effectively replace face detection, alignment, and recognition pipelines? CBMM Memo No. 003, Massachusetts Institute of Technology, Cambridge, MA
- ^ M. Riesenhuber and T. Poggio Hierarchical Models of Object Recognition in Cortex (1999) Nature Neuroscience, vol. 2, no. 11, pp. 1019-1025, 1999.
- ^ T. Serre, M. Kouh, C. Cadieu, U. Knoblich, G. Kreiman, and T. Poggio (2005) A Theory of Object Recognition: Computations and Circuits in the Feedforward Path of the Ventral Stream in Primate Visual Cortex AI Memo 2005-036/CBCL Memo 259, Massachusetts Inst. of Technology, Cambridge.
- ^ a b D.H. Hubel and T.N. Wiesel (1962) Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex The Journal of Physiology 160.
- ^ D. Gabor (1946) Theory of Communication J. IEE, vol. 93, pp. 429-459.
- ^ J.P. Jones and L.A. Palmer (1987) An Evaluation of the Two-Dimensional Gabor Filter Model of Simple Receptive Fields in Cat Striate Cortex J. Neurophysiology, vol. 58, pp. 1233-1258.
- ^ a b Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, and Tomaso Poggio (2007) Robust Object Recognition with Cortex-Like Mechanisms IEEE Transactions on pattern analysis and machine intelligence, VOL. 29, NO. 3
- ^ Qianli Liao, Joel Z Leibo, Youssef Mroueh, Tomaso Poggio (2014) Can a biologically-plausible hierarchy effectively replace face detection, alignment, and recognition pipelines? CBMM Memo No. 003
- ^ Qianli Liao, Joel Z Leibo, and Tomaso Poggio (2014) Learning invariant representations and applications to face verification NIPS 2014
- ^ Georgios Evangelopoulos, Stephen Voinea, Chiyuan Zhang, Lorenzo Rosasco, Tomaso Poggio (2014) Learning An Invariant Speech Representation CBMM Memo No. 022
- ^ https://catalog.ldc.upenn.edu/LDC93S1