Autoencoder
An auto-encoder is an artificial neural network used for learning efficient codings. The aim of an auto-encoder is to learn a compressed representation (encoding) for a set of data. This means it is being used for dimensionality reduction. More specifically, it is a feature extraction method. Auto-encoders use three or more layers:
- An input layer. For example, in a face recognition task, the neurons in the input layer could map to pixels in the photograph.
- A number of considerably smaller hidden layers, which will form the encoding.
- An output layer, where each neuron has the same meaning as in the input layer.
If linear neurons are used, then an auto-encoder is very similar to PCA. Auto-encoder is used in MediCoder Premium (medical coding and terminology tool) where it encode a large number of terms in an offline situation as well as being used to interactively code an individual term and automatically performs a fuzzy match against historical data and presents this to the user.
A transforming auto-encoder can force the outputs of a capsule to represent any property of an image that we can manipulate in a known way. It is easy, for example, to scale up all of the pixel intensities. If a first-level capsule outputs a number that is first multiplied by the brightness scaling factor and then used to scale the outputs of its generation units when predicting the brightness transformed output, this number will learn to represent brightness and will allow the capsule to disentangle the probability that an instance of its visual entity is present from the brightness of the instance. If the direction of lighting of a scene can be varied in a controlled way, a capsule can be forced to output two numbers representing this direction but only if the visual entity is complex enough to allow the lighting direction to be extracted from the activities of the recognition units.
Types
There are three types mainly:-
Methods to increase capaciry
Capacity in the sense of the variety of patterns it can successfully learn. Obviously it's much harder to learn all the digits + the english alphabet than just one class, say only 0's. One constraint on this capacity is the number of units in the middle layer, and the number of layers in total. But by increasing both, learning gets more difficult. Since any trained auto-encoder will correspond to some single minima of the energy landscape. Even if this is the global minima, it may not be enough, I fear. So I am hoping that by changing the structure of the auto-encoder perhaps we can increase this capacity. Any tricks you've been using?
One naive idea is to use a couple of auto-encoders with the parallel RBMs. This way a couple of local minima will be found which will be a better representation of the data. Also hopefully some tricks can make use of the parallelism, which will make this idea better than naive.
You can increase the capacity of single-layer auto-encoders in a variety of ways (most of these observations apply to auto-encoders and RBMs):
1.Vary the number of hidden units: more units should in principle translate to more representational power, since the space of functions to choose from during optimization (=capacity) is simply larger. As you rightly mentioned, however, this can pose problems with optimization (though see next point), because of local minima, finite datasets and finite number of restarts.
2.Increase the number of hidden units, but constrain the representation in some way, e.g. via sparsity. This is a classical trick that makes it possible, for instance, to have sparse over-complete representations of natural images (Olshausen and Field, 1996).
3.Have more powerful/more non-linear hidden units. For instance, your units could be quadratic representations of the input (Bergstra et al, 2011 and Turian et al, 2009) or something like rectified linear units (Nair and Hinton, 2010). This seems to be a new trend in a lot of deep learning papers these days.
4.Use more data: don't be afraid to use data, even if it not strictly part of your training distribution, to "pre-train" your auto-encoder. Getting unsupervised/unlabeled data should be cheap for most problems (esp. for 20-newsgroups, if that's what you're aiming at) and using it in some sort of a self-taught learning (Raina et al, 2007) scenario shouldn't be too difficult, with either RBMs or auto-encoders.
5.Ensembles or combinations of auto-encoders: there is a variety of ways in which you could move in this direction.
6.Stacking auto-encoders or RBMs.
Training
An auto-encoder is often trained using one of the many backpropagation variants (conjugate gradient method, steepest descent, etc.) Though often reasonably effective, there are fundamental problems with using backpropagation to train networks with many hidden layers. Once the errors get backpropagated to the first few layers, they are minuscule, and quite ineffectual. This causes the network to almost always learn to reconstruct the average of all the training data. Though more advanced backpropagation methods (such as the conjugate gradient method) help with this to some degree, it still results in very slow learning and poor solutions. This problem is remedied by using initial weights that approximate the final solution. The process to find these initial weights is often called pretraining.
A pretraining technique developed by Geoffrey Hinton for training many-layered "deep" auto-encoders involves treating each neighboring set of two layers like a Restricted Boltzmann Machine for pre-training to approximate a good solution and then using a backpropagation technique to fine-tune.
High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such “autoencoder” networks, but this works well only if the initial weights are close to a good solution. There are effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.
External links
- Reducing the Dimensionality of Data with Neural Networks (Science, 28 July 2006, Hinton & Salakhutdinov)