Factorial code

Most real world data sets consist of data vectors whose individual components are not statistically independent, that is, they are redundant in the statistical sense.^{[clarification needed]} Then it is desirable to create a factorial code of the data, i. e., a new vector-valued representation of each data vector such that it gets uniquely encoded by the resulting code vector (loss-free coding), but the code components are statistically independent.

Later supervised learning usually works much better when the raw input data is first translated into such a factorial code. For example, suppose the final goal is to classify images with highly redundant pixels. A naive Bayes classifier will assume the pixels are statistically independent random variables and therefore fail to produce good results. If the data are first encoded in a factorial way, however, then the naive Bayes classifier will achieve its optimal performance (compare Schmidhuber et al. 1996).

To create factorial codes, Horace Barlow and co-workers suggested to minimize the sum of the bit entropies of the code components of binary codes (1989). Jürgen Schmidhuber (1992) re-formulated the problem in terms of predictors and binary feature detectors, each receiving the raw data as an input. For each detector there is a predictor that sees the other detectors and learns to predict the output of its own detector in response to the various input vectors or images. But each detector uses a machine learning algorithm to become as unpredictable as possible. The global optimum of this objective function corresponds to a factorial code represented in a distributed fashion across the outputs of the feature detectors.

References

Horace Barlow, T. P. Kaushal, and G. J. Mitchison. Finding minimum entropy codes. Neural Computation, 1:412-423, 1989.
Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879, 1992
J. Schmidhuber and M. Eldracher and B. Foltin. Semilinear predictability minimization produces well-known feature detectors. Neural Computation, 8(4):773-786, 1996

See also

References