Cascading classifiers

Cascading is a particular case of ensemble learning based on the concatenation of several classifiers, using all information collected from the output from a given classifier as additional information for the next classifier in the cascade. Unlike voting or stacking ensembles, which are multiexpert systems, cascading is a multistage one.

The first cascading classifier is the face detector of Viola and Jones (2001). The requirement was that the classifier be fast in order to be implemented on low CPU systems, such as cameras and phones.

The algorithm can be summed up as follows: for all rectangles in the image:

stage 1: is there a face in the current rectangle according to classifier 1 ? if yes stage2, if no rectangle does not contain a face
stage 2: is there a face in the current rectangle according to classifier 2 ? if yes stage3, if no rectangle does not contain a face

...

stage n: is there a face in the current rectangle according to classifier n ? if yes there is a face in current rectangle

The question is there a face in the current rectangle according to classifier i, for the stage number i ranging from 1 to n is answered by a collection of weak learners. For example, the 'face in current rectangle' score of stage 1 might be 0.3*(centerpixels-sidepixels)+0.5*(-pixel 1/3 from the top + rest of pixels) > 0.8. The first condition expresses the fact that the nose catches the light and the second one the fact that eyebrows are darker than the rest of the face.

It can be seen from this description that the classifier will not accept faces that are upside down (the eyebrows are not in a correct position) or the side of the face (the nose is no longer in the center, and there might no be shadows on the side of the nose). Separate cascade classifiers have to be trained for every rotation not in the image plane (side of face) and will have to be retrained or run on rotated features for every rotation in the image plane (face upside down). Scaling is not a problem, since the features can be scaled (centerpixel and sidepixels have a dimension only relative to the rectangle examined). In recent cascades, pixel value from some part of a rectangle compared to another have been replaced with Haar wavelets, which basically does the same thing, since they multiply pixels by a sinusoid in a given window, so (centerpixels-sidepixel) will be replaced by a +cosinus between -pi and pi in a given window.

To have good overall performance, the following criteria must be met:

each stage must validate all faces, and produce many false positives. For example, if stage 1 were to mark as 'does not contain a face' 20% of rectangles containing a face (false negative rate=20%), then the total performance of the chain cannot be higher than 80% true positive, whatever the next stages are, since 20% of faces have already been rejected.
this suggests that a good stage must have 100% true positive and for example 40% false positive, that is accept all rectangles containing faces and erroneously mark many rectangles as potentially containing a face, to be eliminated by later stages. It should be noted that for a first stage, 100% true positive and 40% false positive still gives a lot of false negative, if only 0.1% of all rectangles in an image contain a face, false positives outweigh true positive 400 to 1.
if the first stage is very fast (a few operations), we have eliminated 60% of rectangles not containing a face very quickly.

The training procedure for one stage is therefore to have many weak learners (simple pixel difference operators), train them as a group (raise their weight if they give correct result), but be mindful of having only a few active weak learners so the computation time remains low, this is usually done through cost-aware ADAboost. The sensitivity threshold (0.8 in our example) is then adjusted so that there is close to 100% true positives and some false positives. The procedure can then be started again for stage 2, until the desired accuracy/computation time is desired.

The first detector of Viola & Jones had 38 stages, with 1 feature in the first stage, then 10, 25, 25, 50 in the next five stages, for a total of 6000 features. The first stages remove unwanted rectangles rapidly to avoid paying the computational costs of the next stages, so that computational time is spent analyzing deeply the part of the image that have a high probability of containing the object.

After this initial algorithm, it was understood that training the cascade a whole can be optimized, to achieve a desired accuracy goal with minimal complexity. Examples of such algorithms are RCBoost, ECBoost or RCECBoost, in their most basic versions, they can be understood as choosing, at each step, between adding a stage or adding a weak learner to a previous stage, whichever is less costly, until the desired accuracy has been reached. The classifier (and also every stage) cannot have an accuracy below the desired total accuracy, so this is a constrained optimization problem.

Cascade classifiers are available in OpenCV, with already trained cascades for frontal faces. Training a cascade on different object is possible (search for training a haar cascade, for example), but can currently take a few days.

References

J. Gama and P. Brazdil. Cascade Generalization. Machine Learning, 41(3):315--343, 2000. [1]
J. Minguillón. On Cascading Small Decision Trees. PhD dissertation. Universitat Autònoma de Barcelona, 2002. [2]
H. Zhao and S. Ram. Constrained Cascade Generalization of Decision Trees. IEEE Transactions on Knowledge and Data Engineering, 16(6):727--739, 2004.

This artificial intelligence-related article is a stub. You can help Wikipedia by expanding it.

See also

References