Cascading classifiers

Cascading is a particular case of ensemble learning based on the concatenation of several classifiers, using all information collected from the output from a given classifier as additional information for the next classifier in the cascade. Unlike voting or stacking ensembles, which are multiexpert systems, cascading is a multistage one.

The first cascading classifier is the face detector of Viola and Jones (2001). The requirement was that the classifier be fast in order to be implemented on low CPU systems, such as cameras and phones.

Basic algorithm

The algorithm can be summed up as follows: for all rectangles in the image:

stage 1: is there a face in the current rectangle according to classifier 1 ? if yes stage2, if no rectangle does not contain a face
stage 2: is there a face in the current rectangle according to classifier 2 ? if yes stage3, if no rectangle does not contain a face

...

stage n: is there a face in the current rectangle according to classifier n ? if yes there is a face in current rectangle

The question 'is there a face in the current rectangle according to classifier k?', for the stage number k ranging from 1 to n is answered by a collection of weak learners (that is, simple rules that cannot alone do the classification, but are expressive enough to be able to classify anything when combined). For example, the 'face in current rectangle' score of stage 1 might be 0.3*(centerpixels-leftpixels-rightpixels)+0.5*(-pixel 1/3 from the top + rest of pixels) > 0.8. The first condition expresses the fact that the nose catches the light (the center will be light and the sides dark, giving a high numerical result) and the second one the fact that eyebrows are darker than the rest of the face.

Algorithm properties

scaling and rotations

It can be seen from this description that the classifier will not accept faces that are upside down (the eyebrows are not in a correct position) or the side of the face (the nose is no longer in the center, and shadows on the side of the nose might be missing). Separate cascade classifiers have to be trained for every rotation that is not in the image plane (side of face) and will have to be retrained or run on rotated features for every rotation that is in the image plane (face upside down or tilted to the side). Scaling is not a problem, since the features can be scaled (centerpixel, leftpixels and rightpixels have a dimension only relative to the rectangle examined). In recent cascades, pixel value from some part of a rectangle compared to another have been replaced with Haar wavelets, which extends the idea, since they multiply pixels by a sinusoid in a given window, so (centerpixels-leftpixels-rightpixels) will be replaced by a +cosinus between -pi and pi in a given window, giving a negative weight to the left and right and a positive weight in the center.

stage properties

To have good overall performance, the following criteria must be met:

each stage must validate all faces, and can produce many false positives. For example, if stage 1 were to mark as 'does not contain a face' 20% of rectangles containing a face (false negative rate=20%), then the total performance of the chain cannot be higher than 80% true positive, whatever the next stages are, since 20% of faces have already been rejected.
this suggests that a good stage must have 100% true positive and for example 40% false positive, that is accept all rectangles containing faces and erroneously mark many rectangles as potentially containing a face, to be eliminated by later stages. It should be noted that for a first stage, 100% true positive and 40% false positive still gives a lot of false negative, if only 1 in a 1000 rectangles in an image contain a face, there will still be 400 to 1 false possible faces after the first stage.
if the first stage is very fast (a few operations), we have eliminated 60% of rectangles not containing a face very quickly.

The training procedure for one stage is therefore to have many weak learners (simple pixel difference operators), train them as a group (raise their weight if they give correct result), but be mindful of having only a few active weak learners so the computation time remains low.

The first detector of Viola & Jones had 38 stages, with 1 feature in the first stage, then 10, 25, 25, 50 in the next five stages, for a total of 6000 features. The first stages remove unwanted rectangles rapidly to avoid paying the computational costs of the next stages, so that computational time is spent analyzing deeply the part of the image that have a high probability of containing the object.

cascade training

Cascades are usually done through cost-aware ADAboost. The sensitivity threshold (0.8 in our example) can be adjusted so that there is close to 100% true positives and some false positives. The procedure can then be started again for stage 2, until the desired accuracy/computation time is desired.

After the initial algorithm, it was understood that training the cascade a whole can be optimized, to achieve a desired true detection rate with minimal complexity. Examples of such algorithms are RCBoost, ECBoost or RCECBoost, in their most basic versions, they can be understood as choosing, at each step, between adding a stage or adding a weak learner to a previous stage, whichever is less costly, until the desired accuracy has been reached. Every stage of the classifier cannot have a detection rate (sensitivity) below the desired rate, so this is a constrained optimization problem. To be precise, the total sensitivity will be the product of stage sensitivities.

Cascade classifiers are available in OpenCV, with already trained cascades for frontal faces. Training a cascade on different object is possible (search for training a haar cascade, for example), but can currently take a few days.

Cascading classifiers in statistics

The term is also used in statistics to describe a model that is staged. For example, a classifier (for example k-means), takes a vector of features (decision variables) and outputs for each possible classification result the probability that the vector belongs to the class. This is usually used to take a decision (classify into the class with highest probability), but cascading classifiers use this output as the input to another model (another stage). This is particularly useful for models that have highly combinatorial or counting rules (for example, class1 if exactly two features are negative, class2 otherwise), which cannot be fitted without looking at all the interaction terms. Having cascading classifiers enables the successive stage to gradually approximate the combinatorial nature of the classification, or to add interaction terms in classification algorithms that cannot express them in one stage.

As a simple example, if we try to match the rule (class1 if exactly 2 features out of 3 are negative, class2 otherwise), a decision tree would be:

feature 1 negative
- feature 2 negative
  - feature 3 negative -> class2
  - feature 3 positive -> class1
- feature 2 positive
  - feature 3 negative -> class1
  - feature 3 positive -> class2
feature 1 positive
- feature 2 negative
  - feature 3 negative -> class1
  - feature 3 positive -> class2
- feature 2 positive
  - feature 3 negative -> class2
  - feature 3 positive -> class2

The tree has all the combinations of possible leaves to express the full ruleset, whereas (feature1 positive, feature2 negative) and (feature1 negative, feature2 positive) should actually join to the same rule. This leads to a tree with too few samples on the leaves. A two-stage algorithm can effectively merge these two cases by giving a medium-high probability to class1 if feature1 or (exclusive) feature2 is negative. The second classifier can pick up this higher probability and make a decision on the sign of feature3.

In a bias-variance decomposition, cascaded models are usually seen as lowering bias while raising variance.

FCBoost

It is an algorithm of boosting cascade algorithm for face detection and is proposed by Mohammad Saberian and Nuno Vasconcelos^[1] in 2014, it is Based on Viola and Jones Algorithm^[2]^[3].

Motivation

Paul Viola and Michael Jones^[4] propose great ideas in cascade learning classifier based on Adaboost. It did not solve how to determine cascade learning algorithm configuration, i.e. determine the number of stage in cascade classifier and the number of feature per stage to construct Adaboost classifier is hard to get. In Viola and Jones, Rapid Object Detection using a Boosted Cascade of Simple Features, just give a crude way to determine configuration: number of features in the first five layers of the detector is 1, 10, 25, 25 and 50 features respectively. The remaining layers have increasingly more features.

Contents of Algorithm

FCBoost is intended to solve learning configuration. It can automatically learning both the configuration and the stages of a high detection rate detector cascade, which is derived from a Lagrangian risk that trades-off detection performs and speed. FCBoost optimizes this risk with respect to a predictor that complies with the sequential decision making structure of the cascade architecture.

Two new family of boosting predicator

last-stage

The first family of cascade predictors that we consider is derived from the generator

$g_{1}[f_{1},f_{2}](x)=f_{1}(x)u[-f_{1}(x)]+u[f_{1}(x)]f_{2}(x)={\begin{cases}f_{1}(x)&{\mbox{if }}f_{1}(x)<0\\f_{2}(x)&{\mbox{if }}f_{1}(x)>0\end{cases}}$

The associated predictor recursion is

$F_{k}(x)={\begin{cases}f_{m}(x),&k=m\\f_{k}(x)u[-f_{k}(x)]+u[f_{k}(x)]F_{k}+1(x),&1\leq k<m\end{cases}}$

Multiplicative Cascades

The second family of cascade predictors has generator

$g_{2}[f_{1},f_{2}](x)=f_{1}(x)u[-f_{1}(x)]+u[f_{1}(x)]f_{1}(x)f_{2}(x)={\begin{cases}f_{1}(x)&{\mbox{if }}f_{1}(x)<0\\f_{2}(x)f_{1}(x)&{\mbox{if }}f_{1}(x)>0\end{cases}}$

The associated predictor recursion is

$F_{k}(x)={\begin{cases}f_{m}(x),&k=m\\f_{k}(x)u[-f_{k}(x)]+u[f_{k}(x)]f_{k}(x)F_{k}+1(x),&1\leq k<m\end{cases}}$

How to Learn the Cascade Configuration

This consists of determining the number of cascade stages and the number of weak learners per stage. By minimizing Lagrangian risk to accomplish this goal, the Lagrangian risk is a trade-off about searching most accurate detector under a complexity constraint. Because more accurate detector, more complexity it has.

${\mathcal {L}}[f]={\mathcal {R}}_{E}[F]+\eta {\mathcal {R}}_{C}[F]$

where $F(x)$ and $R_{E}[F]$ are the cascade predictor and classification risk, By using a greedy strategy, if cascade stages which can reduce the Lagrangian risk would be added by the boosting algorithm itself, and a new stage can only be added at the end of the existing cascade. But it is often to meet problem that no update is taken, so they introduce the concept of neutral predictors.

Cascade Learning Algorithm

FCBoost is initialized with a neutral predictor.By iteration, it finds the best weak learner for stage k for each of the cascade stages and the best stage to add at the end of the cascade. And the stage which reduce Lagrangian risk most would be selected. The only parameters are the multiplier η in Lagrange risk, which is a trade-off for complicity and accuracy. It can automatically determine both the cascade configuration(number of stages and number of weak learners per stage through iteration) and the predictor of each stage, so as to optimize the trade-off between detection accuracy and complexity through given parameter η.

The two cascade predictor(last-stage and multiplicative cascade) cover the two predominant cascade structures, first (last-stage) is the independent stage, i.e. stage predictors are designed independently(Viola and Jones, 2001). The second(multiplicative) is embedded stage where predictors of consecutive stages are related by

f_{k+1}(x)=f_{k}(x)=w(x)

and w(x) is a single or linear combination of weak learners.

last-stage:

itr: 1) g1

itr: 2) g1 -> g1 + g2

itr: 3) g1 + g3 -> g1 + g2

itr: 4) g1 + g3 -> g1 + g2 -> g1 + g2 + g4

Multiplicative Cascades:

itr: 1) 1 + g1

itr: 2) 1 + g1 -> 1 + g2

itr: 3) 1 + g1 + g3 -> 1 + g2

itr: 4) 1 + g1 + g3 -> 1 + g2 -> 1 + g4

Conclusion

FCBoost is the new cascade boosting algorithm proposed by Saberian and Vasconcelos. It can determine cascade learning configuration automatically by just given a Lagrange multiplier η (which is relative importance of detect accurate and speed). And in algorithm, it is very efficient to implement and have good performances.

References

^ Boosting Algorithms for Detector Cascade Learning
^ Rapid object detection using a boosted cascade of simple features
^ Viola, Jones: Robust Real-time Object Detection, IJCV 2001 See page 11.
^ Viola, Jones: Robust Real-time Object Detection, IJCV 2001 See pages 1,3.

Gama, J.; Brazdil, P. (2000). "Cascade Generalization". Machine Learning. 41 (3): 315–343. doi:10.1023/a:1007652114878. CiteSeer^x: 10.1.1.46.635.
Minguillón, J. (2002). On Cascading Small Decision Trees (PhD thesis). Universitat Autònoma de Barcelona.
Zhao, H.; Ram, S. (2004). "Constrained Cascade Generalization of Decision Trees". IEEE Transactions on Knowledge and Data Engineering. 16 (6): 727–739. doi:10.1109/tkde.2004.3.

[1] Boosting Algorithms for Detector Cascade Learning

[2] Rapid object detection using a boosted cascade of simple features

[3] Viola, Jones: Robust Real-time Object Detection, IJCV 2001 See page 11.

[4] Viola, Jones: Robust Real-time Object Detection, IJCV 2001 See pages 1,3.

[1]

[2]

[3]

[4]