Random forest

Random forest (or random forests) is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman^[1] and Adele Cutler, and "Random Forests" is their trademark. The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman's "bagging" idea and the random selection of features, introduced independently by Ho^[2]^[3] and Amit and Geman^[4] in order to construct a collection of decision trees with controlled variation.

The selection of a random subset of features is an example of the random subspace method, which, in Ho's formulation, is a way to implement stochastic discrimination^[5] proposed by Eugene Kleinberg.

History

The early development of random forests was influenced by the work of Amit and Geman^[4] which introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. The idea of random subspace selection from Ho^[3] was also influential in the design of random forests. In this method a forest of trees are grown, and variation between the trees is introduced by projecting the training data into a randomly chosen subspace before fitting each tree. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by Dietterich.^[6]

The introduction of random forests proper was first made in a paper by Leo Breiman^[1]. This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular:

Using out-of-bag error an estimate of the generalization error.
Measuring variable importance through permutation.

The report also offers the first theoretical result for random forests in the form of a bound on the generalization error which depends on the strength of the trees in the forest and their correlation.

More recently several major advances in this area have come from Microsoft Research^[7], which incorporate and extend the earlier work from Breiman.

Framework

It is better to think of random forests as a framework rather than as a particular model. The framework consists of several interchangeable parts which can be mixed and matched to create a large number of particular models, all built around the same central theme. Constructing a model in this framework requires making several choices.

The shape of the decision to use in each node.
The type of predictor to use in each leaf.
The splitting objective to optimize in each node.
The method for injecting randomness into the trees.

The simplest type of decision to make at each node is to apply a threshold to a single dimension of the input. This is a very common choice and leads to trees that partition the space into hyper-rectangular regions. However, other decision shapes, such as splitting a node using linear or quadratic decisions are also possible.

Leaf predictors determine how a prediction is made for a point, given that it falls in a particular cell of the space partition defined by the tree. Simple and common choices here include using a histogram for categorical data, or constant predictors for real valued outputs.

In principle there is no restriction on the type of predictor that can be used, for example one could fit a Support Vector Machine or a spline in each leaf; however, in practice this is uncommon for several reasons. For example, if the tree is large then each leaf may have very few points, making it difficult to fit complex models; also the tree growing procedure itself may be complicated if, for example, it is difficult to compute the splitting objective based on a complex leaf model. However, may of the more exotic generalizations of random forests, e.g. to density or manifold estimation, rely on replacing the leaf models.

The splitting objective is a function which is used to rank candidate splits of a leaf as the tree is being grown. This is commonly based on an impurity measure, such as the information gain or the Gini gain.

The method for injecting randomness into each tree is the component of the random forests framework which affords the most freedom to model designers. Breiman's original algorithm achieves this in two ways:

Each tree is trained on a bootstrapped sample of the original data set.
Each time a leaf is split, only a randomly chosen subset of the dimensions are considered for splitting.

In Breiman's model, once the dimensions are chosen the splitting objective is evaluated at every possible split point in each dimension and the best is chosen. This can be contrasted with the method of Criminisi^[7], which preforms no bootstrapping or subsampling of the data between trees, but uses a different approach for choosing the decisions in each node. Their model selects entire decisions at random (e.g. a dimension threshold pair rather than a dimension). The optimization in the node is preformed over a fixed number of these randomly selected decisions, rather than over every possible decision involving some fixed set of dimensions.

Breiman's algorithm

Each tree is constructed using the following algorithm:

Let the number of training cases be N, and the number of variables in the classifier be M.
We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e., take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.
For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).

For prediction a new sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the mode vote of all trees is reported as the random forest prediction.

Features and Advantages

According to the original authors, the advantages of random forest are:^[8]

It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier.^[9]
It runs efficiently on large databases.
It can handle thousands of input variables without variable deletion.
It gives estimates of what variables are important in the classification.
It generates an internal unbiased estimate of the generalization error as the forest building progresses.
It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
It has methods for balancing error in class population unbalanced data sets.
Prototypes are computed that give information about the relation between the variables and the classification.
It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
It offers an experimental method for detecting variable interactions.

Disadvantages

Random forests have been observed to overfit for some datasets with noisy classification/regression tasks.^[10]
Unlike decision trees, the classifications made by random forests are difficult for humans to interpret.^[11]

For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data. Methods such as partial permutations were used to solve the problem.^[12]^[13]
If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.^[14]

Example

Consider a dataset consisting of 200 random points (100 green points and 100 red points) where green points are drawn from a Gaussian distribution with a centroid at (0,1), and red points are drawn from a Gaussian distribution with a centroid at (1,0), with both cases having circular variance with an average radius of 1. After training a Random Forest model consisting of 50 trees on this data, the purity of the color at each location of the model space indicates the portion of the 50 trees that voted in agreement. Significant over-fit can be observed in this visualization. For contrast, a logistic regression model (which is somewhat less-prone to over-fit) trained on this same data is shown for comparison.

Typically, random forest is best-suited for use with categorical features, but continuous features were used in this illustration because they were easier to visualize.

References

^ ^a ^b Breiman, Leo (2001). "Random Forests". Machine Learning. 45 (1): 5–32. doi:10.1023/A:1010933404324.
^ Ho, Tin Kam (1995). Random Decision Forest (PDF). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282.
^ ^a ^b Ho, Tin Kam (1998). "The Random Subspace Method for Constructing Decision Forests" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (8): 832–844. doi:10.1109/34.709601.
^ ^a ^b Amit, Yali; Geman, Donald (1997). "Shape quantization and recognition with randomized trees" (PDF). Neural Computation. 9 (7): 1545–1588. doi:10.1162/neco.1997.9.7.1545.
^ Kleinberg, Eugene (1996). "An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition" (PDF). Annals of Statistics. 24 (6): 2319–2349. doi:10.1214/aos/1032181157. MR 1425956.
^ Dietterich, Thomas (2000). "An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization". Machine Learning: 139–157.
^ ^a ^b Criminisi, Antonio; Shotton, Jamie; Konukoglu, Ender (2011). "Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning". Foundations and Trends in Computer Vision. 7: 81–227. doi:10.1561/0600000035.
^ [1]
^ Caruana, Rich; Karampatziakis, Nikos; Yessenalina, Ainur (2008). An empirical evaluation of supervised learning in high dimensions (PDF). Proceedings of the 25th International Conference on Machine Learning (ICML).
^ Segal, Mark R. (2004). Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics & Molecular Biostatistics. {{cite book}}: Unknown parameter |month= ignored (help)
^ Berthold, Michael R. (2010). Guide to Intelligent Data Analysis. Springer London.
^ Deng,H. (2011). Bias of importance measures for multi-valued attributes and solutions. Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN). pp. 293–300. {{cite conference}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Altmann A, Tolosi L, Sander O, Lengauer T (2010). "Permutation importance:a corrected feature importance measure". Bioinformatics. doi:10.1093/bioinformatics/btq134.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ Tolosi L, Lengauer T (2011). "Classification with correlated features: unreliability of feature ranking and solutions". Bioinformatics. doi:10.1093/bioinformatics/btr300.

External links

Random Forests classifier description (Site of Leo Breiman)
Liaw, Andy & Wiener, Matthew "Classification and Regression by randomForest" R News (2002) Vol. 2/3 p. 18 (Discussion of the use of the random forest package for R)
Ho, Tin Kam (2002). "A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors". Pattern Analysis and Applications 5, p. 102-112 (Comparison of bagging and random subspace method)
Prinzie, Anita; Poel, Dirk (2007). "Database and Expert Systems Applications". Lecture Notes in Computer Science. 4653: 349. doi:10.1007/978-3-540-74469-6_35. ISBN 978-3-540-74467-2. {{cite journal}}: |chapter= ignored (help); Cite journal requires |journal= (help)
Prinzie, Anita; Van Den Poel, Dirk (2008). "Random Forests for multiclass classification: Random MultiNomial Logit". Expert Systems with Applications. 34 (3): 1721. doi:10.1016/j.eswa.2007.01.029.
C# implementation of random forest algorithm for categorization of text documents supporting reading of documents, making dictionaries, filtering stop words, stemming, counting words, making document-term matrix and its usage for building random forest and further categorization.
A python implementation of the random forest algorithm working in regression, classification with multi-output support.

[breiman2001-1] Breiman, Leo (2001). "Random Forests". Machine Learning. 45 (1): 5–32. doi:10.1023/A:1010933404324.

[2] Ho, Tin Kam (1995). Random Decision Forest (PDF). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282.

[ho1998-3] Ho, Tin Kam (1998). "The Random Subspace Method for Constructing Decision Forests" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (8): 832–844. doi:10.1109/34.709601.

[amitgeman1997-4] Amit, Yali; Geman, Donald (1997). "Shape quantization and recognition with randomized trees" (PDF). Neural Computation. 9 (7): 1545–1588. doi:10.1162/neco.1997.9.7.1545.

[5] Kleinberg, Eugene (1996). "An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition" (PDF). Annals of Statistics. 24 (6): 2319–2349. doi:10.1214/aos/1032181157. MR 1425956.

[6] Dietterich, Thomas (2000). "An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization". Machine Learning: 139–157.

[criminisi2011-7] Criminisi, Antonio; Shotton, Jamie; Konukoglu, Ender (2011). "Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning". Foundations and Trends in Computer Vision. 7: 81–227. doi:10.1561/0600000035.

[8] [1]

[9] Caruana, Rich; Karampatziakis, Nikos; Yessenalina, Ainur (2008). An empirical evaluation of supervised learning in high dimensions (PDF). Proceedings of the 25th International Conference on Machine Learning (ICML).

[10] Segal, Mark R. (2004). Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics & Molecular Biostatistics. {{cite book}}: Unknown parameter |month= ignored (help)

[11] Berthold, Michael R. (2010). Guide to Intelligent Data Analysis. Springer London.

[12] Deng,H. (2011). Bias of importance measures for multi-valued attributes and solutions. Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN). pp. 293–300. {{cite conference}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[13] Altmann A, Tolosi L, Sander O, Lengauer T (2010). "Permutation importance:a corrected feature importance measure". Bioinformatics. doi:10.1093/bioinformatics/btq134.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[14] Tolosi L, Lengauer T (2011). "Classification with correlated features: unreliability of feature ranking and solutions". Bioinformatics. doi:10.1093/bioinformatics/btr300.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]