Bootstrap aggregating

Bootstrap aggregating, also called bagging (from bootstrap aggregating), is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

Description of the technique

Given a standard training set $D$ of size n, bagging generates m new training sets $D_{i}$ , each of size n′, by sampling from D uniformly and with replacement. By sampling with replacement, some observations may be repeated in each $D_{i}$ . If n′=n, then for large n the set $D_{i}$ is expected to have the fraction (1 - 1/e) (≈63.2%) of the unique examples of D, the rest being duplicates.^[1] This kind of sample is known as a bootstrap sample. Sampling with replacement ensures each bootstrap is independent from its peers, as it does not depend on previous chosen samples when sampling. Then, m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

Bagging leads to "improvements for unstable procedures",^[2] which include, for example, artificial neural networks, classification and regression trees, and subset selection in linear regression.^[3] Bagging was shown to improve preimage learning.^[4]^[5] On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors.^[2]

Process of the algorithm

Original Dataset:

The original dataset contains several entries of samples from s1 to s5. Each sample has five features (Gene 1 to Gene 5). All samples are labeled as Yes or No for a classification problem.

Samples	Gene 1	Gene 2	Gene 3	Gene 4	Gene 5	Result
s1	1	0	1	0	0	No
s2	1	0	0	0	1	No
s3	0	1	1	0	1	Yes
s4	1	1	1	0	1	Yes
s5	0	0	0	1	1	No

Creation of Bootstrapped Datasets

Given the table above to classify a new sample, first a bootstrapped dataset must be created using the data from the original dataset. This bootstrapped dataset is typically the size of the original dataset, or smaller.

In this example, the size is five (s1 through s5). The bootstrapped dataset is created by randomly selecting samples from the original dataset. Repeat selections are allowed. Any samples that are not chosen for the bootstrapped dataset are placed in a separate dataset called the out-of-bag dataset.

See an example bootstrapped dataset below. It has five entries (same size as the original dataset). There are duplicated entries such as two s3 since the entries are selected randomly with replacement.

Samples	Gene 1	Gene 2	Gene 3	Gene 5	Result
s1	1	0	1	0	No
s3	0	1	1	1	Yes
s2	1	0	0	1	No
s3	0	1	1	1	Yes
s4	1	1	1	1	Yes

This step will repeat to generate m bootstrapped datasets.

Creation of Decision Trees

NewBootstrapdatasettree

A decision tree is created for each bootstrapped dataset using randomly selected column values to split the nodes.

Wisdom of Crowds

The wisdom of crowds is the collective opinion of a group of individuals rather than that of a single expert. ^[6] Utilizing the wisdom of crowds allows for the random forest developed from multiple decision trees to be accurate. When building a random forest, each decision tree acts as a person in the crowd. Each of these trees generates a vote and averaging these results will determine the result per the wisdom of the crowd.

In order to use this method, the crowd must include the prerequisite qualities as listed below in the table.

5 Qualities Necessary For A Wise Crowd
Criteria	Description
Diversity of Opinion	Each person should have private information even if it is just an eccentric interpretation of the known facts.
Independence	People's opinions are not determined by the opinions of those around them.
Decentralization	People are able to specialize and draw on their local knowledge.
Aggregation	Some mechanism exists for turning private judgements into a collective decision.
Trust	Each person trusts the collective group to be fair and valid.

^[7]

Although there are situations where the wisdom of the crowd is not accurate and produces a failed judgement, the wisdom of crowds when above criteria is met yields a wise judgement. Generating and studying the wisdom of crowds in a random forest is an accurate way to judge the overall determining result.

Predicting using Multiple Decision Trees

NewEntryIntoBaseTableExample When a new sample is added to the table, the bootstrapped dataset is used to determine the new entry's classifier value.

NewEntrytesttree

The new sample is tested in the random forest created by each bootstrapped dataset and each tree produces a classifier value for the new sample. For classification, a process called voting is used to determine the final result, where the result produced the most frequently by the random forest is the given result for the sample. For regression, the sample is assigned the average classifier value produced by the trees.

FinalTable

After the sample is tested in the random forest, a classifier value is assigned to the sample and it is added to the table.

Improving Random Forests and Bagging

Random forests are incredibly useful on their own, however, there are certain methods that can be used in order to improve their execution and voting time, their prediction accuracy, and their overall performance. The following are key steps in creating an efficient random forest:

Specify the maximum depth of trees: Instead of allowing your random forest to continue until all nodes are pure, it is better to cut it off at a certain point in order to further decrease chances of overfitting.
Prune the dataset: Using an extremely large dataset may prove to create results that is less indicative of the data provided than a smaller set that more accurately represents what is being focused on.
1. Continue pruning the data at each node split rather than just in the original bagging process.
Decide on accuracy or speed: Depending on the desired results, increasing or decreasing the number of trees within the forest can help. Increasing the number of trees generally provides more accurate results while decreasing the number of trees will provide quicker results.

Pros and Cons of Random Forests and Bagging
Pros	Cons
There are overall less requirements involved for normalization and scaling, making the use of random forests more convenient. ^[8]	The algorithm may change significantly if there is a slight change to the data being bootstrapped and used within the forests^[9].
Easy data preparation.	Random Forests are more complex to implement than lone decision trees or other algorithms.
Consisting of multiple decision trees, forests are able to more accurately make predictions.	Requires much more time to train the data compared to decision trees.
Works well with non-linear data.	Requires much more time to make decisions and vote within the random forest classifier.
Lower risk of overfitting and runs efficiently on even large data sets^[10].	Requires much more computational power and computational resources.
Use of the random forest classifier is popular because of its high accuracy and speed^[11].	Does not predict beyond the range of the training data.
Deals with missing data and datasets with many outliers well.	To recreate specific results you need to keep track of the exact random seed being used to generate the bootstrap sets.
Bootstrapping helps to combine different predictions to find the most accurate amount of votes within a random forest.	Random forests cannot guarantee optimal trees^[12].

Algorithm (classification)

For classification, use a training set $D$ , Inducer $I$ and the number of bootstrap samples $m$ as input. Generate a classifier $C^{*}$ as output^[13]

Create $m$ new training sets $D_{i}$ , from $D$ with replacement
Classifier $C_{i}$ is built from each set $D_{i}$ using $I$ to determine the classification of set $D_{i}$
Finally classifier $C^{*}$ is generated by using the previously created set of classifiers $C_{i}$ on the original data set $D$ , the classification predicted most often by the sub-classifiers $C_{i}$ is the final classification

for i = 1 to m {
    D' = bootstrap sample from D    (sample with replacement)
    Ci = I(D')
}
C*(x) = argmax #{i:Ci(x)=y}            (most often predicted label y)
         y∈Y

Example: ozone data

To illustrate the basic principles of bagging, below is an analysis on the relationship between ozone and temperature (data from Rousseeuw and Leroy^{[clarification needed]} (1986), analysis done in R).

The relationship between temperature and ozone appears to be nonlinear in this data set, based on the scatter plot. To mathematically describe this relationship, LOESS smoothers (with bandwidth 0.5) are used. Rather than building a single smoother for the complete data set, 100 bootstrap samples were drawn. Each sample is composed of a random subset of the original data and maintains a semblance of the master set’s distribution and variability. For each bootstrap sample, a LOESS smoother was fit. Predictions from these 100 smoothers were then made across the range of the data. The black lines represent these initial predictions. The lines lack agreement in their predictions and tend to overfit their data points: evident by the wobbly flow of the lines.

By taking the average of 100 smoothers, each corresponding to a subset of the original data set, we arrive at one bagged predictor (red line). The red line's flow is stable and does not overly conform to any data point(s).

Advantages and disadvantages

Advantages:

Many weak learners aggregated typically outperform a single learner over the entire set, and has less overfit
Removes variance in high-variance low-bias weak learner ^[14]
Can be performed in parallel, as each separate bootstrap can be processed on its own before combination^[15]

Disadvantages:

For weak learner with high bias, bagging will also carry high bias into its aggregate^[14]
Loss of interpretability of a model.
Can be computationally expensive depending on the data set

History

The concept of bootstrap aggregating is derived from the concept of bootstrapping which was developed by Bradley Efron.^[16] Bootstrap aggregating was proposed by Leo Breiman who also coined the abbreviated term "bagging" (bootstrap aggregating). Breiman developed the concept of bagging in 1994 to improve classification by combining classifications of randomly generated training sets. He argued, "If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy".^[3]

References

^ Aslam, Javed A.; Popa, Raluca A.; and Rivest, Ronald L. (2007); On Estimating the Size and Confidence of a Statistical Audit, Proceedings of the Electronic Voting Technology Workshop (EVT '07), Boston, MA, August 6, 2007. More generally, when drawing with replacement n′ values out of a set of n (different and equally likely), the expected number of unique draws is $n(1-e^{-n'/n})$ .
^ ^a ^b Breiman, Leo (1996). "Bagging predictors". Machine Learning. 24 (2): 123–140. CiteSeerX 10.1.1.32.9399. doi:10.1007/BF00058655. S2CID 47328136.
^ ^a ^b Breiman, Leo (September 1994). "Bagging Predictors" (PDF). Technical Report (421). Department of Statistics, University of California Berkeley. Retrieved 2019-07-28.
^ Sahu, A., Runger, G., Apley, D., Image denoising with a multi-phase kernel principal component approach and an ensemble version, IEEE Applied Imagery Pattern Recognition Workshop, pp.1-7, 2011.
^ Shinde, Amit, Anshuman Sahu, Daniel Apley, and George Runger. "Preimages for Variation Patterns from Kernel PCA and Bagging." IIE Transactions, Vol.46, Iss.5, 2014
^ "Wisdom of the crowd", Wikipedia, 2021-11-17, retrieved 2021-11-29
^ "The Wisdom of Crowds", Wikipedia, 2021-10-10, retrieved 2021-11-29
^ "Random Forest Pros & Cons". HolyPython.com. Retrieved 2021-11-26.
^ K, Dhiraj (2020-11-22). "Random Forest Algorithm Advantages and Disadvantages". Medium. Retrieved 2021-11-26.
^ Team, Towards AI; Team, Towards AI. "Why Choose Random Forest and Not Decision Trees – Towards AI — The World's Leading AI and Technology Publication". Retrieved 2021-11-26.
^ "Random Forest". Corporate Finance Institute. Retrieved 2021-11-26.
^ Team, Towards AI; Team, Towards AI. "Why Choose Random Forest and Not Decision Trees – Towards AI — The World's Leading AI and Technology Publication". Retrieved 2021-11-26.
^ Bauer, Eric; Kohavi, Ron (1999). "An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants". Machine Learning. 36: 108–109. doi:10.1023/A:1007515423169. S2CID 1088806. Retrieved 6 December 2020.
^ ^a ^b "What is Bagging (Bootstrap Aggregation)?". CFI. Corporate Finance Institute. Retrieved December 5, 2020.
^ Zoghni, Raouf (September 5, 2020). "Bagging (Bootstrap Aggregating), Overview". The Startup – via Medium.
^ Efron, B. (1979). "Bootstrap methods: Another look at the jackknife". The Annals of Statistics. 7 (1): 1–26. doi:10.1214/aos/1176344552.