Alternating decision tree

An Alternating Decision Tree (ADTree) is a machine learning method for classification. The ADTree data structure and algorithm are a generalization of decision trees and have connections to boosting.

History

ADTrees were introduced by Yoav Freund and Llew Mason^[1]. However, the algorithm as presented had several typographical errors. Clarifications and optimizations were later presented by Bernhard Pfahringer, Geoffrey Holmes and Richard Kirkby ^[2]. Implementations are available in Weka and JBoost.

Motivation

Original boosting algorithms typically used either decision stumps or decision trees as weak hypotheses. As an example, boosting decision stumps creates a set of $T$ weighted decision stumps (where $T$ is the number of boosting iterations), which then vote on the final classification according to their weights. Individual decision stumps are weighted according to their ability to classify the data.

Boosting a simple learner results in an unstructured set of $T$ hypotheses, making it difficult to infer correlations between attributes. Alternating decision trees introduce structure to the set of hypotheses by requiring that they build off a hypothesis that was produced in an earlier iteration. The resulting set of hypotheses can be visualized in a tree based on the relationship between a hypothesis and its "parent."

Another important feature of boosted algorithms is that the data is given a different distribution at each iteration. Instances that are misclassified are given a larger weight while accurately classified instances are given reduced weight.

Description of the structure

The alternating decision tree structure consists of two components: decision nodes and prediction nodes. Decision nodes specify a predicate condition. Prediction nodes specify a value to add to the score based on the result of the decision node. Each decision node can be seen as a conjunction between a precondition (the decision node was reached) and the condition specified in the decision node.

Perhaps the easiest way to understand the interaction of decision and prediction nodes is through an example.The following example is taken from JBoost performing boosting for 6 iterations on the spambase dataset (available from the UCI Machine Learning Repository). Positive examples indicate that the message is spam and negative examples are not spam. During each iteration, a single node is added to the ADTree. The ADTree determined by the learning algorithm implemented in JBoost is:

The tree construction algorithm is described below in the Description of the algorithm section. We now show how to interpret the tree once it has been constructed. We focus on one specific instance:

An instance to be classified
Feature	Value
char_freq_bang	0.08
word_freq_hp	0.4
capital_run_length_longest	4
char_freq_dollar	0
word_freq_remove	0.9
word_freq_george	0
Other features	...

For this instance, we obtain a score that determines the classification of the instance. This score not only acts as a classification, but also as a measure of confidence. The actual order that the ADTree nodes are evaluated will likely be different than the order in which they were created. That is, the node from iteration 4 can be evaluated before the node from iteration 1. There are constraints to this (e.g. node from iteration 2 must be evaluated before the node from iteration 5). In general, either breadth-first or depth-first evaluation will yield the correct interpretation.

The following table shows how the score is created (progressive score) for our above example instance:

Score for the above instance
Iteration	0	1	2	3	4	5	6
Instance values	N/A	.08 < .052 = n	.4 < .195 = n	0 < .01 = y	0 < 0.005 = y	N/A	.9 < .225 = n
Prediction	-0.093	0.74	-1.446	-0.38	0.176	0	1.66
Progressive Score	-0.093	0.647	-0.799	-1.179	-1.003	-1.003	0.657

There are a few observations that we should make

The final classification of the example is positive (0.657), meaning that the example is considered to be spam.
All nodes at depth 1 have their predicate evaluated and one of their prediction nodes contributes to the score. Thus a tree with depth 1 is the equivalent of boosted decision stumps.
If a decision node is not reached (the node from iteration 5 in the above example) then the node's predicate and subsequent prediction nodes will not be evaluated.

Description of the algorithm

The alternating decision tree learning algorithm is described in the original paper^[1]. The general idea involves a few main concepts:

The root decision node is always TRUE or FALSE
The tree is grown iteratively. The total number of iterations is generally decided prior to starting the algorithm.
Each decision node ( $c_{2}$ ) is selected by the algorithm based on how well it discriminates between positive and negative examples.
Once a decision node is created, the prediction node is determined by how well the decision node discriminates.

Before the algorithm, we first define some notation. Let $c$ be a predicate, then

$W_{+}(c)$ is the weight of all positively labeled examples that satisfy $c$
$W_{-}(c)$ is the weight of all negatively labeled examples that satisfy $c$
$W(c)$ is the weight of all examples that satisfy $c$
We call $c$ a precondition when it is a conjunction of previous base conditions and negations of previous base conditions

The exact algorithm is:

INPUT: $m$ examples and labels $(x_{1},y_{1}),\ldots ,(x_{m},y_{m})$

Set the weight of all examples to $W_{1}(x_{j})=1/m$

Set the margin of all examples to $r_{1}(x_{j})=0$

The root decision node is always $c=$ TRUE, with a single prediction node

$a={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(T)}{W_{-}(T)}}$

For $t=1,\ldots ,T$ do:

Let $c_{1}\in P_{t}$ be a precondition (that is, the node being created can be reached via $c_{1}$ ) and $c_{2}$ be a condition (the new node). Then each decision node ( $c_{2}$ ) is selected by the algorithm based on how well it discriminates between positive and negative examples. The original ADTree algorithm minimizes the criterion $2\left({\sqrt {W_{+}(c_{1}\wedge c_{2})W_{-}(c_{1}\wedge c_{2})}}+{\sqrt {W_{+}(c_{1}\wedge \neg c_{2})W_{-}(c_{1}\wedge \neg c_{2})}}\right)+W(\neg c_{2})$ .
Once a decision node is created, the prediction nodes are determined by $a={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(c_{1}\wedge c_{2})}{W_{-}(c_{1}\wedge c_{2})}}$ and $b={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(c_{1}\wedge \neg c_{2})}{W_{-}(c_{1}\wedge \neg c_{2})}}$
Add the conditions $c_{1}\wedge c_{2}$ and $c_{1}\wedge \neg c_{2}$ to the set of possible preconditions $P_{t+1}$
Update the weights: $W_{t+1}(x_{j})=W_{t}(x_{j})e^{-r_{t}(x_{j})y_{j}}$

Empirical Results

Figure 6 in the original paper^[1] demonstrates that ADTrees are typically as robust as boosted decision trees and boosted decision stumps.

References

^ ^a ^b ^c Yoav Freund and Llew Mason. The Alternating Decision Tree Algorithm. Proceedings of the 16th International Conference on Machine Learning, pages 124-133 (1999)
^ Bernhard Pfahringer and Geoffrey Holmes and Richard Kirkby. Optimizing the Induction of Alternating Decision Trees. Proceedings of the Fifth Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. 2001, pp. 477-487

External links

An introduction to Boosting and ADTrees (Has many graphical examples of alternating decision trees in practice).
JBoost software implementing ADTrees.

[Freund99-1] Yoav Freund and Llew Mason. The Alternating Decision Tree Algorithm. Proceedings of the 16th International Conference on Machine Learning, pages 124-133 (1999)

[Pfahringer-2] Bernhard Pfahringer and Geoffrey Holmes and Richard Kirkby. Optimizing the Induction of Alternating Decision Trees. Proceedings of the Fifth Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. 2001, pp. 477-487

[1]

[2]