Random binary tree

In computer science and probability theory, a random binary tree refers to a binary tree selected at random from some probability distribution on binary trees. Two different distributions are commonly used, both on trees with a fixed number n of nodes: binary trees formed by inserting nodes one at a time according to a random permutation, and binary trees chosen from a uniform discrete distribution in which all distinct trees are equally likely. Adding and removing nodes directly in a random binary tree will in general disrupt its random structure, but the treap and related randomized binary search tree data structures use the principle of binary trees formed from a random permutation in order to maintain a balanced binary search tree dynamically as nodes are inserted and deleted.

Binary trees from random permutations

For any set of numbers (or more generally values from some total order), one may form a binary search tree in which each number is inserted in sequence as a leaf of the tree, without changing the structure of the previously inserted numbers. The position into which each number should be inserted is uniquely determined by a binary search in the tree formed by the previous numbers. For instance, if the three numbers (1,3,2) are inserted into a tree in that sequence, the number 1 will be placed at the root of the tree (the only possible structure for a single-node tree), the number 3 will be placed as its right child, and the number 2 will be placed as the left child of the number 3. There are six different permutations of the numbers (1,2,3), but only five trees that may be constructed from them; one of the trees (the one with 2 at its root) has probability 1/3 of being formed from a random permutation of these three numbers, twice the probability of each of the other four trees.

Expected depth of a node

For any fixed choice of a value x in the given set of numbers, if one randomly permutes the numbers and forms a binary tree from them as described above, the expected value of the length of the path from the root of the tree to x is at most 2 log x + O(1), where “log” denotes the natural logarithm function and the O introduces big O notation. For, the expected number of ancestors of x is by linearity of expectation equal to the sum, over all other values y in the set, of the probability that y is an ancestor of x. And a value y is an ancestor of x exactly when y is the first element to be inserted from the elements in the interval [x,y]. Thus, the values that are adjacent to x in the sorted sequence of values have probability 1/2 of being an ancestor of x, the values one step away have probability 1/3, etc. Adding these probabilities for all positions in the sorted sequence gives twice a Harmonic number, leading to the bound above. A bound of this form holds also for the expected search length of a path to a fixed value x that is not part of the given set.^[1]

The longest path

Although not as easy to analyze as the average path length, there has also been much research on determining the expectation (or high probability bounds) of the length of the longest path in a binary search tree generated from a random insertion order. It is now known that this length, for a tree with n nodes, is almost surely

{\frac {1}{\beta }}\ln n\approx 4.312\ln n,

where β is the unique number in the range 0 < β < 1 satisfying the equation

2\beta e^{1-\beta }=1.

^[2]

Treaps and randomized binary search trees

In applications of binary search tree data structures, it is rare for the values in the tree to be inserted without deletion in a random order, limiting the direct applications of random binary trees. However, algorithm designers have devised data structures that allow insertions and deletions to be performed in a binary search tree, at each step maintaining as an invariant the property that the shape of the tree is a random variable with the same distribution as a random binary search tree.

If a given set of ordered numbers is assigned numeric priorities (distinct numbers unrelated to their values), these priorities may be used to construct a Cartesian tree for the numbers, a binary tree that has as its inorder traversal sequence the sorted sequence of the numbers and that is heap-ordered by priorities. Although more efficient construction algorithms are known, it is helpful to think of a Cartesian tree as being constructed by inserting the given numbers into a binary search tree in priority order. Thus, by choosing the priorities either to be a set of independent random real numbers in the unit interval, or by choosing them to be a random permutation of the numbers from 1 to n (where n is the number of nodes in the tree), and by maintaining the heap ordering property using tree rotations after any insertion or deletion of a node, it is possible to maintain a data structure that behaves like a random binary search tree. Such a data structure is known as a treap or a randomized binary search tree.^[3]

Uniformly random binary trees

The number of binary trees with n nodes is a Catalan number: for n = 1, 2, 3, ... these numbers of trees are

1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, … (sequence A000108 in the OEIS).

Thus, if one of these trees is selected uniformly at random, its probability is the reciprocal of a Catalan number. This model of random trees is not generally used for binary search trees, but it has been applied to problems of modeling the parse trees of algebraic expressions in compiler design^[4] and for modeling evolutionary trees.^[5] In some cases the analysis of random binary trees under the random permutation model can be automatically transferred to the uniform model.^[6]

Notes

^ Hibbard (1962); Knuth (1973); Mahmoud (1992), p. 75.
^ Robson (1979); Pittel (1985); Devroye (1986); Mahmoud (1992), pp. 91–99; Reed (2003).
^ Martinez & Roura (1992); Seidel & Aragon (1996).
^ Mahmoud (1992), p. 63.
^ Aldous (1996).
^ Mahmoud (1992), p. 70.

References

Aldous, David (1996), "Probability distributions on cladograms", in Aldous, David; Pemantle, Robin (eds.), Random Discrete Structures, The IMA Volumes in Mathematics and its Applications, vol. 76, Springer-Verlag, pp. 1–18.
Devroye, Luc (1986), "A note on the height of binary search trees", Journal of the ACM, 33 (3): 489–498, doi:10.1145/5925.5930.
Hibbard, T. (1962), "Some combinatorial properties of certain trees with applications to searching and sorting", Journal of the ACM, 9 (1): 13–28, doi:10.1145/321105.321108.
Knuth, Donald M. (1973), "6.2.2 Binary Tree Searching", The Art of Computer Programming, vol. III, Addison-Wesley, pp. 422–451.
Mahmoud, Hosam M. (1992), Evolution of Random Search Trees, John Wiley & Sons.
Martinez, Conrado; Roura, Salvador (1998), "Randomized binary search trees", Journal of the ACM, 45 (2), ACM Press: pp. 288-323, doi:10.1145/274787.274812 {{citation}}: |pages= has extra text (help).
Pittel, B. (1985), "Asymptotical growth of a class of random trees", Annals of Probability, 13 (2): 414–427, doi:10.1214/aop/1176993000.
Reed, Bruce (2003), "The height of a random binary search tree", Journal of the ACM, 50 (3): 306–332, doi:10.1145/765568.765571.
Robson, J. M. (1979), "The height of binary search trees", Australian Computer Journal, 11: 151–153.
Seidel, Raimund; Aragon, Cecilia R. (1996), "Randomized Search Trees", Algorithmica, 16 (4/5): pp. 464-497, doi:10.1007/s004539900061 {{citation}}: |pages= has extra text (help).

[1] Hibbard (1962); Knuth (1973); Mahmoud (1992), p. 75.

[2] Robson (1979); Pittel (1985); Devroye (1986); Mahmoud (1992), pp. 91–99; Reed (2003).

[3] Martinez & Roura (1992); Seidel & Aragon (1996).

[4] Mahmoud (1992), p. 63.

[5] Aldous (1996).

[6] Mahmoud (1992), p. 70.

[1]

[2]

[3]

[4]

[5]

[6]