Talk:Random binary tree/GA1

GA Review

Article (edit | visual edit | history) · Article talk (edit | history) · Watch

Nominator: David Eppstein (talk · contribs)

Reviewer: Czarking0 (talk · contribs) 02:28, 28 March 2024 (UTC)[reply]

Hello and thanks for your contribution. At a glance this seems like a cool article. I will attempt to provide a though and well paced review. I am placing this table below too keep track of the criteria but feel free to suggest another format.

Rate	Attribute	Review Comment
1. Well-written:
	1a. the prose is clear, concise, and understandable to an appropriately broad audience; spelling and grammar are correct.	I am not sure if there is a guideline on this but it seems like this section could really be served by having the mathematical notation written as like a derivation. "By linearity of expectation, the expected number of ancestors of x {\displaystyle x} equals the sum, over other keys y {\displaystyle y}, of the probability that y {\displaystyle y} is an ancestor of x {\displaystyle x}. A key y {\displaystyle y} is an ancestor of x {\displaystyle x} exactly when y {\displaystyle y} is the first key to be inserted from the interval [ x , y ] {\displaystyle [x,y]}. Because each key in the interval is equally likely to be first, this happens with probability inverse to the length of the interval." "Its length, for a tree with n {\displaystyle n} nodes, is almost surely " this seems like a very odd place for an "almost surely ". After reading the page on this I understand. In my professional life where I in part work as a statistician just say that it occurs with probability 1 which I find more clear, but I recognize that this language may be the norm here. "Expected number of leaves" Wanted to highlight how well written this section is "(or extended binary trees with n {\displaystyle n} internal nodes and n + 1 {\displaystyle n+1} external nodes) " I would remove this
	1b. it complies with the Manual of Style guidelines for lead sections, layout, words to watch, fiction, and list incorporation.	Does the stuff about treaps really belong in the lead section? Overall it feels like the lead could be better but I recommend we come back to this point after making the other edits. It may be worthwhile to break out critical Galton-Watson trees into a subsection. Or maybe just a paragraph break before this sentence: "Another application of critical Galton–Watson trees (in the version where the root must be internal) arises in"
2. Verifiable with no original research:
	2a. it contains a list of all references (sources of information), presented in accordance with the layout style guideline.	"However, only the weaker upper bound log 3 ⁡ n + o ( log ⁡ n ) {\displaystyle \log _{3}n+o(\log n)} has actually been proven" I am weary of WP pages claiming that something has never been proven since that is usually hard to verify. Do you have a source for that?
	2b. reliable sources are cited inline. All content that could reasonably be challenged, except for plot summaries and that which summarizes cited content elsewhere in the article, must be cited no later than the end of the paragraph (or line if the content is not in prose).
	2c. it contains no original research.
	2d. it contains no copyright violations or plagiarism.
3. Broad in its coverage:
	3a. it addresses the main aspects of the topic.	"By choosing the priorities to be independent random real numbers in the unit interval" Is it important that the distribution is uniform? You can tell me to read the sources as an answer. Why does the reader want to know about radix trees? What are they used in?
	3b. it stays focused on the topic without going into unnecessary detail (see summary style).
	4. Neutral: it represents viewpoints fairly and without editorial bias, giving due weight to each.
	5. Stable: it does not change significantly from day to day because of an ongoing edit war or content dispute.
6. Illustrated, if possible, by media such as images, video, or audio:
	6a. media are tagged with their copyright statuses, and valid non-free use rationales are provided for non-free content.
	6b. media are relevant to the topic, and have suitable captions.
	7. Overall assessment.

Some replies:

Re 1a: "Derivation": you mean a formal proof? Really? Or am I misunderstanding this point? "Almost surely": should have been with high probability (that is, probability tending to 1 in the limit of large $n$ rather than probability 1 even for fixed $n$ ; fixed. Re "(or extended...)": removed.

Re 1b: There is only one sentence about treaps in the lead. The paragraph it is in is mostly not about treaps. It is intended as a brief summary of the long "from random permutations" section, most of which is motivated by the average-case analysis of insertion-only binary search trees without any rebalancing. Treaps are a trick to get that same analysis to work for worst-case inputs with both insertions and deletions; I think they are worth mentioning in the lead, but not for more than a sentence. Re critical Galton-Watson trees: it's difficult to split out the p=1/2 cases from the p<1/2 and p>1/2 cases in the analysis, but I added a paragraph break before the algorithmic application as suggested, and added subsection headers to the whole section.

Re 2a and "only the weaker upper bound": removed "only" and "actually" to avoid the implication that this is the last word on the subject (although I think it is).

Re 3a: "Is it important that the distribution is uniform?" (for treap priorities) No. It is important that they be independent and (likely) distinct, but uniformity doesn't matter. That's why I omitted it from this part. Some of the later works cited in the last paragraph of this section address the issue that getting computer representations of uniform reals to enough precision to make sure they're all distinct needs a lot of bits of randomness per number (and then all these bits need to be stored in the treap), leading to unnecessary inefficiencies. Instead they use more complicated schemes to save bits, but then they don't get trees whose distribution perfectly matches the uniform insertion order distribution.

Re: "Why does the reader want to know about radix trees?": If the reader cares about probability distributions over random trees, the radix tree gives you a predetermined number of internal nodes, unlike the trie. If the reader cares about data structures, they need to know about both tries and radix trees as both are standard ideas in this area. If the reader cares about algorithms, both tries and radix trees model the behavior of certain binary-representation-based divide-and-conquer sorting algorithms on random inputs, but different algorithms. The trie models a sorting algorithm that divide on the bits of the binary numbers given, in strict succession from most significant to least significant. The radix tree models a sorting algorithm that finds and then divides on the next bit that separates some inputs. My personal interests are in data structures and algorithms, but actually my own motivation for including the section on tries and radix trees in this article was less about algorithms and more from the point of view of probability distributions over random trees: they give a quite different distribution than the other ones described (more strongly balanced, especially in the radix tree case) and I thought that made an interesting contrast to the other distributions.