Top-p sampling

Top-p sampling, also called nucleus sampling, is a technique for language model decoding introduced by Ari Holtzman in 2019.^[1] Naively sampling the highest probability token at each step in auto-regressive decoding is known to product texts that are repetitive and otherwise unnatural. Top-p sampling avoids this by setting a threshold $p$ and then restricting the sampling to the set of most probable tokens with cumulative probability less than $p$ .

Top-k sampling is similar except that the sample is taken from the k-highest probability tokens regardless of their cumulative probability. The advantage of top-p sampling is that one avoids the difficult problem of choosing the optimal value of $k$ which can very depending on the shape of the output distribution and the particular task and dataset^[2].

The top-p sampling technique is used in popular large language model applications like ChatGPT and is implemented in language modeling frameworks like Hugging Face and Cohere^[3].

^ Holtzman, Ari; Buys, Jan; Du, Li; Forbes, Maxwell; Choi, Yejin (22 April 2019). "The Curious Case of Neural Text Degeneration". Retrieved 23 August 2023. {{cite journal}}: Cite journal requires |journal= (help)
^ McCaffrey, James D. "Nucleus Sampling for Natural Language Processing". Retrieved 23 August 2023.
^ von Platen, Patrick. "How to generate text: using different decoding methods for language generation with Transformers". Hugging Face. Retrieved 23 August 2023.

[1] Holtzman, Ari; Buys, Jan; Du, Li; Forbes, Maxwell; Choi, Yejin (22 April 2019). "The Curious Case of Neural Text Degeneration". Retrieved 23 August 2023. {{cite journal}}: Cite journal requires |journal= (help)

[2] McCaffrey, James D. "Nucleus Sampling for Natural Language Processing". Retrieved 23 August 2023.

[3] von Platen, Patrick. "How to generate text: using different decoding methods for language generation with Transformers". Hugging Face. Retrieved 23 August 2023.

[1]

[2]

[3]