Talk:Language model benchmark
| This article is rated B-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||
| |||||||||||
comment
[edit]If some of the benchmarks look weirdly obscure, then apologies. My criteria is simply: If a frontier model is advertised by showing how good it is on *this* or *that* benchmark, then I will put that benchmark in. For example, today I put in "Vibe-Eval", not because it is particularly interesting (I think it is not), but simply because the latest Google Gemini 2.5 (2025-06-05) advertised its ability on Vibe-Eval, so I had to put it in. pony in a strange land (talk) 20:58, 7 June 2025 (UTC)
MLCommons
[edit]A coherent text on the subject, published in 2024 and added by me to Sources, describes something called MLCommons as the only benchmark standardization game in town. I do not know anything about it, but here is a questions to the one who know: is the omission of MLCommons in this long list deliberate (e.g., due to the bias of the source I have used) or accidental? Викидим (talk) 19:37, 12 September 2025 (UTC)