Talk:Language model benchmark

	This article is within the scope of WikiProject Artificial Intelligence, a collaborative effort to improve the coverage of Artificial intelligence on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Artificial IntelligenceWikipedia:WikiProject Artificial IntelligenceTemplate:WikiProject Artificial IntelligenceArtificial Intelligence
???	This article has not yet received a rating on the project's importance scale.

comment

If some of the benchmarks look weirdly obscure, then apologies. My criteria is simply: If a frontier model is advertised by showing how good it is on *this* or *that* benchmark, then I will put that benchmark in. For example, today I put in "Vibe-Eval", not because it is particularly interesting (I think it is not), but simply because the latest Google Gemini 2.5 (2025-06-05) advertised its ability on Vibe-Eval, so I had to put it in. pony in a strange land (talk) 20:58, 7 June 2025 (UTC)[reply]

MLCommons

A coherent text on the subject, published in 2024 and added by me to Sources, describes something called MLCommons as the only benchmark standardization game in town. I do not know anything about it, but here is a questions to the one who know: is the omission of MLCommons in this long list deliberate (e.g., due to the bias of the source I have used) or accidental? Викидим (talk) 19:37, 12 September 2025 (UTC)[reply]