Talk:Language model benchmark

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

comment

If some of the benchmarks look weirdly obscure, then apologies. My criteria is simply: If a frontier model is advertised by showing how good it is on *this* or *that* benchmark, then I will put that benchmark in. For example, today I put in "Vibe-Eval", not because it is particularly interesting (I think it is not), but simply because the latest Google Gemini 2.5 (2025-06-05) advertised its ability on Vibe-Eval, so I had to put it in. pony in a strange land (talk) 20:58, 7 June 2025 (UTC)[reply]