Language model benchmark

Language model benchmarks are standardized tests designed to evaluate the performance of language models on various natural language processing tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding, generation, and reasoning.

Benchmarks generally consist of a dataset and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like question answering, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field.

Reasoning benchmarks

GSM8K (Grade School Math): 8.5K linguistically diverse elementary school math word problems that require 2 to 8 basic arithmetic operations to solve.^[1]
MMLU (Measuring Massive Multitask Language Understanding): 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine.^[2]
MATH: 12,500 competition-level math problems.^[3]
MathEval: An omnibus benchmark that contains 20 other benchmarks, such as GSM8K, MATH, and the math subsection of MMLU. Over 20,000 math problems. Difficulty ranges from elementary school to high school competition.^[4]
GPQA (Google-Proof Q&A): 448 multiple-choice questions written by domain experts in biology, physics, and chemistry, and requires PhD-level experts to solve.^[5]
HumanEval: Programming problems where the solution is always a python function, often just a few lines long.^[6]
SWE-Bench: 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase and an issue, the task is to edit the codebase to solve the issue.^[7]
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence): Something similar to a Raven's Progressive Matrices test.^[8]
LiveBench: A series of benchmarks released monthly, including high school math competition questions, competitive coding questions, logic puzzles, and other tasks.^[9]
FrontierMath: Questions from areas of modern math that are difficult for professional mathematicians to solve. Each question has an integer solution.^[10]

References

^ Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18), Training Verifiers to Solve Math Word Problems, arXiv:2110.14168
^ Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2021-01-12), Measuring Massive Multitask Language Understanding, arXiv:2009.03300
^ Hendrycks, Dan; Burns, Collin; Kadavath, Saurav; Arora, Akul; Basart, Steven; Tang, Eric; Song, Dawn; Steinhardt, Jacob (2021-11-08), Measuring Mathematical Problem Solving With the MATH Dataset, arXiv:2103.03874
^ math-eval (2025-01-26), math-eval/MathEval, retrieved 2025-01-27
^ Rein, David; Hou, Betty Li; Stickland, Asa Cooper; Petty, Jackson; Pang, Richard Yuanzhe; Dirani, Julien; Michael, Julian; Bowman, Samuel R. (2023-11-20), GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv:2311.12022
^ Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas (2021-07-14), Evaluating Large Language Models Trained on Code, arXiv:2107.03374
^ Jimenez, Carlos E.; Yang, John; Wettig, Alexander; Yao, Shunyu; Pei, Kexin; Press, Ofir; Narasimhan, Karthik (2024-11-11), SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, arXiv:2310.06770
^ "ARC Prize". ARC Prize. Retrieved 2025-01-27.
^ "LiveBench". livebench.ai. Retrieved 2025-01-27.
^ Glazer, Elliot; Erdil, Ege; Besiroglu, Tamay; Chicharro, Diego; Chen, Evan; Gunning, Alex; Olsson, Caroline Falkman; Denain, Jean-Stanislas; Ho, Anson (2024-12-20), FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, arXiv:2411.04872

[:2-1] Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18), Training Verifiers to Solve Math Word Problems, arXiv:2110.14168

[2] Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2021-01-12), Measuring Massive Multitask Language Understanding, arXiv:2009.03300

[3] Hendrycks, Dan; Burns, Collin; Kadavath, Saurav; Arora, Akul; Basart, Steven; Tang, Eric; Song, Dawn; Steinhardt, Jacob (2021-11-08), Measuring Mathematical Problem Solving With the MATH Dataset, arXiv:2103.03874

[4] th-eval (2025-01-26), math-eval/MathEval, retrieved 2025-01-27

[5] Rein, David; Hou, Betty Li; Stickland, Asa Cooper; Petty, Jackson; Pang, Richard Yuanzhe; Dirani, Julien; Michael, Julian; Bowman, Samuel R. (2023-11-20), GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv:2311.12022

[:4-6] Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas (2021-07-14), Evaluating Large Language Models Trained on Code, arXiv:2107.03374

[7] Jimenez, Carlos E.; Yang, John; Wettig, Alexander; Yao, Shunyu; Pei, Kexin; Press, Ofir; Narasimhan, Karthik (2024-11-11), SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, arXiv:2310.06770

[8] "ARC Prize". ARC Prize. Retrieved 2025-01-27.

[9] "LiveBench". livebench.ai. Retrieved 2025-01-27.

[10] Glazer, Elliot; Erdil, Ege; Besiroglu, Tamay; Chicharro, Diego; Chen, Evan; Gunning, Alex; Olsson, Caroline Falkman; Denain, Jean-Stanislas; Ho, Anson (2024-12-20), FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, arXiv:2411.04872

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Reasoning benchmarks

See also

References