Reasoning language model
![]() | This article may require copy editing for jargon. (May 2025) |
Reasoning language models (RLMs) are large language models that are trained further to solve tasks that take several steps of reasoning.[1] They tend to do better on logic, math, and programming tasks than standard LLMs, can revisit and revise earlier steps, and make use of extra computation while answering as another way to scale performance, alongside the number of training examples, parameters, and training compute.[2]
History
2024
In September 2024, OpenAI released o1-preview, an LLM with enhanced reasoning.[3] The full version, o1, followed in December 2024. OpenAI also began sharing results on its successor, o3.[4][5][6]
The development of reasoning LLMs has illustrated what Rich Sutton called the "bitter lesson": that scaling compute often outperforms methods that rely on specific human insights.[7] For example, the Generative AI Research Lab (GAIR) explored complex methods such as tree search and reinforcement learning to replicate o1's capabilities. In their "o1 Replication Journey" papers they reported that knowledge distillation (training a smaller model to imitate o1's outputs) worked surprisingly well. This highlighted the effectiveness of distillation in this context.[8][9]
Alibaba released reasoning versions of its Qwen LLMs in November 2024.[10] In December 2024, the team introduced QvQ-72B-Preview, an experimental visual reasoning model.[11]
In December 2024, Google introduced Deep Research in Gemini,[12] a feature that runs multi-step research tasks.[13]
On December 16, 2024, an experiment with a Llama 3B model showed that by scaling test-time compute, a relatively small model could outperform a much larger Llama 70B model on challenging reasoning tasks. This suggested that better inference strategies can unlock useful reasoning capabilities even in small models.[14][15]
2025
In January 2025, DeepSeek released R1, a model with comparable performance to o1 at lower cost. The release demonstrated the effectiveness of Group Relative Policy Optimization (GRPO).[16][17] On January 25, 2025, DeepSeek added a feature to DeepSeek R1 that lets the model search the web while it reasons, making it easier to combine retrieval with reasoning.[18] OpenAI subsequently released o3-mini, followed by Deep Research based on o3.[19] The effectiveness of distillation was shown again by s1-32B, which reached strong performance with budget forcing and scaling methods.[20][9]
On February 2, 2025, OpenAI released Deep Research,[21] a tool that integrates reasoning and web search in one workflow so users can run complex research that needs several steps and sources. It is based on o3 and can take from 5 to 30 minutes to generate comprehensive reports.[21]
Supervised finetuning
A large language model (LLM) can be fine-tuned on a dataset of reasoning tasks paired with example solutions and step-by-step (reasoning) traces. The fine-tuned model can then produce its own reasoning traces for new problems.[22][23]
Because human-written traces are costly to collect, researchers have proposed ways to build such datasets automatically. In rejection sampling finetuning (RFT), new reasoning traces are gathered in a loop:[24]
- Sample a task prompt.
- Generate many reasoning traces for the prompt.
- Use a verifier to remove reasoning traces with a wrong final answer, and optionally remove duplicates
Reinforcement learning
A pretrained language model can be further trained with RL. In the RL formalism, a generative language model is a policy . A task prompt is an environmental state , and the model's response is an action . The probability that the model responds with is .
Training a reasoning language model with RL means constructing a reward model to guide the RL process. Intuitively, the reward says how good a response is for a prompt. For a reasoning task, the reward is high if the response solves the task and low if it does not.
A response may be broken-down into multiple steps, written .
Most recent systems use policy-gradient methods such as Proximal Policy Optimization (PPO) because PPO constrains each policy update with a clipped objective, which stabilises training for very large policies.[25]
Outcome reward model
An outcome reward model, or outcome-supervised RM (ORM),[22] gives the reward for a step based on the final answer: . Such models are often called "verifiers".
For tasks with answers that are easy to verify, such as math word problems, the outcome reward can be binary: 1 if the final answer is correct, 0 otherwise.[22] If automatic verification is hard, humans can label answers as correct or not, and those labels can be used to finetune a base model that predicts the human label.[23] For tasks like creative writing, where quality is not simply true or false, one can train a reward model on human ranked preference data, as in reinforcement learning from human feedback.[26] A base model can also be fine-tuned to predict, from a partial thinking trace , whether the final answer will be correct, and this prediction can serve as a binary reward.[22]
The ORM is usually trained with logistic regression, i.e. by minimizing cross-entropy loss.[27]
Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace,[26] by taking the minimum,[27] or by other ways of aggregating process rewards. DeepSeek used a simple ORM to train the R1 model.[17]
Process reward model
A process reward model, or process-supervised RM (PRM),[22] gives the reward for a step based only on the steps so far: .
Given a partial thinking trace , a human can judge whether the steps so far are correct, without looking at the final answer. This yields a binary reward. Because human labels are costly, a base model can be fine-tuned to predict them.[22] The PRM is usually trained with logistic regression on the human labels, i.e. by minimizing the cross-entropy loss between true and predicted labels.[27]
As an example, a 2023 OpenAI paper collected 800K process labels for 75K thinking traces. A labeler saw a trace and marked each step as "positive" if it moved toward a solution, "neutral" if it was not wrong but did not help, and "negative" if it was a mistake. After the first "negative" label, the labeler stopped on that trace and moved to another. The authors argued that labeling up to the first error was enough to train a capable PRM, even though labeling later steps could give richer signals.[26][28]
To avoid human labels, researchers have proposed methods to create PRM without human labels on the processes. Inspired by Monte Carlo tree search (MCTS), the Math-Shepherd method samples multiple continuations until the end, starting at each reasoning step , and set the reward at that step to be either in the case of "soft estimation", or in the case of "hard estimation". This creates process rewards from an ORM, which is often easier or cheaper to construct. A PRM can then be trained on these labels.[27] Some work has tried a fully MCTS approach.[29]
One can also use an ORM to implicitly construct a PRM, similar to direct preference optimization.[30]
Guided sampling
A trained ORM can be used to pick the best response. The policy generates several responses, and the ORM selects the best one. This implements a simple form of test-time compute scaling ("best-of-N").[23] [31]
A trained PRM can guide reasoning by a greedy tree search: the policy proposes several next steps, the PRM picks one, and the process repeats. This mirrors using an ORM to pick a whole response.[32] Beam search performs better than greedy search.
Lookahead search is another tree search method. The policy proposes several next steps, then makes a short rollout for each. If a solution is found during rollout, the search stops early. Otherwise, the PRM scores each rollout, and the step with the highest score is chosen.[15]
Self-consistency can be combined with an ORM. The model generates multiple answers, and the answers are clustered so that each cluster has the same final answer. The ORM scores each answer, scores in each cluster are summed, and the answer from the highest-scoring cluster is returned.[27]
Benchmarks
Reasoning models generally score higher than non-reasoning models on many benchmarks, especially on tasks requiring multi-step reasoning.
Some benchmarks exclude reasoning models because their responses take longer and cost more.[33][34][35][36]
Humanity's Last Exam
The HLE benchmark tests expert-level reasoning across mathematics, humanities, and the natural sciences, and shows large performance gaps between models. State-of-the-art reasoning models score low on HLE, leaving room to improve. For example, the full reasoning model o3 reached 26.6%,[21] while the lighter o3-mini-high (on text-only questions) reached 13%.[37]
AIME
On the American Invitational Mathematics Examination (AIME), a difficult math competition, non-reasoning models usually solve under 30% of problems. Models that use reasoning methods score between 50% and 80%.[2][17][20] While OpenAI's o1 maintained or slightly improved its accuracy from reported 2024 results to 2025 AIME results, o3-mini (high) reached a higher accuracy (80%) at a much lower cost (about 12 times cheaper).[38]
o3-mini performance
According to OpenAI's January 2025 report on o3-mini, adjusting "reasoning effort" significantly affects performance, especially for STEM tasks. Moving from low to high reasoning effort raises accuracy on AIME 2024, GPQA Diamond, and Codeforces, typically by 10–30%. With high effort, o3-mini (high) achieved 87.3% on AIME (different from the MathArena AIME benchmark), 79.7% on GPQA Diamond, 2130 Elo on Codeforces, and 49.3 on SWE-bench Verified.[38]
Drawbacks
Computational cost
Reasoning models often need far more compute while answering than non-reasoning models. On AIME, they were 10 to 74 times more expensive[26] than non-reasoning counterparts.
Generation time
Reasoning increases response time, with current models taking from a few seconds to several minutes to answer. As depth of reasoning grows, future models may need even longer.
Models
- R1 (based on V3)
- R1-Lite-Preview (test version based on V2.5)
- QvQ-72B-Preview — an experimental visual reasoning model launched on December 24, 2024, which integrates image understanding with verbal chain-of-thought reasoning.
- QwQ-32B-Preview — an experimental text-based reasoning model released in late November 2024 that emphasizes complex, step-by-step analysis.
- Claude Sonnet 3.7 has an adjustable amount of 'thinking' tokens.
- Magistral (medium & small)
See also
References
- ^ Besta, Maciej; Barth, Julia; Schreiber, Eric; Kubicek, Ales; Catarino, Afonso; Gerstenberger, Robert; Nyczyk, Piotr; Iff, Patrick; Li, Yueling (2025-01-23). "Reasoning Language Models: A Blueprint". arXiv:2501.11223 [cs.CL].
- ^ a b "Learning to reason with LLMs". OpenAI. 2024-09-12. Retrieved 2025-07-26.
- ^ Edwards, Benj (2024-09-12). "OpenAI's new "reasoning" AI models are here: o1-preview and o1-mini". Ars Technica. Retrieved 2025-02-06.
- ^ "OpenAI o1 System Card" (PDF). OpenAI. 2024-12-05. Retrieved 2025-07-26.
- ^ Robison, Kylie (2024-12-05). "OpenAI launches ChatGPT Pro, a $200/month plan with unlimited access to o1, GPT-4o, and more". The Verge. Retrieved 2025-07-26.
- ^ Singh, Jaspreet (2024-12-20). "OpenAI unveils 'o3' model, touting advances in reasoning". Reuters. Retrieved 2025-07-26.
- ^ Sutton, Richard S. "The Bitter Lesson". Incomplete Ideas. Retrieved 2025-02-27.
- ^ Huang, Zhen; Zou, Haoyang; Li, Xuefeng; Liu, Yixiu; Zheng, Yuxiang; Chern, Ethan; Xia, Shijie; Qin, Yiwei; Yuan, Weizhe (2024-11-25). "O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?". arXiv:2411.16489 [cs.CL].
- ^ a b Zeff, Maxwell (2025-02-05). "Researchers created an open rival to OpenAI's o1 'reasoning' model for under $50". TechCrunch. Retrieved 2025-07-26.
- ^ "QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown". Qwen (Alibaba Cloud). 2024-11-28. Retrieved 2025-07-26.
- ^ "QVQ: To See the World with Wisdom". Qwen. Alibaba Cloud. 2024-12-25. Retrieved 2025-07-26.
- ^ "Try Deep Research and our new experimental model in Gemini, your AI assistant". Google. 2024-12-11. Retrieved 2025-02-05.
- ^ Roth, Emma (2024-12-11). "Google built an AI tool that can do research for you". The Verge. Retrieved 2025-07-26.
- ^ "Scaling test-time compute". Hugging Face. 2024-12-16. Retrieved 2025-07-26.
- ^ a b Snell, Charlie; Lee, Jaehoon; Xu, Kelvin; Kumar, Aviral (2025). "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters". International Conference on Learning Representations (ICLR 2025). arXiv:2408.03314. Retrieved 2025-07-26.
- ^ Orland, Kyle (2025-01-28). "How does DeepSeek R1 really fare against OpenAI's best reasoning models?". Ars Technica. Retrieved 2025-02-06.
- ^ a b c DeepSeek-AI; Guo, Daya; Yang, Dejian; Zhang, Haowei; Song, Junxiao; Zhang, Ruoyu; Xu, Runxin; Zhu, Qihao; Ma, Shirong (2025-01-22). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". arXiv:2501.12948 [cs.CL].
- ^ DeepSeek 支持“深度思考+联网检索”能力 [DeepSeek adds a search feature supporting simultaneous deep thinking and web search]. People’s Daily Online (in Chinese). 2025-01-29. Retrieved 2025-07-26.
- ^ Milmo, Dan (2025-02-03). "OpenAI launches 'deep research' tool that it says can match research analyst". The Guardian. ISSN 0261-3077. Retrieved 2025-03-16.
- ^ a b Muennighoff, Niklas; Yang, Zitong; Shi, Weijia; Li, Xiang Lisa; Fei-Fei, Li; Hajishirzi, Hannaneh; Zettlemoyer, Luke; Liang, Percy; Candès, Emmanuel (2025-02-03). "s1: Simple test-time scaling". arXiv:2501.19393 [cs.CL].
- ^ a b c "Introducing deep research". OpenAI. 2025-02-02. Retrieved 2025-02-05.
- ^ a b c d e f Uesato, Jonathan; Kushman, Nate; Kumar, Ramana; Song, Francis; Siegel, Noah; Wang, Lisa; Creswell, Antonia; Irving, Geoffrey; Higgins, Irina (2022-11-25). "Solving math word problems with process- and outcome-based feedback". arXiv:2211.14275 [cs.LG].
- ^ a b c Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18). "Training Verifiers to Solve Math Word Problems". arXiv:2110.14168 [cs.LG].
- ^ Yuan, Zheng; Yuan, Hongyi; Li, Chengpeng; Dong, Guanting; Lu, Keming; Tan, Chuanqi; Zhou, Chang; Zhou, Jingren (2023-09-13). "Scaling Relationship on Learning Mathematical Reasoning with Large Language Models". arXiv:2308.01825 [cs.CL].
- ^ "Aligning language models to follow instructions". OpenAI Blog. 2022-01-27. Retrieved 2025-05-04.
- ^ a b c d Lightman, Hunter; Kosaraju, Vineet; Burda, Yura; Edwards, Harri; Baker, Bowen; Lee, Teddy; Leike, Jan; Schulman, John; Sutskever, Ilya (2024). "Let's Verify Step by Step". International Conference on Learning Representations (ICLR 2024). arXiv:2305.20050. Retrieved 2025-07-26.
- ^ a b c d e Wang, Peiyi; Li, Lei; Shao, Zhihong; Xu, Runxin; Dai, Damai; Li, Yifei; Chen, Deli; Wu, Yu; Sui, Zhifang (August 2024). Ku, Lun-Wei; Martins, Andre; Srikumar, Vivek (eds.). "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations". Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics: 9426–9439. arXiv:2312.08935. doi:10.18653/v1/2024.acl-long.510.
- ^ "prm800k". GitHub. OpenAI. 2025-01-27. Retrieved 2025-01-27.
- ^ Chen, Guoxin; Liao, Minpeng; Li, Chengxi; Fan, Kai (2024-09-27). "AlphaMath Almost Zero: Process Supervision without Process". arXiv:2405.03553 [cs.LG].
- ^ Yuan, Lifan; Li, Wendi; Chen, Huayu; Cui, Ganqu; Ding, Ning; Zhang, Kaiyan; Zhou, Bowen; Liu, Zhiyuan; Peng, Hao (2024-12-02). "Free Process Rewards without Process Labels". arXiv:2412.01981 [cs.CL].
- ^ Zhang, Di; Wu, Jianbo; Lei, Jingdi; Che, Tong; Li, Jiatong; Xie, Tong; Huang, Xiaoshui; Zhang, Shufei; Pavone, Marco (2024-11-21). "LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning". arXiv:2410.02884 [cs.CL].
- ^ Ma, Qianli; Zhou, Haotian; Liu, Tingkai; Yuan, Jianbo; Liu, Pengfei; You, Yang; Yang, Hongxia (2023-10-16). "Let's reward step by step: Step-Level reward model as the Navigators for Reasoning". arXiv:2310.10080 [cs.CL].
- ^ Huang, Yuting; Zois, Christos; Wang, Yue; Zhang, Yue; Mavromatis, Christos; Zeng, Jiachen; Yin, Shihao; Voulkidis, Antonios; Shepard, Daniel (2025). "Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study". Proceedings of the 26th International Conference on Information Processing in Sensor Networks (IPSN ’25). ACM.
Although we did not evaluate o1 and o3 models … their high cost and inference time make them impractical for online CED, which requires frequent, low-latency API requests.
- ^ Hu, Zihao; Wang, Yuqing; Sun, Rui; Lu, Haoran; Gong, Qian; Wang, Jinshuai; Gong, Yunlong; Huang, Yiming; He, Peng (2025-02-13). "Inference-Time Compute: More Faithful? A Research Note". arXiv:2502.09673 [cs.CL].
we were unable to evaluate O1 and R1 …
- ^ Chen, Guoliang; Zhu, Zhiyao; Meng, Qinxiang; Liang, Weilin; Ji, Zijie; Liu, Jiangning; Zeng, Jie (2025-03-07). "RealBench: Evaluating LLMs as Verilog Engineers". arXiv:2503.04914 [cs.AI].
For O1-preview, we sample only once due to high cost.
- ^ Gupta, Arpit; Schapira, Michael; Gill, Phillipa; Seetharaman, Srinivasan (2025-01-30). "On the Feasibility of Using LLMs to Execute Multistage Network Attacks". arXiv:2501.16466 [cs.CR].
We were unable to evaluate o1 … the public API has a safeguard that prevents o1 from executing attacks.
- ^ "Humanity's Last Exam leaderboard". Safe.ai. Center for AI Safety. Retrieved 2025-07-26.
- ^ a b "OpenAI o3-mini". OpenAI. 2025-01-31. Retrieved 2025-02-09.
- ^ "Open-R1: a fully open reproduction of DeepSeek-R1". Hugging Face. 2025-02-24. Retrieved 2025-07-26.
- ^ "OlympicCoder-7B". Hugging Face. 2025-03-11. Retrieved 2025-07-26.
External links
- Fortes, Armando (2025-01-27), atfortes/Awesome-LLM-Reasoning, retrieved 2025-01-27
- Huang, Jie; Chang, Kevin Chen-Chuan (2023-05-26), Towards Reasoning in Large Language Models: A Survey, arXiv:2212.10403
- Besta, Maciej; Barth, Julia; Schreiber, Eric; Kubicek, Ales; Catarino, Afonso; Gerstenberger, Robert; Nyczyk, Piotr; Iff, Patrick; Li, Yueling (2025-01-23), Reasoning Language Models: A Blueprint, arXiv:2501.11223