Reasoning language model

Reasoning language models are artificial intelligence systems that combine natural language processing with structured reasoning capabilities. These models are usually constructed by prompting, supervised finetuning (SFT), and reinforcement learning (RL) initialized with pretrained language models.

Prompting

A language model is a generative model of a training dataset of texts. Prompting means constructing a text prompt, such that, conditional on the text prompt, the language model generates a solution to the task. Prompting can be applied to a pretrained model ("base model"), a base model that has undergone SFT, or RL, or both.^[1]

Chain of thought

Chain of Thought prompting (CoT) prompts the model to answer a question by first generating a "chain of thought", i.e. steps of reasoning that mimic a train of thought.^[2] It was published in 2022 by the Brain team of Google on the PaLM-540B model.^[3] In CoT prompting, the prompt is of the form "<Input> Let's think step by step", and the model would respond with a chain of reasoning steps, ended with an answer: ${\text{Input}}\rightarrow \underbrace {{\text{Step}}_{1}\rightarrow {\text{Step}}_{2}\rightarrow \cdots \rightarrow {\text{Step}}_{n}} _{\text{Reasoning chain}}\rightarrow {\text{Answer}}$ Similarly, Tree of Thought prompting generalizes CoT by prompting the model to generate one or more "possible next steps", and then running the model on each of the possible next steps by breadth-first, beam, or some other method of tree search.^[4] Similarly, Graph of Thought generalizes CoT so that the reasoning steps form a directed acyclic graph.^[5]

Self-consistency decoding performs several chain-of-thought rollouts, then selects the most commonly reached conclusion out of all the rollouts.^[6] If the rollouts disagree by a lot, a human can be queried for the correct chain of thought.^[7]

Retrieval-augmented generation

A language model may answer a query by first querying a database of documents using the query. The document retrieval can be via a vector database, summary index, tree index, or keyword table index.^[8] Following document retrieval, the LLM generates an output that incorporates information from both the query and the retrieved documents.^[9]

Tool use

Language models can perform long reasoning steps by calling external methods, such as numerical recipes, program interpreters, API calls, and so on. This can be prompt-engineered by describing the external methods in-context (an example of in-context learning) or finetuned into the model.^[10]

Supervised finetuning

A base model can be finetuned on a dataset of reasoning tasks with example solutions and reasoning traces. The finetuned model would then be able to generate reasoning traces for a given problem.^[11]^[12]

Reinforcement learning

A pretrained language model can be further trained by RL. In the RL formalism, a generative language model is a policy $\pi$ . A prompt specifying a task to solve is an environmental state $x$ , and the response of the language model to the prompt is an action $y$ . The probability that the language model responds $x$ with $y$ is $\pi (y|x)$ .

Training a reasoning language model by RL then consists of constructing a reward model $r(x,y)$ to guide the RL process. Intuitively, a reward model describes how desirable/appropriate/good the response is for the prompt. For reasoning language model, the prompt describes a reasoning task, and the reward would be high if the response solves the task, and low if the response fails to solve the task.

For reasoning language models, the model's response $y$ may be broken down into multiple steps, in which case it is written as $y_{1},y_{2},\dots ,y_{n}$ .

Outcome Reward Model

Outcome reward model, or outcome-supervised RM (ORM),^[11] is a reward model that computes the reward of a step $r(x,y_{1},\dots ,y_{i})$ determined by the final answer: $r(x,y_{1},\dots ,y_{i})=r(x,y_{n})$ . They are also called "verifiers".

For tasks with an answer that is easy to verify, such as word problems in math, the outcome reward can simply be binary: 1 if the final answer is correct, and 0 otherwise.^[11] If the answer is not easy to verify programmatically, humans can manually label the answers as correct or not, then the labels can be used to finetune a base model that predicts the human label.^[13] For other kinds of tasks, such as creative writing, where verification is difficult, one can train a reward model by finetuning a base model on human preference data, such as used in reinforcement learning from human feedback.^[14] A base model can also be finetuned to predict, given a partial thinking trace $x,y_{1},\dots ,y_{m}$ , whether the final answer would be correct or not. This can then be used as a binary reward signal.^[11]

Process Reward Model

Process reward model, or process-supervised RM (PRM),^[11] is a reward model that computes the reward of a step $r(x,y_{1},\dots ,y_{i})$ determined by the steps so far: $(x,y_{1},\dots ,y_{i})$ .

Given a partial thinking trace $x,y_{1},\dots ,y_{m}$ , a human can be queried as to whether the steps so far are correct, regardless of whether the ultimate answer would be correct. This can then be used as a binary reward signal. As human labels are expensive, a base model can be finetuned to predict the human labels.^[11]

Applications

Prompt engineering was discovered in GPT-3 as "few-shot learning",^[15] which began a period of research into "eliciting" capacities of pretrained language models. It was then found that a model could be prompted to perform CoT reasoning, which improves its performance on reasoning tasks.

References

^ Qiao, Shuofei; Ou, Yixin; Zhang, Ningyu; Chen, Xiang; Yao, Yunzhi; Deng, Shumin; Tan, Chuanqi; Huang, Fei; Chen, Huajun (2023-09-18), Reasoning with Language Model Prompting: A Survey, arXiv, doi:10.48550/arXiv.2212.09597, arXiv:2212.09597
^ Wei, Jason; Wang, Xuezhi; Schuurmans, Dale; Bosma, Maarten; Ichter, Brian; Xia, Fei; Chi, Ed H.; Le, Quoc V.; Zhou, Denny (31 October 2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems (NeurIPS 2022). Vol. 35. arXiv:2201.11903.
^ Sharan Narang and Aakanksha Chowdhery (2022-04-04). "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance".
^ Yao, Shunyu; Yu, Dian; Zhao, Jeffrey; Shafran, Izhak; Griffiths, Thomas L.; Cao, Yuan; Narasimhan, Karthik (2023-05-17). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models". arXiv:2305.10601 [cs.CL].
^ Besta, Maciej; Blach, Nils; Kubicek, Ales; Gerstenberger, Robert; Podstawski, Michal; Gianinazzi, Lukas; Gajda, Joanna; Lehmann, Tomasz; Niewiadomski, Hubert; Nyczyk, Piotr; Hoefler, Torsten (2024-03-24). "Graph of Thoughts: Solving Elaborate Problems with Large Language Models". Proceedings of the AAAI Conference on Artificial Intelligence. 38 (16): 17682–17690. doi:10.1609/aaai.v38i16.29720. ISSN 2374-3468.
^ Wang, Xuezhi; Wei, Jason; Schuurmans, Dale; Le, Quoc; Chi, Ed; Narang, Sharan; Chowdhery, Aakanksha; Zhou, Denny (2022-03-01). "Self-Consistency Improves Chain of Thought Reasoning in Language Models". arXiv:2203.11171 [cs.CL].
^ Diao, Shizhe; Wang, Pengcheng; Lin, Yong; Zhang, Tong (2023-02-01). "Active Prompting with Chain-of-Thought for Large Language Models". arXiv:2302.12246 [cs.CL].
^ "How Each Index Works - LlamaIndex 🦙 v0.10.17". docs.llamaindex.ai. Retrieved 2024-04-08.
^ Lewis, Patrick; Perez, Ethan; Piktus, Aleksandra; Petroni, Fabio; Karpukhin, Vladimir; Goyal, Naman; Küttler, Heinrich; Lewis, Mike; Yih, Wen-tau; Rocktäschel, Tim; Riedel, Sebastian; Kiela, Douwe (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". Advances in Neural Information Processing Systems. 33. Curran Associates, Inc.: 9459–9474. arXiv:2005.11401.
^ Schick, Timo; Dwivedi-Yu, Jane; Dessi, Roberto; Raileanu, Roberta; Lomeli, Maria; Hambro, Eric; Zettlemoyer, Luke; Cancedda, Nicola; Scialom, Thomas (2023-12-15). "Toolformer: Language Models Can Teach Themselves to Use Tools". Advances in Neural Information Processing Systems. 36: 68539–68551.
^ ^a ^b ^c ^d ^e ^f Uesato, Jonathan; Kushman, Nate; Kumar, Ramana; Song, Francis; Siegel, Noah; Wang, Lisa; Creswell, Antonia; Irving, Geoffrey; Higgins, Irina (2022-11-25), Solving math word problems with process- and outcome-based feedback, arXiv, doi:10.48550/arXiv.2211.14275, arXiv:2211.14275
^ Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18), Training Verifiers to Solve Math Word Problems, arXiv, doi:10.48550/arXiv.2110.14168, arXiv:2110.14168
^ Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18), Training Verifiers to Solve Math Word Problems, arXiv, doi:10.48550/arXiv.2110.14168, arXiv:2110.14168
^ Lightman, Hunter; Kosaraju, Vineet; Burda, Yura; Edwards, Harri; Baker, Bowen; Lee, Teddy; Leike, Jan; Schulman, John; Sutskever, Ilya (2023-05-31), Let's Verify Step by Step, arXiv, doi:10.48550/arXiv.2305.20050, arXiv:2305.20050
^ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon (2020-12-06). "Language models are few-shot learners". Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS '20. Red Hook, NY, USA: Curran Associates Inc.: 1877–1901. doi:10.5555/3495724.3495883. ISBN 978-1-7138-2954-6. {{cite journal}}: Check |doi= value (help)

External links

Fortes, Armando (2025-01-27), atfortes/Awesome-LLM-Reasoning, retrieved 2025-01-27
Huang, Jie; Chang, Kevin Chen-Chuan (2023-05-26), Towards Reasoning in Large Language Models: A Survey, arXiv, doi:10.48550/arXiv.2212.10403, arXiv:2212.10403, retrieved 2025-01-27

[1] Qiao, Shuofei; Ou, Yixin; Zhang, Ningyu; Chen, Xiang; Yao, Yunzhi; Deng, Shumin; Tan, Chuanqi; Huang, Fei; Chen, Huajun (2023-09-18), Reasoning with Language Model Prompting: A Survey, arXiv, doi:10.48550/arXiv.2212.09597, arXiv:2212.09597

[weipaper2-2] Wei, Jason; Wang, Xuezhi; Schuurmans, Dale; Bosma, Maarten; Ichter, Brian; Xia, Fei; Chi, Ed H.; Le, Quoc V.; Zhou, Denny (31 October 2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems (NeurIPS 2022). Vol. 35. arXiv:2201.11903.

[3] Sharan Narang and Aakanksha Chowdhery (2022-04-04). "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance".

[4] Yao, Shunyu; Yu, Dian; Zhao, Jeffrey; Shafran, Izhak; Griffiths, Thomas L.; Cao, Yuan; Narasimhan, Karthik (2023-05-17). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models". arXiv:2305.10601 [cs.CL].

[5] Besta, Maciej; Blach, Nils; Kubicek, Ales; Gerstenberger, Robert; Podstawski, Michal; Gianinazzi, Lukas; Gajda, Joanna; Lehmann, Tomasz; Niewiadomski, Hubert; Nyczyk, Piotr; Hoefler, Torsten (2024-03-24). "Graph of Thoughts: Solving Elaborate Problems with Large Language Models". Proceedings of the AAAI Conference on Artificial Intelligence. 38 (16): 17682–17690. doi:10.1609/aaai.v38i16.29720. ISSN 2374-3468.

[6] Wang, Xuezhi; Wei, Jason; Schuurmans, Dale; Le, Quoc; Chi, Ed; Narang, Sharan; Chowdhery, Aakanksha; Zhou, Denny (2022-03-01). "Self-Consistency Improves Chain of Thought Reasoning in Language Models". arXiv:2203.11171 [cs.CL].

[7] Diao, Shizhe; Wang, Pengcheng; Lin, Yong; Zhang, Tong (2023-02-01). "Active Prompting with Chain-of-Thought for Large Language Models". arXiv:2302.12246 [cs.CL].

[8] "How Each Index Works - LlamaIndex 🦙 v0.10.17". docs.llamaindex.ai. Retrieved 2024-04-08.

[9] Lewis, Patrick; Perez, Ethan; Piktus, Aleksandra; Petroni, Fabio; Karpukhin, Vladimir; Goyal, Naman; Küttler, Heinrich; Lewis, Mike; Yih, Wen-tau; Rocktäschel, Tim; Riedel, Sebastian; Kiela, Douwe (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". Advances in Neural Information Processing Systems. 33. Curran Associates, Inc.: 9459–9474. arXiv:2005.11401.

[10] Schick, Timo; Dwivedi-Yu, Jane; Dessi, Roberto; Raileanu, Roberta; Lomeli, Maria; Hambro, Eric; Zettlemoyer, Luke; Cancedda, Nicola; Scialom, Thomas (2023-12-15). "Toolformer: Language Models Can Teach Themselves to Use Tools". Advances in Neural Information Processing Systems. 36: 68539–68551.

[:0-11] ^ ^a ^b ^c ^d ^e ^f Uesato, Jonathan; Kushman, Nate; Kumar, Ramana; Song, Francis; Siegel, Noah; Wang, Lisa; Creswell, Antonia; Irving, Geoffrey; Higgins, Irina (2022-11-25), Solving math word problems with process- and outcome-based feedback, arXiv, doi:10.48550/arXiv.2211.14275, arXiv:2211.14275

[12] Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18), Training Verifiers to Solve Math Word Problems, arXiv, doi:10.48550/arXiv.2110.14168, arXiv:2110.14168

[13] Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18), Training Verifiers to Solve Math Word Problems, arXiv, doi:10.48550/arXiv.2110.14168, arXiv:2110.14168

[14] Lightman, Hunter; Kosaraju, Vineet; Burda, Yura; Edwards, Harri; Baker, Bowen; Lee, Teddy; Leike, Jan; Schulman, John; Sutskever, Ilya (2023-05-31), Let's Verify Step by Step, arXiv, doi:10.48550/arXiv.2305.20050, arXiv:2305.20050

[15] Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon (2020-12-06). "Language models are few-shot learners". Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS '20. Red Hook, NY, USA: Curran Associates Inc.: 1877–1901. doi:10.5555/3495724.3495883. ISBN 978-1-7138-2954-6. {{cite journal}}: Check |doi= value (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]