Reflection (artificial intelligence)

Reflection in artificial intelligence, notably used in large language models, specifically in Reasoning Language Models (RLMs), is the ability for an artificial neural network to provide top-down feedback to its input or previous layers, based on their outputs or subsequent layers. This process involves self-assessment and internal deliberation, aiming to enhance reasoning accuracy, minimize errors (like hallucinations), and increase interpretability. Reflection is a form of "test-time compute", where additional computational resources are used during inference.

Introduction

Traditional neural networks process inputs in a feedforward manner, generating outputs in a single pass. However, their limitations in handling complex tasks, and especially compositional ones, have led to the development of methods that simulate internal deliberation. Techniques such as chain-of-thought prompting encourage models to generate intermediate reasoning steps, thereby improving their performance in such tasks.

The feedback can take place either after a full network pass and decoding to tokens, or continuously in latent space (the last layer can be fed back to the first layer).^[1]^[2] In LLMs, special tokens can mark the beginning and end of reflection before producing a final response (e.g., <thinking>).

This internal process of "thinking" about the steps leading to an answer is analogous to human metacognition or "thinking about thinking". It helps AI systems approach tasks that require multi-step reasoning, planning, and logical thought.

Techniques

Increasing the length of the Chain-of-Thought reasoning process, by passing the output of the model back to its input and doing multiple network passes, increases inference-time scaling.^[3] Reinforcement learning frameworks have also been used to steer the Chain-of-Thought. One example is Group Relative Policy Optimization (GRPO), used in DeepSeek-R1,^[4] a variant of policy gradient methods that eliminates the need for a separate "critic" model by normalizing rewards within a group of generated outputs, reducing computational cost. Simple techniques like "budget forcing" (forcing the model to continue generating reasoning steps) have also proven effective in improving performance.^[5]

Types of reflection

Post-hoc reflection

Analyzes and critiques an initial output separately, often involving prompting the model to identify errors or suggest improvements after generating a response. The Reflexion framework follows this approach.^[6]^[7]

Iterative reflection

Revises earlier parts of a response dynamically during generation. Self-monitoring mechanisms allow the model to adjust reasoning as it progresses. Methods like Tree-of-Thoughts exemplify this, enabling backtracking and alternative exploration.

Intrinsic reflection

Integrates self-monitoring directly into the model architecture rather than relying solely on external prompts, enabling models with inherent awareness of their reasoning limitations and uncertainties. This has been used by Google DeepMind in a technique called Self-Correction via Reinforcement Learning (SCoRe) which rewards the model for improving its responses.^[8]

Process reward models and limitations

Early research explored PRMs to provide feedback on each reasoning step, unlike traditional reinforcement learning which rewards only the final outcome. However, PRMs have faced challenges, including computational cost and reward hacking. DeepSeek-R1's developers found them to be not beneficial.^[9]^[10]

Benchmarks

Reflective models generally outperform non-reflective models in most benchmarks, especially on tasks requiring multi-step reasoning.

However, some benchmarks exclude reflective models due to longer response times.

Humanity's Last Exam

The HLE, a rigorous benchmark designed to assess expert-level reasoning across mathematics, humanities, and the natural sciences, reveals substantial performance gaps among models. State-of-the-art reasoning models have demonstrated low accuracy on HLE, highlighting significant room for improvement. In particular, the full reasoning model o3 achieved an accuracy of 26.6%,^[11] while its lighter counterpart, o3‑mini-high (evaluated on text‑only questions), reached 13%.^[12]

AIME

The American Invitational Mathematics Examination (AIME) benchmark, a challenging mathematics competition, demonstrates significant performance differences between model types. Non-reasoning models typically solve less than 30% of AIME. In contrast, models employing reasoning techniques score between 50% and 80%.^[13] While OpenAI's o1 maintained or slightly improved its accuracy from reported 2024^{[citation needed]} metrics to 2025 AIME results, o3-mini (high) achieved a higher accuracy (80%) at a significantly lower cost (approximately 12 times cheaper).

o3-mini performance

According to OpenAI's January 2025 report on o3-mini, adjustable "reasoning effort" significantly affects performance, particularly in STEM. Increasing reasoning effort from low to high boosts accuracy on benchmarks like AIME 2024, GPQA Diamond, and Codeforces, providing performance gains typically in the range of 10-30%. With high reasoning effort, o3-mini (high) achieved 87.3% in AIME (different from the MathArena AIME benchmark results), 79.7% in GPQA Diamond, 2130 Elo in Codeforces, and 49.3 in SWE-bench Verified.^[14]

Drawbacks

Computational cost

Reflective models require significantly more test-time compute than non-reasoning models. On the AIME benchmark, reasoning models were 10 to 74 times more expensive^[13] than non-reasoning counterparts.

Generation time

Reflective reasoning increases response times, with current models taking anywhere from three seconds to several minutes to generate an answer. As reasoning depth improves, future models may require even longer processing times.

References

^ "1 Scaling by Thinking in Continuous Space". arxiv.org. Retrieved 2025-02-14.
^ "Training Large Language Models to Reason in a Continuous Latent Space". arxiv.org. Retrieved 2025-02-14.
^ "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". arxiv.org. Retrieved 2025-02-23.
^ "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". arxiv.org. Retrieved 2025-02-23.
^ Muennighoff, Niklas; Yang, Zitong; Shi, Weijia; Li, Xiang Lisa; Fei-Fei, Li; Hajishirzi, Hannaneh; Zettlemoyer, Luke; Liang, Percy; Candès, Emmanuel (2025-02-03), s1: Simple test-time scaling, arXiv:2501.19393
^ Shinn, Noah (2023-10-10), Reflexion: Language Agents with Verbal Reinforcement Learning, arXiv:2303.11366
^ Shinn, Noah; Cassano, Federico; Berman, Edward; Gopinath, Ashwin; Narasimhan, Karthik; Yao, Shunyu (2023-10-10), Reflexion: Language Agents with Verbal Reinforcement Learning, arXiv:2303.11366
^ Dickson, Ben (1 October 2024). "DeepMind's SCoRe shows LLMs can use their internal knowledge to correct their mistakes". VentureBeat. Retrieved 20 February 2025.
^ Uesato, Jonathan (2022). "Solving math word problems with process- and outcome-based feedback". arXiv:2211.14275 [cs.LG].
^ Lightman, Hunter (2024). "Let's Verify Step by Step". arXiv:2305.20050 [cs.LG].
^ McKenna, Greg. "OpenAI's deep research can complete 26% of Humanity's Last Exam". Fortune. Retrieved 2025-03-16.
^ John-Anthony Disotto (2025-02-04). "OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake". TechRadar. Retrieved 2025-03-16.
^ ^a ^b "MathArena". 2025-02-10. Archived from the original on 10 February 2025. Retrieved 2025-02-10.
^ "OpenAI o3-mini". OpenAI. 2025-01-31. Retrieved 2025-02-09.

[1] "1 Scaling by Thinking in Continuous Space". arxiv.org. Retrieved 2025-02-14.

[2] "Training Large Language Models to Reason in a Continuous Latent Space". arxiv.org. Retrieved 2025-02-14.

[3] "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". arxiv.org. Retrieved 2025-02-23.

[4] "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". arxiv.org. Retrieved 2025-02-23.

[5] Muennighoff, Niklas; Yang, Zitong; Shi, Weijia; Li, Xiang Lisa; Fei-Fei, Li; Hajishirzi, Hannaneh; Zettlemoyer, Luke; Liang, Percy; Candès, Emmanuel (2025-02-03), s1: Simple test-time scaling, arXiv:2501.19393

[6] Shinn, Noah (2023-10-10), Reflexion: Language Agents with Verbal Reinforcement Learning, arXiv:2303.11366

[7] Shinn, Noah; Cassano, Federico; Berman, Edward; Gopinath, Ashwin; Narasimhan, Karthik; Yao, Shunyu (2023-10-10), Reflexion: Language Agents with Verbal Reinforcement Learning, arXiv:2303.11366

[8] Dickson, Ben (1 October 2024). "DeepMind's SCoRe shows LLMs can use their internal knowledge to correct their mistakes". VentureBeat. Retrieved 20 February 2025.

[9] Uesato, Jonathan (2022). "Solving math word problems with process- and outcome-based feedback". arXiv:2211.14275 [cs.LG].

[10] Lightman, Hunter (2024). "Let's Verify Step by Step". arXiv:2305.20050 [cs.LG].

[11] McKenna, Greg. "OpenAI's deep research can complete 26% of Humanity's Last Exam". Fortune. Retrieved 2025-03-16.

[12] John-Anthony Disotto (2025-02-04). "OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake". TechRadar. Retrieved 2025-03-16.

[:1-13] "MathArena". 2025-02-10. Archived from the original on 10 February 2025. Retrieved 2025-02-10.

[14] "OpenAI o3-mini". OpenAI. 2025-01-31. Retrieved 2025-02-09.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]