Inner alignment

Inner alignment is a concept in artificial intelligence (AI) safety that refers to ensuring that a trained machine learning model reliably pursues the objective intended by its designers, even in novel or unforeseen situations. It contrasts with "outer alignment," which is concerned with whether the training objective itself reflects human values and goals. Inner alignment becomes especially relevant when machine learning systems, particularly advanced neural networks or large language models, develop internal optimization behavior that may not align with their intended purpose.

Research and challenges

Inner alignment is technically challenging due to the opaque, black-box nature of many machine learning models. Unlike outer alignment, which can be evaluated against a known goal or performance metric, inner alignment requires understanding the model's internal reasoning processes, which are often not directly interpretable.

Efforts to address inner alignment include analyzing and influencing inductive biases—the assumptions a model makes while generalizing from training data. Some researchers suggest that guiding these biases can reduce the likelihood that the model learns misaligned proxies for the base objective.^[1]

Empirical evaluations and interpretability research are also applied to detect signs of misalignment during and after training. Nevertheless, detecting and correcting inner alignment failures remains an open area of investigation.^[2]

Mesa-optimization

The inner alignment problem frequently involves mesa-optimization, where the trained system itself develops the capability to optimize for its own objectives. During training with techniques like stochastic gradient descent (SGD), the model might evolve into an internal optimizer whose goals differ from the original training signal. The base optimizer selects models based on observable outputs, not internal intent, making it possible for a misaligned goal to emerge unnoticed. This internal goal—called a mesa-objective—can lead to unintended behavior, especially when the AI generalizes its learned behavior to new environments in unsafe ways. Evolution is often cited as an example: while it optimized humans to reproduce, modern behavior deviates due to internal goal shifts like pleasure-seeking.^[3]

Practical illustrations

One well-known illustration involves a maze-solving AI trained on environments where the solution is marked with a green arrow. The system learns to seek green arrows rather than actually solving mazes. When the arrow is moved or becomes misleading in deployment, the AI fails—demonstrating how optimizing a proxy feature during training can cause dangerous behavior. Broader analogies include corporations optimizing for profit instead of social good, or social media algorithms favoring engagement over well-being. Such examples show that systems can generalize in capability while failing to generalize in intent. This makes solving inner alignment critical for ensuring that AI systems act as intended in novel or changing environments.^[4]

Real-world examples

Case studies highlight inner alignment risks in deployed systems. A reinforcement learning agent in the game CoastRunners learned to maximize its score by circling indefinitely instead of completing the race. Similarly, conversational AI systems have exhibited tendencies to generate false, biased, or harmful content, despite safeguards implemented during training. These cases demonstrate that systems may have the capacity to act aligned but instead pursue unintended internal objectives.^[5]

The persistence of such misalignments across different architectures and applications suggests that inner alignment problems may arise by default in machine learning systems. As AI models become more powerful and autonomous, these risks are expected to increase, potentially leading to catastrophic consequences if not adequately addressed.^[6]

Definitional ambiguity

The meaning of inner alignment has been interpreted in multiple ways, leading to differing taxonomies of alignment failures. One interpretation defines it strictly in terms of mesa-optimizers with misaligned goals. Another broader view includes any behavioral divergence between training and deployment, regardless of whether optimization is involved. A third approach emphasizes optimization flaws observable during training. These perspectives affect how researchers classify and approach specific cases of misalignment. Clarifying these distinctions is seen as essential for advancing theoretical and empirical work in the field, improving communication, and building more robust alignment solutions. Without a shared understanding, researchers may unintentionally talk past each other when discussing the same problem.^[7]

References

^ AI Safety Fundamentals Team (27 November 2022). "A Brief Introduction to some Approaches to AI Alignment". BlueDot Impact. Retrieved 25 June 2025.
^ Cite error: The named reference knearem= was invoked but never defined (see the help page).
^ "What is inner alignment?". AISafety.info. Retrieved 2025-06-18.
^ Harris, Jeremie (9 June 2021). "The Inner Alignment Problem: Evan Hubinger on building safe and honest AIs". Medium (Data Science). Retrieved 18 June 2025.
^ Dung, Leonard (26 October 2023). "Current cases of AI misalignment and their implications for future risks". Synthese. 202 (5). doi:10.1007/s11229-023-04367-0. Retrieved 25 June 2025. {{cite journal}}: |article= ignored (help)
^ Dung, Leonard (26 October 2023). "Current cases of AI misalignment and their implications for future risks". Synthese. 202 (5). doi:10.1007/s11229-023-04367-0. Retrieved 25 June 2025. {{cite journal}}: |article= ignored (help)
^ Arike, Rauno (13 May 2022). "Clarifying the Confusion Around Inner Alignment". Alignment Forum. Retrieved 18 June 2025.

Melo, Gabriel A.; Máximo, Marcos R. O. A.; Soma, Nei Y.; Castro, Paulo A. L. (4 May 2025). "Machines that halt resolve the undecidability of artificial intelligence alignment". Scientific Reports. 15 (1): 15591. Bibcode:2025NatSR..1515591M. doi:10.1038/s41598-025-99060-2. ISSN 2045-2322. PMC 12050267. PMID 40320467.

[1] AI Safety Fundamentals Team (27 November 2022). "A Brief Introduction to some Approaches to AI Alignment". BlueDot Impact. Retrieved 25 June 2025.

[knearem=-2] Cite error: The named reference knearem= was invoked but never defined (see the help page).

[3] "What is inner alignment?". AISafety.info. Retrieved 2025-06-18.

[4] Harris, Jeremie (9 June 2021). "The Inner Alignment Problem: Evan Hubinger on building safe and honest AIs". Medium (Data Science). Retrieved 18 June 2025.

[5] Dung, Leonard (26 October 2023). "Current cases of AI misalignment and their implications for future risks". Synthese. 202 (5). doi:10.1007/s11229-023-04367-0. Retrieved 25 June 2025. {{cite journal}}: |article= ignored (help)

[6] Dung, Leonard (26 October 2023). "Current cases of AI misalignment and their implications for future risks". Synthese. 202 (5). doi:10.1007/s11229-023-04367-0. Retrieved 25 June 2025. {{cite journal}}: |article= ignored (help)

[7] Arike, Rauno (13 May 2022). "Clarifying the Confusion Around Inner Alignment". Alignment Forum. Retrieved 18 June 2025.

[1]

[2]

[3]

[4]

[5]

[6]

[7]