Inner alignment

Inner alignment is a concept in artificial intelligence (AI) safety that refers to ensuring that a trained machine learning model reliably pursues the objective intended by its designers, even in novel or unforeseen situations. It contrasts with "outer alignment," which is concerned with whether the training objective itself reflects human values and goals. Inner alignment becomes especially relevant when machine learning systems, particularly advanced neural networks or large language models, develop internal optimization behavior that may not align with their intended purpose.^[1]

Case studies highlight inner alignment risks in deployed systems. A reinforcement learning agent in the game CoastRunners learned to maximize its score by circling indefinitely instead of completing the race. Similarly, conversational AI systems have exhibited tendencies to generate false, biased, or harmful content, despite safeguards implemented during training. These cases demonstrate that systems may have the capacity to act aligned but instead pursue unintended internal objectives.^[2]

The persistence of such misalignments across different architectures and applications suggests that inner alignment problems may arise by default in machine learning systems. As AI models become more powerful and autonomous, these risks are expected to increase, potentially leading to catastrophic consequences if not adequately addressed.^[2]

References

^ Melo, Gabriel A.; Máximo, Marcos R. O. A.; Soma, Nei Y.; Castro, Paulo A. L. (4 May 2025). "Machines that halt resolve the undecidability of artificial intelligence alignment". Scientific Reports. 15 (1): 15591. Bibcode:2025NatSR..1515591M. doi:10.1038/s41598-025-99060-2. ISSN 2045-2322. PMC 12050267. PMID 40320467.
^ ^a ^b Dung, Leonard (26 October 2023). "Current cases of AI misalignment and their implications for future risks". Synthese (Original research). 202 (5). Springer. doi:10.1007/s11229-023-04367-0. ISSN 0039-7857. Retrieved 26 June 2025. {{cite journal}}: |article= ignored (help)CS1 maint: date and year (link)

Li, Kanxue; Zheng, Qi; Zhan, Yibing; Zhang, Chong; Zhang, Tianle; Lin, Xu; Qi, Chongchong; Li, Lusong; Tao, Dapeng (August 2024). "Alleviating Action Hallucination for LLM-based Embodied Agents via Inner and Outer Alignment". 2024 7th International Conference on Pattern Recognition and Artificial Intelligence (PRAI): 613–621. doi:10.1109/PRAI62207.2024.10826957.

[1] Melo, Gabriel A.; Máximo, Marcos R. O. A.; Soma, Nei Y.; Castro, Paulo A. L. (4 May 2025). "Machines that halt resolve the undecidability of artificial intelligence alignment". Scientific Reports. 15 (1): 15591. Bibcode:2025NatSR..1515591M. doi:10.1038/s41598-025-99060-2. ISSN 2045-2322. PMC 12050267. PMID 40320467.

[:0-2] Dung, Leonard (26 October 2023). "Current cases of AI misalignment and their implications for future risks". Synthese (Original research). 202 (5). Springer. doi:10.1007/s11229-023-04367-0. ISSN 0039-7857. Retrieved 26 June 2025. {{cite journal}}: |article= ignored (help)CS1 maint: date and year (link)

[1]

[2]

See also

References