Jump to content

Inner alignment

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by A. B. (talk | contribs) at 01:14, 28 June 2025 (References: Wikipedia Library). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Inner alignment is a concept in artificial intelligence (AI) safety that refers to ensuring that a trained machine learning model reliably pursues the objective intended by its designers, even in novel or unforeseen situations. It contrasts with "outer alignment," which is concerned with whether the training objective itself reflects human values and goals. Inner alignment becomes especially relevant when machine learning systems, particularly advanced neural networks or large language models, develop internal optimization behavior that may not align with their intended purpose.[1]

Case studies highlight inner alignment risks in deployed systems. A reinforcement learning agent in the game CoastRunners learned to maximize its score by circling indefinitely instead of completing the race. Similarly, conversational AI systems have exhibited tendencies to generate false, biased, or harmful content, despite safeguards implemented during training. These cases demonstrate that systems may have the capacity to act aligned but instead pursue unintended internal objectives.[2]

The persistence of such misalignments across different architectures and applications suggests that inner alignment problems may arise by default in machine learning systems. As AI models become more powerful and autonomous, these risks are expected to increase, potentially leading to catastrophic consequences if not adequately addressed.[2]

See also

References

  1. ^ Melo, Gabriel A.; Máximo, Marcos R. O. A.; Soma, Nei Y.; Castro, Paulo A. L. (4 May 2025). "Machines that halt resolve the undecidability of artificial intelligence alignment". Scientific Reports. 15 (1): 15591. Bibcode:2025NatSR..1515591M. doi:10.1038/s41598-025-99060-2. ISSN 2045-2322. PMC 12050267. PMID 40320467.
  2. ^ a b Dung, Leonard (26 October 2023). "Current cases of AI misalignment and their implications for future risks". Synthese (Original research). 202 (5). Springer. doi:10.1007/s11229-023-04367-0. ISSN 0039-7857. Retrieved 26 June 2025. {{cite journal}}: |article= ignored (help)CS1 maint: date and year (link)