Wikipedia:Articles for deletion/Inner alignment

Inner alignment

[Hide this box] New to Articles for deletion (AfD)? Read these primers!

(Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL)

The article does not currently cite reliable sources. Current citations include the forums "LessWrong" and "AI Alignment Forum", and blog articles on "AISafety.info", Medium, and LinkedIn. A web search turned up the following primary source articles:

I am recommending this article for deletion since I could find no references to this concept in reliable secondary sources. Elestrophe (talk) 01:40, 25 June 2025 (UTC)[reply]

If you came here because someone asked you to, or you read a message on another website, please note that this is not a majority vote, but instead a discussion among Wikipedia contributors. Wikipedia has policies and guidelines regarding the encyclopedia's content, and consensus (agreement) is gauged based on the merits of the arguments, not by counting votes.

However, you are invited to participate and your opinion is welcome. Remember to assume good faith on the part of others and to sign your posts on this page by adding ~~~~ at the end.

Note: Comments may be tagged as follows: suspected single-purpose accounts: {{subst:spa|username}}; suspected canvassed users: {{subst:canvassed|username}}; accounts blocked for sockpuppetry: {{subst:csm|username}} or {{subst:csp|username}}.

Keep: This concept seems to exist and be a confounding factor in artificial intelligence spaces, and therefore has some value to the overall encyclopedia. Because AI is advancing at such a rate, and because such advancements raise challenges faster than scientific study of those challenges can be adequately conducted, I would argue that there is some limited room for article creation before full adequate sourcing exists. There is a fine line between what I am talking about and a violation of WP:CRYSTALBALL and WP:NOR; but I would raise that it is better to have an article in this case than not have an article. Foxtrot620 (talk) 18:23, 25 June 2025 (UTC)[reply]
Creating an article "before full adequate sourcing exists" is a violation of the No Original Research policy, full stop. Stepwise Continuous Dysfunction (talk) 00:20, 26 June 2025 (UTC)[reply]
Note: This discussion has been included in the list of Technology-related deletion discussions. WCQuidditch ☎ ✎ 02:27, 25 June 2025 (UTC)[reply]
Keep - this is a notable concept. I just added a reference to the article from Scientific Reports. A Google Scholar search for "inner alignment" artificial intelligence turns up 300+ results. Many are preprints but there remain many peer-reviewed papers and books. Books, too. --A. B. ^{(talk • contribs • global count)} 20:43, 25 June 2025 (UTC)[reply]
Scientific Reports is not a good journal. It's the cash-grab of the Nature company. The majority of Wikipedia's own article about it is the "Controversies" section, for goodness sake. Stepwise Continuous Dysfunction (talk) 00:12, 26 June 2025 (UTC)[reply]
Keep The version has been improved and the concept itself is notable and increasingly discussed in the academic literature. The notion of “inner alignment” is widely cited in alignment research and has been already formalized. While the original discussions emerged on platforms like the AI Alignment Forum and LessWrong, the term has since migrated into peer-reviewed academic publications. Southernhemisphere (talk) 23:15, 25 June 2025 (UTC)[reply]
Delete In the absence of actual serious literature, i.e., multiple reliably-published articles that cover the topic in depth, this is just an advertisement for an ideology. The current sourcing is dreadful, running the gamut from LessWrong to LinkedIn, and a search for better options did not turn up nearly enough to indicate that this needs an article rather than, at most, a sentence somewhere else. Stepwise Continuous Dysfunction (talk) 00:17, 26 June 2025 (UTC)[reply]
LessWrong and LinkedIn referenced texts were deleted. While the article requires further refinement, the topic remains highly relevant. Southernhemisphere (talk) 05:27, 26 June 2025 (UTC)[reply]
OK, now remove "aisafety.info" (a primary, non-independent source with no editorial standards that can be discerned). And "Bluedot Impact" (likewise). And the blog post about a podcast episode on Medium, which fails every test one could want for a source good enough to build an encyclopedia article upon. What's left? Not much. Stepwise Continuous Dysfunction (talk) 06:42, 26 June 2025 (UTC)[reply]
Keep Deleting by what is in the article today vs what is out there is not how it works. Poorly or incompletely written is not grounds to delete. Google this: "Inner alignment" artificial intelligence. Lots of stuff if we but look: [1], [2], [3], [4], [5]. Exists and is notable, and newer sciences, so you have to dig more. -- Very Polite Person (talk) 03:50, 26 June 2025 (UTC)[reply]
The first link is to the arXiv preprint version of a conference proceedings paper in a conference with unknown standards. The lead author was at OpenAI, which means that the paper has to be judged for the possibility of criti-hype, and in any event, should be regarded as primary and not independent. The second is a page of search results from a search engine that does not screen for peer review and even includes a self-published book. The third is in Scientific Reports, which via this essay I learned has published crackpot physics. The fifth is a thesis, which is generally not a good kind of source to use. In short, there is much less here than meets the eye. Stepwise Continuous Dysfunction (talk) 06:38, 26 June 2025 (UTC)[reply]
I will note that a doctoral thesis is an allowable reliable source. However hinging an article like this on a single source is not appropriate. This is why I proposed draftification. This topic could very well be one that generates reliable sources but it's clearly not there yet. Simonm223 (talk) 13:34, 26 June 2025 (UTC)[reply]
Delete The only source that looks halfway like credible computer science is a wildly speculative pre-print from 2024 sponsored by Google and Microsoft. The article looks like covert advertising for AIsafety.info. Jujodon (talk) 10:14, 26 June 2025 (UTC)[reply]
Draftify as WP:TOOSOON. If reliable academic sources come forward then this article then that's fine but preprints and blogs are not reliable sources. Simonm223 (talk) 13:31, 26 June 2025 (UTC)[reply]
Delete or draftify. Is there a single RS for this? Perhaps we could move the article to arXiv too, or maybe viXra - David Gerard (talk) 18:50, 26 June 2025 (UTC)[reply]
Keep. Inner alignment is a notable and emerging concept in AI safety, now cited in peer-reviewed sources such as Scientific Reports (Melo et al., 2025) and PRAI 2024 (Li et al.). While the article began with less formal sources, newer academic literature confirms its relevance. Per WP:GNG, the topic has significant coverage in reliable sources. Improvements are ongoing, and deletion would be premature for a concept gaining scholarly traction. Sebasargent (talk) 19:05, 26 June 2025 (UTC) — Sebasargent (talk • contribs) has made few or no other edits outside this topic. [reply]
- "emerging concept" places it squarely as WP:TOOSOON - David Gerard (talk) 23:54, 26 June 2025 (UTC)[reply]
  Inner alignment is an urgent topic because it addresses a core safety challenge in the development of powerful AI systems, especially those based on LLMs or other ML techniques. Southernhemisphere (talk) 00:04, 27 June 2025 (UTC)[reply]
I have just removed the many paragraphs cited solely to blog posts, arXiv preprints, Medium posts, some guy's website, or nothing at all. This is now a three-paragraph article with two cites. Is that really all there is to this? Nothing else in a solid RS? - David Gerard (talk) 00:03, 27 June 2025 (UTC)[reply]
The article should be fixed and enhanced, not deleted. Inner alignment is crucial to preventing both existential risks and suffering risks. Misaligned AI systems may pursue unintended goals, leading to human extinction or vast suffering. Ensuring AI internal goals match human values is key to avoiding catastrophic outcomes as AI systems become more capable and autonomous. Southernhemisphere (talk) 00:06, 27 June 2025 (UTC)[reply]
If you seriously claim that LLMs will lead to the end of humanity, then this sounds like the topic is squarely within the purview of WP:FRINGE. This puts upon it strong RS requirements. Right now it has two RSes, one of those the topic is merely a passing mention in a footnote. Given this, you really, really need more solid sourcing. I just posted a call on WP:FTN asking for good sourcing - David Gerard (talk) 00:10, 27 June 2025 (UTC)[reply]
The article doesn’t assert that LLMs will end humanity, but notes that some researchers view inner alignment as a potential contributor to AI risk. I agree that stronger secondary sources are needed and will work on adding more reliable references to reflect the seriousness of the topic neutrally. Southernhemisphere (talk) 00:14, 27 June 2025 (UTC)[reply]

To speak to your point, User:David Gerard, As an expert in Emergency Management, and someone who has spent a great deal of time studying global catastrophic risk, the idea that AI could lead to the end of humanity is far from fringe science. The fact that essentially every AI company working towards AGI has a team working on Catostrophic Risk is more than enough evidence that AI poses a possible existential threat. Essentially no one on either side of the AI debate disagrees that AI poses a general catastrophic risk. They may disagree on the level of risk and everything else, but the risk is universally acknowledged to be there. - Foxtrot620 (talk) 00:50, 27 June 2025 (UTC)[reply]
Every "AI" company having a team working on catastrophic risk is not significant evidence, because they would still have those teams just for hype under the null hypothesis of lack of belief in catastrophic risk. It would almost certainly fail to reject the null with p < .05, and the Bayes factor would be so small that it shouldn't convince you of anything that you don't already have very high priors for. (Which, sure, might be reasonable for some narrow statements, like companies believing actual AGI "possibly" posing existential risks. Companies believing the current marginal dollar spent on this providing more benefit to them on the "actual risk" side compared to the "attract investment and other hype" is going to be a nah from me) Alpha3031 (t • c) 03:42, 27 June 2025 (UTC)[reply]
I want to pause and reframe, because I don't think this is conveying the point I need to be heard here. While your points are valid, they don't invalidate the concerns I'm raising about AI risk. I want to present this from an emergency management perspective, my area of expertise in order to insure that it's fully understood.

discussion of the general subject of AI risk, not the article nor the specific topic

In emergency management, we assess risk based on three core factors: scale, likelihood, and severity. A risk is worth planning for if any two factors are high. If all three factors are high, or if the likelihood is certain, planning is essential.

Let's illustrate this with some examples in a hypothetical Midwest US town, "Anytown," with a population of 70,000:

Tornado:

Likelihood: High (Midwest location).

Scale: High (could impact the entire town).

Severity: High (could destroy Anytown).

Conclusion: A tornado is a critical risk to prepare for.

Asteroid Impact:

Likelihood: Very low.

Scale: Variable (could be a house or the entire city), but large impacts are extremely low likelihood.

Severity: Variable (from a ruined garden to flattening the town).

Conclusion: Not a primary risk for Anytown to plan for due to low likelihood.

Pandemic:

Likelihood: Certain (history shows pandemics recur).

Scale: High (will impact the entire town).

Severity: Generally high if classified as a pandemic.

Conclusion: A pandemic is an essential risk to prepare for.

Tsunami:

Likelihood: Essentially impossible (Anytown is landlocked).

Conclusion: Not a risk for Anytown to plan for.

Now, applying this established emergency management framework to AI and AGI, we have multiple companies actively developing AGI, often with questionable ethical guidelines and insufficient safeguards. While the likelihood of AGI reaching a critical stage where it poses a significant threat is currently unknown, its potential scale and severity could both be of the absolute highest level, impacting the entire globe. According to the same emergency management principles, that tell us a tornado is a threat to prepare for, so is AI. This is not fringe science; it's a direct application of widely accepted risk assessment principles.

It's also crucial to differentiate here, as the risk isn't just with the theoretical AGI. While AGI poses a potential Global Catastrophic Risk, the issue of AI risk isn't limited to hypothetical future scenarios. AI is already demonstrating tangible risks at various levels:

We know, indisputably, that current, AI has already contributed to loss of life. For instance, when UnitedHealthcare implemented an AI system for prior authorizations, it wrongfully denied countless claims, leading to treatment delays and, tragically, patient deaths. This wasn't AGI; it was basic AI with real-world, life-or-death consequences. While not a global risk, it was certainly a significant risk for the over 22 million patients insured by UHC. It was a national level impact from AI, and it's one that happened.

AI is a pervasive risk that demands comprehensive planning. The inherent flaws that lead to these risks, including the very subject of this page, are a critical part of this conversation and cannot be dismissed as fringe. Foxtrot620 (talk) 20:41, 27 June 2025 (UTC)[reply]
You have just posted a massive forum-style discussion on the general topic of AI to an AFD about a specific article, and you're not even talking about the article at hand. I have not removed your text, but I have collapsed it so you don't flood out discussion on the AFD. Please don't do this again - David Gerard (talk) 21:31, 27 June 2025 (UTC)[reply]

Then I eagerly await you bringing the solid RSes on inner alignment - David Gerard (talk) 08:15, 27 June 2025 (UTC)[reply]
The existence of references on Inner Alignment have no bearing on the validity of AI as a general risk, global or otherwise, which is what this comment was about. Foxtrot620 (talk) 20:54, 27 June 2025 (UTC)[reply]
This page is about a specific article. It is expected that AFD discussions will be about the article - David Gerard (talk) 21:32, 27 June 2025 (UTC)[reply]

Foxtrot620 you make a number of important points about AI risks and the potential utility of AI-specifice risk management tools.

This discussion is much more parochial: do we yet have sufficient independent, reliable sources to support a Wikipedia article on inner alignment. The concern expressed by others is that, no, we don't. The idea may have merit but the scientific community hasn't adequately analyzed yet. Perhaps this will A. B. ^{(talk • contribs • global count)} 21:32, 27 June 2025 (UTC)[reply]

The existence of references is central to whether this article is appropriate to Wikipedia. I personally think the main risk of the technology we call "AI" presently is its massive climate inpact but, reading the article and its discussion of bot map navigation and green arrows, I thought "yeah, this might be the basis for an interesting article." But if the sources don't exist to our standards yet then the article should not exist yet. Simonm223 (talk) 10:30, 28 June 2025 (UTC)[reply]
I left a brief notice of this discussion at Wikipedia talk:WikiProject Artificial Intelligence. --A. B. ^{(talk • contribs • global count)} 03:53, 27 June 2025 (UTC)[reply]
See also Outer alignment, which was sourced to a similar combination of blog posts, forum posts and some guy's web site as this article was, and now has only the Science Reports link. We are seriously lacking in RSes that either of these is a thing outside a WP:FRINGE blog network - David Gerard (talk) 08:20, 27 June 2025 (UTC)[reply]
@David, I have to take exception to the use of "fringe" with this topic. Much of the material on the topic of AI inner- and outer-alignment is self-published on a couple of particular forums and arxiv.org. That doesn't mean this work is fringe. The field is moving very rapidly.

Yes, arxiv.org papers are not peer-reviewed and we don't cite them but other papers have cited one arxiv.org alignment paper 354 times.The contributors to that paper were from Peking University, the University of Cambridge, the University of Oxford, Carnegie Mellon University, Hong Kong University of Science and Technology and the University of Southern California -- hardly fringe-y places.

The two relevant forums are the Alignment Forum and LessWrong. The Alignment Forum restricts posts to a group of selected AI experts. Peer-reviewed AI papers cite specific posts on these forums.

Our guidelines may limit the use of some of this material but that doesn't mean this topic or its community of researchers are fringe. A. B. ^{(talk • contribs • global count)} 03:09, 28 June 2025 (UTC)[reply]
Delete. Wikipedia articles are not for "emerging concepts" but only for topics that "the outside world has already taken notice of". Bishonen | tålk 12:40, 27 June 2025 (UTC).[reply]
Bishonen, the cutting-edge stuff is either published on arXiv.org or else posted on the Alignment Forum. The topic has emerged, though -- there are peer-reviewed papers that show up 1-3 years later. I've added several to the article. A. B. ^{(talk • contribs • global count)} 03:15, 28 June 2025 (UTC)[reply]

Comment - I have added 3 refs to the article that I got from a quick check of the Wikipedia Library:

Li, Kanxue; Zheng, Qi; Zhan, Yibing; Zhang, Chong; Zhang, Tianle; Lin, Xu; Qi, Chongchong; Li, Lusong; Tao, Dapeng (August 2024). "Alleviating Action Hallucination for LLM-based Embodied Agents via Inner and Outer Alignment". 2024 7th International Conference on Pattern Recognition and Artificial Intelligence (PRAI): 613–621. doi:10.1109/PRAI62207.2024.10826957. Accessed via The Wikipedia Library.
Kilian, Kyle A.; Ventura, Christopher J.; Bailey, Mark M. (1 August 2023). "Examining the differential risk from high-level artificial intelligence and the question of control". Futures. 151: 103182. doi:10.1016/j.futures.2023.103182. ISSN 0016-3287. Accessed via The Wikipedia Library.
Hartridge, Samuel; Walker-Munro, Brendan (4 April 2025). "Autonomous Weapons Systems and the ai Alignment Problem". Journal of International Humanitarian Legal Studies. 16 (1). Brill | Nijhoff: 38–65. doi:10.1163/18781527-bja10107. ISSN 1878-1373. Accessed via The Wikipedia Library.

--A. B. ^{(talk • contribs • global count)} 22:00, 27 June 2025 (UTC)[reply]

And yet you did not check them - the third only mentions "inner alignment" in a footnote pointing somewhere else. Please review WP:REFBOMB - David Gerard (talk) 00:35, 28 June 2025 (UTC)[reply]

The third ref discusses alignment in general and is written for less technical people.

David, what's your analysis of the other two references? Thanks, --A. B. ^{(talk • contribs • global count)} 00:57, 28 June 2025 (UTC)[reply]

The third ref is literally not about the article topic! - David Gerard (talk) 08:40, 28 June 2025 (UTC)[reply]

Comment - we have several other AI alignment articles. The main one is AI alignment. There are also mesa-optimization, alignment faking and outer alignment. I'm confident we have enough reliable sources now to establish notability but would we better served combining this and the outer alignment article into the main article which already mentions both. --A. B. ^{(talk • contribs • global count)} 03:38, 28 June 2025 (UTC)[reply]
Comment: I looked through the list of sources referenced by the current version of the article. Here are my thoughts on them:

Melo, Gabriel A.; Máximo, Marcos R. O. A.; Soma, Nei Y.; Castro, Paulo A. L. (4 May 2025). "Machines that halt resolve the undecidability of artificial intelligence alignment". Scientific Reports. 15 (1): 15591. Bibcode:2025NatSR..1515591M. doi:10.1038/s41598-025-99060-2. ISSN 2045-2322. PMC 12050267. PMID 40320467.

This paper considers the inner alignment problem in the context of determining whether an AI model (formalized as a Turing machine) satisfies an arbitrary nontrivial semantic property. They show that this problem is algorithmically undecidable in general, by observing that this is just the statement of Rice's theorem, which has been known for 74 years. Not exactly earthshattering research, but it at least supports the definition of "inner alignment".

Li, Kanxue; Zheng, Qi; Zhan, Yibing; Zhang, Chong; Zhang, Tianle; Lin, Xu; Qi, Chongchong; Li, Lusong; Tao, Dapeng (August 2024). "Alleviating Action Hallucination for LLM-based Embodied Agents via Inner and Outer Alignment". 2024 7th International Conference on Pattern Recognition and Artificial Intelligence (PRAI). pp. 613–621. doi:10.1109/PRAI62207.2024.10826957. ISBN 979-8-3503-5089-0. Retrieved 28 June 2025. Accessed via The Wikipedia Library.

This article seems to use "inner alignment" and "outer alignment" in a very different way from the exposition in Inner alignment.

The widely adopted approach for model alignment follows a two-stage alignment paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL) [29]. However, implementing RL to achieve action space alignment for LLM-based embodied agents in embodied environments presents several challenges [...] To address the above challenges, this paper proposes an innovative alignment method that synergizes inner alignment with outer alignment, as illustrated in Fig. 2. Specifically, in the inner alignment, parameter-efficient fine-tuning (PEFT) methods including Q-Lora [24] and Deepspeed [26] are utilized [...] The second stage is outer alignment, which differs from traditional methods that update model parameters using reinforcement learning. In this stage, a retrieval-augmented generation (RAG) [17] method is employed.

It seems to me that "inner alignment" and "outer alignment" are used here only to signify two separate stages of LLM training. It doesn't obviously have a connection to the topic as defined in the head of the article.

Kilian, Kyle A.; Ventura, Christopher J.; Bailey, Mark M. (1 August 2023). "Examining the differential risk from high-level artificial intelligence and the question of control". Futures. 151: 103182. arXiv:2211.03157. doi:10.1016/j.futures.2023.103182. ISSN 0016-3287. Retrieved 28 June 2025. Accessed via The Wikipedia Library.

This is an article in a futures studies journal. The method of study was via a large-scale survey of researchers at universities, research groups, and leading AI companies, but also "popular AI alignment forums and existential risk conferences". Participants "self-report[ed] on level of expertise". The authors asked participants to assess the likelihood and impacts of various future possibilities including "Inner Alignment" and "AGI"; see the Appendix (arXiv) for the full survey. They then found various correlations among the participants' responses.

Sicari, Sabrina; Cevallos M., Jesus F.; Rizzardi, Alessandra; Coen-Porisini, Alberto (10 December 2024). "Open-Ethical AI: Advancements in Open-Source Human-Centric Neural Language Models". ACM Computing Surveys. 57 (4): 83:1–83:47. doi:10.1145/3703454. ISSN 0360-0300. Retrieved 28 June 2025. Accessed via The Wikipedia Library.

A survey article which mentions "inner alignment" once in the Related Works section: "The survey in [265] focuses instead on the alignment of LLMs, distinguishing between techniques devoted to the correct encoding of alignment goals (outer alignment) and techniques that ensure a robust extrapolation of the encoded goals over OOD scenarios (inner alignment)." (Here [265] is a different survey published on arXiv.)

For comparison, here is the current text of Inner alignment supported by this citation:

Inner alignment as a key element in achieving human-centric AI has been outlined, particularly models that satisfy the "3H" criteria: Helpful, Honest, and Harmless. In this context, inner alignment refers to the reliable generalization of externally defined objectives across novel or adversarial inputs.

A range of techniques to support this goal has been highlighted, including parameter-efficient fine-tuning, interpretability-focused design, robust training, and factuality enhancement. These strategies aim to ensure that models not only learn aligned behavior but also retain and apply it across deployment contexts. Inner alignment is thus viewed as critical to making aligned AI behavior stable and generalizable.

I am not certain how we got all that from this sentence.

Safron, Adam; Sheikhbahaee, Zahra; Hay, Nick; Orchard, Jeff; Hoey, Jesse (2023). "Value Cores for Inner and Outer Alignment: Simulating Personality Formation via Iterated Policy Selection and Preference Learning with Self-World Modeling Active Inference Agents". Active Inference: Third International Workshop, IWAI 2022, Grenoble, France, September 19, 2022, Revised Selected Papers. Springer Nature Switzerland. pp. 343–354. doi:10.1007/978-3-031-28719-0_24. ISBN 978-3-031-28719-0. Retrieved 28 June 2025. Accessed via The Wikipedia Library.

This article mentions "inner alignment" once to define it, and then never mentions it separately from "outer alignment" again. Most of the article reads like total nonsense to me, but I gather that the authors speculate that AI could be designed using analogies to certain biological processes in the brain.

Dung, Leonard (26 October 2023). "Current cases of AI misalignment and their implications for future risks". Synthese (Original research). 202 (5). Springer. doi:10.1007/s11229-023-04367-0. ISSN 0039-7857. Retrieved 26 June 2025. {{cite journal}}: |article= ignored (help)CS1 maint: date and year (link)

Careful not to confuse this with an identically-titled article by "Chris herny, uniy taiwo". In any case ~~as noted by David Gerard~~ (edit: that was about a different article) this article only mentions "inner alignment" once, in a footnote discussing the views of an arXiv paper and an Alignment Forum post.

Here is the current text of Inner alignment supported by this citation:

Case studies highlight inner alignment risks in deployed systems. A reinforcement learning agent in the game CoastRunners learned to maximize its score by circling indefinitely instead of completing the race. Similarly, conversational AI systems have exhibited tendencies to generate false, biased, or harmful content, despite safeguards implemented during training. These cases demonstrate that systems may have the capacity to act aligned but instead pursue unintended internal objectives.

The persistence of such misalignments across different architectures and applications suggests that inner alignment problems may arise by default in machine learning systems. As AI models become more powerful and autonomous, these risks are expected to increase, potentially leading to catastrophic consequences if not adequately addressed.

This seems like WP:SYN to me, since the actual article does not mention inner alignment in connection with these considerations.

Elestrophe (talk) 09:28, 28 June 2025 (UTC)[reply]

This is excellent work here Elestrophe. Simonm223 (talk) 10:35, 28 June 2025 (UTC)[reply]