Jump to content

Talk:Mechanistic interpretability

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by AryamanA (talk | contribs) at 20:08, 19 August 2025 (Bad sourcing, COI editing: Reply). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Creating the Mechanistic Interpretability article

I started this page after consulting with the meta-page XAI/Interpretability in Talk:Explainable artificial intelligence#Mechanistic Interpretability. I believe that mech interp is a sufficient distinct movement and field of study, with a growing body of work (see main article), such that it warrants a separate page. I plan to continuously improve upon and maintain this page, and will respond to feedback. Thank you! JoNeedsSleep (talk) 03:28, 12 May 2025 (UTC)[reply]

Requested move 12 May 2025

The following is a closed discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. Editors desiring to contest the closing decision should consider a move review after discussing it on the closer's talk page. No further edits should be made to this discussion.

The result of the move request was: Speedy close as moved; agreed to be a non-controversial "technical" request. (closed by non-admin page mover) —⁠ ⁠BarrelProof (talk) 21:22, 12 May 2025 (UTC)[reply]


Mechanistic InterpretabilityMechanistic interpretability – titles should be in sentence case (see MOS:TITLECAPS). Alenoach (talk) 08:03, 12 May 2025 (UTC)[reply]

Agree. I don't see why this would be controversial. WeyerStudentOfAgrippa (talk) 14:32, 12 May 2025 (UTC)[reply]
Yes, it's not controversial indeed, but there is already a redirect page named "Mechanistic interpretability", so renaming it requires special permissions. Alenoach (talk) 14:58, 12 May 2025 (UTC)[reply]
The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Wiki Education assignment: Computation, Culture, and Society

This article was the subject of a Wiki Education Foundation-supported course assignment, between 25 March 2025 and 31 May 2025. Further details are available on the course page. Student editor(s): JoNeedsSleep (article contribs).

— Assignment last updated by Laurelli7 (talk) 19:03, 30 May 2025 (UTC)[reply]

Just curious

@JoNeedsSleep: Glad to see this page exists. Just curious, was any of the content from User:AryamanA/Draft:Mechanistic interpretability? (Don't really care if it was or wasn't, just glad if it was useful!) AryamanA (talk, contribs) 01:34, 1 July 2025 (UTC)[reply]

Hi Aryaman, appreciate the message! I didn’t see this sandbox page when I created the page - it would’ve been very helpful. Just had a look at your draft; your techniques section, especially the extensive entry on SAEs and its evals, would be a great addition to the current methods section. JoNeedsSleep (talk) 17:07, 1 July 2025 (UTC)[reply]
Awesome, yes I'll be contributing to the page as I have time. I've merged in the new stuff I had (mainly SAEs). AryamanA (talk, contribs) 22:07, 1 July 2025 (UTC)[reply]

Bad sourcing, COI editing

This article seems substantially cited to primary sources and unreliable sources such as arXiv preprints, and there's a call to action on an external site from the people who seem to have coined a lot of this stuff. At best it may be OR if not just straight up COI.

What's the coverage of the topic like in RSes from people who are not directly involved? - David Gerard (talk) 22:53, 13 August 2025 (UTC)[reply]

I am one of the main contributors, a current Ph.D. student at Stanford working on mechanistic interpretability and not involved in the LessWrong/AI safety sphere. I was working on this article before the call above. I have spotlight papers at NeurIPS and ICML on these topics. I intended to work on this article as a sort of survey of the field, to collate important works I frequently refer to. I do not appreciate the removal of the citations that I had added, yes some of these are arXiv preprints/blogposts/LessWrong posts but the ones I have added are widely used in the field. What are the appropriate guidelines on citations for scientific sources? I am not sure about the removal of my contributions by User:Stepwise Continuous Dysfunction and would at the very least like a reference to the justifying guidelines. I do not believe there are many contributors qualified to work on this article, and it would be great to have a constructive exchange since I hope to continue working on this article in a conducive environment. AryamanA (talk, contribs) 20:08, 19 August 2025 (UTC)[reply]