Jump to content

Mechanistic interpretability

From Wikipedia, the free encyclopedia
This is the current revision of this page, as edited by JoNeedsSleep (talk | contribs) at 07:31, 26 June 2025 (added a section on key methods and concepts which people may find a useful directory. shortened some of the section titles as it seemed too verbose.). The present address (URL) is a permanent link to this version.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

Mechanistic interpretability (often shortened to "Mech Interp" or "MI") is a subfield of interpretability that seeks to reverse‑engineer neural networks, generally perceived as a black box, into human‑understandable components or "circuits", revealing the causal pathways by which models process information.[1] The object of study generally includes but is not limited to vision models and Transformer-based large language models (LLMs).

History

[edit]

Chris Olah is generally credited with coining the term 'Mechanistic interpretability' and spearheading its early development.[2] In the 2018 paper The Building Blocks of Interpretability, Olah (then at Google Brain) and his colleagues combined existing interpretability techniques, including feature visualization, dimensionality reduction, and attribution with human-computer interface methods to explore features represented by the neurons in the vision model, Inception v1. In the March 2020 paper Zoom In: An Introduction to Circuits, Olah and the OpenAI Clarity team described "an approach inspired by neuroscience or cellular biology", hypothesizing that features, like individual cells, are the basis of computation for neural networks and connect to form circuits, which can be understood as "sub-graphs in a network".[3] In this paper, the authors described their line of work as understanding the "mechanistic implementations of neurons in terms of their weights".

In 2021, Chris Olah co-founded the company Anthropic and established its Interpretability team, which publishes their results on the Transformer Circuits Thread.[4] In December 2021, the team published A Mathematical Framework for Transformer Circuits, reverse-engineering a toy transformer with one and two attention layers. Notably, they discovered the complete algorithm of induction circuits, responsible for in-context learning of repeated token sequences. The team further elaborated this result in the March 2022 paper In-context Learning and Induction Heads.[5]

Notable results in mechanistic interpretability from 2022 include the theory of superposition wherein a model represents more features than there are directions in its representation space;[6] a mechanistic explanation for grokking, the phenomenon where test-set loss begins to decay only after a delay relative to training-set loss;[7] and the introduction of sparse autoencoders, a sparse dictionary learning method to extract interpretable features from LLMs.[8]

Mechanistic interpretability has garnered significant interest, talent, and funding in the AI safety community. In 2021, Open Philanthropy called for proposals that advanced "mechanistic understanding of neural networks" alongside other projects aimed to reduce risks from advanced AI systems.[9] The interpretability topic prompt in the request for proposal was written by Chris Olah.[10] The ML Alignment & Theory Scholars (MATS) program, a research seminar focused on AI alignment, has historically supported numerous projects in mechanistic interpretability. In its summer 2023 cohort, for example, 20% of the research projects were on mechanistic interpretability.[11]

Many organizations and research groups work on mechanistic interpretability, often with the stated goal of improving AI safety. Max Tegmark runs the Tegmark AI Safety Group at MIT, which focuses on mechanistic interpretability.[12] In February 2023, Neel Nanda started the mechanistic interpretability team at Google DeepMind. Apollo Research, an AI evals organization with a focus on interpretability research, was founded in May 2023.[13] EleutherAI has published multiple papers on interpretability.[14] Goodfire, an AI interpretability startup, was founded in 2024.[15]

Mechanistic interpretability has greatly expanded its scope, practitioners, and attention in the ML community in recent years. In July 2024, the first ICML Mechanistic Interpretability Workshop was held, aiming to bring together "separate threads of work in industry and academia".[16] In November 2024, Chris Olah discussed mechanistic interpretability on the Lex Fridman podcast as part of the Anthropic team.[17]

Distinction between interpretability and mechanistic interpretability

[edit]

The term mechanistic interpretability designates both a class of technical methods—explainability methods such as saliency maps are generally not considered mechanistic interpretability research[18]—and a community. Mechanistic interpretability's early development was rooted in the AI safety community, though the term is increasingly adopted by academia. In “Mechanistic?”, Saphra and Wiegreffe identify four senses of “mechanistic interpretability”:[19]

1. Narrow technical definition: A technical approach to understanding neural networks through their causal mechanisms.

2. Broad technical definition: Any research that describes the internals of a model, including its activations or weights.

3. Narrow cultural definition: Any research originating from the MI community.

4. Broad cultural definition: Any research in the field of AI—especially LM—interpretability.

As the scope and popular recognition of mechanistic interpretability increase, many have begun to recognize that other communities such as natural language processing researchers have pursued similar objectives in their work.

Key Concepts

[edit]

Linear Representation

[edit]

The latent space of a LLM is generally conceptualized as a vector space.[20] There is a growing body of work around the linear representation hypothesis, which posits that high-level features correspond to approximately linear directions in a model’s activation space.[21][22]

Superposition

[edit]

Superposition is the phenomenon where many unrelated features are “packed’’ into the same subspace or even into single neurons, making a network highly over-complete yet still linearly decodable after nonlinear filtering.[23] Recent formal analysis links the amount of polysemanticity to feature ‘‘capacity’’ and input sparsity, predicting when neurons become monosemantic or remain polysemantic.[24]

Methods

[edit]

Probing

[edit]

Probing involves training a linear classifier on model activations to test whether a feature is linearly decodable at a given layer or subset of neurons.[25] Generally, a linear probe is trained on a labelled dataset encoding the desired feature. While linear probes are popular amongst Mechanistic Interpretability researchers, its introduction dates back to 2016 and was widely used in the NLP community.[26]

In 2023, Nanda et al. showed that world-model features such as in-context truth values emerge as linearly decodable directions early in training, strengthening the case for linear probes as faithful monitors of internal state.[27]

Difference-in-Means

[edit]

Difference-in-Means, or Diff-in-Means constructs a steering vector by subtracting the mean activation for one class of examples from the mean for another. Unlike learned probes, Diff-in-Means has no trainable parameters and often generalises better out-of-distribution.[28] Diff-in-Means has been used to isolate model representation for refusal/compliance, true/false, and sentiment.[29][30][31]

Steering

[edit]

Steering adds or subtracts a direction (often obtained via probing, Diff-in-Means, or K-means) from the residual stream to causally change model behaviour.

Activation/Attribution Patching

[edit]

Activation patching ablates model components or neurons to trace which part of the model is causally necessary for a behaviour. Automated Circuit Discovery (ACDC) prunes the computational graph by iteratively patch-testing edges, localising minimal sub-circuits without manual hand-holding.[32]

Sparse Autoencoders (SAEs) and Dictionary Learning

[edit]

SAEs learn a sparse, over-complete basis that reconstructs residual-stream activations; coefficients often activate monosemantically for human-interpretable concepts (e.g. “\newline”, “Latin-alphabet”). Scaling experiments recover thousands of such features in GPT-class models, hinting at a parts-list for larger systems.[33]

Circuit Tracing and Automated Graph Discovery

[edit]

Circuit tracing substitutes parts of the model, in particular the MLP block, with more interpretable components called "Transcoders". The goal is to recover explicit computational graphs. Like SAEs, circuit tracing uses sparse dictionary learning techniques to. Instead of reconstruction model activations like SAEs, however, Transcoders aims to predict the output of non-linear components given its input. The technique was introduced in the paper "Circuit Tracing: Revealing Computational Graphs in Language Models", published in April 2025 by Anthropic.[34] Circuit tracing has been used to understand how a model plans the rhyme in a poem, perform medical diagnosis, and understand Chain of thought unfaithfulness.[35]

Critique

[edit]

Critiques of Mechanistic Interpretability can be roughly divided into two categories: critiques of the real-world impact of interpretability, and critiques of the mechanistic approach compared to other sources.

Critics have argued that current circuits-level analyses generalise poorly and risk giving false confidence about alignment, citing cherry-picking, small-scale demos, and absence of worst-case guarantees.[36][37] In response to Dario Amodei's blogpost "The Urgency of Interpretability", which claimed that researchers are on track to achieve MI's goal of developing "the analogue of a highly precise and accurate MRI that would fully reveal the inner workings of an AI model", Neel Nanda, a researcher lead of the Google Deepmind Mechanistic Interpretability team, argued that Mechanistic Interpretability will never act as highly reliable monitors for safety-relevant features, such as deception.[38][39] In particular, Nanda highlighted that only a small portion of the model seem to be interpretable and that current methods cannot guarantee the absence of particular features or circuits of interest.

Some have framed Mechanistic Interpretability as "bottom-up interpretability", where the emphasis is neuron-level circuits in contrast to "top-down interpretability", which focuses on emergent high-level concepts.[40] They argue that LLMs are complex systems and high-level concepts cannot be simulated, predicted, and understood with low-level components. As one critic puts, "If you wanted to explain why England won World War II using particle physics, you would just be on the wrong track."[41]

References

[edit]
  1. ^ "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases". transformer-circuits.pub. Retrieved 2025-05-03.
  2. ^ Saphra, Naomi; Wiegreffe, Sarah (2024). "Mechanistic?". arXiv:2410.09087 [cs.AI].
  3. ^ Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020-03-10). "Zoom In: An Introduction to Circuits". Distill. 5 (3). doi:10.23915/distill.00024.001. ISSN 2476-0757.
  4. ^ "Transformer Circuits Thread". transformer-circuits.pub. Retrieved 2025-05-12.
  5. ^ Olsson, Catherine; Elhage, Nelson; Nanda, Neel; Joseph, Nicholas; DasSarma, Nova; Henighan, Tom; Mann, Ben; Askell, Amanda; Bai, Yuntao; Chen, Anna; Conerly, Tom; Drain, Dawn; Ganguli, Deep; Hatfield-Dodds, Zac; Hernandez, Danny; Johnston, Scott; Jones, Andy; Kernion, Jackson; Lovitt, Liane; Ndousse, Kamal; Amodei, Dario; Brown, Tom; Clark, Jack; Kaplan, Jared; McCandlish, Sam; Olah, Chris (2022). "In-context Learning and Induction Heads". arXiv:2209.11895 [cs.LG].
  6. ^ Elhage, Nelson; Hume, Tristan; Olsson, Catherine; Schiefer, Nicholas; Henighan, Tom; Kravec, Shauna; Hatfield-Dodds, Zac; Lasenby, Robert; Drain, Dawn; Chen, Carol; Grosse, Roger; McCandlish, Sam; Kaplan, Jared; Amodei, Dario; Wattenberg, Martin; Olah, Christopher (2022). "Toy Models of Superposition". arXiv:2209.10652 [cs.LG].
  7. ^ Nanda, Neel; Chan, Lawrence; Lieberum, Tom; Smith, Jess; Steinhardt, Jacob (2023). "Progress measures for grokking via mechanistic interpretability". arXiv:2301.05217 [cs.LG].
  8. ^ Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., ... & Olah, C. (2023). Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2.
  9. ^ "Request for proposals for projects in AI alignment that work with deep learning systems". Open Philanthropy. Retrieved 2025-05-12.
  10. ^ "Interpretability". Alignment Forum. 2021-10-29.
  11. ^ Gil, Juan; Kidd, Ryan; Smith, Christian (December 1, 2023). "MATS Summer 2023 Retrospective".
  12. ^ "Tegmark Group". tegmark.org. Retrieved 2025-05-12.
  13. ^ Hobbhahn, Marius; Millidge, Beren; Sharkey, Lee; Bushnaq, Lucius; Braun, Dan; Balesni, Mikita; Scheurer, Jérémy (2023-05-30). "Announcing Apollo Research".
  14. ^ "Interpretability". EleutherAI. 2024-02-06. Retrieved 2025-05-12.
  15. ^ Sullivan, Mark (2025-04-22). "This startup wants to reprogram the mind of AI—and just got $50 million to do it". Fast Company. Archived from the original on 2025-05-04. Retrieved 2025-05-12.
  16. ^ "ICML 2024 Mechanistic Interpretability Workshop". icml2024mi.pages.dev. Retrieved 2025-05-12.
  17. ^ "Mechanistic Interpretability explained – Chris Olah and Lex Fridman". YouTube. 14 November 2024. Retrieved 2025-05-03.
  18. ^ "Mechanistic Interpretability explained – Chris Olah and Lex Fridman". YouTube. 14 November 2024. Retrieved 2025-05-03.
  19. ^ Saphra, Naomi; Wiegreffe, Sarah (2024). "Mechanistic?". arXiv:2410.09087 [cs.AI].
  20. ^ Elhage, Nelson; Nanda, Neel; Olsson, Catherine (2021-12-22). "A Mathematical Framework for Transformer Circuits". Transformer Circuits Thread. et al. Retrieved 2025-05-30.
  21. ^ Park, Kiho; Choe, Yo Joong; Veitch, Victor (2024-07-17), The Linear Representation Hypothesis and the Geometry of Large Language Models, arXiv:2311.03658
  22. ^ Elhage, Nelson (2021-06-21), A Mathematical Framework for Transformer Circuits
  23. ^ Elhage, Nelson (2022-03-10), Toy Models of Superposition
  24. ^ Scherlis, Adam (2025-03-25), Polysemanticity and Capacity in Neural Networks, arXiv:2210.01892
  25. ^ Bereska, Leonard; Gavves, Efstratios (2024-08-23), Mechanistic Interpretability for AI Safety -- A Review, arXiv, doi:10.48550/arXiv.2404.14082, arXiv:2404.14082, retrieved 2025-05-28
  26. ^ Alain, Guillaume; Bengio, Yoshua (2018-11-22), Understanding intermediate layers using linear classifier probes, arXiv, doi:10.48550/arXiv.1610.01644, arXiv:1610.01644, retrieved 2025-05-30
  27. ^ Nanda, Neel (2023-09-07), Emergent Linear Representations in World Models of Self-Supervised Sequence Models, arXiv:2309.00941
  28. ^ Marks, Samuel; Tegmark, Max (2024-08-19), The Geometry of Truth: Emergent Linear Structure in LLM Representations, arXiv:2310.06824
  29. ^ Arditi, Andy; Obeso, Oscar; Syed, Aaquib; Paleka, Daniel; Panickssery, Nina; Gurnee, Wes; Nanda, Neel (2024-10-30), Refusal in Language Models Is Mediated by a Single Direction, arXiv, doi:10.48550/arXiv.2406.11717, arXiv:2406.11717, retrieved 2025-05-31
  30. ^ Marks, Samuel; Tegmark, Max (2024-08-19), The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets, arXiv, doi:10.48550/arXiv.2310.06824, arXiv:2310.06824, retrieved 2025-05-31
  31. ^ Tigges, Curt; Hollinsworth, Oskar John; Geiger, Atticus; Nanda, Neel (2023-10-23), Linear Representations of Sentiment in Large Language Models, arXiv, doi:10.48550/arXiv.2310.15154, arXiv:2310.15154, retrieved 2025-05-31
  32. ^ Conmy, Arthur (2023-04-28), Towards Automated Circuit Discovery for Mechanistic Interpretability, arXiv:2304.14997
  33. ^ Olsson, Catherine (2023-11-01), Decomposing Language Models with Dictionary Learning
  34. ^ Ameisen, Emmanuel, et al. “Circuit Tracing: Revealing Computational Graphs in Language Models.” Transformer Circuits, Anthropic, 27 March 2025, https://transformer-circuits.pub/2025/attribution-graphs/methods.html. Accessed 29 May 2025.
  35. ^ Lindsey, Jack, et al. “On the Biology of a Large Language Model.” Transformer Circuits, Anthropic, 27 March 2025, https://transformer-circuits.pub/2025/attribution-graphs/biology.html. Accessed 29 May 2025.
  36. ^ Casper, Stephen (2023-02-17), EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety
  37. ^ Segerie, Charbel-Raphael (17 August 2023). "Against Almost Every Theory of Impact of Interpretability". AI Alignment Forum. Retrieved 31 May 2025.
  38. ^ "Dario Amodei — The Urgency of Interpretability". www.darioamodei.com. Retrieved 2025-05-29.
  39. ^ Nanda, Neel (2025-05-04). "Interpretability Will Not Reliably Find Deceptive AI". AI Alignment Forum. Retrieved 2025-05-30.
  40. ^ Hendrycks, Dan; Hiscott, Laura (May 15, 2025). "The Misguided Quest for Mechanistic AI Interpretability". AI Frontiers. Retrieved May 31, 2025.
  41. ^ Patel, Dwarkesh (May 22, 2025). "How Does Claude 4 Think? — Sholto Douglas & Trenton Bricken". Dwarkesh Podcast. Retrieved May 31, 2025.