User:AryamanA/Draft:Mechanistic interpretability

Mechanistic interpretability is a subfield of research within explainable artificial intelligence which seeks to fully reverse-engineer neural networks (akin to reverse-engineering a compiled binary of a computer program), with the ultimate goal of understanding the mechanisms underlying their computations.^[1]^[2]^[3] A key notion shared by much work in the area is that neural networks are composed of circuits which operate on atomic features that are represented as linear directions in activation space.^[2] Features need not be basis-aligned, leading to polysemanticity in neurons and superposition of features when they are too numerous relative to the available neurons.^[4] New methods in the field are thus designed to recover circuits and features despite these difficulties; these approaches include sparse autoencoders and transcoders, causal interventions, and activation steering.

The field of mechanistic interpretability arose from early investigations on feature visualisation in vision models by Chris Olah and collaborators at Google Brain. By 2020, Olah had joined OpenAI as head of the Clarity team,^[5] which worked on finding circuits in vision models and published their findings in Distill. In 2021, the team largely shifted to Anthropic after the split and changed focus to reverse-engineering transformer language models, documenting its work in the Transformer Circuits Thread. Since then, mechanistic interpretability has become a growing area of research within AI safety, conducted by teams at frontier research labs including Anthropic,^[6] OpenAI, and Google DeepMind,^[7] a variety of smaller research organisations and startups, and numerous academic machine learning departments, including at Harvard University.^[8]

Name and definitions

Chris Olah is usually credited with coining the term "mechanistic interpretability". His motivation was the differentiate this nascent approach to interpretability from established saliency map-based approaches which at the time dominated computer vision.^[9]

In-field explanations of the goal of mechanistic interpretability make an analogy to reverse-engineering computer programs,^[3]^[10] with the argument being that rather than being arbitrary functions, neural networks represent are composed of independent reverse-engineerable mechanisms that are compressed into the weights.

Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.
— Chris Olah, "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases"^[1] [emphasis added]

One emerging approach for understanding the internals of neural networks is mechanistic interpretability: reverse engineering the algorithms implemented by neural networks into human-understandable mechanisms, often by examining the weights and activations of neural networks to identify circuits [Cammarata et al., 2020, Elhage et al., 2021] that implement particular behaviors.
— Mechanistic Interpretability Workshop 2024^[11] [emphasis added]

However, the lack of strict definitional constraints beyond the broad goal of reverse-engineering leads Naomi Saphra and Sarah Wiegreffe to argue that mechanistic interpretability is primarily a cultural grouping of researchers largely motivated by AI safety and from a mainly machine learning background (as opposed to natural language processing).^[12]

History

InceptionV1 and the Distill Circuits thread (2020–2021)

Early work on transformer language models (2021–2022)

Causal interventions and circuit discovery (2022–)

Polysemanticity, superposition, and sparse autoencoders (2022–)

Concepts

Features and circuits

Universality

Polysemanticity and superposition

Linear representation hypothesis

Techniques

Sparse decomposition methods

Sparse autoencoders (SAEs)

Sparse autoencoders (SAEs) for mechanistic interpretability were proposed in order to address the superposition problem by decomposing the feature space into an overcomplete basis (i.e. with more features than dimensions) of monosemantic concepts, guided by the intuition that features can only be manipulated under superposition if they are sparsely activated (otherwise, interference between features would be too high).^[13]^[14]

Given a vector $\mathbf {x} \in \mathbb {R} ^{n}$ representing an activation collected from some model component (in a transformer, usually the MLP inner activation or the residual stream), the sparse autoencoder computes the following: ${\hat {\mathbf {x} }}=\mathbf {W} _{\mathrm {dec} }^{\top }(\mathrm {ReLU} (\mathbf {W} _{\mathrm {enc} }\mathbf {x} +\mathbf {b} _{\textrm {enc}}))+\mathbf {b} _{\textrm {dec}}$ Here, $\mathbf {W} _{\mathrm {enc} }\in \mathbb {R} ^{z\times n}$ projects the activation into a $z$ -dimensional latent space, applies the ReLU nonlinearity, and finally the decoder $\mathbf {W} _{\mathrm {dec} }\in \mathbb {R} ^{z\times n}$ aims to reconstruct the original activation from this latent representation. The bias terms are $\mathbf {b} _{\textrm {enc}}\in \mathbb {R} ^{z},\mathbf {b} _{\textrm {dec}}\in \mathbb {R} ^{n}$ ; the latter is omitted in some formulations. The encoder and decoder matrices may also be tied.

Given a dataset of activations $\mathbf {X} =\{\mathbf {x} _{1},\ldots ,\mathbf {x} _{n}\}$ , the SAE is trained with gradient descent to minimise the following loss function: ${\mathcal {L}}(\mathbf {x} )=||\mathbf {x} -{\hat {\mathbf {x} }}||_{2}^{2}+\alpha ||\mathbf {z} ||_{1}$ where the first term is the reconstruction loss (i.e. the standard autoencoding objective) and the second is a sparsity loss on the latent representation $\mathbf {z} =\mathrm {ReLU} (\mathbf {W} _{\mathrm {enc} }\mathbf {x} +\mathbf {b} _{\textrm {enc}})$ which aims to minimise its $\ell ^{1}$ -norm.

Alternative designs

Several works motivate alternative nonlinearities to ReLU based on improved downstream performance or training stability.

TopK, which means selecting the top- $K$ activating latents and zeroing out the rest, which also allows for dropping the sparsity loss entirely.^[15]
JumpReLU, defined as $\mathrm {JumpReLU} (z)=zH(z-\theta )$ where $H$ is the Heaviside step function.^[16]

Evaluation

The core metrics for evaluating SAEs are sparsity, measured by the $\ell _{0}$ -norm of the latent representations over the dataset, and fidelity, which may be the MSE reconstruction error as in the loss function or a downstream metric when substituting the SAE output into the model, such as loss recovered or KL-divergence from the original model behaviour.^[17]

SAE latents are usually labelled using an autointerpretability pipeline. Most such pipelines feed highly-activating (i.e. having activation $\mathbf {x}$ resulting in large $\mathbf {z} _{i}$ for feature $i$ , repeating over all features) dataset exemplars to a large language model, which generates a natural-language description based on the contexts the latent is active.

Early works directly adapt Bills et al. (2023)'s^[18] neuron-labelling and evaluation pipeline and report higher interpretability scores than alternative methods (the standard basis, PCA, etc.).^[13]^[14] However, this leads to misleading score, since explanations achieve high recall but usually low precision, leading to more nuanced evaluation metrics being introduced in later works: neuron-to-graph explanations (or other approaches) reporting both precision and recall,^[15] and intervention-based metrics that measure the downstream effect of manipulating a latent feature.^[19]^[20]

Transcoders

Transcoders are formulated identically to SAEs, with the caveat that they seek to approximate the input-output behaviour of a model component (usually the MLP).^[21] This is useful for measuring how latent features in different layers of the model affect each other in an input-invariant manner (i.e. by directly comparing encoder and decoder weights). A transcoder thus computes the following: ${\hat {\mathbf {y} }}=\mathbf {W} _{\mathrm {dec} }^{\top }(\mathrm {ReLU} (\mathbf {W} _{\mathrm {enc} }\mathbf {x} +\mathbf {b} _{\textrm {enc}}))+\mathbf {b} _{\textrm {dec}}$ and is trained to minimise the loss: ${\mathcal {L}}(\mathbf {x} )=||\mathrm {MLP} (\mathbf {x} )-{\hat {\mathbf {y} }}||_{2}^{2}+\alpha ||\mathbf {z} ||_{1}$ When ignoring or holding attention components constant (which may obscure some information), transcoders trained on different layers of a model can then be used to conduct circuit analysis without having to process individual inputs and collect latent activations, unlike SAEs.

Transcoders generally outperform SAEs, achieving lower loss and better automated interpretability scores.^[22]^[21]

Sparse crosscoders

A disadvantage of single-layer SAEs and transcoders is that they produce duplicate features when trained on multiple layers, if those features persist throughout the residual stream. This complicates understanding layer-to-layer feature propagation and also wastes latent parameters. Crosscoders were introduced to enable cross-layer representation of features, which minimises these issues.^[23]

A crosscoder computes the cross-layer latent representation $\mathbf {z}$ using a set of layer-wise activations $\{\mathbf {x} ^{(l_{1})},\ldots ,\mathbf {x} ^{(l_{n})}\}$ over layers $L$ obatained from some input as follows: $\mathbf {z} =\mathrm {ReLU} \left(\sum _{l\in L}{\mathbf {W} _{\mathrm {enc} }^{(l)}\mathbf {x} ^{(l)}+\mathbf {b} _{\mathrm {enc} }}\right)$ The reconstruction is done independently for each layer using this cross-layer representation: ${\hat {\mathbf {x} }}^{(l)}=\mathbf {W} _{\mathrm {dec} }^{(l)}\mathbf {z} +\mathbf {b} _{\mathrm {dec} }$ Alternatively, the target may be layer-wise component outputs ${\hat {\mathbf {y} }}^{(l)}$ if using the transcoder objective. The model is then trained to minimise a loss: ${\mathcal {L}}(\{\mathbf {x} ^{(1)},\ldots ,\mathbf {x} ^{(n)}\})=\sum _{l\in L}{||{\hat {\mathbf {x} }}^{(l)}-\mathbf {x} ^{(l)}||_{2}^{2}+\sum _{l\in L}\sum _{i}{\mathbf {z} _{i}||\mathbf {W} _{\mathrm {dec} ,i}^{(l)}||_{2}}}$ Note that the regularisation term uses the $\ell ^{2}$ -norm; the $\ell ^{1}$ -norm is an alternative choice, considered but not used in the original paper.^[23]

Causal interventions

Activation steering and top-down interpretability

References

^ ^a ^b Olah, Chris (June 2022). "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases". Transformer Circuits Thread. Anthropic. Retrieved 28 March 2025.
^ ^a ^b Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020). "Zoom In: An Introduction to Circuits". Distill. doi:10.23915/distill.00024.001.
^ ^a ^b Elhage, Nelson; et al. (2021). "A Mathematical Framework for Transformer Circuits". Transformer Circuits Thread. Anthropic.
^ Elhage, Nelson; et al. (2022). "Toy Models of Superposition". Transformer Circuits Thread. Anthropic.
^ Olah, Chris [@ch402] (April 16, 2020). "OpenAI has been *incredibly* supportive! We have the Clarity team (@ludwigschubert @shancarter @gabeeegoooh @nicklovescode and myself) as a full time effort to "reverse engineer" neural networks" (Tweet) – via Twitter.
^ Amodei, Dario (April 2025). "The Urgency of Interpretability".
^ Mulligan, Scott J. "Google DeepMind has a new way to look inside an AI's "mind"" (14 November 2024). MIT Technology Review. MIT. Retrieved 28 March 2025.
^ "Insight + Interaction Lab". Insight + Interaction Lab. Harvard University. Retrieved 28 March 2025.
^ @ch402 (July 29, 2024). "I was motivated by many of my colleagues at Google Brain being deeply skeptical of things like saliency maps. When I started the OpenAI interpretability team, I used it to distinguish our goal: understand how the weights of a neural network map to algorithms" (Tweet) – via Twitter.
^ Nanda, Neel (January 31, 2023). "Mechanistic Interpretability Quickstart Guide". Neel Nanda. Retrieved 28 March 2025.
^ "Mechanistic Interpretability Workshop 2024". 2024.
^ Saphra, Naomi; Wiegreffe, Sarah (2024). "Mechanistic?". Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics: 480–498. Retrieved 28 March 2025.
^ ^a ^b Cunningham, Hoagy; Ewart, Aidan; Riggs, Logan; Huben, Robert; Sharkey, Lee (May 7–11, 2024). "Sparse Autoencoders Find Highly Interpretable Features in Language Models". The Twelfth International Conference on Learning Representations (ICLR 2024). Vienna, Austria: OpenReview.net. Retrieved 2025-04-29.
^ ^a ^b Bricken, Trenton; et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning". Transformer Circuits Thread. Retrieved 2025-04-29.
^ ^a ^b Gao, Leo; et al. (2024). "Scaling and evaluating sparse autoencoders". arXiv:2406.04093. A bot will complete this citation soon. Click here to jump the queue
^ Rajamanoharan, Senthooran; et al. (2024). "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders". arXiv:2407.14435. A bot will complete this citation soon. Click here to jump the queue
^ Karvonen, Adam; et al. (2025). "SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability". arXiv:2503.09532. A bot will complete this citation soon. Click here to jump the queue
^ Bills, Steven; et al. (2023). "Language models can explain neurons in language models". Retrieved 2025-04-29.
^ Paulo, Gonçalo; et al. (2024). "Automatically Interpreting Millions of Features in Large Language Models". arXiv:2410.13928. A bot will complete this citation soon. Click here to jump the queue
^ Wu, Zhengxuan; et al. (2025). "AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders". arXiv:2501.17148. A bot will complete this citation soon. Click here to jump the queue
^ ^a ^b Dunefsky, Jacob; et al. (December 10–15, 2024). "Transcoders find interpretable LLM feature circuits". Advances in Neural Information Processing Systems 38 (NeurIPS 2024). Vancouver, BC, Canada. Retrieved 2025-04-29.
^ Paulo, Gonçalo; et al. (2025). "Transcoders Beat Sparse Autoencoders for Interpretability". arXiv:2501.18823 [cs.LG].
^ ^a ^b Lindsey, Jack; et al. (2024). "Sparse Crosscoders for Cross-Layer Features and Model Diffing". Transformer Circuits Thread. Anthropic. Retrieved 2025-04-30.

[olah-blog-1] Olah, Chris (June 2022). "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases". Transformer Circuits Thread. Anthropic. Retrieved 28 March 2025.

[zoom-in-2] Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020). "Zoom In: An Introduction to Circuits". Distill. doi:10.23915/distill.00024.001.

[mathematical-3] Elhage, Nelson; et al. (2021). "A Mathematical Framework for Transformer Circuits". Transformer Circuits Thread. Anthropic.

[4] Elhage, Nelson; et al. (2022). "Toy Models of Superposition". Transformer Circuits Thread. Anthropic.

[5] Olah, Chris [@ch402] (April 16, 2020). "OpenAI has been *incredibly* supportive! We have the Clarity team (@ludwigschubert @shancarter @gabeeegoooh @nicklovescode and myself) as a full time effort to "reverse engineer" neural networks" (Tweet) – via Twitter.

[6] Amodei, Dario (April 2025). "The Urgency of Interpretability".

[7] Mulligan, Scott J. "Google DeepMind has a new way to look inside an AI's "mind"" (14 November 2024). MIT Technology Review. MIT. Retrieved 28 March 2025.

[8] "Insight + Interaction Lab". Insight + Interaction Lab. Harvard University. Retrieved 28 March 2025.

[9] @ch402 (July 29, 2024). "I was motivated by many of my colleagues at Google Brain being deeply skeptical of things like saliency maps. When I started the OpenAI interpretability team, I used it to distinguish our goal: understand how the weights of a neural network map to algorithms" (Tweet) – via Twitter.

[10] Nanda, Neel (January 31, 2023). "Mechanistic Interpretability Quickstart Guide". Neel Nanda. Retrieved 28 March 2025.

[11] "Mechanistic Interpretability Workshop 2024". 2024.

[12] Saphra, Naomi; Wiegreffe, Sarah (2024). "Mechanistic?". Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics: 480–498. Retrieved 28 March 2025.

[cunningham-13] Cunningham, Hoagy; Ewart, Aidan; Riggs, Logan; Huben, Robert; Sharkey, Lee (May 7–11, 2024). "Sparse Autoencoders Find Highly Interpretable Features in Language Models". The Twelfth International Conference on Learning Representations (ICLR 2024). Vienna, Austria: OpenReview.net. Retrieved 2025-04-29.

[bricken-14] Bricken, Trenton; et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning". Transformer Circuits Thread. Retrieved 2025-04-29.

[gao-15] Gao, Leo; et al. (2024). "Scaling and evaluating sparse autoencoders". arXiv:2406.04093. A bot will complete this citation soon. Click here to jump the queue

[16] Rajamanoharan, Senthooran; et al. (2024). "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders". arXiv:2407.14435. A bot will complete this citation soon. Click here to jump the queue

[17] Karvonen, Adam; et al. (2025). "SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability". arXiv:2503.09532. A bot will complete this citation soon. Click here to jump the queue

[18] Bills, Steven; et al. (2023). "Language models can explain neurons in language models". Retrieved 2025-04-29.

[19] Paulo, Gonçalo; et al. (2024). "Automatically Interpreting Millions of Features in Large Language Models". arXiv:2410.13928. A bot will complete this citation soon. Click here to jump the queue

[20] Wu, Zhengxuan; et al. (2025). "AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders". arXiv:2501.17148. A bot will complete this citation soon. Click here to jump the queue

[dunefsky-21] Dunefsky, Jacob; et al. (December 10–15, 2024). "Transcoders find interpretable LLM feature circuits". Advances in Neural Information Processing Systems 38 (NeurIPS 2024). Vancouver, BC, Canada. Retrieved 2025-04-29.

[22] Paulo, Gonçalo; et al. (2025). "Transcoders Beat Sparse Autoencoders for Interpretability". arXiv:2501.18823 [cs.LG].

[lindsey-23] Lindsey, Jack; et al. (2024). "Sparse Crosscoders for Cross-Layer Features and Model Diffing". Transformer Circuits Thread. Anthropic. Retrieved 2025-04-30.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]