A Technical Primer on
Mechanistic Interpretability

A brief primer for technically-minded people unfamiliar with the discipline of mechanistic interpretability—a field of research developing a causal understanding of the internal mechanisms of "black box" artificial neural networks. In writing this, I have liberally simplified and lightly stolen from the extensive body of research on this topic; please read the sources I reference and beyond.

Alexei Gannon

PhD Student, Neuroscience · New York City

Motivation & Background

On the one hand, I write this as a neuroscientist who believes the techniques developed by mechanistic interpretability have an underappreciated potential for extracting scientific facts about the world from machine learning models—especially in the study of biological brains. On the other hand, I find mechanistic interpretability to be a tool of great importance for people involved in AI policy and governance. From both perspectives, an influx of interest in mechanistic interpretability from experts outside of the field may be broadly useful for AI safety.

Large machine learning models like AlphaFold and the ESM Series have proven to be incredibly effective at predicting three-dimensional protein structures from amino acid sequences. Recent work across multiple labs^1,2 has begun to apply the methods of mechanistic interpretability to biology foundation models, identifying what undiscovered biological mechanisms these models have learned. Such investigations into the inner working of foundation models trained to complete other medical tasks, such as Alzheimer's detection, have identified a novel class of biomarkers³. This paradigm can be applied across scientific disciplines, uncovering facts about the world from black-box models.

Therefore, I direct this piece towards a generally technical audience. I do not define my terms in complete formal mathematics for the sake of accessibility—please read the sources I reference for this type of discussion. My goal is to foster intuition about what types of questions this field is interested in asking, what methods they use and why, and what outcomes have guided the progression of this field over the past few years.

Interpreting Neural Networks

In 2020, OpenAI's interpretability team led by Chris Olah published Zoom In: An Introduction to Circuits⁴. This paper laid out three speculative claims about neural networks that provide a foundation for mechanistic interpretability:

Claim 1: Features — Features are the fundamental unit of neural networks. They correspond to directions. These features can be rigorously studied and understood.

Claim 2: Circuits — Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood.

Claim 3: Universality — Analogous features and circuits form across models and tasks.

This paper analyzed what images maximally activate individual neurons of a computer vision model to understand their function—much like how one can study the tuning curve of a biological neuron. The researchers identified some neurons that cleanly respond to individual features, like the orientation of a curve. Circuits construct complex features through a weighted combination of simpler features.

Unfortunately, neural networks are not reducible to interpretable neurons. While some circuits construct complex features from natural-seeming primitives like "windows + car body + wheels = car," polysemantic neurons combine multiple complex features under superposition.

Left: Monosemantic neurons form an interpretable "car" circuit. Right: A polysemantic neuron mixes unrelated features — superposition.

Olah and others went on to investigate superposition at Anthropic in 2022 in Toy Models of Superposition⁵. Here, they identified superposition as a natural way for neural networks to represent more features than dimensions.

As an example, imagine a layer of a network has three neurons. If the optimal representation of features at this layer is three or less, then each neuron can represent one feature. You can imagine a 3-dimensional grid where each axis represents both a feature and a neuron—each feature is orthogonal to each other. However, if it is optimal to store more than three features, then each other feature must be represented in a non-orthogonal direction within 3D space. While this should imply interference between features, if features are sparse, or rarely active, the network can avoid a loss in performance.

A 3-neuron network. Toggle to see how adding features forces non-orthogonal, superposed representations.

This presents a generalized problem for interpretability. If superposition is a predictable behavior for neural networks, then features will not simply fall out of analyzing the function of individual neurons. The field has developed two ways to respond to this problem. The first is to develop methods for feature extraction out of compressed representations. The second is to design interpretable architectures that disincentivize superposition. To have a concrete example of how these approaches have matured, let us focus on the interpretability of transformers, the field's primary area of interest.

Transformer Interpretability

In A Mathematical Framework for Transformer Circuits⁶, a team led by Chris Olah at Anthropic including Nelson Elhage, Neel Nanda, and Catherine Olsson among others examined transformers of two layers and less to derive some general principles of their internal mechanisms. Transformers are the fundamental attention-based architecture behind every major LLM among others. Below is the anatomy of one layer of a transformer which takes in a token—a small string in the fundamental vocabulary of the model—and outputs logits for the probability of the next token. The line running horizontally from "embed" to "unembed" representing the information carried between the input token and output logit is called the residual stream, a linear high-dimensional vector space.

One layer of a transformer. The residual stream carries information; attention and MLP blocks read from and write to it.

You can think of an LLM as stacking up layers of a transformer to apply more and more operations onto the content of the residual stream. On each layer, attention heads direct how much each token attends to each other token, reading from and adding onto the residual stream. Then, a shallow neural network called a multi-layer perceptron (MLP) takes the residual stream as input, applies an activation function, then adds onto the stream again. In short, attention heads move information between positions, then an MLP applies some computation on that information.

The importance of this paper comes from the discovery of an interpretable circuit formed by the composition of attention heads. Specifically, the researchers identified the algorithm through which a two-layer attention-only transformer predicts patterns of the form: see [AB], see [A] → predict [B]. On the first attention layer, some heads attend to the content of the token immediately before a given token, so "[A] precedes [B]" is stored in the residual stream. Making use of this, the second layer induction head attends from the current [A] to the [B] that followed the previous [A], and copies the value [B] to predict the next token.

Induction head: Layer 1 stores "A precedes B"; Layer 2 finds the pattern and predicts B.

This was a proof-of-concept that trained models can implement baroque yet interpretable algorithms—a guiding light in an era of mechanistic interpretability that sought to completely reverse-engineer toy models. Nanda et al. identified that small transformers⁷ use Fourier transforms to implement modular addition. Li et al.⁸ could recover the board state from an 8 layer model trained to play Othello. Wang et al. recovered a circuit⁹ for the identification of indirect objects in GPT-2 small. These are each quite elegant papers worth a closer read.

At this point, researchers began to question the utility of this approach. It had become clear that LLMs are large, dense networks of uninterpretable neurons with complex interpretable functions buried underneath complexity. Identifying baroque algorithms an LLM can implement at each layer established the promise of this field, but repeating this type of work for every possible computation is infeasible. This problem incentivized a new goal for interpretability: the automatic detection of features at scale.

Dictionary Learning

Multiple teams began to investigate the efficacy of sparse autoencoders (SAEs). An SAE is a wide single-layer neural network that is tasked with learning sparse representations that can reproduce a layer of neural activations. This is one attempt at dictionary learning: the decomposition of distributed representations of a neural network into many interpretable features.

A sparse autoencoder decomposes dense MLP activations into many sparse, interpretable features, then reconstructs the original.

While Bricken et al. at Anthropic studied¹⁰ SAEs in the MLPs of one-layer transformers, Cunningham et al. at Eleuther AI researched¹¹ SAEs in the residual stream of GPT-2 size models, which Templeton et al. at Anthropic scaled up¹² to a complete dictionary of features of Claude 3 Sonnet. In all cases, researchers successfully identified interpretable features in an unsupervised manner—albeit the goal of accurately reducing a frontier model into a dictionary of features remained out of reach.

In the pursuit of this goal, it was discovered that identified features, among other directions in activation space, could be used to direct the behavior of LLMs—a process called activation steering. This opened up a new approach to mechanistic interpretability: even if we could not learn the entire model, we could learn what matters to succeed at the task at hand.

Representation Engineering

Identifying and activating directions within the model to control behavior is also referred to as representation engineering, another branch of research that began to converge with mechanistic interpretability at this time. This term was coined by Zou et al. at the Center for AI Safety¹³, who made use of the linear representation of features to identify and manipulate directions in activation space that induce behaviors like honesty or power-seeking. While one could target features identified by SAEs, the simplest way to identify a useful axis for manipulation is by constructing contrastive pairs: run prompts that elicit honesty, run prompts that elicit dishonesty, and calculate an honesty axis by subtracting the means of honest activations from dishonest activations.

Representation engineering: subtract dishonest from honest activations to find a steering direction. Add it to steer behavior.

Representation engineering has been effective at eliciting truthful answers and identifying axes that mediate the refusal of harmful requests or model persona^14,15,16. That said, the development of a useful axis is a subjective task which may not generalize out of distribution or even extend outside the behavior of interest. For example, when trying to mitigate social bias¹⁷, Anthropic found that steering features related to gender bias produced off-target effects on age bias.

While this approach is not identical to mechanistic interpretability's goal of complete reverse engineering, it does provide a clear use-case for the methods developed by mechanistic interpretability along with an avenue to make causal claims between model internals and model behavior. A Pragmatic Vision for Interpretability¹⁸ argues that mechanistic interpretability research should choose the level of analysis most relevant for the safety-relevant task at hand, erring towards method minimalism.

In this vein—while SAEs are still useful to discover new features and analyze the geometry of a representation—the field has embraced simpler methods to identify supervised features of interest. When trying to monitor an abstract feature like "harmful user intent" outside of the training distribution, Google DeepMind identified that SAEs were outperformed by linear probes¹⁹, a linear classifier/regressor trained on a layer of model activations. This provides a computationally cheap and interpretable way to monitor internals given a labeled dataset.

Transcoders

While the methods above are useful to operate on representations, they don't explain how the model computes. If we detect that the model is about to engage in deception at some layer, both scientific curiosity and pragmatism may lead us to ask how the processing of prior inputs led to that behavior—especially if activation steering is insufficient to alter our behavior of interest. To do this, we cannot simply learn the representation within an MLP; we must learn the transformation between the inputs of an MLP to its outputs. This is the purpose of a transcoder, a wide MLP that is trained to sparsely approximate the function of the MLP²⁰, input-to-output, in a trained LLM.

The optimal structure of a transcoder is an open question. The original approach trains one transcoder per layer. A skip transcoder²¹ adds an affine skip connection to separate the linear component of the MLP from the nonlinear features. Cross-layer transcoders write to the MLP outputs²² of all future layers, enabling clean attribution across the entire model at the expense of extra compute.

Furthermore, one can measure how attention moves information between interpretable features through QK attribution. This is done by decomposing attention scores into sums of feature-pair dot products²³ between query and key positions. Together, transcoders and QK attribution make nearly the full forward pass interpretable.

Left: Transformer with transcoder replacement. Right: Attribution graph — each method reveals more structure.

We can use these methods to create attribution graphs that trace a computation throughout a model. This can be used to understand reasoning, diagnose reasoning errors, or understand jailbreaks^24,25,26 in a causal manner. That said, it is important to remember that the replacement model is still an approximation, and not every prompt is neatly explainable.

Natural Language Explanation

For a more top-level explanation, these principles allow researchers to train a language model to explain the internals of another model. Caden Juang et al. at Stanford developed LatentQA²⁷, which finetuned a decoder LLM to answer specific domains of questions about a target LLM from the activations of the target plus the input prompt. For example, this activation explainer can read what persona a model has been told to adopt via an otherwise hidden prompt. Furthermore, given this model has learned how to translate directions in activation space into natural language, it can also calculate a loss based on natural language prompts like "be an unbiased person" that can be backpropagated to the target model's activations to change model behavior during inference. Chen et al. at Anthropic generalized this framework through diversified training to develop activation oracles that can answer natural language queries²⁸ about model activity beyond the domain of their training distribution.

Relatedly, researchers at Transluce found that models trained²⁹ for LatentQA-type tasks work best when the decoder LLM is generated by training the target LLM to explain its own activations. In combination with evidence from Lindsey et al. at Anthropic that models can detect external manipulations³⁰ of internal activations above chance, this provides preliminary evidence that frontier models have some capacity for introspection. It remains an open question how training models to perform introspection affects model capabilities.

LatentQA trains a decoder to read activations; activation oracles generalize via diverse training; introspection shows models explain themselves best.

Training on Mechanistic Interpretability

In the field of AI Safety, training models on the measures we use for mechanistic interpretability has been called "The Most Forbidden Technique³¹." If we use a probe to detect a representation of some ideal feature like honesty, then training the model to activate said probe could merely train the model to represent honesty while learning some less interpretable way to engage in dishonest behavior—making the probe useless.

That said, research may uncover useful and safe ways to train on mechanistic interpretability—transforming this area of study from a taboo into a series of complex engineering questions. The main area of research thus far involves incorporating linear probes into training, as linear probes would not dramatically increase the amount of compute required for training. Concerning model honesty, Cundy and Gleave at FAR.AI found that maintaining honest behavior³² when incorporating lie detectors into training required careful choices about probe design and training hyperparameters. From a feature-based perspective, Casademunt et al. demonstrated that ablation of misaligned features³³ during fine-tuning can reduce emergent misalignment without degrading performance. Goodfire has recently found success in using a linear probe that discerns between activations underlying factually true and false outputs to train an LLM against hallucination from the outset³⁴.

It is unclear what the trade-off between capabilities and safety will be in this domain on both practical and methodological dimensions. Neel Nanda of Google DeepMind has suggested³⁵ that integrating these techniques into frontier model training stacks would be such a pain that it is safe to ask empirical questions about what this approach might do long-term with minimal imminent risk. Thomas McGrath of Goodfire suggests training³⁶ on one set of methods and testing safety with another set of methods could mitigate the chance that models learn to obfuscate their internal representations. This is an unresolved question of active debate and research.

Conclusion

The capabilities presented by complex computational systems seem to increase more quickly every year—it is essential to understand how they function. At the core of these systems are artificial neural networks which represent features in superposition, making individual neurons uninterpretable. In the face of this challenge, the field has developed tools to extract interpretable features, trace computation through a model, explain behavior in natural language, and shape the development of new models.

Mechanistic interpretability is still early. Attribution graphs are approximations. SAE features don't perfectly reconstruct the model they decompose. Training on interpretability signals raises adversarial concerns that are not yet resolved. The field has produced elegant results on toy models and increasingly compelling results on frontier models, but a complete mechanistic understanding of a system as large as a modern LLM remains out of reach. Whether this is a temporary bottleneck or a fundamental limitation is an open question.

That said, the trajectory is promising enough to warrant broad attention. For those in the sciences, application of these methods to biology foundation models is already producing novel discoveries. For those in policy, the ability to audit a model's internal reasoning is a prerequisite for meaningful regulation of AI systems making consequential decisions. The methods described here are the best candidates we have—so far.

I have tried to present this material at a level of detail that builds intuition without requiring the reader to have followed the field from the beginning. If I have succeeded, you should now be able to engage with the primary sources with some fluency; I encourage you to do so. The field is moving fast, and the papers referenced throughout this piece are a good starting point.