Mechanistic Interpretability of In-Context Learning in Transformer Models
Abstract
In-context learning (ICL) — the ability of transformer models to adapt to new tasks from a few demonstration examples without weight updates — remains one of the most striking yet poorly understood capabilities of large language models. We reverse-engineer the internal circuits responsible for ICL by combining activation patching, causal tracing, and probing classifiers across a family of GPT-2-scale transformer models. We identify a three-phase circuit architecture consisting of induction heads, task-encoding subspaces, and output heads. Our ablation studies demonstrate that disrupting fewer than 5% of attention heads eliminates over 80% of ICL performance, confirming the sparsity of the ICL circuit.
1. Introduction
The emergence of in-context learning (ICL) in large language models has fundamentally altered the paradigm of machine learning deployment. Given a prompt of the form , a pretrained transformer can produce consistent with the implicit task defined by the demonstrations — all without any gradient-based updates to its parameters. This capability, first documented systematically by Brown et al. (2020), raises a fundamental question: what computational mechanism within the transformer implements this learning algorithm?
Prior work has approached this question from both theoretical and empirical angles. Theoretical analyses have shown that transformers can implement gradient descent in their forward pass (Akyürek et al., 2023; Von Oswald et al., 2023), while empirical studies have identified attention patterns — particularly "induction heads" — that correlate with ICL performance (Olsson et al., 2022). However, a unified mechanistic account of the complete ICL circuit remains elusive.
In this paper, we present a comprehensive mechanistic analysis of ICL in transformer models. Our contributions are threefold:
- We identify a three-phase circuit for ICL: pattern-matching induction heads, task-encoding residual stream subspaces, and label-predicting output heads.
- We demonstrate the sparsity of this circuit through targeted ablations, showing that a small fraction of components are causally responsible for the majority of ICL capability.
- We characterize the developmental trajectory of ICL circuits during pretraining, revealing a staged emergence pattern.
2. Related Work
Mechanistic interpretability. The field of mechanistic interpretability seeks to reverse-engineer neural networks into human-understandable algorithms (Elhage et al., 2021; Conmy et al., 2023). Key techniques include activation patching (Geiger et al., 2021), causal tracing (Meng et al., 2022), and circuit discovery via path patching (Wang et al., 2023). Our work applies these tools specifically to the ICL phenomenon.
Theoretical models of ICL. Several works have established that transformer architectures can, in principle, implement learning algorithms. Garg et al. (2022) showed that transformers trained on linear regression tasks learn to implement ridge regression. Akyürek et al. (2023) demonstrated that transformer layers can simulate gradient descent steps on least-squares objectives. Formally, a single attention layer can compute:
which corresponds to one step of gradient descent on .
Induction heads. Olsson et al. (2022) identified induction heads — attention heads that implement a "copy previous token's successor" operation — as a key mechanism for in-context learning. These heads perform a two-step composition: a previous-token head writes positional information into the residual stream, and the induction head uses this to attend to tokens following previous occurrences of the current token.
3. Methodology
3.1 Experimental Setup
We study ICL in GPT-2 Small (12 layers, 12 heads, ) and GPT-2 Medium (24 layers, 16 heads, ) on synthetic classification tasks. Each task is defined by a randomly sampled linear classifier with inputs and labels . We construct ICL prompts with demonstrations.
3.2 Circuit Discovery via Activation Patching
We employ activation patching (also known as causal mediation analysis) to identify the components causally responsible for ICL. For each attention head at layer and head index , we measure the indirect effect (IE) on ICL accuracy:
where is a valid ICL prompt and has shuffled labels. High IE indicates that the head carries information essential for correct ICL predictions.
3.3 Task-Encoding Probes
To identify where task representations are formed, we train linear probes on residual stream activations at each layer. Specifically, we train a logistic regression classifier to predict the task identity (the underlying weight vector , discretized into clusters) from the residual stream activation at the query position after layer :
We measure probe accuracy across layers to identify the formation point of task representations.
4. Results and Discussion
4.1 Three-Phase ICL Circuit
Our analysis reveals a clear three-phase architecture:
Phase 1: Induction Heads (Layers 2–5). We identify 4 attention heads in GPT-2 Small (specifically , , , and ) that exhibit strong induction behavior. These heads attend from the query token to demonstration inputs that are semantically similar, with attention weights approximating:
where is a learned similarity matrix. Ablating these four heads reduces ICL accuracy from 87.3% to 54.1% (near random baseline of 50%).
Phase 2: Task Encoding (Layers 5–8). Probing accuracy for task identity jumps sharply between layers 5 and 8, reaching 91.2% by layer 8 (compared to 52.3% at layer 2). PCA analysis of residual stream activations reveals a low-dimensional task subspace of effective dimension , computed as:
where are eigenvalues of the activation covariance matrix.
Phase 3: Output Heads (Layers 9–11). Three attention heads in the final layers (, , ) read from the task-encoding subspace and produce the output logits. These heads show high indirect effect scores (mean IE = 0.31) and their output subspaces align strongly with the label direction ().
4.2 Circuit Sparsity
A key finding is the remarkable sparsity of the ICL circuit. Out of 144 total attention heads in GPT-2 Small, only 7 heads (4.9%) are causally critical for ICL. Ablating these 7 heads reduces ICL accuracy to near-chance levels, while ablating a random set of 7 heads reduces accuracy by less than 3%. This result follows a power-law distribution of indirect effects:
where is the head with rank in indirect effect.
4.3 Developmental Trajectory
By analyzing checkpoints saved every 1,000 training steps, we observe a staged emergence of the ICL circuit:
| Training Phase | Steps | Circuit Component | ICL Accuracy |
|---|---|---|---|
| Phase I | 0–10K | Bigram statistics only | 52.1% |
| Phase II | 10K–30K | Induction heads form | 68.4% |
| Phase III | 30K–60K | Task-encoding subspace emerges | 81.7% |
| Phase IV | 60K–100K | Output heads specialize | 87.3% |
Notably, induction heads emerge in a sharp phase transition around step 15K, consistent with observations by Olsson et al. (2022). The task-encoding capability develops more gradually, suggesting a qualitatively different learning mechanism.
5. Conclusion
We have presented a mechanistic account of in-context learning in transformer models, identifying a sparse, three-phase circuit architecture. Our findings demonstrate that ICL is implemented by a remarkably small fraction of model components organized in a hierarchical pipeline: pattern matching, task encoding, and label prediction. The staged developmental trajectory we observe suggests that these capabilities build upon each other during pretraining, with simpler copying mechanisms providing the scaffold for more abstract task representations.
These results have practical implications for model design and training. The sparsity of the ICL circuit suggests that targeted fine-tuning of a small number of attention heads could substantially improve few-shot performance. Furthermore, understanding the developmental trajectory may enable curriculum design strategies that accelerate the formation of ICL capabilities.
Future work should extend this analysis to larger models and more complex tasks, investigate the relationship between ICL circuits and in-weights learning, and explore whether similar circuit motifs appear across different architectures.
References
- Akyürek, E., Schuurmans, D., Andreas, J., Ma, T., & Zhou, D. (2023). What learning algorithm is in-context learning? Investigations with linear models. ICLR 2023.
- Brown, T. B., et al. (2020). Language models are few-shot learners. NeurIPS 2020.
- Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., & Garriga-Alonso, A. (2023). Towards automated circuit discovery for mechanistic interpretability. NeurIPS 2023.
- Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.
- Garg, S., Tsipras, D., Liang, P., & Valiant, G. (2022). What can transformers learn in-context? A case study of simple function classes. NeurIPS 2022.
- Geiger, A., Lu, H., Icard, T., & Potts, C. (2021). Causal abstractions of neural networks. NeurIPS 2021.
- Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. NeurIPS 2022.
- Olsson, C., et al. (2022). In-context learning and induction heads. Transformer Circuits Thread.
- Von Oswald, J., et al. (2023). Transformers learn in-context by gradient descent. ICML 2023.
- Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2023). Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. ICLR 2023.


