So far, we've learned to observe representations (visualization), manipulate them (steering), and test causal importance (patching). Now we put it all together: reverse-engineering complete computational circuits. A circuit is a minimal, faithful subgraph of the network that implements a specific behavior. This week focuses on circuit discovery methodology, with the Indirect Object Identification (IOI) circuit in GPT-2 small as our main case study: how the model identifies who received an action (e.g., "Mary gave the milk to ____" → "John"). We also cover binding (attribute-entity associations), concept induction (semantic-level patterns), automated discovery (ACDC), and testing circuit faithfulness.
By the end of this week, you should be able to:
A circuit is a computational subgraph of a neural network that implements a specific algorithm or behavior. Think of it like reverse-engineering: given a complex machine (the full network), identify the minimal set of components that perform a particular function.
Path patching extends activation patching to trace specific paths through the network. Instead of patching an entire layer, we patch only the pathway from component A to component B.
Consider two runs: clean and corrupted. We want to know: does information flow from component A (e.g., attention head 3.2) to component B (e.g., attention head 7.5)?
In transformers, component A writes to the residual stream. Component B reads from the residual stream. To patch the path A→B:
High path patching effect = A causally influences B
Multi-head attention circuits communicate through composition: one head's output influences another head's computation. There are three main types:
Head A's output affects what Head B pays attention to by modifying B's query vectors.
Example: Binding circuits use Q-composition. An earlier head identifies an entity, and a later head uses this information to determine where to look for attributes.
Head A's output affects what Head B attends to by modifying B's key vectors.
Example: In the IOI circuit, earlier heads modify key representations so that name mover heads can attend to the correct name (the indirect object) and not the subject.
Head A's output affects what information Head B extracts by modifying B's value vectors.
How to distinguish: Path patch from A to B's Q/K/V inputs separately. Whichever has the highest effect tells you the composition type.
Indirect Object Identification (IOI) is a natural language task: given a sentence like "When Mary and John went to the store, Mary gave the milk to ____", the model must predict the indirect object—the person who received the action—i.e., "John", not "Mary" (the subject). Wang et al. (2022) reverse-engineered how GPT-2 small implements this behavior using causal interventions.
Input: "When Mary and John went to the store, Mary gave the milk to"
Model predicts: "John" (the indirect object—who received the milk—not "Mary", the subject)
The IOI circuit in GPT-2 small comprises 26 attention heads grouped into 7 main classes. Key roles include:
Name Mover Heads
Duplicate Token Heads
S-Inhibition (Subject Inhibition)
Other Supporting Heads
The circuit was discovered using causal interventions (ablation, path patching). The paper evaluates the explanation with three criteria:
Necessity (ablation):
Sufficiency and minimality:
Binding is about associating attributes with entities: remembering that "tall" goes with person 1 and "short" goes with person 2.
"The tall person and the short person walked into a room.
The tall person sat down."
Question: Which person sat? → person 1 (the tall one)
Nikhil's work identified a multi-head circuit for maintaining attribute-entity bindings:
Phase 1: Create Bindings (during first mention)
Phase 2: Retrieve Bindings (during second mention)
The composition is through queries (Q-composition):
| Aspect | IOI Circuit | Binding Circuit |
|---|---|---|
| Composition Type | Mix of Q/K/V (name movers, duplicate token, S-inhibition) | Q-composition |
| What's Modified | Which name is attended (subject vs. indirect object) | Where to attend from (queries) |
| Complexity | 26 heads in 7 classes (GPT-2 small) | Multiple heads (3-5) |
| Task | Indirect object identification | Attribute-entity association |
| Information Used | Duplicate names, subject vs. object roles | Binding vectors |
The Feucht paper discovered that induction operates at two levels: token-level and concept-level.
The IOI circuit we studied relies on token-level structure: duplicate detection and name moving. Similar pattern-copying (e.g., induction) also works at the token level.
"When Mary and John went to the store, Mary gave a drink to"
→ Predicts: "John" (IOI: indirect object; or induction: token that followed first "Mary")
A parallel mechanism that copies semantic patterns, not just tokens.
"When Mary and John went to the store, Mary gave a drink to"
BUT if "John" is edited in the model's weights to "Jonathan":
→ Still predicts: "Jonathan" (semantic association, not token)
The model uses both routes simultaneously:
Route 1: Token Circuit (attention-based)
Route 2: Concept Circuit (MLP-based)
This reveals an important lesson: behaviors can have multiple implementations. Circuit discovery must account for:
Manually finding circuits is tedious. Automated Circuit Discovery (ACDC) systematically searches for circuits using path patching.
A circuit is faithful if it actually explains how the full model works, not just a spurious correlation.
1. Ablation Test:
2. Out-of-Distribution Test:
3. Adversarial Test:
4. Intervention Test:
5. Necessity and Sufficiency:
Just because a component activates during a behavior doesn't mean it causes the behavior. Always use causal interventions (patching/ablation) to test.
A single head might implement multiple circuits for different tasks. Test that your circuit is specific to your target behavior.
Circuits found on one dataset may not generalize. Always validate on diverse, out-of-distribution examples.
Setting the pruning threshold too high removes genuinely important components. Validate that your "minimal" circuit still works.
Models often have redundant components (e.g., multiple name mover heads in the IOI circuit). Ablating one head may not reveal its importance if others compensate.
Goal: Not just "this circuit exists" but "this is how the model solves this task, and here's why this architecture makes sense."
This week's exercise provides hands-on experience with circuit discovery:
We will attempt automated circuit discovery for pun processing, synthesizing insights from previous weeks to map the computational pathway from setup to punchline recognition.
Before automated discovery, compile what you already know:
Create a hypothesis: "Pun processing likely involves [specific layers], activated by [specific positions], and uses [specific attention patterns]."
Apply gradient-based attribution to discover which attention heads matter for pun recognition:
Validate the discovered circuit with causal interventions:
Discussion: Does pun understanding have a clean, minimal circuit, or is it distributed across many components? How does this compare to more localized circuits like IOI?
Open EAP-IG Notebook in Colab Open Circuit Tracer Notebook in ColabDue: Thursday of Week 8
Map the computational pathway for your concept: identify how features and components combine across layers to compute your concept. Build a mechanistic explanation of the algorithm the model uses.
A good circuit explanation should let someone understand not just WHERE the concept is computed, but HOW and WHY the model computes it that way.