So far, we've learned to observe representations (visualization), manipulate them (steering), and test causal importance (patching). Now we put it all together: reverse-engineering complete computational circuits. A circuit is a minimal, faithful subgraph of the network that implements a specific behavior. This week focuses on circuit discovery methodology using three fundamental case studies: induction (pattern-copying), binding (attribute-entity associations), and concept induction (semantic-level patterns).
By the end of this week, you should be able to:
A circuit is a computational subgraph of a neural network that implements a specific algorithm or behavior. Think of it like reverse-engineering: given a complex machine (the full network), identify the minimal set of components that perform a particular function.
Path patching extends activation patching to trace specific paths through the network. Instead of patching an entire layer, we patch only the pathway from component A to component B.
Consider two runs: clean and corrupted. We want to know: does information flow from component A (e.g., attention head 3.2) to component B (e.g., attention head 7.5)?
In transformers, component A writes to the residual stream. Component B reads from the residual stream. To patch the path A→B:
High path patching effect = A causally influences B
Multi-head attention circuits communicate through composition: one head's output influences another head's computation. There are three main types:
Head A's output affects what Head B pays attention to by modifying B's query vectors.
Example: Binding circuits use Q-composition. An earlier head identifies an entity, and a later head uses this information to determine where to look for attributes.
Head A's output affects what Head B attends to by modifying B's key vectors.
Example: Induction circuits use K-composition. The previous-token head modifies key representations so the induction head can find the right token to copy from.
Head A's output affects what information Head B extracts by modifying B's value vectors.
How to distinguish: Path patch from A to B's Q/K/V inputs separately. Whichever has the highest effect tells you the composition type.
Induction is the simplest and most fundamental circuit: copy the token that previously followed the current pattern.
Input: "When Mary and John went to the store, Mary gave a drink to"
Model predicts: "John" (copies the token that followed "Mary" earlier)
The induction circuit requires two attention heads working together:
Head 1: Previous-Token Head (early layer)
Head 2: Induction Head (later layer)
The composition is through keys (K-composition):
Necessity (ablation):
Sufficiency (path patching):
Minimality:
Binding is about associating attributes with entities: remembering that "tall" goes with person 1 and "short" goes with person 2.
"The tall person and the short person walked into a room.
The tall person sat down."
Question: Which person sat? → person 1 (the tall one)
Nikhil's work identified a multi-head circuit for maintaining attribute-entity bindings:
Phase 1: Create Bindings (during first mention)
Phase 2: Retrieve Bindings (during second mention)
The composition is through queries (Q-composition):
| Aspect | Induction Circuit | Binding Circuit |
|---|---|---|
| Composition Type | K-composition | Q-composition |
| What's Modified | What gets attended to (keys) | Where to attend from (queries) |
| Complexity | 2 heads minimum | Multiple heads (3-5) |
| Task | Pattern copying | Attribute-entity association |
| Information Stored | Token identity (shifted) | Binding vectors |
The Feucht paper discovered that induction operates at two levels: token-level and concept-level.
The standard induction circuit we just studied: copy exact tokens.
"When Mary and John went to the store, Mary gave a drink to"
→ Predicts: "John" (exact token match)
A parallel mechanism that copies semantic patterns, not just tokens.
"When Mary and John went to the store, Mary gave a drink to"
BUT if "John" is edited in the model's weights to "Jonathan":
→ Still predicts: "Jonathan" (semantic association, not token)
The model uses both routes simultaneously:
Route 1: Token Circuit (attention-based)
Route 2: Concept Circuit (MLP-based)
This reveals an important lesson: behaviors can have multiple implementations. Circuit discovery must account for:
Manually finding circuits is tedious. Automated Circuit Discovery (ACDC) systematically searches for circuits using path patching.
A circuit is faithful if it actually explains how the full model works, not just a spurious correlation.
1. Ablation Test:
2. Out-of-Distribution Test:
3. Adversarial Test:
4. Intervention Test:
5. Necessity and Sufficiency:
Just because a component activates during a behavior doesn't mean it causes the behavior. Always use causal interventions (patching/ablation) to test.
A single head might implement multiple circuits for different tasks. Test that your circuit is specific to your target behavior.
Circuits found on one dataset may not generalize. Always validate on diverse, out-of-distribution examples.
Setting the pruning threshold too high removes genuinely important components. Validate that your "minimal" circuit still works.
Models often have redundant circuits (like token and concept induction). Ablating one circuit may not reveal it if another compensates.
Goal: Not just "this circuit exists" but "this is how the model solves this task, and here's why this architecture makes sense."
This week's exercise provides hands-on experience with circuit discovery:
We will attempt automated circuit discovery for pun processing, synthesizing insights from previous weeks to map the computational pathway from setup to punchline recognition.
Before automated discovery, compile what you already know:
Create a hypothesis: "Pun processing likely involves [specific layers], activated by [specific positions], and uses [specific attention patterns]."
Apply gradient-based attribution to discover which attention heads matter for pun recognition:
Validate the discovered circuit with causal interventions:
Discussion: Does pun understanding have a clean, minimal circuit, or is it distributed across many components? How does this compare to simpler circuits like induction?
Open EAP-IG Notebook in Colab Open Circuit Tracer Notebook in ColabDue: Thursday of Week 8
Map the computational pathway for your concept: identify how features and components combine across layers to compute your concept. Build a mechanistic explanation of the algorithm the model uses.
A good circuit explanation should let someone understand not just WHERE the concept is computed, but HOW and WHY the model computes it that way.