Understanding what a model computes is one thing; understanding how it computes it is another. This week introduces causal intervention techniques that let you test hypotheses about which components are responsible for specific behaviors. Through activation patching, causal tracing, and attribution methods, you'll learn to identify the mechanisms that matter, moving from correlation to causation in interpretability research.
By the end of this week, you should be able to:
Visualization shows us what is represented. Steering shows us we can change behavior. But neither tells us which specific components are causally responsible for a behavior.
Example: If a model correctly answers "The capital of France is Paris," we might observe:
But which of these are necessary for the correct answer? Which are merely correlated? Causal intervention lets us test this.
Activation patching is the fundamental intervention method: replace activations from one forward pass with activations from another, then measure the effect.
"The capital of France is" → "Paris" ✓
"The capital of Germany is" → "Berlin"
There are two main patching strategies, each answering different questions:
Replace activations with random noise or zeros:
Question answered: Is this component necessary for the behavior?
Interpretation: If performance degrades, the component was contributing.
Replace activations with those from a different, meaningful prompt:
Question answered: Is this component sufficient to change behavior from corrupted to
clean?
Interpretation: If behavior changes, this component causally mediates the difference.
| Use Case | Strategy | Why |
|---|---|---|
| Find necessary components | Noise patching | See what breaks when removed |
| Find sufficient components | Clean patching | See what can transfer behavior |
| Understand factual recall | Clean patching | Isolate subject-specific processing |
| Test robustness | Noise patching | Find critical dependencies |
To reason precisely about interventions, we need formal definitions from causal inference.
The effect of changing one variable while holding others constant:
In neural networks: The difference in output when we intervene to set a component's activation to x₁ vs x₀.
The overall impact of X on Y, including all paths (direct and indirect):
The effect of X on Y that flows through intermediate variable Z:
Example: How much does changing "France" to "Germany" at position 5 affect the output through its effect on layer 10's MLP?
The ROME paper uses causal tracing to answer: Where is factual knowledge stored in GPT models?
ROME measures the indirect effect of the subject through each component:
The entity tracking work (finetuning.baulab.info) studied a different question: How do models track which attributes belong to which entities?
"The tall person and the short person walked into the room.
The tall person sat down."
Question: How does the model remember "tall" is bound to the first person when processing "The tall person sat down"?
| Aspect | ROME (Factual Knowledge) | Entity Tracking (Bindings) |
|---|---|---|
| Storage | MLP layers | Attention heads |
| Layer | Middle layers (5-10) | Various layers, task-dependent |
| What's stored | Long-term facts ("Seattle") | Context-specific bindings ("tall" ↔ person 1) |
| Localization | Highly localized (specific MLPs) | More distributed (multiple heads) |
Lesson: Different types of information are stored in different architectural components. Facts go in MLPs, bindings in attention.
Testing every possible component via patching is expensive. Gradient-based attribution estimates causal importance efficiently using gradients.
Instead of actually patching every activation, approximate the effect using gradients:
This gives a first-order approximation of how much patching that activation would change the output.
attribution = gradient × ΔactivationAIE provides a systematic framework for finding all causally important components.
For each component (layer, head, neuron):
AIE = P(correct | patch) - P(correct | no patch)Use AIE hierarchically to narrow down:
Function vectors encode specific computational functions (like "negate" or "compare") and can be extracted through causal intervention.
If a function is represented as a vector, adding/subtracting it should enable/disable that function:
Some attention heads specialize in computing specific functions. You can identify them by:
Effective causal experiments require carefully designed counterfactual pairs.
1. Minimal Pairs: Change only the target variable
Good: "France" / "Germany" (minimal change)
Bad: "France" / "The United States of America" (length differs)
2. Matched Structure: Keep syntax and structure identical
Good: "The capital of France" / "The capital of Germany"
Bad: "France's capital" / "The capital of Germany" (different structure)
3. Clear Causation: The changed variable should clearly cause the output difference
Good: "happy" / "sad" → sentiment changes
Bad: "happy Tuesday" / "sad Wednesday" → multiple changes
4. Sufficient Diversity: Test across varied contexts
Subject Substitution:
The [subject] is located in [object]
Attribute Swapping:
The [adjective] person walked / The [different adjective] person walked
Negation Toggle:
The tower is tall / The tower is not tall
Relation Reversal:
A is greater than B / A is less than B
Test your dataset:
Note on validation: Causal intervention is powerful, but how do we know our interpretations are actually correct? Week 10 covers a comprehensive validation framework including faithfulness testing, sanity checks, and common pitfalls to avoid.
Building on our pun dataset and representation analysis from previous weeks, we will now use causal tracing to determine where in the model pun understanding is actually computed.
Create minimal pairs for causal intervention:
Apply ROME-style causal tracing to locate pun processing:
Key questions:
Use activation patching to test specific hypotheses:
Discussion: How does pun localization compare to factual knowledge localization (ROME)? Are semantic concepts like humor processed similarly to factual associations?
Open Causal Tracing Notebook in ColabThis week's exercise provides hands-on experience with causal intervention:
Due: Thursday of Week 4
Use activation patching and causal intervention techniques to localize where your concept is computed. Move beyond correlation to establish causal relationships between components and concept processing.
These causal findings will guide Week 5's probe training: you'll focus on the layers and positions identified as causally important here.