Week 5: Causal Localization - Neural Mechanics

Tutorial: From Observation to Causation

1. The Challenge: Correlation vs. Causation

Visualization shows us what is represented. Steering shows us we can change behavior. But neither tells us which specific components are causally responsible for a behavior.

Example: If a model correctly answers "The capital of France is Paris," we might observe:

Certain neurons activate strongly for "France"
Attention heads attend from "capital" to "France"
MLP layers at certain positions have high activation

But which of these are necessary for the correct answer? Which are merely correlated? Causal intervention lets us test this.

2. Activation Patching: The Core Technique

Activation patching is the fundamental intervention method: replace activations from one forward pass with activations from another, then measure the effect.

Basic Setup

Clean run: Run the model on a prompt where it behaves correctly
"The capital of France is" → "Paris" ✓
Corrupted run: Run on a prompt where behavior differs
"The capital of Germany is" → "Berlin"
Patch: In a new run on the corrupted prompt, replace specific activations with those from the clean run
Measure: Does the model now produce the clean run's output?

Corrupted: "Germany" → activations → "Berlin"
Clean: "France" → activations → "Paris"

Patched: "Germany" + [patched activations from "France"] → ???

If output → "Paris": patched component was causally important!

Interpretation

If patching restores clean behavior: The patched component is causally sufficient (in this context)
If patching has no effect: The component is not necessary for the behavior
Partial restoration: Component plays a role but is not solely responsible

3. Noise Patching vs. Clean Patching

There are two main patching strategies, each answering different questions:

Noise Patching (Ablation)

Replace activations with random noise or zeros:

Original: "The capital of France is" → "Paris"
Noise patched: "The capital of France is" + [noise] → ???

Question answered: Is this component necessary for the behavior?
Interpretation: If performance degrades, the component was contributing.

Clean (Counterfactual) Patching

Replace activations with those from a different, meaningful prompt:

Corrupted: "The capital of Germany is" → "Berlin"
Clean: "The capital of France is" → "Paris"
Patched: "Germany" + [France activations] → "Paris"?

Question answered: Is this component sufficient to change behavior from corrupted to clean?
Interpretation: If behavior changes, this component causally mediates the difference.

Choosing Your Strategy

Use Case	Strategy	Why
Find necessary components	Noise patching	See what breaks when removed
Find sufficient components	Clean patching	See what can transfer behavior
Understand factual recall	Clean patching	Isolate subject-specific processing
Test robustness	Noise patching	Find critical dependencies

4. Causal Effects: Formal Definitions

To reason precisely about interventions, we need formal definitions from causal inference.

Average Causal Effect (ACE)

The effect of changing one variable while holding others constant:

ACE = E[Y | do(X=x₁)] - E[Y | do(X=x₀)]

In neural networks: The difference in output when we intervene to set a component's activation to x₁ vs x₀.

Total Effect

The overall impact of X on Y, including all paths (direct and indirect):

X → Y (direct)
X → Z → Y (indirect through Z)

Total Effect = Direct Effect + Indirect Effects

Indirect Effect

The effect of X on Y that flows through intermediate variable Z:

Indirect Effect = E[Y | do(Z = value_when_X=x₁), X=x₀] - E[Y | X=x₀]

Example: How much does changing "France" to "Germany" at position 5 affect the output through its effect on layer 10's MLP?

5. ROME: Causal Tracing for Knowledge Localization

The ROME paper uses causal tracing to answer: Where is factual knowledge stored in GPT models?

The Experiment

Clean prompt: "The Space Needle is located in the city of" → "Seattle"
Corrupted prompt: Add noise to all token embeddings
Systematic patching: Restore clean activations one component at a time
Measure: Does the model recover the correct answer?

Key Findings

Critical window: Restoring the last subject token ("Needle") at middle MLP layers (layers 5-10 in GPT-2 XL) is sufficient to recover factual recall
Localized storage: Factual associations are stored in specific MLP weights, not distributed across the network
Layer specificity: Early layers encode syntax/position, middle layers store facts, late layers decode to vocabulary

Subject: "Space Needle"
↓
Middle MLP layers (5-10) at last subject token
↓
Retrieve: "located in Seattle"
↓
Output: "Seattle"

Methodology: Average Indirect Effect (AIE)

ROME measures the indirect effect of the subject through each component:

Run corrupted prompt (all noise)
Patch clean activations at state S (e.g., layer 8, position "Needle")
Measure: how much does this restore correct output?
Repeat for all states S
States with high AIE are causally important

6. Entity Tracking and Binding Vectors

The entity tracking work (finetuning.baulab.info) studied a different question: How do models track which attributes belong to which entities?

The Setup


      "The tall person and the short person walked into the room.

      The tall person sat down."

Question: How does the model remember "tall" is bound to the first person when processing "The tall person sat down"?

Findings: Binding Vectors

Binding information is stored in attention head outputs, not MLPs
Specific attention heads maintain binding vectors that associate attributes with entities
These binding vectors can be extracted and reused across contexts
Patching binding vectors transfers attribute associations

Contrast with ROME

Aspect	ROME (Factual Knowledge)	Entity Tracking (Bindings)
Storage	MLP layers	Attention heads
Layer	Middle layers (5-10)	Various layers, task-dependent
What's stored	Long-term facts ("Seattle")	Context-specific bindings ("tall" ↔ person 1)
Localization	Highly localized (specific MLPs)	More distributed (multiple heads)

Lesson: Different types of information are stored in different architectural components. Facts go in MLPs, bindings in attention.

7. Gradient-Based Attribution Patching

Testing every possible component via patching is expensive. Gradient-based attribution estimates causal importance efficiently using gradients.

The Idea

Instead of actually patching every activation, approximate the effect using gradients:

Attribution(activation) ≈ gradient(loss w.r.t. activation) × (clean_value - corrupted_value)

This gives a first-order approximation of how much patching that activation would change the output.

Algorithm

Run clean and corrupted forward passes, save all activations
Run corrupted pass again, computing gradients w.r.t. output loss
For each activation: attribution = gradient × Δactivation
Activations with high attribution are predicted to be important

Advantages

Speed: One backward pass instead of thousands of forward passes
Fine-grained: Can attribute to individual neurons, not just layers
Actionable: Identifies specific interventions to test

Limitations

Linear approximation may miss nonlinear effects
Doesn't account for interactions between components
Should be validated with actual patching on top candidates

8. Average Indirect Effect (AIE) for Systematic Search

AIE provides a systematic framework for finding all causally important components.

The Method

For each component (layer, head, neuron):

Start with corrupted run
Patch only that component with clean activations
Measure effect on output: AIE = P(correct | patch) - P(correct | no patch)
Repeat across many examples
Components with high average AIE are important

Hierarchical Search

Use AIE hierarchically to narrow down:

Test each layer → find important layers
Test each component in important layers → find important heads/neurons
Test positions × components → find spatiotemporal importance

Interpretation

High AIE: Component mediates the causal effect (is in the causal path)
Zero AIE: Component is not in the causal path for this behavior
Negative AIE: Component actually suppresses the behavior (inhibitory)

9. Function Vectors: Elegant Application of Patching

Function vectors encode specific computational functions (like "negate" or "compare") and can be extracted through causal intervention.

Core Idea

If a function is represented as a vector, adding/subtracting it should enable/disable that function:

Original: "The tower is tall" → "tall"
+ Negation vector: "The tower is tall" → "short"
- Negation vector: (might enhance affirmation)

Finding Function Vectors

Create pairs: Sentences that differ only in the target function
- "The tower is tall" / "The tower is short" (negation)
- "Paris is larger than Lyon" / "Lyon is smaller than Paris" (comparison reversal)
Extract activations: Get activation differences at various layers/heads
Find direction: Compute mean difference across pairs
Test causally: Add vector to new examples, verify it performs the function

Function Vector Attention Heads

Some attention heads specialize in computing specific functions. You can identify them by:

High AIE when patching for function-related tasks
Attention patterns consistent with the function (e.g., comparison heads attend between compared entities)
Output vectors aligned with the function vector

Applications

Negation: "NOT" function, flips sentiment/truth
Comparison: "greater than", "less than"
Temporal: "past", "future", "present"
Modality: "possible", "necessary", "actual"

10. Designing Counterfactual Datasets

Effective causal experiments require carefully designed counterfactual pairs.

Principles

1. Minimal Pairs: Change only the target variable


      Good: "France" / "Germany" (minimal change)

      Bad: "France" / "The United States of America" (length differs)

2. Matched Structure: Keep syntax and structure identical


      Good: "The capital of France" / "The capital of Germany"

      Bad: "France's capital" / "The capital of Germany" (different structure)

3. Clear Causation: The changed variable should clearly cause the output difference


      Good: "happy" / "sad" → sentiment changes

      Bad: "happy Tuesday" / "sad Wednesday" → multiple changes

4. Sufficient Diversity: Test across varied contexts

Different sentence structures
Different positions of target variable
Different surrounding context

Common Patterns

Subject Substitution:

The [subject] is located in [object]

Attribute Swapping:

The [adjective] person walked / The [different adjective] person walked

Negation Toggle:

The tower is tall / The tower is not tall

Relation Reversal:

A is greater than B / A is less than B

Validation

Test your dataset:

Does the model produce different outputs for each pair?
Are the outputs consistent with your hypothesis?
Do results generalize across multiple examples?
Are there confounds (other variables that changed)?

Putting It All Together: A Research Workflow

Hypothesis: Formulate what you think the model is doing
Dataset: Design counterfactual pairs testing your hypothesis
Baseline: Verify the model shows the target behavior
Coarse search: Use AIE to find important layers/components
Gradient attribution: Narrow down to specific activations
Causal validation: Patch top candidates, measure effects
Interpretation: Build mechanistic story from findings
Generalization: Test on new examples/tasks

Note on validation: Causal intervention is powerful, but how do we know our interpretations are actually correct? Week 10 covers a comprehensive validation framework including faithfulness testing, sanity checks, and common pitfalls to avoid.

Overview

Learning Objectives

Required Readings

Tutorial: From Observation to Causation

1. The Challenge: Correlation vs. Causation

2. Activation Patching: The Core Technique

Basic Setup

Interpretation

3. Noise Patching vs. Clean Patching

Noise Patching (Ablation)

Clean (Counterfactual) Patching

Choosing Your Strategy

4. Causal Effects: Formal Definitions

Average Causal Effect (ACE)

Total Effect

Indirect Effect

5. ROME: Causal Tracing for Knowledge Localization

The Experiment

Key Findings

Methodology: Average Indirect Effect (AIE)

6. Entity Tracking and Binding Vectors

The Setup

Findings: Binding Vectors

Contrast with ROME

7. Gradient-Based Attribution Patching

The Idea

Algorithm

Advantages

Limitations

8. Average Indirect Effect (AIE) for Systematic Search

The Method

Hierarchical Search

Interpretation

9. Function Vectors: Elegant Application of Patching

Core Idea

Finding Function Vectors

Function Vector Attention Heads

Applications

10. Designing Counterfactual Datasets

Principles

Common Patterns

Validation

Putting It All Together: A Research Workflow

In-Class Exercise: Where Is Humor Localized?

Part 1: Setting Up Counterfactuals (15 min)

Part 2: Causal Tracing for Puns (25 min)

Part 3: Cross-Patching Experiments (20 min)

Code Exercise

Project Milestone

Causal Intervention Experiments

Deliverables: