Northeastern Logo

Week 5: Causal Localization

Overview

Understanding what a model computes is one thing; understanding how it computes it is another. This week introduces causal intervention techniques that let you test hypotheses about which components are responsible for specific behaviors. Through activation patching, causal tracing, and attribution methods, you'll learn to identify the mechanisms that matter, moving from correlation to causation in interpretability research.

Learning Objectives

By the end of this week, you should be able to:

Required Readings

Meng et al. (2022). Introduces causal tracing to localize where factual knowledge is stored. The foundational ROME paper.
Todd et al. (2023). Extends localization from facts to functions. Shows task-specific vectors are localized and transferable.
Prakash et al. (2024). How models bind properties to entities and how these mechanisms localize through training.

Tutorial: From Observation to Causation

1. The Challenge: Correlation vs. Causation

Visualization shows us what is represented. Steering shows us we can change behavior. But neither tells us which specific components are causally responsible for a behavior.

Example: If a model correctly answers "The capital of France is Paris," we might observe:

But which of these are necessary for the correct answer? Which are merely correlated? Causal intervention lets us test this.

2. Activation Patching: The Core Technique

Activation patching is the fundamental intervention method: replace activations from one forward pass with activations from another, then measure the effect.

Basic Setup

  1. Clean run: Run the model on a prompt where it behaves correctly
    "The capital of France is" → "Paris" ✓
  2. Corrupted run: Run on a prompt where behavior differs
    "The capital of Germany is" → "Berlin"
  3. Patch: In a new run on the corrupted prompt, replace specific activations with those from the clean run
  4. Measure: Does the model now produce the clean run's output?
Corrupted: "Germany" → activations → "Berlin"
Clean: "France" → activations → "Paris"

Patched: "Germany" + [patched activations from "France"] → ???

If output → "Paris": patched component was causally important!

Interpretation

3. Noise Patching vs. Clean Patching

There are two main patching strategies, each answering different questions:

Noise Patching (Ablation)

Replace activations with random noise or zeros:

Original: "The capital of France is" → "Paris"
Noise patched: "The capital of France is" + [noise] → ???

Question answered: Is this component necessary for the behavior?
Interpretation: If performance degrades, the component was contributing.

Clean (Counterfactual) Patching

Replace activations with those from a different, meaningful prompt:

Corrupted: "The capital of Germany is" → "Berlin"
Clean: "The capital of France is" → "Paris"
Patched: "Germany" + [France activations] → "Paris"?

Question answered: Is this component sufficient to change behavior from corrupted to clean?
Interpretation: If behavior changes, this component causally mediates the difference.

Choosing Your Strategy

Use Case Strategy Why
Find necessary components Noise patching See what breaks when removed
Find sufficient components Clean patching See what can transfer behavior
Understand factual recall Clean patching Isolate subject-specific processing
Test robustness Noise patching Find critical dependencies

4. Causal Effects: Formal Definitions

To reason precisely about interventions, we need formal definitions from causal inference.

Average Causal Effect (ACE)

The effect of changing one variable while holding others constant:

ACE = E[Y | do(X=x₁)] - E[Y | do(X=x₀)]

In neural networks: The difference in output when we intervene to set a component's activation to x₁ vs x₀.

Total Effect

The overall impact of X on Y, including all paths (direct and indirect):

X → Y (direct)
X → Z → Y (indirect through Z)

Total Effect = Direct Effect + Indirect Effects

Indirect Effect

The effect of X on Y that flows through intermediate variable Z:

Indirect Effect = E[Y | do(Z = value_when_X=x₁), X=x₀] - E[Y | X=x₀]

Example: How much does changing "France" to "Germany" at position 5 affect the output through its effect on layer 10's MLP?

5. ROME: Causal Tracing for Knowledge Localization

The ROME paper uses causal tracing to answer: Where is factual knowledge stored in GPT models?

The Experiment

  1. Clean prompt: "The Space Needle is located in the city of" → "Seattle"
  2. Corrupted prompt: Add noise to all token embeddings
  3. Systematic patching: Restore clean activations one component at a time
  4. Measure: Does the model recover the correct answer?

Key Findings

Subject: "Space Needle"

Middle MLP layers (5-10) at last subject token

Retrieve: "located in Seattle"

Output: "Seattle"

Methodology: Average Indirect Effect (AIE)

ROME measures the indirect effect of the subject through each component:

  1. Run corrupted prompt (all noise)
  2. Patch clean activations at state S (e.g., layer 8, position "Needle")
  3. Measure: how much does this restore correct output?
  4. Repeat for all states S
  5. States with high AIE are causally important

6. Entity Tracking and Binding Vectors

The entity tracking work (finetuning.baulab.info) studied a different question: How do models track which attributes belong to which entities?

The Setup

"The tall person and the short person walked into the room.
The tall person sat down."

Question: How does the model remember "tall" is bound to the first person when processing "The tall person sat down"?

Findings: Binding Vectors

Contrast with ROME

Aspect ROME (Factual Knowledge) Entity Tracking (Bindings)
Storage MLP layers Attention heads
Layer Middle layers (5-10) Various layers, task-dependent
What's stored Long-term facts ("Seattle") Context-specific bindings ("tall" ↔ person 1)
Localization Highly localized (specific MLPs) More distributed (multiple heads)

Lesson: Different types of information are stored in different architectural components. Facts go in MLPs, bindings in attention.

7. Gradient-Based Attribution Patching

Testing every possible component via patching is expensive. Gradient-based attribution estimates causal importance efficiently using gradients.

The Idea

Instead of actually patching every activation, approximate the effect using gradients:

Attribution(activation) ≈ gradient(loss w.r.t. activation) × (clean_value - corrupted_value)

This gives a first-order approximation of how much patching that activation would change the output.

Algorithm

  1. Run clean and corrupted forward passes, save all activations
  2. Run corrupted pass again, computing gradients w.r.t. output loss
  3. For each activation: attribution = gradient × Δactivation
  4. Activations with high attribution are predicted to be important

Advantages

Limitations

8. Average Indirect Effect (AIE) for Systematic Search

AIE provides a systematic framework for finding all causally important components.

The Method

For each component (layer, head, neuron):

  1. Start with corrupted run
  2. Patch only that component with clean activations
  3. Measure effect on output: AIE = P(correct | patch) - P(correct | no patch)
  4. Repeat across many examples
  5. Components with high average AIE are important

Hierarchical Search

Use AIE hierarchically to narrow down:

  1. Test each layer → find important layers
  2. Test each component in important layers → find important heads/neurons
  3. Test positions × components → find spatiotemporal importance

Interpretation

9. Function Vectors: Elegant Application of Patching

Function vectors encode specific computational functions (like "negate" or "compare") and can be extracted through causal intervention.

Core Idea

If a function is represented as a vector, adding/subtracting it should enable/disable that function:

Original: "The tower is tall" → "tall"
+ Negation vector: "The tower is tall" → "short"
- Negation vector: (might enhance affirmation)

Finding Function Vectors

  1. Create pairs: Sentences that differ only in the target function
    • "The tower is tall" / "The tower is short" (negation)
    • "Paris is larger than Lyon" / "Lyon is smaller than Paris" (comparison reversal)
  2. Extract activations: Get activation differences at various layers/heads
  3. Find direction: Compute mean difference across pairs
  4. Test causally: Add vector to new examples, verify it performs the function

Function Vector Attention Heads

Some attention heads specialize in computing specific functions. You can identify them by:

Applications

10. Designing Counterfactual Datasets

Effective causal experiments require carefully designed counterfactual pairs.

Principles

1. Minimal Pairs: Change only the target variable

Good: "France" / "Germany" (minimal change)
Bad: "France" / "The United States of America" (length differs)

2. Matched Structure: Keep syntax and structure identical

Good: "The capital of France" / "The capital of Germany"
Bad: "France's capital" / "The capital of Germany" (different structure)

3. Clear Causation: The changed variable should clearly cause the output difference

Good: "happy" / "sad" → sentiment changes
Bad: "happy Tuesday" / "sad Wednesday" → multiple changes

4. Sufficient Diversity: Test across varied contexts

Common Patterns

Subject Substitution:

The [subject] is located in [object]

Attribute Swapping:

The [adjective] person walked / The [different adjective] person walked

Negation Toggle:

The tower is tall / The tower is not tall

Relation Reversal:

A is greater than B / A is less than B

Validation

Test your dataset:

Putting It All Together: A Research Workflow

  1. Hypothesis: Formulate what you think the model is doing
  2. Dataset: Design counterfactual pairs testing your hypothesis
  3. Baseline: Verify the model shows the target behavior
  4. Coarse search: Use AIE to find important layers/components
  5. Gradient attribution: Narrow down to specific activations
  6. Causal validation: Patch top candidates, measure effects
  7. Interpretation: Build mechanistic story from findings
  8. Generalization: Test on new examples/tasks

Note on validation: Causal intervention is powerful, but how do we know our interpretations are actually correct? Week 10 covers a comprehensive validation framework including faithfulness testing, sanity checks, and common pitfalls to avoid.

In-Class Exercise: Where Is Humor Localized?

Building on our pun dataset and representation analysis from previous weeks, we will now use causal tracing to determine where in the model pun understanding is actually computed.

Part 1: Setting Up Counterfactuals (15 min)

Create minimal pairs for causal intervention:

  1. Select pun pairs: For each pun, create a non-pun version that is structurally identical
    • Pun: "Time flies like an arrow; fruit flies like a banana"
    • Non-pun: "Time passes like an arrow; fruit falls like a stone"
  2. Verify behavior difference: Confirm the model shows different logit patterns for pun vs non-pun
  3. Prepare 10-15 pairs for patching experiments

Part 2: Causal Tracing for Puns (25 min)

Apply ROME-style causal tracing to locate pun processing:

  1. Corrupt baseline: Run the pun with noise added to embeddings
  2. Systematic restoration: For each (layer, position) pair:
    • Restore clean activations only at that location
    • Measure: How much does this restore the "pun signature" in model outputs?
  3. Create heatmap: Plot causal importance across layers × positions

Key questions:

Part 3: Cross-Patching Experiments (20 min)

Use activation patching to test specific hypotheses:

  1. Pun → Non-pun patching:
    • Run the non-pun sentence
    • Patch in activations from the pun version at key locations
    • Does this make the model treat the non-pun as a pun?
  2. Component-specific patching:
    • Patch only MLP outputs vs. only attention outputs
    • Which component type carries the "pun signal"?
  3. Compare to your Week 4 findings:
    • Do the causally important layers match where you found the best pun/non-pun separation?
    • Does your "pun direction" from Week 4 align with the causal structure?

Discussion: How does pun localization compare to factual knowledge localization (ROME)? Are semantic concepts like humor processed similarly to factual associations?

Open Causal Tracing Notebook in Colab

Code Exercise

This week's exercise provides hands-on experience with causal intervention:

Open Exercise in Colab

Project Milestone

Due: Thursday of Week 4

Use activation patching and causal intervention techniques to localize where your concept is computed. Move beyond correlation to establish causal relationships between components and concept processing.

Causal Intervention Experiments

Deliverables:

These causal findings will guide Week 5's probe training: you'll focus on the layers and positions identified as causally important here.