Week 7: Attribution

Learning Objectives

Required Readings

Supplementary Readings

1. What is Input Attribution?

The Core Question

Input attribution answers: "Which parts of the input were most important for the model's output?"

For an LLM generating text, attribution identifies which input tokens influenced each output token. This helps us:

Attribution vs Other Interpretability Methods

Method Question Answered Granularity
Attribution
(This week)
Which inputs matter? Input tokens
Circuits
(Week 5)
Which components compute the output? Attention heads, MLPs
Probes
(Week 6)
What information is encoded? Layer activations
SAEs
(Week 7)
What features are represented? Sparse feature directions
Causal Validation
(Week 8)
Are findings causal? Variable-level

Key insight: Attribution is input-level explanation. It tells you what the model used, not how it was processed internally. Combine attribution with circuit analysis (Week 5) to get the full picture.

2. Gradient-Based Attribution Methods

Gradient-based methods use the model's gradients to measure input importance. Intuitively: if changing an input token would change the output a lot (high gradient), that token is important.

2.1 Saliency Maps (Vanilla Gradients)

Method: Saliency Maps

Idea: The gradient magnitude indicates importance.

Formula: \( \text{Saliency}(x_i) = \left| \frac{\partial f(x)}{\partial x_i} \right| \)

Pros: Simple, fast, easy to implement

Cons: Suffers from gradient saturation (near-zero gradients in deep networks)

Example:

Input: "The cat sat on the mat"

Output: "because" (predicted next token)

Saliency scores: [0.02, 0.35, 0.15, 0.08, 0.40, 0.12]

Interpretation: "cat" and "mat" have highest gradients → most influential for predicting "because"

Problem: Gradient Saturation

Deep networks with ReLU/sigmoid activations often have near-zero gradients even when inputs are important. This happens because the function flattens (saturates) in certain regions.

Solution: Use Integrated Gradients or other methods that accumulate gradients along a path.

2.2 Input × Gradient

Method: Input × Gradient

Idea: Weight gradients by input magnitude to get better signal.

Formula: \( \text{Attribution}(x_i) = x_i \cdot \frac{\partial f(x)}{\partial x_i} \)

Pros: Better handles zero-input cases than vanilla gradients

Cons: Still affected by saturation

2.3 Integrated Gradients (IG)

Method: Integrated Gradients

Idea: Accumulate gradients along a straight line from a baseline to the input.

Formula: \[ \text{IG}(x_i) = (x_i - x_i') \int_{\alpha=0}^{1} \frac{\partial f(x' + \alpha \cdot (x - x'))}{\partial x_i} \, d\alpha \] where \( x' \) is a baseline input (e.g., all zeros or padding tokens).

Pros: Theoretically sound (satisfies completeness axiom), mitigates saturation

Cons: Computationally expensive, baseline-dependent, assumes linear path

Why Integrated Gradients Works

By integrating gradients along a path, IG avoids the saturation problem of vanilla gradients. Even if the gradient is zero at the input, IG captures importance by looking at gradients throughout the interpolation.

Completeness property: The sum of attributions equals the difference in model output between input and baseline: \[ \sum_i \text{IG}(x_i) = f(x) - f(x') \]

2.4 The Baseline Selection Problem

A critical challenge for IG: what baseline should we use?

Baseline choices for LLMs:

Impact: Baseline choice can dramatically change attribution scores. Always justify your baseline!

2.5 Uniform Discretized Integrated Gradients (UDIG)

Method: UDIG (2024)

Problem with standard IG: Linear interpolation through embedding space doesn't respect the discrete, linguistic nature of words. Intermediate points may not correspond to real words.

Solution: Use a nonlinear path that stays closer to actual word embeddings.

Result: Better performance on NLP tasks (sentiment, QA) compared to standard IG.

When to use: For language models where discrete token structure matters.

2.6 Other Gradient Methods

3. Perturbation-Based Attribution Methods

Perturbation methods modify inputs and observe how outputs change. No gradients needed—purely empirical.

3.1 Occlusion / Ablation

Method: Token Ablation

Idea: Remove (or mask) each token and measure output change.

Procedure:

  1. Run model on full input → get output probability \( p \)
  2. For each token \( i \):
    • Remove or mask token \( i \)
    • Run model → get new probability \( p_i \)
    • Attribution \( = p - p_i \) (drop in probability)

Pros: Intuitive, model-agnostic, no gradient computation

Cons: Computationally expensive (\( O(n) \) forward passes for \( n \) tokens), may create unnatural inputs

3.2 LIME (Local Interpretable Model-Agnostic Explanations)

Method: LIME

Idea: Fit a simple linear model locally around the input to approximate the model's behavior.

Procedure:

  1. Generate perturbed inputs by randomly masking tokens
  2. Run model on perturbed inputs to get outputs
  3. Fit a weighted linear model: \( g(x') \approx f(x) \) where \( x' \) is in the perturbation neighborhood
  4. Linear coefficients = token importance

Pros: Model-agnostic, interpretable coefficients

Cons: Requires many model calls, local approximation may be inaccurate, random sampling variability

3.3 SHAP (Shapley Additive Explanations)

Method: SHAP

Idea: Use game-theoretic Shapley values to assign fair credit to each token.

Formula: For each token \( i \), compute contribution by averaging over all possible subsets: \[ \phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|! (|N| - |S| - 1)!}{|N|!} [f(S \cup \{i\}) - f(S)] \]

Pros: Theoretically grounded (unique solution satisfying fairness axioms), consistent

Cons: Exponentially expensive to compute exactly (approximations needed), requires defining "coalition" semantics for tokens

Critical Limitation: Additivity Assumption

Recent research (2024) shows that transformers cannot represent additive models due to their architecture (attention, LayerNorm). This casts doubt on LIME and SHAP's applicability, which assume local additivity.

Implication: Use LIME/SHAP cautiously for transformers. Validate with other methods.

3.4 ReAGent (Replace with Alternatives)

Method: ReAGent

Idea: Replace each token with plausible alternatives (e.g., from a masked LM) rather than just removing it.

Advantage: More natural perturbations than simple masking.

When to use: When you want to measure importance without creating out-of-distribution inputs.

4. Attention-Based Attribution

Can we use attention weights as attribution? The answer is complicated.

4.1 Raw Attention Weights

Problem: Attention is Not Explanation

In transformers, information from different tokens gets increasingly mixed across layers through attention and residual connections. By the final layer, it's unclear which input tokens contributed to a representation.

Raw attention weights show what the model attends to in a single layer, but they don't track information flow from input to output across the full network.

When raw attention is useful:

4.2 Attention Rollout

Method: Attention Rollout (Abnar & Zuidema, 2020)

Idea: Propagate attention weights through layers, accounting for residual connections.

Assumption: Token identities are linearly combined based on attention weights.

Formula: Roll out attention from layer \( \ell \) to input: \[ \tilde{A}^{(\ell)} = A^{(\ell)} \cdot \tilde{A}^{(\ell-1)} \] where \( A^{(\ell)} \) is attention at layer \( \ell \), adjusted for residual connections.

Pros: Accounts for multi-layer information flow

Cons: Linear approximation may be inaccurate for nonlinear models

4.3 Attention Flow

Method: Attention Flow (Abnar & Zuidema, 2020)

Idea: Model the attention graph as a flow network and compute maximum flow from each input token to the output.

Algorithm: Use max-flow algorithms (e.g., Ford-Fulkerson) with attention weights as edge capacities.

Pros: Captures multi-path information flow

Cons: Computationally more expensive than rollout

Attention Rollout vs Attention Flow

Aspect Attention Rollout Attention Flow
Assumption Linear mixing of identities Flow network
Computation Matrix multiplication (fast) Max-flow algorithm (slower)
Multi-path Averages paths Finds bottlenecks
Use case Quick approximate attribution Detailed flow analysis

Key finding: Both methods correlate better with ablation and gradient-based methods than raw attention.

5. The Inseq Library: Practical Tool

Inseq is a Python library that makes attribution analysis accessible for sequence generation models.

5.1 What Inseq Provides

5.2 Example Usage

import inseq

# Load model
model = inseq.load_model("gpt2", "integrated_gradients")

# Run attribution
result = model.attribute(
    "The cat sat on the",
    n_steps=5,  # Generate 5 tokens
    step_scores=["probability"]
)

# Visualize
result.show()

5.3 Key Pedagogical Examples from Inseq

Example 1: Gender Bias in Machine Translation

Task: Translate "The doctor asked the nurse to help" from English to gendered language (e.g., Italian).

Question: Does the model assign gendered pronouns based on stereotypes?

Attribution reveals: Which input words drive gendered predictions ("doctor" vs "nurse").

Example 2: Factual Knowledge Localization in GPT-2

Input: "The capital of France is"

Output: "Paris"

Attribution reveals: Model strongly attends to "France" (expected) but may also use "capital" heavily, showing it recognizes the prompt structure.

6. Comparing Attribution Methods

Different methods can give different results. How do we know which to trust?

6.1 Validation Techniques

1. Perturbation Test (Ground Truth)

Remove tokens with high attribution scores. The output should change more than removing low-attribution tokens.

Metric: Correlation between attribution scores and output change when tokens are removed.

2. Sanity Checks

3. Cross-Method Agreement

Check if multiple methods (e.g., IG, LIME, ablation) agree on which tokens are important. High agreement increases confidence.

6.2 Method Comparison Summary

Method Speed Accuracy Interpretability LLM-Specific Issues
Saliency ⚡ Fast ⚠️ Saturation ✓ Clear Unreliable in deep networks
Integrated Gradients ⚠️ Slow ✓✓ Good ✓ Clear Baseline-dependent, linear path
UDIG ⚠️ Slow ✓✓✓ Best ✓ Clear Designed for discrete inputs
Ablation ⚠️⚠️ Very Slow ✓✓✓ Ground truth ✓✓ Very clear Unnatural inputs (gaps)
LIME ⚠️⚠️ Very Slow ⚠️ Variable ✓✓ Linear model Additivity assumption violated
SHAP ⚠️⚠️⚠️ Extremely Slow ✓ Theoretically sound ✓✓ Game-theoretic Additivity assumption violated
Attention Rollout ⚡⚡ Very Fast ⚠️ Approximate ⚠️ Hard to interpret Linear approximation
Attention Flow ⚡ Fast ✓ Better than rollout ⚠️ Hard to interpret Flow network assumption

6.3 Recommendations

7. Technical Challenges and Solutions

7.1 Gradient Saturation

Problem: Gradients near zero despite input importance.

Solutions:

7.2 Computational Cost

Problem: IG requires ~50-300 forward passes; SHAP/LIME even more.

Solutions:

7.3 Long Sequences

Problem: Attributing 1000+ token inputs is expensive and overwhelming.

Solutions:

7.4 Discrete vs Continuous Inputs

Problem: Text is discrete, but IG assumes continuous interpolation.

Solutions:

8. Applications to Your Research

8.1 Debugging Model Predictions

Scenario: Your model incorrectly predicts subject-verb agreement.

Input: "The key to the cabinets are" (incorrect, should be "is")

Use attribution to investigate:

  1. Run IG to see which input tokens influenced "are" prediction
  2. If attribution is high on "cabinets" (distractor), model is attending to wrong noun
  3. Guides intervention: Need to strengthen subject tracking (relate to Week 5 circuits)

8.2 Analyzing Your Concept

For your project concept, attribution helps answer:

8.3 Guiding Circuit Discovery

Workflow combining Week 5 and Week 9:

  1. Attribution (Week 9): Identify which input tokens matter for your concept
  2. Attention analysis: Which heads attend to those important tokens?
  3. Path patching (Week 5): Test if those heads causally compute the concept
  4. Validation (Week 8): Use IIA to confirm the circuit

9. Integration with Previous Weeks

Connecting Attribution to Other Methods

Previous Week How Attribution Helps
Week 5: Circuits Use attribution to identify which input tokens activate your circuit.
Example: If a circuit computes induction, attribution should show high scores on the repeated token.
Week 6: Probes Compare probe predictions with attribution. If a probe detects a feature, attribution should show which inputs provided that information.
Example: Probe detects "plural subject" → attribution should highlight the plural noun.
Week 7: SAEs For an SAE feature representing your concept, use attribution to see which inputs activate it.
Example: "Politeness feature" activates → attribution reveals which politeness markers (e.g., "please," "thank you") trigger it.
Week 8: Causal Validation Attribution ≠ causation! Validate high-attribution tokens with interventions.
Example: Attribution says "France" is important → intervene by changing "France" to "Italy" and verify output changes to "Rome."

10. Limitations and Skepticism

Critical Limitations to Remember

When Attribution Fails

11. Best Practices for Your Research

Checklist for Using Attribution in Your Project

Part 2: Skepticism and Attribution Validation (50% of content)

Critical Reality Check

Attribution methods produce compelling visualizations. Saliency maps highlight features. Attention weights show where the model "looks." But do these explanations actually reflect how the model makes decisions? This section examines evidence that attribution methods can be misleading—and how to validate them rigorously.

1. Sanity Checks for Saliency Maps (Adebayo et al., 2018)

The Problem: Visual Plausibility ≠ Correctness

Saliency maps often highlight seemingly meaningful regions—faces in images, important words in text. But are these explanations faithful to the model's computation?

Adebayo et al. (2018) proposed simple sanity checks to test whether attribution methods are actually sensitive to the model and data they're supposed to explain.

Sanity Check 1: Model Parameter Randomization

Test: Randomly re-initialize the model's weights.

Expected behavior: A good explanation method should produce completely different explanations for a trained model vs. a random model.

Why it matters: If explanations look the same for both, the method is just detecting network architecture or input patterns, not what the model learned.

Example Application to LLMs:

# Test your attribution method
trained_model = GPT2LMHeadModel.from_pretrained("gpt2")
random_model = GPT2LMHeadModel(config)  # Random initialization

attribution_trained = compute_attribution(trained_model, text)
attribution_random = compute_attribution(random_model, text)

# They should be substantially different!
assert not np.allclose(attribution_trained, attribution_random)

Sanity Check 2: Data Randomization

Test: Train the model on data with random labels.

Expected behavior: Explanations should look different for a model trained on meaningful data vs. random noise.

Why it matters: A model that memorized noise has fundamentally different internal mechanisms than one that learned patterns.

Key Findings from Adebayo et al.

Implication for Your Project:

Before trusting any attribution visualization, run these sanity checks. If your method fails, you're seeing artifacts of the attribution algorithm, not insights into your model.

2. Attention Is Not (Necessarily) Explanation (Jain & Wallace, 2019)

The Intuition vs. The Evidence

Attention mechanisms are everywhere in modern NLP. It's tempting to interpret attention weights as showing which input tokens are "important" for predictions. But Jain & Wallace (2019) systematically tested this assumption.

Three Tests of Attention as Explanation

Test 1: Correlation with Feature Importance

Compare attention weights with gradient-based measures of actual feature importance.

Result: Often uncorrelated. High attention ≠ high importance for prediction.

Test 2: Counterfactual Attention Distributions

Can you construct different attention patterns that produce identical outputs?

Result: Yes! Multiple contradictory attention distributions often yield the same prediction.

Implication: If multiple explanations are equally valid, which one is "true"?

Test 3: Adversarial Attention

Optimize attention weights to be maximally different while keeping outputs unchanged.

Result: Possible for many tasks. Attention can be manipulated without affecting predictions.

When Can You Trust Attention?

Use Case Trustworthy? Reason
Single-layer analysis ✓ Moderate Shows what that layer attends to (but not why)
Debugging attention patterns ✓ Yes Useful for finding bugs (attending to padding, etc.)
Input importance for output ✗ No Use attribution methods instead
Explanation for end users ✗ Risky May be unfaithful; validate first

Better Alternatives

3. The ROAR Benchmark: Quantifying Attribution Quality (Hooker et al., 2019)

The Challenge

We have many attribution methods (gradients, IG, SHAP, LIME, attention). How do we know which ones actually identify important features? Visual inspection is subjective and prone to confirmation bias.

ROAR: RemOve And Retrain

Hooker et al. (2019) proposed a quantitative benchmark:

The ROAR Protocol

  1. Train a model to convergence
  2. Use an attribution method to identify most important features
  3. Remove those features from the training data
  4. Retrain the model from scratch without those features
  5. Measure the drop in performance

Logic: If the method truly identifies important features, removing them should hurt performance more than removing random features.

Shocking Results

Conclusion: Just because an attribution method produces a plausible-looking heatmap doesn't mean it's identifying features that actually matter.

Adapting ROAR for LLMs

For text models, adapt ROAR as follows:

# ROAR for text attribution
1. Compute attribution scores for your dataset
2. Identify top-k most important tokens per example
3. Mask or remove those tokens from training data
4. Fine-tune (or retrain) model without important tokens
5. Measure performance drop

# Compare
drop_with_important_removed = baseline_acc - roar_acc
drop_with_random_removed = baseline_acc - random_acc

# Good attribution method:
assert drop_with_important_removed > drop_with_random_removed

Cheaper Alternative: Perturbation Test

Full ROAR requires retraining (expensive). A faster proxy:

4. Validation Checklist for Attribution Methods

Before trusting attribution results, verify:

Mandatory Checks

  1. ✓ Sanity checks (Adebayo):
    • Random model test: attributions differ from trained model
    • Random label test: attributions differ from properly trained model
  2. ✓ Perturbation validation:
    • Removing high-attribution features hurts performance
    • More than removing random features
    • Effect size is substantial
  3. ✓ Method agreement:
    • Compare at least 2-3 independent attribution methods
    • If they disagree, investigate why
    • Don't cherry-pick the method that gives desired results
  4. ✓ Baseline comparison:
    • Beats random feature selection
    • Beats frequency-based heuristics
    • Compare simple (gradients) vs complex (IG, SHAP) methods
  5. ✓ Completeness test (for IG):
    • Sum of attributions ≈ f(x) - f(baseline)
    • If not, something is wrong with implementation or baseline

5. Common Pitfalls and How to Avoid Them

Pitfall 1: Baseline Dependence (Integrated Gradients)

Problem: IG results can change dramatically with baseline choice.

Solution:

Pitfall 2: Additivity Assumptions (LIME, SHAP)

Problem: Transformers violate local additivity due to attention and LayerNorm.

Solution:

Pitfall 3: Out-of-Distribution Perturbations

Problem: Removing tokens creates unnatural inputs. Model behavior may not reflect normal operation.

Solution:

Pitfall 4: Confirmation Bias

Problem: Seeing what you expect in attribution maps.

Solution:

6. Best Practices for Rigorous Attribution Research

Do's

Don'ts

7. Integration with Your Research Project

Validating Your Concept Attribution

When attributing input importance for your concept:

Week 6 Validation Protocol

  1. Choose methods: Pick 2-3 attribution methods (e.g., IG, gradients, ablation)
  2. Run sanity checks:
    • Test on random model
    • Verify completeness (for IG)
  3. Compute attributions: For your concept-relevant predictions
  4. Validate:
    • Do methods agree on top features?
    • Perturbation test: removing features changes output
    • Compare to random baseline
  5. Causal integration:
    • Combine with Week 4 patching results
    • Do high-attribution tokens align with causally important components?
  6. Report honestly:
    • Show agreement and disagreement between methods
    • Document failure cases
    • Acknowledge limitations

8. Looking Ahead: More Skepticism in Week 10

This week introduced validation for attribution methods. But interpretability illusions go deeper:

The validation framework from Week 4, combined with attribution skepticism from this week, prepares you to critically evaluate any interpretability claim.

Summary: Attribution + Skepticism

Part 1 Takeaways (Attribution Methods)

Part 2 Takeaways (Skepticism)

For Your Project

References for Part 2 (Skepticism)

Core Papers

Nuance and Follow-ups

12. Summary and Next Steps

Key Takeaways

For Your Research Project

  1. Use attribution to identify which input features activate your concept
  2. Combine with circuit analysis (Week 5) to understand how those inputs are processed
  3. Validate findings with causal interventions (Week 8)
  4. Report attribution results in your paper with appropriate caveats

Looking Ahead

Week 10: Skepticism and interpretability illusions—when interpretability methods mislead us, and how to be rigorous.

References & Resources

Core Papers

Tools & Libraries

Supplementary Reading

In-Class Exercise: Which Words Make a Pun Punny?

Using attribution methods, we will identify which input tokens are most important for the model's "pun recognition"—helping us understand what linguistic cues trigger humor processing.

Part 1: Attribution Setup (15 min)

Prepare for attribution analysis on your pun dataset:

  1. Select examples: Choose 10 puns where the model shows clear pun recognition (from your probe analysis)
  2. Define target: Use your pun probe's output as the attribution target
    • Which inputs increase the "pun" score?
  3. Choose methods: We will compare Integrated Gradients, Input×Gradient, and Attention Rollout

Part 2: Compute and Compare Attributions (25 min)

Apply multiple attribution methods and analyze agreement:

  1. Run integrated gradients: Compute token-level attributions for each pun
  2. Visualize heatmaps: Which words have highest attribution?
    • Is it the punchline word (double meaning)?
    • Is it the setup words that create the context?
    • Do both parts contribute?
  3. Compare methods:
    • Do IG and Input×Gradient agree on important tokens?
    • How does Attention Rollout compare?
    • Compute correlation between method rankings
  4. Aggregate patterns:
    • Across all puns, what types of tokens are important?
    • Are punchlines always critical, or does context matter more?

Part 3: Validate with Interventions (20 min)

Test whether attribution accurately identifies important tokens:

  1. Ablate high-attribution tokens:
    • Remove or mask the top-3 highest-attribution tokens
    • Does the pun probe score drop significantly?
  2. Ablate low-attribution tokens:
    • Remove tokens with low attribution
    • Pun recognition should remain relatively stable
  3. Compute correlation:
    • Plot: attribution score vs. effect of ablating that token
    • If correlation is low, attribution may not be faithful

Discussion: What have you learned about how the model "sees" puns? Are the important tokens what you would expect, or are there surprises?

Open Pun Attribution Notebook in Colab

Project Milestone

Due: Thursday of Week 6

Apply attribution methods to identify which input tokens matter most for your concept. Use sanity checks to validate that your attribution methods are meaningful and not artifacts.

Attribution Analysis

Deliverables:

Attribution methods can be unreliable—always validate with sanity checks and interventions before drawing conclusions about which inputs matter.