Week 7: Attribution
Learning Objectives
- Understand what input attribution is and when to use it vs other interpretability methods
- Master gradient-based attribution methods (saliency, integrated gradients, DeepLIFT)
- Apply perturbation-based methods (LIME, SHAP, ablation) to LLMs
- Use attention-based attribution (attention rollout, attention flow)
- Work effectively with the Inseq library for sequence generation models
- Critically evaluate attribution methods and understand their limitations
- Address technical challenges (baseline selection, gradient saturation, computational cost)
- Apply attribution to debug model predictions and analyze your project concepts
- Integrate attribution with previous weeks' methods (circuits, probes, SAEs, causal validation)
Required Readings
Supplementary Readings
1. What is Input Attribution?
The Core Question
Input attribution answers: "Which parts of the input were most important for the model's
output?"
For an LLM generating text, attribution identifies which input tokens influenced each output token. This helps us:
- Debug predictions: Why did the model generate this specific word?
- Detect bias: Is the model relying on demographic features inappropriately?
- Validate understanding: Does the model attend to the right context?
- Guide interventions: Which inputs should we modify to change behavior?
Attribution vs Other Interpretability Methods
| Method |
Question Answered |
Granularity |
Attribution (This week) |
Which inputs matter? |
Input tokens |
Circuits (Week 5) |
Which components compute the output? |
Attention heads, MLPs |
Probes (Week 6) |
What information is encoded? |
Layer activations |
SAEs (Week 7) |
What features are represented? |
Sparse feature directions |
Causal Validation (Week 8) |
Are findings causal? |
Variable-level |
Key insight: Attribution is input-level explanation. It tells you what the model
used, not how it was processed internally. Combine attribution with circuit analysis (Week 5) to get the
full picture.
2. Gradient-Based Attribution Methods
Gradient-based methods use the model's gradients to measure input importance. Intuitively: if changing an input
token would change the output a lot (high gradient), that token is important.
2.1 Saliency Maps (Vanilla Gradients)
Method: Saliency Maps
Idea: The gradient magnitude indicates importance.
Formula: \( \text{Saliency}(x_i) = \left| \frac{\partial f(x)}{\partial x_i} \right| \)
Pros: Simple, fast, easy to implement
Cons: Suffers from gradient saturation (near-zero gradients in deep networks)
Example:
Input: "The cat sat on the mat"
Output: "because" (predicted next token)
Saliency scores: [0.02, 0.35, 0.15, 0.08, 0.40, 0.12]
Interpretation: "cat" and "mat" have highest gradients → most influential for predicting "because"
Problem: Gradient Saturation
Deep networks with ReLU/sigmoid activations often have near-zero gradients even when inputs are important. This
happens because the function flattens (saturates) in certain regions.
Solution: Use Integrated Gradients or other methods that accumulate gradients along a path.
2.2 Input × Gradient
Method: Input × Gradient
Idea: Weight gradients by input magnitude to get better signal.
Formula: \( \text{Attribution}(x_i) = x_i \cdot \frac{\partial f(x)}{\partial x_i} \)
Pros: Better handles zero-input cases than vanilla gradients
Cons: Still affected by saturation
2.3 Integrated Gradients (IG)
Method: Integrated Gradients
Idea: Accumulate gradients along a straight line from a baseline to the input.
Formula:
\[
\text{IG}(x_i) = (x_i - x_i') \int_{\alpha=0}^{1} \frac{\partial f(x' + \alpha \cdot (x - x'))}{\partial x_i}
\, d\alpha
\]
where \( x' \) is a baseline input (e.g., all zeros or padding tokens).
Pros: Theoretically sound (satisfies completeness axiom), mitigates saturation
Cons: Computationally expensive, baseline-dependent, assumes linear path
Why Integrated Gradients Works
By integrating gradients along a path, IG avoids the saturation problem of vanilla gradients. Even if the
gradient is zero at the input, IG captures importance by looking at gradients throughout the interpolation.
Completeness property: The sum of attributions equals the difference in model output between
input and baseline:
\[
\sum_i \text{IG}(x_i) = f(x) - f(x')
\]
2.4 The Baseline Selection Problem
A critical challenge for IG: what baseline should we use?
Baseline choices for LLMs:
- Zero embeddings: Common but may not be semantically meaningful
- Padding tokens: Model-specific, represents "no information"
- Mask tokens: For masked language models (BERT-style)
- Average embeddings: Represents "typical" input
- Random text: Contrast against unrelated content
Impact: Baseline choice can dramatically change attribution scores. Always justify your baseline!
2.5 Uniform Discretized Integrated Gradients (UDIG)
Method: UDIG (2024)
Problem with standard IG: Linear interpolation through embedding space doesn't respect the
discrete, linguistic nature of words. Intermediate points may not correspond to real words.
Solution: Use a nonlinear path that stays closer to actual word embeddings.
Result: Better performance on NLP tasks (sentiment, QA) compared to standard IG.
When to use: For language models where discrete token structure matters.
2.6 Other Gradient Methods
- DeepLIFT: Compares activations to reference activations, decomposes prediction differences
- GradientSHAP: Combines gradients with Shapley value sampling
- Layer-wise methods: LayerIntegratedGradients, LayerGradientXActivation for internal layers
3. Perturbation-Based Attribution Methods
Perturbation methods modify inputs and observe how outputs change. No gradients needed—purely
empirical.
3.1 Occlusion / Ablation
Method: Token Ablation
Idea: Remove (or mask) each token and measure output change.
Procedure:
- Run model on full input → get output probability \( p \)
- For each token \( i \):
- Remove or mask token \( i \)
- Run model → get new probability \( p_i \)
- Attribution \( = p - p_i \) (drop in probability)
Pros: Intuitive, model-agnostic, no gradient computation
Cons: Computationally expensive (\( O(n) \) forward passes for \( n \) tokens), may create
unnatural inputs
3.2 LIME (Local Interpretable Model-Agnostic Explanations)
Method: LIME
Idea: Fit a simple linear model locally around the input to approximate the model's
behavior.
Procedure:
- Generate perturbed inputs by randomly masking tokens
- Run model on perturbed inputs to get outputs
- Fit a weighted linear model: \( g(x') \approx f(x) \) where \( x' \) is in the perturbation neighborhood
- Linear coefficients = token importance
Pros: Model-agnostic, interpretable coefficients
Cons: Requires many model calls, local approximation may be inaccurate, random sampling
variability
3.3 SHAP (Shapley Additive Explanations)
Method: SHAP
Idea: Use game-theoretic Shapley values to assign fair credit to each token.
Formula: For each token \( i \), compute contribution by averaging over all possible subsets:
\[
\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|! (|N| - |S| - 1)!}{|N|!} [f(S \cup \{i\}) - f(S)]
\]
Pros: Theoretically grounded (unique solution satisfying fairness axioms), consistent
Cons: Exponentially expensive to compute exactly (approximations needed), requires defining
"coalition" semantics for tokens
Critical Limitation: Additivity Assumption
Recent research (2024) shows that transformers cannot represent additive models due to their
architecture (attention, LayerNorm). This casts doubt on LIME and SHAP's applicability, which assume local
additivity.
Implication: Use LIME/SHAP cautiously for transformers. Validate with other methods.
3.4 ReAGent (Replace with Alternatives)
Method: ReAGent
Idea: Replace each token with plausible alternatives (e.g., from a masked LM) rather than just
removing it.
Advantage: More natural perturbations than simple masking.
When to use: When you want to measure importance without creating out-of-distribution inputs.
4. Attention-Based Attribution
Can we use attention weights as attribution? The answer is complicated.
4.1 Raw Attention Weights
Problem: Attention is Not Explanation
In transformers, information from different tokens gets increasingly mixed across layers through
attention and residual connections. By the final layer, it's unclear which input tokens contributed to a
representation.
Raw attention weights show what the model attends to in a single layer, but they don't track information
flow from input to output across the full network.
When raw attention is useful:
- Single-layer analysis (e.g., "does this head attend to previous token?")
- Qualitative exploration
- Sanity checks (e.g., "is the model attending to padding?")
4.2 Attention Rollout
Method: Attention Rollout (Abnar & Zuidema, 2020)
Idea: Propagate attention weights through layers, accounting for residual connections.
Assumption: Token identities are linearly combined based on attention weights.
Formula: Roll out attention from layer \( \ell \) to input:
\[
\tilde{A}^{(\ell)} = A^{(\ell)} \cdot \tilde{A}^{(\ell-1)}
\]
where \( A^{(\ell)} \) is attention at layer \( \ell \), adjusted for residual connections.
Pros: Accounts for multi-layer information flow
Cons: Linear approximation may be inaccurate for nonlinear models
4.3 Attention Flow
Method: Attention Flow (Abnar & Zuidema, 2020)
Idea: Model the attention graph as a flow network and compute maximum flow from each
input token to the output.
Algorithm: Use max-flow algorithms (e.g., Ford-Fulkerson) with attention weights as edge
capacities.
Pros: Captures multi-path information flow
Cons: Computationally more expensive than rollout
Attention Rollout vs Attention Flow
| Aspect |
Attention Rollout |
Attention Flow |
| Assumption |
Linear mixing of identities |
Flow network |
| Computation |
Matrix multiplication (fast) |
Max-flow algorithm (slower) |
| Multi-path |
Averages paths |
Finds bottlenecks |
| Use case |
Quick approximate attribution |
Detailed flow analysis |
Key finding: Both methods correlate better with ablation and gradient-based methods than raw
attention.
5. The Inseq Library: Practical Tool
Inseq is a Python library that makes attribution analysis
accessible for sequence generation models.
5.1 What Inseq Provides
- Unified API for gradient, perturbation, and attention methods
- Model support: GPT-2, GPT-NeoX, Llama, BLOOM, T5, BART, Marian MT (via HuggingFace)
- Methods supported:
- Gradient-based: Saliency, Input×Gradient, Integrated Gradients, DeepLIFT, GradientSHAP, UDIG
- Attention-based: Attention weights, value zeroing
- Perturbation-based: Occlusion, LIME, ReAGent
- High-level interface:
model.attribute(input_text, target_text, method="integrated_gradients")
- Visualization tools for displaying attributions
- Step-level attribution for generation (attribute each output token separately)
5.2 Example Usage
import inseq
# Load model
model = inseq.load_model("gpt2", "integrated_gradients")
# Run attribution
result = model.attribute(
"The cat sat on the",
n_steps=5, # Generate 5 tokens
step_scores=["probability"]
)
# Visualize
result.show()
5.3 Key Pedagogical Examples from Inseq
Example 1: Gender Bias in Machine Translation
Task: Translate "The doctor asked the nurse to help" from English to gendered language (e.g., Italian).
Question: Does the model assign gendered pronouns based on stereotypes?
Attribution reveals: Which input words drive gendered predictions ("doctor" vs "nurse").
Example 2: Factual Knowledge Localization in GPT-2
Input: "The capital of France is"
Output: "Paris"
Attribution reveals: Model strongly attends to "France" (expected) but may also use "capital" heavily,
showing it recognizes the prompt structure.
6. Comparing Attribution Methods
Different methods can give different results. How do we know which to trust?
6.1 Validation Techniques
1. Perturbation Test (Ground Truth)
Remove tokens with high attribution scores. The output should change more than removing low-attribution tokens.
Metric: Correlation between attribution scores and output change when tokens are removed.
2. Sanity Checks
- Random model test: Attributions should differ between trained and random models
- Random label test: Attributions should differ for different target labels
- Completeness test: Sum of attributions should match output difference (for IG)
3. Cross-Method Agreement
Check if multiple methods (e.g., IG, LIME, ablation) agree on which tokens are important. High agreement
increases confidence.
6.2 Method Comparison Summary
| Method |
Speed |
Accuracy |
Interpretability |
LLM-Specific Issues |
| Saliency |
⚡ Fast |
⚠️ Saturation |
✓ Clear |
Unreliable in deep networks |
| Integrated Gradients |
⚠️ Slow |
✓✓ Good |
✓ Clear |
Baseline-dependent, linear path |
| UDIG |
⚠️ Slow |
✓✓✓ Best |
✓ Clear |
Designed for discrete inputs |
| Ablation |
⚠️⚠️ Very Slow |
✓✓✓ Ground truth |
✓✓ Very clear |
Unnatural inputs (gaps) |
| LIME |
⚠️⚠️ Very Slow |
⚠️ Variable |
✓✓ Linear model |
Additivity assumption violated |
| SHAP |
⚠️⚠️⚠️ Extremely Slow |
✓ Theoretically sound |
✓✓ Game-theoretic |
Additivity assumption violated |
| Attention Rollout |
⚡⚡ Very Fast |
⚠️ Approximate |
⚠️ Hard to interpret |
Linear approximation |
| Attention Flow |
⚡ Fast |
✓ Better than rollout |
⚠️ Hard to interpret |
Flow network assumption |
6.3 Recommendations
- For quick exploration: Saliency or Attention Rollout
- For reliable attribution: Integrated Gradients (or UDIG for NLP)
- For validation: Ablation (ground truth)
- For publication: Use multiple methods, report agreement
- For transformers: Prefer gradient-based over LIME/SHAP (additivity issues)
7. Technical Challenges and Solutions
7.1 Gradient Saturation
Problem: Gradients near zero despite input importance.
Solutions:
- Use Integrated Gradients (accumulates over path)
- Try Input×Gradient (scales by input magnitude)
- For circuit discovery: Use EAP-GP (adaptive path avoids saturated regions)
7.2 Computational Cost
Problem: IG requires ~50-300 forward passes; SHAP/LIME even more.
Solutions:
- Use fewer interpolation steps (trade accuracy for speed)
- Attribute only critical tokens (e.g., content words, not punctuation)
- Use faster methods (saliency, attention) for exploration, then validate with IG
- Parallelize across GPUs
7.3 Long Sequences
Problem: Attributing 1000+ token inputs is expensive and overwhelming.
Solutions:
- Aggregate attributions (e.g., by sentence, by entity)
- Focus on specific output tokens of interest
- Use sliding windows for local context
7.4 Discrete vs Continuous Inputs
Problem: Text is discrete, but IG assumes continuous interpolation.
Solutions:
- Use UDIG (nonlinear path respecting discrete structure)
- Apply IG to embeddings (continuous) but interpret results carefully
- Consider perturbation methods (naturally discrete)
8. Applications to Your Research
8.1 Debugging Model Predictions
Scenario: Your model incorrectly predicts subject-verb agreement.
Input: "The key to the cabinets are" (incorrect, should be "is")
Use attribution to investigate:
- Run IG to see which input tokens influenced "are" prediction
- If attribution is high on "cabinets" (distractor), model is attending to wrong noun
- Guides intervention: Need to strengthen subject tracking (relate to Week 5 circuits)
8.2 Analyzing Your Concept
For your project concept, attribution helps answer:
- Which input features trigger your concept? (e.g., for "politeness," which words are most
important?)
- Are there spurious correlations? (e.g., does "musical key" attribution rely on irrelevant
cues?)
- How does context matter? (e.g., does the concept depend on distant words?)
8.3 Guiding Circuit Discovery
Workflow combining Week 5 and Week 9:
- Attribution (Week 9): Identify which input tokens matter for your concept
- Attention analysis: Which heads attend to those important tokens?
- Path patching (Week 5): Test if those heads causally compute the concept
- Validation (Week 8): Use IIA to confirm the circuit
9. Integration with Previous Weeks
Connecting Attribution to Other Methods
| Previous Week |
How Attribution Helps |
| Week 5: Circuits |
Use attribution to identify which input tokens activate your circuit.
Example: If a circuit computes induction, attribution should show high scores on the repeated token.
|
| Week 6: Probes |
Compare probe predictions with attribution. If a probe detects a feature, attribution should show which inputs
provided that information.
Example: Probe detects "plural subject" → attribution should highlight the plural noun.
|
| Week 7: SAEs |
For an SAE feature representing your concept, use attribution to see which inputs activate it.
Example: "Politeness feature" activates → attribution reveals which politeness markers (e.g., "please," "thank
you") trigger it.
|
| Week 8: Causal Validation |
Attribution ≠ causation! Validate high-attribution tokens with interventions.
Example: Attribution says "France" is important → intervene by changing "France" to "Italy" and verify output
changes to "Rome."
|
10. Limitations and Skepticism
Critical Limitations to Remember
- Correlation ≠ Causation: High attribution doesn't prove a token is necessary.
Always validate with interventions (Week 8).
- Additivity assumptions: LIME and SHAP assume local additivity, which transformers violate.
Results may be misleading.
- Baseline dependence: IG results change with baseline choice. Report your baseline and justify
it.
- Attention ≠ Explanation: Raw attention weights are insufficient for full attributions due to
information mixing.
- Out-of-distribution perturbations: Removing tokens creates unnatural inputs. Model behavior
on these may not reflect normal operation.
- Single-path assumptions: Standard IG uses a straight line, which may not capture the true
path through representation space.
When Attribution Fails
- Highly distributed concepts: If your concept depends on complex interactions of many tokens,
attribution may not isolate individual contributions clearly.
- Nonlinear interactions: Token A + Token B together matter, but individually they don't. Linear
attribution methods will underestimate their importance.
- Internal reasoning: If the model performs multi-step reasoning, input attribution alone won't
reveal the internal steps (need circuit analysis).
11. Best Practices for Your Research
Checklist for Using Attribution in Your Project
- ✓ Use multiple methods and report agreement
- ✓ Validate with ablation (ground truth test)
- ✓ Run sanity checks (random model, random label)
- ✓ Justify baseline choice (for IG)
- ✓ Report computational costs
- ✓ Visualize attributions clearly
- ✓ Combine with causal validation (Week 8 IIA)
- ✓ Aggregate for long sequences
- ✓ Don't over-interpret (attribution shows correlation, not causation)
- ✓ Compare to human intuition (does attribution make sense?)
Part 2: Skepticism and Attribution Validation (50% of content)
Critical Reality Check
Attribution methods produce compelling visualizations. Saliency maps highlight features. Attention weights show where
the model "looks." But do these explanations actually reflect how the model makes decisions? This section examines
evidence that attribution methods can be misleading—and how to validate them rigorously.
1. Sanity Checks for Saliency Maps (Adebayo et al., 2018)
The Problem: Visual Plausibility ≠ Correctness
Saliency maps often highlight seemingly meaningful regions—faces in images, important words in text. But are these
explanations faithful to the model's computation?
Adebayo et al. (2018) proposed simple sanity checks to test whether attribution methods are
actually sensitive to the model and data they're supposed to explain.
Sanity Check 1: Model Parameter Randomization
Test: Randomly re-initialize the model's weights.
Expected behavior: A good explanation method should produce completely different
explanations for a trained model vs. a random model.
Why it matters: If explanations look the same for both, the method is just detecting network
architecture or input patterns, not what the model learned.
Example Application to LLMs:
# Test your attribution method
trained_model = GPT2LMHeadModel.from_pretrained("gpt2")
random_model = GPT2LMHeadModel(config) # Random initialization
attribution_trained = compute_attribution(trained_model, text)
attribution_random = compute_attribution(random_model, text)
# They should be substantially different!
assert not np.allclose(attribution_trained, attribution_random)
Sanity Check 2: Data Randomization
Test: Train the model on data with random labels.
Expected behavior: Explanations should look different for a model trained on meaningful data vs.
random noise.
Why it matters: A model that memorized noise has fundamentally different internal mechanisms than
one that learned patterns.
Key Findings from Adebayo et al.
- Guided Backpropagation FAILS: Produces similar-looking visualizations for trained and random
models
- Guided GradCAM FAILS: Primarily does edge detection, regardless of what the model learned
- Simple Gradients PASS: Consistently distinguish trained from random models
- Integrated Gradients PASS: Changes appropriately with model changes
Implication for Your Project:
Before trusting any attribution visualization, run these sanity checks. If your method fails, you're seeing artifacts
of the attribution algorithm, not insights into your model.
2. Attention Is Not (Necessarily) Explanation (Jain & Wallace, 2019)
The Intuition vs. The Evidence
Attention mechanisms are everywhere in modern NLP. It's tempting to interpret attention weights as showing which input
tokens are "important" for predictions. But Jain & Wallace (2019) systematically tested this
assumption.
Three Tests of Attention as Explanation
Test 1: Correlation with Feature Importance
Compare attention weights with gradient-based measures of actual feature importance.
Result: Often uncorrelated. High attention ≠ high importance for prediction.
Test 2: Counterfactual Attention Distributions
Can you construct different attention patterns that produce identical outputs?
Result: Yes! Multiple contradictory attention distributions often yield the same prediction.
Implication: If multiple explanations are equally valid, which one is "true"?
Test 3: Adversarial Attention
Optimize attention weights to be maximally different while keeping outputs unchanged.
Result: Possible for many tasks. Attention can be manipulated without affecting predictions.
When Can You Trust Attention?
| Use Case |
Trustworthy? |
Reason |
| Single-layer analysis |
✓ Moderate |
Shows what that layer attends to (but not why) |
| Debugging attention patterns |
✓ Yes |
Useful for finding bugs (attending to padding, etc.) |
| Input importance for output |
✗ No |
Use attribution methods instead |
| Explanation for end users |
✗ Risky |
May be unfaithful; validate first |
Better Alternatives
- Attention Rollout: Propagate attention through layers (see Week 6 Part 1)
- Attention Flow: Model as flow network
- Integrated Gradients: More faithful to actual importance
- Causal Interventions: Test by ablating attended tokens
3. The ROAR Benchmark: Quantifying Attribution Quality (Hooker et al., 2019)
The Challenge
We have many attribution methods (gradients, IG, SHAP, LIME, attention). How do we know which ones actually identify
important features? Visual inspection is subjective and prone to confirmation bias.
ROAR: RemOve And Retrain
Hooker et al. (2019) proposed a quantitative benchmark:
The ROAR Protocol
- Train a model to convergence
- Use an attribution method to identify most important features
- Remove those features from the training data
- Retrain the model from scratch without those features
- Measure the drop in performance
Logic: If the method truly identifies important features, removing them should hurt performance
more than removing random features.
Shocking Results
- Many popular methods perform no better than random feature selection
- Even removing 90% of pixels (chosen by importance) still allows ~64% accuracy on ImageNet
- Only ensemble methods (VarGrad, SmoothGrad-Squared) consistently beat random
Conclusion: Just because an attribution method produces a plausible-looking heatmap doesn't mean
it's identifying features that actually matter.
Adapting ROAR for LLMs
For text models, adapt ROAR as follows:
# ROAR for text attribution
1. Compute attribution scores for your dataset
2. Identify top-k most important tokens per example
3. Mask or remove those tokens from training data
4. Fine-tune (or retrain) model without important tokens
5. Measure performance drop
# Compare
drop_with_important_removed = baseline_acc - roar_acc
drop_with_random_removed = baseline_acc - random_acc
# Good attribution method:
assert drop_with_important_removed > drop_with_random_removed
Cheaper Alternative: Perturbation Test
Full ROAR requires retraining (expensive). A faster proxy:
- Remove high-attribution tokens from test inputs
- Measure output change
- Compare to removing random tokens
- Not as rigorous as ROAR, but much faster
4. Validation Checklist for Attribution Methods
Before trusting attribution results, verify:
Mandatory Checks
- ✓ Sanity checks (Adebayo):
- Random model test: attributions differ from trained model
- Random label test: attributions differ from properly trained model
- ✓ Perturbation validation:
- Removing high-attribution features hurts performance
- More than removing random features
- Effect size is substantial
- ✓ Method agreement:
- Compare at least 2-3 independent attribution methods
- If they disagree, investigate why
- Don't cherry-pick the method that gives desired results
- ✓ Baseline comparison:
- Beats random feature selection
- Beats frequency-based heuristics
- Compare simple (gradients) vs complex (IG, SHAP) methods
- ✓ Completeness test (for IG):
- Sum of attributions ≈ f(x) - f(baseline)
- If not, something is wrong with implementation or baseline
5. Common Pitfalls and How to Avoid Them
Pitfall 1: Baseline Dependence (Integrated Gradients)
Problem: IG results can change dramatically with baseline choice.
Solution:
- Test multiple baselines (zero embeddings, padding, average embeddings)
- Report which baseline you used and why
- If results flip with baseline change, interpretation is fragile
Pitfall 2: Additivity Assumptions (LIME, SHAP)
Problem: Transformers violate local additivity due to attention and LayerNorm.
Solution:
- Use LIME/SHAP cautiously for transformers
- Prefer gradient-based methods
- Validate LIME/SHAP results with ablation
Pitfall 3: Out-of-Distribution Perturbations
Problem: Removing tokens creates unnatural inputs. Model behavior may not reflect normal operation.
Solution:
- Use ReAGent (replace with alternatives) instead of masking
- Consider in-distribution perturbations
- Report that perturbations are artificial
Pitfall 4: Confirmation Bias
Problem: Seeing what you expect in attribution maps.
Solution:
- Pre-register predictions before running attribution
- Show examples where attribution gives unexpected results
- Have collaborators blind-evaluate attributions
6. Best Practices for Rigorous Attribution Research
Do's
- ✓ Run sanity checks before trusting any method
- ✓ Use multiple independent attribution methods
- ✓ Validate with perturbation or ROAR tests
- ✓ Report baseline choice and justify it
- ✓ Show when attribution fails or gives unexpected results
- ✓ Combine attribution with causal intervention (Week 4)
- ✓ Report computational cost
Don'ts
- ✗ Trust visualizations without validation
- ✗ Use only one attribution method
- ✗ Rely solely on attention weights
- ✗ Cherry-pick examples that look good
- ✗ Ignore baseline dependence
- ✗ Skip sanity checks to save time
- ✗ Over-interpret correlation as causation
7. Integration with Your Research Project
Validating Your Concept Attribution
When attributing input importance for your concept:
Week 6 Validation Protocol
- Choose methods: Pick 2-3 attribution methods (e.g., IG, gradients, ablation)
- Run sanity checks:
- Test on random model
- Verify completeness (for IG)
- Compute attributions: For your concept-relevant predictions
- Validate:
- Do methods agree on top features?
- Perturbation test: removing features changes output
- Compare to random baseline
- Causal integration:
- Combine with Week 4 patching results
- Do high-attribution tokens align with causally important components?
- Report honestly:
- Show agreement and disagreement between methods
- Document failure cases
- Acknowledge limitations
8. Looking Ahead: More Skepticism in Week 10
This week introduced validation for attribution methods. But interpretability illusions go deeper:
- Week 7 (SAEs): Can sparse features be adversarially fragile?
- Week 8 (Circuits): How sensitive are circuit findings to methodological choices?
- Week 10 (Full Skepticism): Comprehensive study of when interpretability methods fail
The validation framework from Week 4, combined with attribution skepticism from this week, prepares you to critically
evaluate any interpretability claim.
Summary: Attribution + Skepticism
Part 1 Takeaways (Attribution Methods)
- Input attribution identifies which inputs matter for outputs
- Gradient-based methods (IG, UDIG) are reliable when validated
- Perturbation methods (ablation) provide ground truth
- Attention-based methods are fast but approximate
- Inseq library makes attribution accessible
Part 2 Takeaways (Skepticism)
- Visual plausibility ≠ correctness (Adebayo sanity checks)
- Attention weights are not reliable explanations (Jain & Wallace)
- Many methods perform no better than random (ROAR benchmark)
- Always validate with multiple independent tests
- Combine attribution with causal validation
For Your Project
- Use attribution to identify important inputs for your concept
- Run full validation protocol (sanity checks + perturbation + agreement)
- Integrate with Week 4 causal intervention results
- Report failures and limitations honestly
- Build toward rigorous, validated interpretability
References for Part 2 (Skepticism)
Core Papers
- Adebayo et al. (2018): "Sanity Checks for Saliency Maps." NeurIPS. arXiv:1810.03292
- Jain & Wallace (2019): "Attention is not Explanation." NAACL. arXiv:1902.10186
- Hooker et al. (2019): "A Benchmark for Interpretability Methods in Deep Neural Networks."
NeurIPS. arXiv:1806.10758
Nuance and Follow-ups
- Wiegreffe & Pinter (2019). "Attention is not not Explanation." EMNLP. arXiv:1908.04626 [Defense of attention under certain conditions]
- Jacovi & Goldberg (2020). "Towards Faithfully Interpretable NLP Systems." ACL. [Framework for faithfulness]
12. Summary and Next Steps
Key Takeaways
- Input attribution identifies which inputs matter for outputs—complementary to circuit/probe/SAE
analysis
- Gradient-based methods (IG, UDIG) are reliable for LLMs when used carefully
- Perturbation-based methods (ablation) provide ground truth but are computationally expensive
- Attention-based methods (rollout, flow) are fast but approximate
- Inseq library makes attribution accessible for sequence generation models
- Always validate attributions with multiple methods and causal interventions
- Critical limitations: Baseline dependence, additivity assumptions, correlation ≠ causation
For Your Research Project
- Use attribution to identify which input features activate your concept
- Combine with circuit analysis (Week 5) to understand how those inputs are processed
- Validate findings with causal interventions (Week 8)
- Report attribution results in your paper with appropriate caveats
Looking Ahead
Week 10: Skepticism and interpretability illusions—when interpretability methods mislead us, and
how to be rigorous.
References & Resources
Core Papers
- Inseq Library: Sarti et al. (2023). "Inseq: An Interpretability Toolkit for Sequence
Generation Models." arXiv:2302.13942
- Integrated Gradients: Sundararajan et al. (2017). "Axiomatic Attribution for Deep Networks."
ICML. arXiv:1703.01365
- UDIG: Recent (2024). "Uniform Discretized Integrated Gradients." arXiv:2412.03886
- Attention Flow: Abnar & Zuidema (2020). "Quantifying Attention Flow in Transformers." ACL. ACL Anthology
- LIME: Ribeiro et al. (2016). "Why Should I Trust You?" KDD. arXiv:1602.04938
- SHAP: Lundberg & Lee (2017). "A Unified Approach to Interpreting Model Predictions." NIPS. arXiv:1705.07874
- Baseline Selection: Sturmfels et al. (2020). "Visualizing the Impact of Feature Attribution
Baselines." Distill. distill.pub
Tools & Libraries
Supplementary Reading
- Attention is not Explanation: Jain & Wallace (2019). arXiv:1902.10186
- Attention is not not Explanation: Wiegreffe & Pinter (2019). arXiv:1908.04626
- Sanity Checks for Saliency Maps: Adebayo et al. (2018). NIPS. arXiv:1810.03292
- Transformers Can't Represent Additive Models: Recent research on LIME/SHAP limitations for transformers (2024).
In-Class Exercise: Which Words Make a Pun Punny?
Using attribution methods, we will identify which input tokens are most important for the model's
"pun recognition"—helping us understand what linguistic cues trigger humor processing.
Part 1: Attribution Setup (15 min)
Prepare for attribution analysis on your pun dataset:
- Select examples: Choose 10 puns where the model shows clear pun recognition (from your probe analysis)
- Define target: Use your pun probe's output as the attribution target
- Which inputs increase the "pun" score?
- Choose methods: We will compare Integrated Gradients, Input×Gradient, and Attention Rollout
Part 2: Compute and Compare Attributions (25 min)
Apply multiple attribution methods and analyze agreement:
- Run integrated gradients: Compute token-level attributions for each pun
- Visualize heatmaps: Which words have highest attribution?
- Is it the punchline word (double meaning)?
- Is it the setup words that create the context?
- Do both parts contribute?
- Compare methods:
- Do IG and Input×Gradient agree on important tokens?
- How does Attention Rollout compare?
- Compute correlation between method rankings
- Aggregate patterns:
- Across all puns, what types of tokens are important?
- Are punchlines always critical, or does context matter more?
Part 3: Validate with Interventions (20 min)
Test whether attribution accurately identifies important tokens:
- Ablate high-attribution tokens:
- Remove or mask the top-3 highest-attribution tokens
- Does the pun probe score drop significantly?
- Ablate low-attribution tokens:
- Remove tokens with low attribution
- Pun recognition should remain relatively stable
- Compute correlation:
- Plot: attribution score vs. effect of ablating that token
- If correlation is low, attribution may not be faithful
Discussion: What have you learned about how the model "sees" puns?
Are the important tokens what you would expect, or are there surprises?
Open Pun Attribution Notebook in Colab
Project Milestone
Due: Thursday of Week 6
Apply attribution methods to identify which input tokens matter most for your concept.
Use sanity checks to validate that your attribution methods are meaningful and not artifacts.
Attribution Analysis
- Apply multiple attribution methods:
- Gradient-based: Input gradients, integrated gradients, or gradient × input
- Perturbation-based: Ablation or occlusion
- Attention-based: Attention rollout or attention flow (with caution)
- Identify important tokens:
- Which input tokens have highest attribution scores for your concept?
- Do different methods agree on important tokens?
- Visualize attributions across multiple examples
- Run sanity checks:
- Random model baseline: do attributions disappear with random weights?
- Random input baseline: do attributions change appropriately with random inputs?
- Cascading randomization: test layers systematically
- Validate with interventions:
- Ablate high-attribution tokens: does model behavior change as predicted?
- Ablate low-attribution tokens: behavior should remain stable
- Compare attribution rankings with intervention effect sizes
Deliverables:
- Attribution visualizations:
- Heatmaps showing token attributions for 10-15 examples
- Comparison across different attribution methods
- Summary of which tokens are consistently important
- Sanity check results:
- Results from all sanity check tests
- Assessment: do your attributions pass sanity checks?
- Validation results:
- Correlation between attribution scores and ablation effects
- Examples where attribution predictions match/mismatch interventions
- Interpretation:
- Which tokens matter most for your concept?
- Are attributions trustworthy for your concept?
- What does this reveal about how models process your concept?
- Code: Notebook with attribution methods, sanity checks, and validation
Attribution methods can be unreliable—always validate with sanity checks and interventions before
drawing conclusions about which inputs matter.