Week 7: Attribution

Learning Objectives

Understand what input attribution is and when to use it vs other interpretability methods
Master gradient-based attribution methods (saliency, integrated gradients, DeepLIFT)
Apply perturbation-based methods (LIME, SHAP, ablation) to LLMs
Use attention-based attribution (attention rollout, attention flow)
Work effectively with the Inseq library for sequence generation models
Critically evaluate attribution methods and understand their limitations
Address technical challenges (baseline selection, gradient saturation, computational cost)
Apply attribution to debug model predictions and analyze your project concepts
Integrate attribution with previous weeks' methods (circuits, probes, SAEs, causal validation)

Required Readings

Axiomatic Attribution for Deep Networks (Integrated Gradients)
Sundararajan et al. (2017). Principled input attribution grounded in axiomatic requirements.
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation (MIRAGE)
Qi et al. (2024). Attribution for RAG systems—which retrieved documents actually contributed?

Supplementary Readings

Rationalization for Explainable NLP: A Survey
Gurrapu et al. (2023). Survey of rationalization methods connecting attribution to explanation.

1. What is Input Attribution?

The Core Question

Input attribution answers: "Which parts of the input were most important for the model's output?"

For an LLM generating text, attribution identifies which input tokens influenced each output token. This helps us:

Attribution vs Other Interpretability Methods

Method	Question Answered	Granularity
Attribution (This week)	Which inputs matter?	Input tokens
Circuits (Week 5)	Which components compute the output?	Attention heads, MLPs
Probes (Week 6)	What information is encoded?	Layer activations
SAEs (Week 7)	What features are represented?	Sparse feature directions
Causal Validation (Week 8)	Are findings causal?	Variable-level

Key insight: Attribution is input-level explanation. It tells you what the model used, not how it was processed internally. Combine attribution with circuit analysis (Week 5) to get the full picture.

2. Gradient-Based Attribution Methods

Gradient-based methods use the model's gradients to measure input importance. Intuitively: if changing an input token would change the output a lot (high gradient), that token is important.

2.1 Saliency Maps (Vanilla Gradients)

Method: Saliency Maps

Idea: The gradient magnitude indicates importance.

Formula: \( \text{Saliency}(x_i) = \left| \frac{\partial f(x)}{\partial x_i} \right| \)

Pros: Simple, fast, easy to implement

Cons: Suffers from gradient saturation (near-zero gradients in deep networks)

Example:

Input: "The cat sat on the mat"

Output: "because" (predicted next token)

Saliency scores: [0.02, 0.35, 0.15, 0.08, 0.40, 0.12]

Interpretation: "cat" and "mat" have highest gradients → most influential for predicting "because"

Problem: Gradient Saturation

Deep networks with ReLU/sigmoid activations often have near-zero gradients even when inputs are important. This happens because the function flattens (saturates) in certain regions.

Solution: Use Integrated Gradients or other methods that accumulate gradients along a path.

2.2 Input × Gradient

Method: Input × Gradient

Idea: Weight gradients by input magnitude to get better signal.

Formula: \( \text{Attribution}(x_i) = x_i \cdot \frac{\partial f(x)}{\partial x_i} \)

Pros: Better handles zero-input cases than vanilla gradients

Cons: Still affected by saturation

2.3 Integrated Gradients (IG)

Method: Integrated Gradients

Idea: Accumulate gradients along a straight line from a baseline to the input.

Formula: \[ \text{IG}(x_i) = (x_i - x_i') \int_{\alpha=0}^{1} \frac{\partial f(x' + \alpha \cdot (x - x'))}{\partial x_i} \, d\alpha \] where \( x' \) is a baseline input (e.g., all zeros or padding tokens).

Pros: Theoretically sound (satisfies completeness axiom), mitigates saturation

Cons: Computationally expensive, baseline-dependent, assumes linear path

Why Integrated Gradients Works

By integrating gradients along a path, IG avoids the saturation problem of vanilla gradients. Even if the gradient is zero at the input, IG captures importance by looking at gradients throughout the interpolation.

Completeness property: The sum of attributions equals the difference in model output between input and baseline: \[ \sum_i \text{IG}(x_i) = f(x) - f(x') \]

2.4 The Baseline Selection Problem

Baseline choices for LLMs:

Zero embeddings: Common but may not be semantically meaningful
Padding tokens: Model-specific, represents "no information"
Mask tokens: For masked language models (BERT-style)
Average embeddings: Represents "typical" input
Random text: Contrast against unrelated content

Impact: Baseline choice can dramatically change attribution scores. Always justify your baseline!

2.5 Uniform Discretized Integrated Gradients (UDIG)

Method: UDIG (2024)

Problem with standard IG: Linear interpolation through embedding space doesn't respect the discrete, linguistic nature of words. Intermediate points may not correspond to real words.

Solution: Use a nonlinear path that stays closer to actual word embeddings.

Result: Better performance on NLP tasks (sentiment, QA) compared to standard IG.

When to use: For language models where discrete token structure matters.

2.6 Other Gradient Methods

3. Perturbation-Based Attribution Methods

Perturbation methods modify inputs and observe how outputs change. No gradients needed—purely empirical.

3.1 Occlusion / Ablation

Method: Token Ablation

Idea: Remove (or mask) each token and measure output change.

Procedure:

Run model on full input → get output probability \( p \)
For each token \( i \):
- Remove or mask token \( i \)
- Run model → get new probability \( p_i \)
- Attribution \( = p - p_i \) (drop in probability)

Pros: Intuitive, model-agnostic, no gradient computation

Cons: Computationally expensive (\( O(n) \) forward passes for \( n \) tokens), may create unnatural inputs

3.2 LIME (Local Interpretable Model-Agnostic Explanations)

Method: LIME

Idea: Fit a simple linear model locally around the input to approximate the model's behavior.

Procedure:

Generate perturbed inputs by randomly masking tokens
Run model on perturbed inputs to get outputs
Fit a weighted linear model: \( g(x') \approx f(x) \) where \( x' \) is in the perturbation neighborhood
Linear coefficients = token importance

Pros: Model-agnostic, interpretable coefficients

Cons: Requires many model calls, local approximation may be inaccurate, random sampling variability

3.3 SHAP (Shapley Additive Explanations)

Method: SHAP

Idea: Use game-theoretic Shapley values to assign fair credit to each token.

Formula: For each token \( i \), compute contribution by averaging over all possible subsets: \[ \phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|! (|N| - |S| - 1)!}{|N|!} [f(S \cup \{i\}) - f(S)] \]

Pros: Theoretically grounded (unique solution satisfying fairness axioms), consistent

Cons: Exponentially expensive to compute exactly (approximations needed), requires defining "coalition" semantics for tokens

3.4 ReAGent (Replace with Alternatives)

4. Attention-Based Attribution

4.1 Raw Attention Weights

Problem: Attention is Not Explanation

In transformers, information from different tokens gets increasingly mixed across layers through attention and residual connections. By the final layer, it's unclear which input tokens contributed to a representation.

Raw attention weights show what the model attends to in a single layer, but they don't track information flow from input to output across the full network.

4.2 Attention Rollout

Method: Attention Rollout (Abnar & Zuidema, 2020)

Idea: Propagate attention weights through layers, accounting for residual connections.

Assumption: Token identities are linearly combined based on attention weights.

Formula: Roll out attention from layer \( \ell \) to input: \[ \tilde{A}^{(\ell)} = A^{(\ell)} \cdot \tilde{A}^{(\ell-1)} \] where \( A^{(\ell)} \) is attention at layer \( \ell \), adjusted for residual connections.

Pros: Accounts for multi-layer information flow

Cons: Linear approximation may be inaccurate for nonlinear models

4.3 Attention Flow

Method: Attention Flow (Abnar & Zuidema, 2020)

Idea: Model the attention graph as a flow network and compute maximum flow from each input token to the output.

Algorithm: Use max-flow algorithms (e.g., Ford-Fulkerson) with attention weights as edge capacities.

Pros: Captures multi-path information flow

Cons: Computationally more expensive than rollout

Attention Rollout vs Attention Flow

Aspect	Attention Rollout	Attention Flow
Assumption	Linear mixing of identities	Flow network
Computation	Matrix multiplication (fast)	Max-flow algorithm (slower)
Multi-path	Averages paths	Finds bottlenecks
Use case	Quick approximate attribution	Detailed flow analysis

Key finding: Both methods correlate better with ablation and gradient-based methods than raw attention.

5. The Inseq Library: Practical Tool

Inseq is a Python library that makes attribution analysis accessible for sequence generation models.

5.1 What Inseq Provides

5.2 Example Usage

5.3 Key Pedagogical Examples from Inseq

Example 1: Gender Bias in Machine Translation

Task: Translate "The doctor asked the nurse to help" from English to gendered language (e.g., Italian).

Question: Does the model assign gendered pronouns based on stereotypes?

Attribution reveals: Which input words drive gendered predictions ("doctor" vs "nurse").

Example 2: Factual Knowledge Localization in GPT-2

Input: "The capital of France is"

Output: "Paris"

Attribution reveals: Model strongly attends to "France" (expected) but may also use "capital" heavily, showing it recognizes the prompt structure.

6. Comparing Attribution Methods

6.1 Validation Techniques

6.2 Method Comparison Summary

6.3 Recommendations

7. Technical Challenges and Solutions

7.1 Gradient Saturation

7.2 Computational Cost

7.3 Long Sequences

7.4 Discrete vs Continuous Inputs

8. Applications to Your Research

8.1 Debugging Model Predictions

Method	Speed	Accuracy	Interpretability	LLM-Specific Issues
Saliency	⚡ Fast	⚠️ Saturation	✓ Clear	Unreliable in deep networks
Integrated Gradients	⚠️ Slow	✓✓ Good	✓ Clear	Baseline-dependent, linear path
UDIG	⚠️ Slow	✓✓✓ Best	✓ Clear	Designed for discrete inputs
Ablation	⚠️⚠️ Very Slow	✓✓✓ Ground truth	✓✓ Very clear	Unnatural inputs (gaps)
LIME	⚠️⚠️ Very Slow	⚠️ Variable	✓✓ Linear model	Additivity assumption violated
SHAP	⚠️⚠️⚠️ Extremely Slow	✓ Theoretically sound	✓✓ Game-theoretic	Additivity assumption violated
Attention Rollout	⚡⚡ Very Fast	⚠️ Approximate	⚠️ Hard to interpret	Linear approximation
Attention Flow	⚡ Fast	✓ Better than rollout	⚠️ Hard to interpret	Flow network assumption

Scenario: Your model incorrectly predicts subject-verb agreement.

Input: "The key to the cabinets are" (incorrect, should be "is")

Use attribution to investigate:

Run IG to see which input tokens influenced "are" prediction
If attribution is high on "cabinets" (distractor), model is attending to wrong noun
Guides intervention: Need to strengthen subject tracking (relate to Week 5 circuits)

8.2 Analyzing Your Concept

8.3 Guiding Circuit Discovery

9. Integration with Previous Weeks

Connecting Attribution to Other Methods

10. Limitations and Skepticism

Previous Week	How Attribution Helps
Week 5: Circuits	Use attribution to identify which input tokens activate your circuit. Example: If a circuit computes induction, attribution should show high scores on the repeated token.
Week 6: Probes	Compare probe predictions with attribution. If a probe detects a feature, attribution should show which inputs provided that information. Example: Probe detects "plural subject" → attribution should highlight the plural noun.
Week 7: SAEs	For an SAE feature representing your concept, use attribution to see which inputs activate it. Example: "Politeness feature" activates → attribution reveals which politeness markers (e.g., "please," "thank you") trigger it.
Week 8: Causal Validation	Attribution ≠ causation! Validate high-attribution tokens with interventions. Example: Attribution says "France" is important → intervene by changing "France" to "Italy" and verify output changes to "Rome."

Critical Limitations to Remember

Correlation ≠ Causation: High attribution doesn't prove a token is necessary. Always validate with interventions (Week 8).
Additivity assumptions: LIME and SHAP assume local additivity, which transformers violate. Results may be misleading.
Baseline dependence: IG results change with baseline choice. Report your baseline and justify it.
Attention ≠ Explanation: Raw attention weights are insufficient for full attributions due to information mixing.
Out-of-distribution perturbations: Removing tokens creates unnatural inputs. Model behavior on these may not reflect normal operation.
Single-path assumptions: Standard IG uses a straight line, which may not capture the true path through representation space.

When Attribution Fails

11. Best Practices for Your Research

Checklist for Using Attribution in Your Project

✓ Use multiple methods and report agreement
✓ Validate with ablation (ground truth test)
✓ Run sanity checks (random model, random label)
✓ Justify baseline choice (for IG)
✓ Report computational costs
✓ Visualize attributions clearly
✓ Combine with causal validation (Week 8 IIA)
✓ Aggregate for long sequences
✓ Don't over-interpret (attribution shows correlation, not causation)
✓ Compare to human intuition (does attribution make sense?)

Part 2: Skepticism and Attribution Validation (50% of content)

1. Sanity Checks for Saliency Maps (Adebayo et al., 2018)

The Problem: Visual Plausibility ≠ Correctness

Saliency maps often highlight seemingly meaningful regions—faces in images, important words in text. But are these explanations faithful to the model's computation?

Adebayo et al. (2018) proposed simple sanity checks to test whether attribution methods are actually sensitive to the model and data they're supposed to explain.

Sanity Check 1: Model Parameter Randomization

Test: Randomly re-initialize the model's weights.

Expected behavior: A good explanation method should produce completely different explanations for a trained model vs. a random model.

Why it matters: If explanations look the same for both, the method is just detecting network architecture or input patterns, not what the model learned.

Example Application to LLMs:

# Test your attribution method
trained_model = GPT2LMHeadModel.from_pretrained("gpt2")
random_model = GPT2LMHeadModel(config)  # Random initialization

attribution_trained = compute_attribution(trained_model, text)
attribution_random = compute_attribution(random_model, text)

# They should be substantially different!
assert not np.allclose(attribution_trained, attribution_random)

Sanity Check 2: Data Randomization

Key Findings from Adebayo et al.

2. Attention Is Not (Necessarily) Explanation (Jain & Wallace, 2019)

The Intuition vs. The Evidence

Attention mechanisms are everywhere in modern NLP. It's tempting to interpret attention weights as showing which input tokens are "important" for predictions. But Jain & Wallace (2019) systematically tested this assumption.

Three Tests of Attention as Explanation

Test 1: Correlation with Feature Importance

Compare attention weights with gradient-based measures of actual feature importance.

Result: Often uncorrelated. High attention ≠ high importance for prediction.

Test 2: Counterfactual Attention Distributions

Result: Yes! Multiple contradictory attention distributions often yield the same prediction.

Implication: If multiple explanations are equally valid, which one is "true"?

Test 3: Adversarial Attention

Optimize attention weights to be maximally different while keeping outputs unchanged.

Result: Possible for many tasks. Attention can be manipulated without affecting predictions.

When Can You Trust Attention?

Better Alternatives

3. The ROAR Benchmark: Quantifying Attribution Quality (Hooker et al., 2019)

The Challenge

We have many attribution methods (gradients, IG, SHAP, LIME, attention). How do we know which ones actually identify important features? Visual inspection is subjective and prone to confirmation bias.

ROAR: RemOve And Retrain

Shocking Results

Adapting ROAR for LLMs

Use Case	Trustworthy?	Reason
Single-layer analysis	✓ Moderate	Shows what that layer attends to (but not why)
Debugging attention patterns	✓ Yes	Useful for finding bugs (attending to padding, etc.)
Input importance for output	✗ No	Use attribution methods instead
Explanation for end users	✗ Risky	May be unfaithful; validate first

# ROAR for text attribution
1. Compute attribution scores for your dataset
2. Identify top-k most important tokens per example
3. Mask or remove those tokens from training data
4. Fine-tune (or retrain) model without important tokens
5. Measure performance drop

# Compare
drop_with_important_removed = baseline_acc - roar_acc
drop_with_random_removed = baseline_acc - random_acc

# Good attribution method:
assert drop_with_important_removed > drop_with_random_removed

Cheaper Alternative: Perturbation Test

4. Validation Checklist for Attribution Methods

Mandatory Checks

✓ Sanity checks (Adebayo):
- Random model test: attributions differ from trained model
- Random label test: attributions differ from properly trained model
✓ Perturbation validation:
- Removing high-attribution features hurts performance
- More than removing random features
- Effect size is substantial
✓ Method agreement:
- Compare at least 2-3 independent attribution methods
- If they disagree, investigate why
- Don't cherry-pick the method that gives desired results
✓ Baseline comparison:
- Beats random feature selection
- Beats frequency-based heuristics
- Compare simple (gradients) vs complex (IG, SHAP) methods
✓ Completeness test (for IG):
- Sum of attributions ≈ f(x) - f(baseline)
- If not, something is wrong with implementation or baseline

5. Common Pitfalls and How to Avoid Them

Pitfall 1: Baseline Dependence (Integrated Gradients)

Pitfall 2: Additivity Assumptions (LIME, SHAP)

Problem: Transformers violate local additivity due to attention and LayerNorm.

Pitfall 3: Out-of-Distribution Perturbations

Problem: Removing tokens creates unnatural inputs. Model behavior may not reflect normal operation.

Pitfall 4: Confirmation Bias

6. Best Practices for Rigorous Attribution Research

Do's

✓ Run sanity checks before trusting any method
✓ Use multiple independent attribution methods
✓ Validate with perturbation or ROAR tests
✓ Report baseline choice and justify it
✓ Show when attribution fails or gives unexpected results
✓ Combine attribution with causal intervention (Week 4)
✓ Report computational cost

Don'ts

✗ Trust visualizations without validation
✗ Use only one attribution method
✗ Rely solely on attention weights
✗ Cherry-pick examples that look good
✗ Ignore baseline dependence
✗ Skip sanity checks to save time
✗ Over-interpret correlation as causation

7. Integration with Your Research Project

Validating Your Concept Attribution

Week 6 Validation Protocol

Choose methods: Pick 2-3 attribution methods (e.g., IG, gradients, ablation)
Run sanity checks:
- Test on random model
- Verify completeness (for IG)
Compute attributions: For your concept-relevant predictions
Validate:
- Do methods agree on top features?
- Perturbation test: removing features changes output
- Compare to random baseline
Causal integration:
- Combine with Week 4 patching results
- Do high-attribution tokens align with causally important components?
Report honestly:
- Show agreement and disagreement between methods
- Document failure cases
- Acknowledge limitations

8. Looking Ahead: More Skepticism in Week 10

This week introduced validation for attribution methods. But interpretability illusions go deeper:

The validation framework from Week 4, combined with attribution skepticism from this week, prepares you to critically evaluate any interpretability claim.

Summary: Attribution + Skepticism

Part 1 Takeaways (Attribution Methods)

Part 2 Takeaways (Skepticism)

For Your Project

References for Part 2 (Skepticism)

Core Papers

Nuance and Follow-ups

12. Summary and Next Steps

Key Takeaways

For Your Research Project

Looking Ahead

Week 10: Skepticism and interpretability illusions—when interpretability methods mislead us, and how to be rigorous.

References & Resources

Core Papers

Tools & Libraries

Supplementary Reading

In-Class Exercise: Which Words Make a Pun Punny?

Using attribution methods, we will identify which input tokens are most important for the model's "pun recognition"—helping us understand what linguistic cues trigger humor processing.

Part 1: Attribution Setup (15 min)

Prepare for attribution analysis on your pun dataset:

Select examples: Choose 10 puns where the model shows clear pun recognition (from your probe analysis)
Define target: Use your pun probe's output as the attribution target
- Which inputs increase the "pun" score?
Choose methods: We will compare Integrated Gradients, Input×Gradient, and Attention Rollout

Part 2: Compute and Compare Attributions (25 min)

Apply multiple attribution methods and analyze agreement:

Run integrated gradients: Compute token-level attributions for each pun
Visualize heatmaps: Which words have highest attribution?
- Is it the punchline word (double meaning)?
- Is it the setup words that create the context?
- Do both parts contribute?
Compare methods:
- Do IG and Input×Gradient agree on important tokens?
- How does Attention Rollout compare?
- Compute correlation between method rankings
Aggregate patterns:
- Across all puns, what types of tokens are important?
- Are punchlines always critical, or does context matter more?

Part 3: Validate with Interventions (20 min)

Test whether attribution accurately identifies important tokens:

Ablate high-attribution tokens:
- Remove or mask the top-3 highest-attribution tokens
- Does the pun probe score drop significantly?
Ablate low-attribution tokens:
- Remove tokens with low attribution
- Pun recognition should remain relatively stable
Compute correlation:
- Plot: attribution score vs. effect of ablating that token
- If correlation is low, attribution may not be faithful

Discussion: What have you learned about how the model "sees" puns? Are the important tokens what you would expect, or are there surprises?

Open Pun Attribution Notebook in Colab

Project Milestone

Due: Thursday of Week 6

Apply attribution methods to identify which input tokens matter most for your concept. Use sanity checks to validate that your attribution methods are meaningful and not artifacts.

Attribution Analysis

Apply multiple attribution methods:
- Gradient-based: Input gradients, integrated gradients, or gradient × input
- Perturbation-based: Ablation or occlusion
- Attention-based: Attention rollout or attention flow (with caution)
Identify important tokens:
- Which input tokens have highest attribution scores for your concept?
- Do different methods agree on important tokens?
- Visualize attributions across multiple examples
Run sanity checks:
- Random model baseline: do attributions disappear with random weights?
- Random input baseline: do attributions change appropriately with random inputs?
- Cascading randomization: test layers systematically
Validate with interventions:
- Ablate high-attribution tokens: does model behavior change as predicted?
- Ablate low-attribution tokens: behavior should remain stable
- Compare attribution rankings with intervention effect sizes

Deliverables:

Attribution visualizations:
- Heatmaps showing token attributions for 10-15 examples
- Comparison across different attribution methods
- Summary of which tokens are consistently important
Sanity check results:
- Results from all sanity check tests
- Assessment: do your attributions pass sanity checks?
Validation results:
- Correlation between attribution scores and ablation effects
- Examples where attribution predictions match/mismatch interventions
Interpretation:
- Which tokens matter most for your concept?
- Are attributions trustworthy for your concept?
- What does this reveal about how models process your concept?
Code: Notebook with attribution methods, sanity checks, and validation

Attribution methods can be unreliable—always validate with sanity checks and interventions before drawing conclusions about which inputs matter.

Learning Objectives

Required Readings

Supplementary Readings

1. What is Input Attribution?

The Core Question

Attribution vs Other Interpretability Methods

2. Gradient-Based Attribution Methods

2.1 Saliency Maps (Vanilla Gradients)

Method: Saliency Maps

2.2 Input × Gradient

Method: Input × Gradient

2.3 Integrated Gradients (IG)

Method: Integrated Gradients

Why Integrated Gradients Works

2.4 The Baseline Selection Problem

2.5 Uniform Discretized Integrated Gradients (UDIG)

Method: UDIG (2024)

2.6 Other Gradient Methods

3. Perturbation-Based Attribution Methods

3.1 Occlusion / Ablation

Method: Token Ablation

3.2 LIME (Local Interpretable Model-Agnostic Explanations)

Method: LIME

3.3 SHAP (Shapley Additive Explanations)

Method: SHAP

Critical Limitation: Additivity Assumption

3.4 ReAGent (Replace with Alternatives)

Method: ReAGent

4. Attention-Based Attribution

4.1 Raw Attention Weights

Problem: Attention is Not Explanation

4.2 Attention Rollout

Method: Attention Rollout (Abnar & Zuidema, 2020)

4.3 Attention Flow

Method: Attention Flow (Abnar & Zuidema, 2020)

Attention Rollout vs Attention Flow

5. The Inseq Library: Practical Tool

5.1 What Inseq Provides

5.2 Example Usage

5.3 Key Pedagogical Examples from Inseq

6. Comparing Attribution Methods

6.1 Validation Techniques

6.2 Method Comparison Summary

6.3 Recommendations

7. Technical Challenges and Solutions

7.1 Gradient Saturation

7.2 Computational Cost

7.3 Long Sequences

7.4 Discrete vs Continuous Inputs

8. Applications to Your Research

8.1 Debugging Model Predictions

8.2 Analyzing Your Concept

8.3 Guiding Circuit Discovery

9. Integration with Previous Weeks

Connecting Attribution to Other Methods

10. Limitations and Skepticism

Critical Limitations to Remember

When Attribution Fails

11. Best Practices for Your Research

Checklist for Using Attribution in Your Project

Part 2: Skepticism and Attribution Validation (50% of content)

Critical Reality Check

1. Sanity Checks for Saliency Maps (Adebayo et al., 2018)

The Problem: Visual Plausibility ≠ Correctness

Sanity Check 1: Model Parameter Randomization

Sanity Check 2: Data Randomization

Key Findings from Adebayo et al.

2. Attention Is Not (Necessarily) Explanation (Jain & Wallace, 2019)

The Intuition vs. The Evidence

Three Tests of Attention as Explanation

Test 1: Correlation with Feature Importance

Test 2: Counterfactual Attention Distributions

Test 3: Adversarial Attention

When Can You Trust Attention?

Better Alternatives

3. The ROAR Benchmark: Quantifying Attribution Quality (Hooker et al., 2019)

The Challenge

ROAR: RemOve And Retrain

The ROAR Protocol