Week 10: Human Understanding - Neural Mechanics

Modern AI systems achieve superhuman performance on many tasks: AlphaZero dominates chess champions, AlphaGo defeated the world's best Go players, and large language models process text at scales no human can match.

But a fundamental question remains: Do these AI systems know something we don't?

Two Modes of Learning from Superhuman AI

2. Passive Learning: The Go Revolution (Shin et al., 2023)

The Natural Experiment

Aspect	Passive Learning (Shin et al.)	Active Teaching (Schut et al.)
Mechanism	Humans observe AI play and spontaneously adopt novel strategies	Explicitly extract concepts from AI, filter, and teach humans
Scale	Population-level (all players improve)	Individual-level (train specific experts)
Evidence	5.8M Go moves over 71 years	4 grandmasters, controlled study
Control	Observational (AI exists in environment)	Interventional (deliberate concept transfer)
Novelty role	Novel moves correlate with improvement	Novelty is a filter criterion

In 2016, AlphaGo defeated Lee Sedol, the world champion Go player. This marked a turning point: AI became superhuman in a game with more possible board states than atoms in the universe.

Methodology

Key Findings

Example: Move 37

Implications

3. Active Teaching: Extracting Chess Concepts (Schut et al., 2023/2025)

The Challenge

Unlike Shin et al.'s observational study, Schut et al. sought to actively extract knowledge from AlphaZero and transfer it to human experts through deliberate training.

Methodology: Four-Stage Pipeline

Stage 1: Concept Discovery

Use convex optimization to find concept vectors in AlphaZero's latent space.

Stage 2: Filtering by Teachability

Not all extracted concepts are learnable by humans or other agents. Teachability test:

Stage 3: Filtering by Novelty

A teachable concept might still be something humans already know. Novelty test:

Stage 4: Human Validation Study

Key Findings

Example Concept

4. Comparing the Two Approaches

Complementary Insights

5. Designing Human Studies for Interpretability

How do you validate that your interpretability findings are meaningful? Human studies provide the gold standard.

Three Types of Evaluation (Doshi-Velez & Kim, 2017)

Key Considerations for Study Design

6. Building a Concept Atlas

Dimension	Shin et al. (Passive)	Schut et al. (Active)
Research Question	Do humans improve from AI exposure?	Can we extract and teach AI concepts?
Scale	Population (thousands of players)	Individual (4 grandmasters)
Evidence Type	Observational (71 years)	Experimental (controlled study)
Novelty Role	Outcome (novel moves = better)	Filter (select novel concepts)
Teaching Method	Implicit (watch AI play)	Explicit (study concept prototypes)
Interpretability	Black box (don't know what changed)	White box (extract specific concepts)
Measurement	Move quality via AI evaluation	Accuracy on concept test positions
Strength	Ecological validity (real-world effect)	Causal clarity (know what's taught)

Type	When to Use	Example
Application-Grounded	Real-world task with domain experts	Schut's grandmaster study
Human-Grounded	Simplified task with lay users	Show concept examples, ask "Does this make sense?"
Functionally-Grounded	No humans, use proxy metrics	Teachability test (AI student learns)

As you discover concepts in your project, you need to organize them systematically. A concept atlas maps the landscape of what your model knows.

Why Build an Atlas?

Organizing Dimensions

Building Your Project Atlas

7. Applying to Your Research Project

Validating Your Interpretability Claims

By Week 11, you should have discovered something about how your model represents your concept. Now validate it:

Knowledge Transfer Potential

8. Limitations and Open Questions

Limitations of Current Approaches

Open Research Questions

9. Best Practices for Your Paper

10. Summary and Integration

Key Takeaways

Your Research Workflow

Looking Ahead

The field of interpretability is moving toward actionable insights—not just understanding what models do, but using that understanding to improve human knowledge, align AI systems, and enable human-AI collaboration.

Your project contributes to this by characterizing how LLMs represent non-CS concepts—a crucial step toward AI systems that can truly collaborate with domain experts across all fields.

References & Resources

Core Papers

Related Work

Supplementary Reading

Part 2: Validation Framework for Interpretability

The Need for Rigorous Validation

By now you have applied many interpretability methods: causal tracing, probing, attribution, circuit discovery. But how do we know our interpretations are actually correct? This section establishes a comprehensive validation framework to ensure your findings are robust and faithful.

1. Levels of Validation (Doshi-Velez & Kim, 2017)

Doshi-Velez and Kim (2017) propose three levels of evaluation for interpretability methods:

Application-Grounded Evaluation

Test interpretations with real users performing real tasks in the actual application domain.

Example: Have doctors use model explanations to make diagnoses, measure accuracy improvement
Pros: Highest ecological validity, tests real-world utility
Cons: Expensive, requires domain experts, slow iteration
When to use: Final validation before deployment, high-stakes applications

Human-Grounded Evaluation

Test with lay users on simplified tasks that capture the essence of the real application.

Example: Show users attention visualizations, ask if they make sense
Pros: Faster than application-grounded, can use larger samples
Cons: May not generalize to real application, subjective judgments
When to use: Iterative development, when domain experts are unavailable

Functionally-Grounded Evaluation

Test interpretations using quantitative metrics without human subjects.

Example: Measure if high-importance features (by attribution) actually change outputs when intervened on
Pros: Fast, reproducible, scalable, no human subjects needed
Cons: May miss aspects that matter to humans
When to use: Initial validation, debugging, comparing methods

2. Faithfulness: The Core Requirement (Jacovi & Goldberg, 2020)

Jacovi and Goldberg (2020) argue that faithfulness is the fundamental property interpretations must satisfy:

Faithfulness: An explanation is faithful if it accurately represents the model's true decision-making process.

Why Faithfulness Matters

An explanation can be:

Plausible but unfaithful: Makes intuitive sense but does not reflect what the model actually does
Faithful but implausible: Accurately describes the model but is hard to understand

Goal: Faithful AND plausible explanations. But when forced to choose, faithfulness must come first.

Testing Faithfulness

The causal intervention methods you learned in Week 5 are key faithfulness tests:

Forward simulation: If explanation says X causes Y, intervening on X should change Y
Backward verification: If explanation highlights component C, ablating C should break the behavior
Sufficiency test: Patching only the explained components should recover the behavior

3. Multi-Method Validation

No single interpretability method is perfect. Robust findings require convergent evidence from multiple independent methods.

Method Triangulation

Method Category	Example Methods	What It Tests
Causal Intervention	Patching, ablation, steering	Which components are necessary/sufficient
Attribution	Gradients, IG, attention	Which inputs are important
Probing	Linear probes, logistic regression	What information is encoded
Feature Discovery	SAEs, dictionary learning	What features are represented
Behavioral Testing	Adversarial examples, edge cases	When does the interpretation fail

Validation strategy: Use at least 3 independent methods. If they agree, confidence increases. If they disagree, investigate why.

4. Sanity Checks: Catching Illusions

Before trusting any interpretation, run basic sanity checks.

Model Randomization Test

Test: Does your interpretation change when applied to a randomly initialized model?


      interpretation_trained = analyze(trained_model)

      interpretation_random = analyze(random_model)

      

      assert interpretation_trained != interpretation_random

Why it matters: If interpretations look the same for trained and random models, your method might just be detecting network architecture, not learned behavior.

Data Randomization Test

Test: Train a model on data with random labels. Does your interpretation change?

A model trained on random labels should have different (probably nonsensical) internal mechanisms
If interpretation looks the same, it is not capturing what the model learned

Ablation Completeness Test

Test: If you ablate all "important" components identified by your method, does the behavior break?


      important_components = find_important_components(model)

      ablated_performance = test(model, ablate=important_components)

      

      assert ablated_performance << baseline_performance

Sign Test (for attribution methods)

Test: Do positive attributions actually help the predicted class?

Remove features with high positive attribution: performance should drop
Remove features with high negative attribution: performance might improve
If removing "important" features does not change output, attribution is unfaithful

5. Baseline Comparisons

Always compare your interpretability findings against appropriate baselines:

Random Baseline

For component importance: Compare AIE of identified components vs. random components
For attribution: Compare removal of high-attribution features vs. random features
Minimum bar: Your method must beat random

Frequency Baseline

For text: High-attribution words might just be common words
Test: Does attribution correlate with word frequency? If yes, consider TF-IDF weighting

6. Counterfactual Testing

The strongest validation: predict what will happen, then intervene and verify.

The Counterfactual Validation Loop

Interpret: "Component C computes function F for concept X"
Predict: "If I intervene on C, behavior related to X should change in way Y"
Intervene: Actually modify C (patch, ablate, steer)
Measure: Did Y happen?
Iterate: If not, revise interpretation

7. Validation Checklist for Your Research

Before claiming you have discovered something about how a model works, verify:

Validation Checklist

Sanity checks passed:
- Random model gives different interpretation
- Random labels give different interpretation
- Ablating "important" components breaks behavior
Causal validation:
- Intervention on identified components changes behavior as predicted
- Effect size is substantial (not just statistically significant)
- Results replicate across multiple examples
Multi-method agreement:
- At least 3 independent methods point to same components/features
- If methods disagree, you understand why
Baseline comparisons:
- Performance beats random baseline
- Results are not explained by simple heuristics (frequency, position)
Generalization:
- Findings hold on held-out test set
- Findings transfer to related tasks/prompts
- You have characterized when the interpretation fails
Negative results reported:
- You have documented what does not work
- You have shown edge cases where interpretation breaks

8. Common Pitfalls

Watch out for these validation failures:

Confirmation bias: Finding what you expect to find, ignoring contradictory evidence
Cherry-picking: Showing only examples that support your interpretation
P-hacking: Testing many hypotheses, reporting only significant ones without correction
Overfitting: Interpretation works on training data but not test data
Confounds: Attributed importance is actually due to correlated features
Visualization artifacts: Patterns in visualization do not reflect model computation

The solution: Rigorous application of this validation framework throughout your research.

In-Class Exercise: Decoding Pun Representations with Patchscopes

In this final pun exercise, we use Patchscopes—the model's own language generation—to decode and describe what information is encoded in pun representations. This helps us understand what the model "sees" when processing humor.

Part 1: Patchscopes Setup (15 min)

Understand the Patchscopes technique:

Review the method:
- Patchscopes "patches" a hidden state from one context into a different prompt
- The model then generates text describing that hidden state
- Example: patch the representation of "Time flies like an arrow" into a prompt like "The following text is about: "
Select examples:
- Choose 5 puns where your probe shows high pun-recognition
- Choose 5 similar non-puns for comparison
Design prompts: Create 2-3 different "decoder" prompts for Patchscopes
- "This sentence is: "
- "The hidden meaning is: "
- "This is funny because: "

Part 2: Decoding Pun Representations (25 min)

Apply Patchscopes to understand what the model encodes about puns:

Extract representations:
- Run each pun/non-pun through the model
- Extract the hidden state at the punchline position
- Focus on the layer with best pun probe accuracy (from Week 6)
Apply Patchscopes:
- Patch the hidden state into each decoder prompt
- Generate 3-5 tokens of completion
- Record the model's "interpretation" of each representation
Compare pun vs non-pun:
- Do pun representations produce different descriptions than non-puns?
- Does the model mention humor, wordplay, or double meanings?
- Are the descriptions meaningful or random?

Part 3: Probing Specific Features (20 min)

Test whether the model can articulate specific aspects of puns:

Test the double meaning:
- For a pun like "Time flies like an arrow; fruit flies like a banana"
- Use prompt: "The word 'flies' here means: "
- Does the patched representation produce both meanings?
Test humor awareness:
- Use prompt: "This sentence is [humorous/serious]: "
- Does the model correctly classify puns as humorous?
Compare layers:
- Try Patchscopes at early, middle, and late layers
- At which layer does the model best understand the pun?
- Does this match your causal tracing results from Week 5?

Discussion: Can the model articulate its own understanding of puns? What does this tell us about whether humor understanding is explicit or implicit in the model?

Open Neologism Training Notebook in Colab

Note: Requires NDIF access for session-based training.

Project Milestone

Due: Thursday of Week 11

Design and conduct a human validation study to test whether your interpretability findings help humans understand or predict model behavior related to your concept.

Human Validation Study

Design study:
- What is your research question? (e.g., "Do explanations help humans predict model errors?")
- What will you show participants? (explanations, examples, circuit diagrams)
- What will you ask them to do? (predict behavior, identify errors, categorize examples)
- How will you measure understanding? (accuracy, confidence, response time)
Create materials:
- Control condition: humans with no explanation
- Experimental condition: humans with your interpretability findings
- Test cases: mix of typical and edge cases
- Instructions and training examples
Recruit participants:
- Target: 10-20 participants minimum
- Consider expertise level: domain experts vs novices
- Randomize assignment to conditions
Run study and analyze:
- Collect responses
- Compare performance: explanation vs no-explanation
- Statistical significance testing
- Qualitative feedback: what was helpful? confusing?

Deliverables:

Study design document:
- Research question and hypotheses
- Methods: participants, materials, procedure
- Planned analyses
Results:
- Quantitative: performance comparison with statistical tests
- Qualitative: participant feedback and observations
- Visualization: accuracy, confidence, or other metrics by condition
Interpretation:
- Do your explanations help humans?
- What aspects are most/least helpful?
- What does this reveal about the quality of your interpretability work?
Materials: Study materials, data, and analysis code

The ultimate test of interpretability: do your findings help humans understand the model? Even small, well-designed studies can provide valuable validation.

Week 10: Human Understanding & Self-Description

Learning Objectives

Required Readings

Supplementary Readings

1. The Human-AI Knowledge Gap

What Is the Knowledge Gap?

Two Modes of Learning from Superhuman AI

2. Passive Learning: The Go Revolution (Shin et al., 2023)

Paper: "Superhuman Artificial Intelligence Can Improve Human Decision Making by Increasing Novelty"

The Natural Experiment

Methodology

Key Findings

Example: Move 37

Implications

3. Active Teaching: Extracting Chess Concepts (Schut et al., 2023/2025)

Paper: "Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero"

The Challenge

Methodology: Four-Stage Pipeline

Stage 1: Concept Discovery

Stage 2: Filtering by Teachability

Stage 3: Filtering by Novelty

Stage 4: Human Validation Study

Key Findings

Example Concept

4. Comparing the Two Approaches

Complementary Insights

5. Designing Human Studies for Interpretability

Three Types of Evaluation (Doshi-Velez & Kim, 2017)

Key Considerations for Study Design

6. Building a Concept Atlas

Why Build an Atlas?

Organizing Dimensions

Building Your Project Atlas

7. Applying to Your Research Project

Validating Your Interpretability Claims

Knowledge Transfer Potential

8. Limitations and Open Questions

Limitations of Current Approaches

Open Research Questions

9. Best Practices for Your Paper

Making Rigorous Claims

10. Summary and Integration

Key Takeaways

Your Research Workflow

Looking Ahead

References & Resources

Core Papers

Related Work

Supplementary Reading

Part 2: Validation Framework for Interpretability

The Need for Rigorous Validation

1. Levels of Validation (Doshi-Velez & Kim, 2017)

Application-Grounded Evaluation

Human-Grounded Evaluation

Functionally-Grounded Evaluation

2. Faithfulness: The Core Requirement (Jacovi & Goldberg, 2020)

Why Faithfulness Matters

Testing Faithfulness

3. Multi-Method Validation

Method Triangulation

4. Sanity Checks: Catching Illusions

Model Randomization Test

Data Randomization Test

Ablation Completeness Test

Sign Test (for attribution methods)

5. Baseline Comparisons

Random Baseline

Frequency Baseline

6. Counterfactual Testing

The Counterfactual Validation Loop

7. Validation Checklist for Your Research

Validation Checklist

8. Common Pitfalls

In-Class Exercise: Decoding Pun Representations with Patchscopes

Part 1: Patchscopes Setup (15 min)

Part 2: Decoding Pun Representations (25 min)

Part 3: Probing Specific Features (20 min)

Project Milestone

Human Validation Study

Deliverables: