Week 10: Human Understanding & Self-Description

Learning Objectives

Required Readings

Supplementary Readings

1. The Human-AI Knowledge Gap

What Is the Knowledge Gap?

Modern AI systems achieve superhuman performance on many tasks: AlphaZero dominates chess champions, AlphaGo defeated the world's best Go players, and large language models process text at scales no human can match.

But a fundamental question remains: Do these AI systems know something we don't?

The Knowledge Gap Problem:

AI systems learn from vast amounts of self-play or data, exploring strategy spaces far beyond human experience. They may discover novel concepts, strategies, or patterns that humans haven't recognized. If these discoveries remain hidden inside neural networks, we lose the opportunity to:

Two Modes of Learning from Superhuman AI

This week explores two complementary approaches to bridging the knowledge gap:

Aspect Passive Learning (Shin et al.) Active Teaching (Schut et al.)
Mechanism Humans observe AI play and spontaneously adopt novel strategies Explicitly extract concepts from AI, filter, and teach humans
Scale Population-level (all players improve) Individual-level (train specific experts)
Evidence 5.8M Go moves over 71 years 4 grandmasters, controlled study
Control Observational (AI exists in environment) Interventional (deliberate concept transfer)
Novelty role Novel moves correlate with improvement Novelty is a filter criterion

2. Passive Learning: The Go Revolution (Shin et al., 2023)

Paper: "Superhuman Artificial Intelligence Can Improve Human Decision Making by Increasing Novelty"

Authors: Minkyu Shin, Jin Kim, Bas van Opheusden, Tom Griffiths

Published: PNAS, 2023

Key contribution: Quantitative evidence that superhuman AI improves human decision-making at scale

The Natural Experiment

In 2016, AlphaGo defeated Lee Sedol, the world champion Go player. This marked a turning point: AI became superhuman in a game with more possible board states than atoms in the universe.

Shin et al. asked: Did human players improve after witnessing superhuman AI?

Methodology

Data:

Evaluation:

Novelty metric:

Key Findings

1. Humans improved significantly after AlphaGo

Human decision quality increased measurably after 2016. Players made moves closer to what superhuman AI would choose.

2. Novel moves drove improvement

3. Breaking from tradition works

Superhuman AI prompted players to explore beyond traditional strategies. This wasn't just imitation—players discovered their own novel approaches, inspired by AI's creativity.

Example: Move 37

The most famous move in AI history:

In Game 2 of the AlphaGo vs Lee Sedol match, AlphaGo played Move 37—placing a stone on the 5th line from the edge. Human commentators were shocked: "I thought it was a mistake."

This move violated centuries of Go wisdom. Yet it was exceptionally strong. Move 37 demonstrated that AI had explored beyond human knowledge.

After witnessing this, professional players began experimenting with similar unconventional moves, many of which proved effective.

Implications

3. Active Teaching: Extracting Chess Concepts (Schut et al., 2023/2025)

Paper: "Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero"

Authors: Lisa Schut, Nenad Tomasev, Tom McGrath, Demis Hassabis, Ulrich Paquet, Been Kim

Published: arXiv 2023, PNAS 2025

Key contribution: Systematic method to extract, filter, and teach novel concepts from superhuman AI

The Challenge

Unlike Shin et al.'s observational study, Schut et al. sought to actively extract knowledge from AlphaZero and transfer it to human experts through deliberate training.

Key questions:

Methodology: Four-Stage Pipeline

Stage 1: Concept Discovery

Use convex optimization to find concept vectors in AlphaZero's latent space.

How it works:

  1. Collect chess positions where a concept might be relevant
  2. Extract AlphaZero's internal representations (activations) for these positions
  3. Find a direction in activation space that separates concept-present from concept-absent positions
  4. This direction is the Concept Activation Vector (CAV)

Technical detail:

Train a linear classifier to distinguish concept examples from random counterexamples. The vector orthogonal to the decision boundary is the CAV.

Stage 2: Filtering by Teachability

Not all extracted concepts are learnable by humans or other agents. Teachability test:

Definition: A concept is teachable if another AI agent (student) can learn it from examples.

Procedure:

  1. Generate prototype positions that strongly activate the concept vector
  2. Train a student AI on these prototypes
  3. Test if the student can generalize the concept to new positions
  4. Measure improvement: Does the student now select the same moves as AlphaZero?

Why this matters: If even an AI student can't learn the concept, a human likely can't either.

Stage 3: Filtering by Novelty

A teachable concept might still be something humans already know. Novelty test:

Definition: A concept is novel if it's not present in human chess games.

Procedure:

  1. Collect a large database of human chess games
  2. Test if the concept vector activates on human games
  3. Low activation → concept is novel to humans
  4. High activation → concept already exists in human play

Why this matters: We want to teach humans something new, not rediscover known principles.

Stage 4: Human Validation Study

The ultimate test: Can world chess champions learn these concepts?

Study design:

Participants: 4 top chess grandmasters (all former or current world champions)

Procedure:

  1. Pre-test: Grandmasters solve concept prototype positions (no instruction)
  2. Learning phase: Grandmasters study the positions with explanations
  3. Post-test: Grandmasters solve new positions involving the same concepts

Measurement: Improvement from pre-test to post-test

Key Findings

All four grandmasters improved after the learning phase.

This demonstrates that:

Types of concepts discovered:

Example Concept

Concept: "Provoke Weakness Through Quiet Moves"

Classical chess principle: Active, forcing moves (checks, captures) are strong.

AlphaZero's insight: Sometimes a quiet, non-forcing move creates zugzwang—opponent must worsen their position.

Why it's novel: Human games rarely show this pattern (low activation on human database).

Why it's teachable: Student AI improved 15% on test positions after training.

Human validation: Grandmasters improved from 40% to 65% accuracy on these positions after studying examples.

4. Comparing the Two Approaches

Dimension Shin et al. (Passive) Schut et al. (Active)
Research Question Do humans improve from AI exposure? Can we extract and teach AI concepts?
Scale Population (thousands of players) Individual (4 grandmasters)
Evidence Type Observational (71 years) Experimental (controlled study)
Novelty Role Outcome (novel moves = better) Filter (select novel concepts)
Teaching Method Implicit (watch AI play) Explicit (study concept prototypes)
Interpretability Black box (don't know what changed) White box (extract specific concepts)
Measurement Move quality via AI evaluation Accuracy on concept test positions
Strength Ecological validity (real-world effect) Causal clarity (know what's taught)

Complementary Insights

Together, these papers show that superhuman AI can improve humans through:

Both require novelty—humans learn by going beyond traditional knowledge.

5. Designing Human Studies for Interpretability

How do you validate that your interpretability findings are meaningful? Human studies provide the gold standard.

Three Types of Evaluation (Doshi-Velez & Kim, 2017)

Type When to Use Example
Application-Grounded Real-world task with domain experts Schut's grandmaster study
Human-Grounded Simplified task with lay users Show concept examples, ask "Does this make sense?"
Functionally-Grounded No humans, use proxy metrics Teachability test (AI student learns)

Key Considerations for Study Design

1. Who are your participants?

2. What do you measure?

3. Control conditions?

4. Avoiding pitfalls (from Week 10):

6. Building a Concept Atlas

As you discover concepts in your project, you need to organize them systematically. A concept atlas maps the landscape of what your model knows.

Why Build an Atlas?

Organizing Dimensions

1. Abstraction Level

2. Domain Specificity

3. Complexity

4. Novelty (from Schut)

5. Teachability (from Schut)

Building Your Project Atlas

Step 1: Inventory

List all concepts you've encountered in Weeks 1-10:

Step 2: Characterize

For each concept, assess:

Step 3: Map Relationships

Step 4: Assess Coverage

7. Applying to Your Research Project

Validating Your Interpretability Claims

By Week 11, you should have discovered something about how your model represents your concept. Now validate it:

Checklist for Rigorous Validation:

✓ Sanity Checks (Week 10)

✓ Multiple Methods (Weeks 5-9)

✓ Causal Validation (Week 8)

✓ Novelty Assessment (Schut)

✓ Teachability Assessment (Schut)

✓ Human Study (if feasible)

Knowledge Transfer Potential

Ask: Could a domain expert learn from your finding?

Example: Musical Key Representation

Finding: LLM uses specific attention heads to track musical key.

Novelty check: Is key tracking in music theory textbooks? (Yes → not novel)

Teachability check: Could a musician learn which contexts activate key tracking? (Possibly)

Knowledge transfer value: Moderate—validates human understanding but doesn't extend it.


Example: Protein Stability Representations

Finding: LLM predicts protein mutations that improve stability using a novel attention pattern.

Novelty check: Pattern not present in biochemistry literature? (Novel)

Teachability check: Can a biologist learn to recognize this pattern? (Test needed)

Knowledge transfer value: High—could guide wet-lab experiments.

8. Limitations and Open Questions

Limitations of Current Approaches

From Shin et al.:

From Schut et al.:

Open Research Questions

9. Best Practices for Your Paper

Making Rigorous Claims

✓ DO:

✗ DON'T:

10. Summary and Integration

Key Takeaways

Your Research Workflow

  1. Weeks 1-9: Discover concepts using multiple methods
  2. Week 10: Apply skepticism—run sanity checks
  3. Week 9:
  4. Week 12: Present findings with rigorous evidence

Looking Ahead

The field of interpretability is moving toward actionable insights—not just understanding what models do, but using that understanding to improve human knowledge, align AI systems, and enable human-AI collaboration.

Your project contributes to this by characterizing how LLMs represent non-CS concepts—a crucial step toward AI systems that can truly collaborate with domain experts across all fields.

References & Resources

Core Papers

Related Work

Supplementary Reading

Part 2: Validation Framework for Interpretability

The Need for Rigorous Validation

By now you have applied many interpretability methods: causal tracing, probing, attribution, circuit discovery. But how do we know our interpretations are actually correct? This section establishes a comprehensive validation framework to ensure your findings are robust and faithful.

1. Levels of Validation (Doshi-Velez & Kim, 2017)

Doshi-Velez and Kim (2017) propose three levels of evaluation for interpretability methods:

Application-Grounded Evaluation

Test interpretations with real users performing real tasks in the actual application domain.

Human-Grounded Evaluation

Test with lay users on simplified tasks that capture the essence of the real application.

Functionally-Grounded Evaluation

Test interpretations using quantitative metrics without human subjects.

2. Faithfulness: The Core Requirement (Jacovi & Goldberg, 2020)

Jacovi and Goldberg (2020) argue that faithfulness is the fundamental property interpretations must satisfy:

Faithfulness: An explanation is faithful if it accurately represents the model's true decision-making process.

Why Faithfulness Matters

An explanation can be:

Goal: Faithful AND plausible explanations. But when forced to choose, faithfulness must come first.

Testing Faithfulness

The causal intervention methods you learned in Week 5 are key faithfulness tests:

  1. Forward simulation: If explanation says X causes Y, intervening on X should change Y
  2. Backward verification: If explanation highlights component C, ablating C should break the behavior
  3. Sufficiency test: Patching only the explained components should recover the behavior

3. Multi-Method Validation

No single interpretability method is perfect. Robust findings require convergent evidence from multiple independent methods.

Method Triangulation

Method Category Example Methods What It Tests
Causal Intervention Patching, ablation, steering Which components are necessary/sufficient
Attribution Gradients, IG, attention Which inputs are important
Probing Linear probes, logistic regression What information is encoded
Feature Discovery SAEs, dictionary learning What features are represented
Behavioral Testing Adversarial examples, edge cases When does the interpretation fail

Validation strategy: Use at least 3 independent methods. If they agree, confidence increases. If they disagree, investigate why.

4. Sanity Checks: Catching Illusions

Before trusting any interpretation, run basic sanity checks.

Model Randomization Test

Test: Does your interpretation change when applied to a randomly initialized model?

interpretation_trained = analyze(trained_model)
interpretation_random = analyze(random_model)

assert interpretation_trained != interpretation_random

Why it matters: If interpretations look the same for trained and random models, your method might just be detecting network architecture, not learned behavior.

Data Randomization Test

Test: Train a model on data with random labels. Does your interpretation change?

Ablation Completeness Test

Test: If you ablate all "important" components identified by your method, does the behavior break?

important_components = find_important_components(model)
ablated_performance = test(model, ablate=important_components)

assert ablated_performance << baseline_performance

Sign Test (for attribution methods)

Test: Do positive attributions actually help the predicted class?

5. Baseline Comparisons

Always compare your interpretability findings against appropriate baselines:

Random Baseline

Frequency Baseline

6. Counterfactual Testing

The strongest validation: predict what will happen, then intervene and verify.

The Counterfactual Validation Loop

  1. Interpret: "Component C computes function F for concept X"
  2. Predict: "If I intervene on C, behavior related to X should change in way Y"
  3. Intervene: Actually modify C (patch, ablate, steer)
  4. Measure: Did Y happen?
  5. Iterate: If not, revise interpretation

7. Validation Checklist for Your Research

Before claiming you have discovered something about how a model works, verify:

Validation Checklist

  1. Sanity checks passed:
    • Random model gives different interpretation
    • Random labels give different interpretation
    • Ablating "important" components breaks behavior
  2. Causal validation:
    • Intervention on identified components changes behavior as predicted
    • Effect size is substantial (not just statistically significant)
    • Results replicate across multiple examples
  3. Multi-method agreement:
    • At least 3 independent methods point to same components/features
    • If methods disagree, you understand why
  4. Baseline comparisons:
    • Performance beats random baseline
    • Results are not explained by simple heuristics (frequency, position)
  5. Generalization:
    • Findings hold on held-out test set
    • Findings transfer to related tasks/prompts
    • You have characterized when the interpretation fails
  6. Negative results reported:
    • You have documented what does not work
    • You have shown edge cases where interpretation breaks

8. Common Pitfalls

Watch out for these validation failures:

The solution: Rigorous application of this validation framework throughout your research.

In-Class Exercise: Decoding Pun Representations with Patchscopes

In this final pun exercise, we use Patchscopes—the model's own language generation—to decode and describe what information is encoded in pun representations. This helps us understand what the model "sees" when processing humor.

Part 1: Patchscopes Setup (15 min)

Understand the Patchscopes technique:

  1. Review the method:
    • Patchscopes "patches" a hidden state from one context into a different prompt
    • The model then generates text describing that hidden state
    • Example: patch the representation of "Time flies like an arrow" into a prompt like "The following text is about: "
  2. Select examples:
    • Choose 5 puns where your probe shows high pun-recognition
    • Choose 5 similar non-puns for comparison
  3. Design prompts: Create 2-3 different "decoder" prompts for Patchscopes
    • "This sentence is: "
    • "The hidden meaning is: "
    • "This is funny because: "

Part 2: Decoding Pun Representations (25 min)

Apply Patchscopes to understand what the model encodes about puns:

  1. Extract representations:
    • Run each pun/non-pun through the model
    • Extract the hidden state at the punchline position
    • Focus on the layer with best pun probe accuracy (from Week 6)
  2. Apply Patchscopes:
    • Patch the hidden state into each decoder prompt
    • Generate 3-5 tokens of completion
    • Record the model's "interpretation" of each representation
  3. Compare pun vs non-pun:
    • Do pun representations produce different descriptions than non-puns?
    • Does the model mention humor, wordplay, or double meanings?
    • Are the descriptions meaningful or random?

Part 3: Probing Specific Features (20 min)

Test whether the model can articulate specific aspects of puns:

  1. Test the double meaning:
    • For a pun like "Time flies like an arrow; fruit flies like a banana"
    • Use prompt: "The word 'flies' here means: "
    • Does the patched representation produce both meanings?
  2. Test humor awareness:
    • Use prompt: "This sentence is [humorous/serious]: "
    • Does the model correctly classify puns as humorous?
  3. Compare layers:
    • Try Patchscopes at early, middle, and late layers
    • At which layer does the model best understand the pun?
    • Does this match your causal tracing results from Week 5?

Discussion: Can the model articulate its own understanding of puns? What does this tell us about whether humor understanding is explicit or implicit in the model?

Open Neologism Training Notebook in Colab

Note: Requires NDIF access for session-based training.

Project Milestone

Due: Thursday of Week 11

Design and conduct a human validation study to test whether your interpretability findings help humans understand or predict model behavior related to your concept.

Human Validation Study

Deliverables:

The ultimate test of interpretability: do your findings help humans understand the model? Even small, well-designed studies can provide valuable validation.