Week 10: Human Understanding & Self-Description
Learning Objectives
- Understand the human-AI knowledge gap and why it matters
- Distinguish active vs passive knowledge transfer from superhuman AI
- Apply teachability and novelty criteria to evaluate discovered concepts
- Design human studies to validate interpretability findings
- Implement concept discovery methods (convex optimization)
- Measure human improvement from AI exposure
- Build concept taxonomies and organize your findings
- Validate your project's interpretability claims
- Assess knowledge transfer potential of your concept
Required Readings
Supplementary Readings
1. The Human-AI Knowledge Gap
What Is the Knowledge Gap?
Modern AI systems achieve superhuman performance on many tasks: AlphaZero dominates chess champions,
AlphaGo defeated the world's best Go players, and large language models process text at scales no human can match.
But a fundamental question remains: Do these AI systems know something we don't?
The Knowledge Gap Problem:
AI systems learn from vast amounts of self-play or data, exploring strategy spaces far beyond human experience.
They may discover novel concepts, strategies, or patterns that humans haven't recognized. If
these discoveries remain hidden inside neural networks, we lose the opportunity to:
- Improve human expertise in the domain
- Validate that AI reasoning aligns with human values
- Understand failure modes and limitations
- Transfer insights across domains
Two Modes of Learning from Superhuman AI
This week explores two complementary approaches to bridging the knowledge gap:
| Aspect |
Passive Learning (Shin et al.) |
Active Teaching (Schut et al.) |
| Mechanism |
Humans observe AI play and spontaneously adopt novel strategies |
Explicitly extract concepts from AI, filter, and teach humans |
| Scale |
Population-level (all players improve) |
Individual-level (train specific experts) |
| Evidence |
5.8M Go moves over 71 years |
4 grandmasters, controlled study |
| Control |
Observational (AI exists in environment) |
Interventional (deliberate concept transfer) |
| Novelty role |
Novel moves correlate with improvement |
Novelty is a filter criterion |
2. Passive Learning: The Go Revolution (Shin et al., 2023)
Paper: "Superhuman Artificial Intelligence Can Improve Human Decision Making by Increasing Novelty"
Authors: Minkyu Shin, Jin Kim, Bas van Opheusden, Tom Griffiths
Published: PNAS, 2023
Key contribution: Quantitative evidence that superhuman AI improves human decision-making at scale
The Natural Experiment
In 2016, AlphaGo defeated Lee Sedol, the world champion Go player. This marked a turning point:
AI became superhuman in a game with more possible board states than atoms in the universe.
Shin et al. asked: Did human players improve after witnessing superhuman AI?
Methodology
Data:
- 5.8 million moves by professional Go players
- 71 years of gameplay (1950-2021)
- Before vs After AlphaGo comparison
Evaluation:
- Used a superhuman AI to evaluate move quality
- Compared win rates of actual moves vs AI-suggested alternatives
- Generated 58 billion counterfactual game patterns
Novelty metric:
- Classified moves as novel (never seen before in professional play) or traditional
- Tracked novelty rates and quality over time
Key Findings
1. Humans improved significantly after AlphaGo
Human decision quality increased measurably after 2016. Players made moves closer to what superhuman AI would
choose.
2. Novel moves drove improvement
- Novel moves occurred more frequently after AlphaGo
- Novel moves became associated with higher decision quality
- Before AlphaGo: novelty ≠ better; After AlphaGo: novelty = better
3. Breaking from tradition works
Superhuman AI prompted players to explore beyond traditional strategies. This wasn't just
imitation—players discovered their own novel approaches, inspired by AI's creativity.
Example: Move 37
The most famous move in AI history:
In Game 2 of the AlphaGo vs Lee Sedol match, AlphaGo played Move 37—placing a stone on the 5th
line from the edge. Human commentators were shocked: "I thought it was a mistake."
This move violated centuries of Go wisdom. Yet it was exceptionally strong. Move 37 demonstrated
that AI had explored beyond human knowledge.
After witnessing this, professional players began experimenting with similar unconventional moves, many of which
proved effective.
Implications
- Superhuman AI can teach without explicit instruction: Simply existing in the environment
inspires humans to explore
- Novelty is learnable: Humans can recognize and adopt AI-discovered innovations
- Traditional knowledge isn't optimal: Conventions may persist due to lack of exploration, not
superiority
- Population-level effects: Impact extends beyond direct AI users to entire communities
3. Active Teaching: Extracting Chess Concepts (Schut et al., 2023/2025)
Paper: "Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero"
Authors: Lisa Schut, Nenad Tomasev, Tom McGrath, Demis Hassabis, Ulrich Paquet, Been Kim
Published: arXiv 2023, PNAS 2025
Key contribution: Systematic method to extract, filter, and teach novel concepts from superhuman
AI
The Challenge
Unlike Shin et al.'s observational study, Schut et al. sought to actively extract knowledge from
AlphaZero and transfer it to human experts through deliberate training.
Key questions:
- What concepts has AlphaZero learned that humans haven't?
- Can we extract these concepts in human-understandable form?
- Are they teachable to even the world's best players?
Methodology: Four-Stage Pipeline
Stage 1: Concept Discovery
Use convex optimization to find concept vectors in AlphaZero's latent space.
How it works:
- Collect chess positions where a concept might be relevant
- Extract AlphaZero's internal representations (activations) for these positions
- Find a direction in activation space that separates concept-present from concept-absent
positions
- This direction is the Concept Activation Vector (CAV)
Technical detail:
Train a linear classifier to distinguish concept examples from random counterexamples. The vector orthogonal to
the decision boundary is the CAV.
Stage 2: Filtering by Teachability
Not all extracted concepts are learnable by humans or other agents. Teachability test:
Definition: A concept is teachable if another AI agent (student) can learn it from
examples.
Procedure:
- Generate prototype positions that strongly activate the concept vector
- Train a student AI on these prototypes
- Test if the student can generalize the concept to new positions
- Measure improvement: Does the student now select the same moves as AlphaZero?
Why this matters: If even an AI student can't learn the concept, a human likely can't either.
Stage 3: Filtering by Novelty
A teachable concept might still be something humans already know. Novelty test:
Definition: A concept is novel if it's not present in human chess games.
Procedure:
- Collect a large database of human chess games
- Test if the concept vector activates on human games
- Low activation → concept is novel to humans
- High activation → concept already exists in human play
Why this matters: We want to teach humans something new, not rediscover known
principles.
Stage 4: Human Validation Study
The ultimate test: Can world chess champions learn these concepts?
Study design:
Participants: 4 top chess grandmasters (all former or current world champions)
Procedure:
- Pre-test: Grandmasters solve concept prototype positions (no instruction)
- Learning phase: Grandmasters study the positions with explanations
- Post-test: Grandmasters solve new positions involving the same concepts
Measurement: Improvement from pre-test to post-test
Key Findings
All four grandmasters improved after the learning phase.
This demonstrates that:
- AlphaZero encodes knowledge beyond existing human understanding
- This knowledge is not beyond human grasp—it can be learned
- Even elite experts can learn from AI
Types of concepts discovered:
- Quiet moves to provoke long-term weaknesses (counterintuitive to classical principles)
- Strategic queen sacrifices in unexpected contexts
- Positional imbalances that traditional evaluation undervalues
Example Concept
Concept: "Provoke Weakness Through Quiet Moves"
Classical chess principle: Active, forcing moves (checks, captures) are strong.
AlphaZero's insight: Sometimes a quiet, non-forcing move creates zugzwang—opponent must
worsen their position.
Why it's novel: Human games rarely show this pattern (low activation on human database).
Why it's teachable: Student AI improved 15% on test positions after training.
Human validation: Grandmasters improved from 40% to 65% accuracy on these positions after
studying examples.
4. Comparing the Two Approaches
| Dimension |
Shin et al. (Passive) |
Schut et al. (Active) |
| Research Question |
Do humans improve from AI exposure? |
Can we extract and teach AI concepts? |
| Scale |
Population (thousands of players) |
Individual (4 grandmasters) |
| Evidence Type |
Observational (71 years) |
Experimental (controlled study) |
| Novelty Role |
Outcome (novel moves = better) |
Filter (select novel concepts) |
| Teaching Method |
Implicit (watch AI play) |
Explicit (study concept prototypes) |
| Interpretability |
Black box (don't know what changed) |
White box (extract specific concepts) |
| Measurement |
Move quality via AI evaluation |
Accuracy on concept test positions |
| Strength |
Ecological validity (real-world effect) |
Causal clarity (know what's taught) |
Complementary Insights
Together, these papers show that superhuman AI can improve humans through:
- Inspiration (Shin): Seeing AI play prompts exploration of novel strategies
- Instruction (Schut): Explicitly teaching extracted concepts
Both require novelty—humans learn by going beyond traditional knowledge.
5. Designing Human Studies for Interpretability
How do you validate that your interpretability findings are meaningful? Human studies provide the gold standard.
Three Types of Evaluation (Doshi-Velez & Kim, 2017)
| Type |
When to Use |
Example |
| Application-Grounded |
Real-world task with domain experts |
Schut's grandmaster study |
| Human-Grounded |
Simplified task with lay users |
Show concept examples, ask "Does this make sense?" |
| Functionally-Grounded |
No humans, use proxy metrics |
Teachability test (AI student learns) |
Key Considerations for Study Design
1. Who are your participants?
- Domain experts (like grandmasters): High validity, but expensive and small sample
- Lay users: Larger sample, but may lack necessary background
- Trade-off: Schut used 4 experts; Shin used population-level data
2. What do you measure?
- Performance: Can they solve tasks better? (Schut: accuracy on test positions)
- Understanding: Can they explain the concept? (Free-response questions)
- Trust: Do they believe the explanation? (Likert scales)
- Warning: Trust ≠ utility (Bansal et al., Week 10)
3. Control conditions?
- Pre-post design: Measure before and after learning (Schut)
- Comparison group: Concept group vs baseline explanation
- Temporal controls: Before vs after AI release (Shin)
4. Avoiding pitfalls (from Week 10):
- Don't cherry-pick examples that support your hypothesis
- Use multiple evaluation metrics (accuracy, confidence, explanation quality)
- Test robustness: Do concepts transfer to new contexts?
- Report negative results: Which concepts weren't learnable?
6. Building a Concept Atlas
As you discover concepts in your project, you need to organize them systematically. A concept
atlas maps the landscape of what your model knows.
Why Build an Atlas?
- Locate your concept: How does your project concept relate to known concepts?
- Avoid redundancy: Is your "novel" concept actually a rediscovery?
- Find relationships: Do concepts compose, interfere, or prerequisite each other?
- Assess coverage: What fraction of model behavior do identified concepts explain?
Organizing Dimensions
1. Abstraction Level
- Low-level: "Edges," "colors," "individual words"
- Mid-level: "Object parts," "syntactic roles," "local patterns"
- High-level: "Objects," "semantic concepts," "strategic plans"
2. Domain Specificity
- Universal: Position, number, negation (across tasks)
- Domain-specific: "Zugzwang" (chess), "ko threat" (Go)
3. Complexity
- Atomic: Single feature
- Compositional: Combination of concepts (e.g., "quiet move that provokes weakness")
4. Novelty (from Schut)
- Known: Present in human data
- Novel: Absent from human data
5. Teachability (from Schut)
- Teachable: Student AI or humans can learn
- Unteachable: Too complex or ill-defined
Building Your Project Atlas
Step 1: Inventory
List all concepts you've encountered in Weeks 1-10:
- From Week 5 (circuits): Which operations did you find?
- From Week 6 (probes): What information is encoded where?
- From Week 7 (SAEs): What features did the autoencoder discover?
- From Week 9 (attribution): Which inputs matter for your concept?
Step 2: Characterize
For each concept, assess:
- Abstraction level
- Domain specificity
- Novelty (is it in your training data?)
- Teachability (could a human learn it?)
Step 3: Map Relationships
- Prerequisite: Does concept A require understanding concept B?
- Composition: Is concept C = A + B?
- Interference: Do A and B conflict?
Step 4: Assess Coverage
- What fraction of model behavior do your concepts explain?
- Where are the gaps?
7. Applying to Your Research Project
Validating Your Interpretability Claims
By Week 11, you should have discovered something about how your model represents your concept. Now validate it:
Checklist for Rigorous Validation:
✓ Sanity Checks (Week 10)
- Did you run model/data randomization tests?
- Is your interpretation robust to methodological choices?
✓ Multiple Methods (Weeks 5-9)
- Did circuits, probes, SAEs, and attribution agree?
- If not, why? (Disagreement can be informative)
✓ Causal Validation (Week 8)
- Did IIA confirm that your identified components are causal?
- Can you intervene to change the concept?
✓ Novelty Assessment (Schut)
- Is your concept in the training data?
- If yes: rediscovery (still valid!)
- If no: potential superhuman insight
✓ Teachability Assessment (Schut)
- Can you explain the concept to a colleague?
- Can they use it to improve their understanding?
- If you trained a simple model on concept examples, does it generalize?
✓ Human Study (if feasible)
- Even a small study (n=3-5) is valuable
- Pre-post design or comparison with baseline
- Measure performance, not just trust
Knowledge Transfer Potential
Ask: Could a domain expert learn from your finding?
Example: Musical Key Representation
Finding: LLM uses specific attention heads to track musical key.
Novelty check: Is key tracking in music theory textbooks? (Yes → not novel)
Teachability check: Could a musician learn which contexts activate key tracking? (Possibly)
Knowledge transfer value: Moderate—validates human understanding but doesn't extend it.
Example: Protein Stability Representations
Finding: LLM predicts protein mutations that improve stability using a novel attention pattern.
Novelty check: Pattern not present in biochemistry literature? (Novel)
Teachability check: Can a biologist learn to recognize this pattern? (Test needed)
Knowledge transfer value: High—could guide wet-lab experiments.
8. Limitations and Open Questions
Limitations of Current Approaches
From Shin et al.:
- Can't identify which concepts humans learned (black box improvement)
- Correlation, not causation (did AlphaGo cause the improvement or just coincide?)
- Domain-specific (Go); generalizes to other domains?
From Schut et al.:
- Linear concept assumption (CAVs are linear directions)
- Small sample (4 grandmasters, but they're world champions)
- Chess-specific; method complexity for other domains?
- Doesn't extract all concepts, just some filtered subset
Open Research Questions
- Completeness: How do we know we've found all the important concepts?
- Compositionality: How do simple concepts combine into complex strategies?
- Transfer across models: Are concepts universal or model-specific?
- Long-term retention: Do humans retain AI-taught concepts over months/years?
- Misalignment detection: What if AI discovers harmful concepts?
- Scaling: Can these methods work for domains more complex than chess/Go?
9. Best Practices for Your Paper
Making Rigorous Claims
✓ DO:
- Use multiple validation methods (circuits + probes + SAEs + IIA)
- Report which concepts were teachable and which weren't
- Assess novelty (compare with training data or domain literature)
- Design human studies, even small ones
- Show when your interpretation fails (edge cases)
- Compare your concepts to known domain knowledge
✗ DON'T:
- Claim "superhuman" insight without novelty evidence
- Cherry-pick examples that support your interpretation
- Rely on single method (e.g., only probes)
- Assume teachability without testing
- Over-generalize from one model or task
- Ignore Week 10's skepticism lessons
10. Summary and Integration
Key Takeaways
- The knowledge gap is real: AI systems learn beyond human knowledge (Shin: novelty improves
humans; Schut: extractable novel concepts)
- Two pathways: Passive (observation) and active (explicit teaching) knowledge transfer
- Novelty + teachability: Framework for filtering valuable concepts
- Human studies matter: Validate interpretability claims with human performance
- Concept atlases: Organize findings systematically
- Integration: Combine methods from Weeks 1-10 for rigorous validation
Your Research Workflow
- Weeks 1-9: Discover concepts using multiple methods
- Week 10: Apply skepticism—run sanity checks
- Week 9:
- Assess novelty (is it in training data?)
- Assess teachability (can humans/students learn it?)
- Design human study (even small)
- Build concept atlas (organize findings)
- Validate with multiple methods
- Week 12: Present findings with rigorous evidence
Looking Ahead
The field of interpretability is moving toward actionable insights—not just understanding what
models do, but using that understanding to improve human knowledge, align AI systems, and enable human-AI
collaboration.
Your project contributes to this by characterizing how LLMs represent non-CS concepts—a crucial step toward AI
systems that can truly collaborate with domain experts across all fields.
References & Resources
Core Papers
- Shin et al. (2023): "Superhuman Artificial Intelligence Can Improve Human Decision Making by
Increasing Novelty." PNAS, 120(12), e2214840120. Link
- Schut et al. (2023/2025): "Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer
in AlphaZero." PNAS, 122, e2406675122. arXiv:2310.16410
Related Work
- AlphaGo: Silver et al. (2016). "Mastering the game of Go with deep neural networks and tree
search." Nature, 529, 484-489.
- AlphaZero: Silver et al. (2018). "A general reinforcement learning algorithm that masters
chess, shogi, and Go through self-play." Science, 362(6419), 1140-1144.
- TCAV: Kim et al. (2018). "Interpretability Beyond Feature Attribution: Quantitative Testing
with Concept Activation Vectors." ICML. (If moved from Week 6)
- Doshi-Velez & Kim (2017): "Towards A Rigorous Science of Interpretable Machine Learning." arXiv:1702.08608
Supplementary Reading
- Atanasova et al. (2020). "A Diagnostic Study of Explainability Techniques for Text Classification."
EMNLP.
- Bansal et al. (2021). "Does the Whole Exceed its Parts?" CHI.
- AlphaStar case study: DeepMind blog on discovering novel StarCraft strategies
- Doshi-Velez & Kim (2017). "Towards A Rigorous Science of Interpretable Machine Learning."
- Jacovi & Goldberg (2020). "Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness?"
Part 2: Validation Framework for Interpretability
The Need for Rigorous Validation
By now you have applied many interpretability methods: causal tracing, probing, attribution, circuit discovery.
But how do we know our interpretations are actually correct? This section establishes a comprehensive validation
framework to ensure your findings are robust and faithful.
1. Levels of Validation (Doshi-Velez & Kim, 2017)
Doshi-Velez and Kim (2017) propose three levels of evaluation for interpretability methods:
Application-Grounded Evaluation
Test interpretations with real users performing real tasks in the actual application domain.
- Example: Have doctors use model explanations to make diagnoses, measure accuracy improvement
- Pros: Highest ecological validity, tests real-world utility
- Cons: Expensive, requires domain experts, slow iteration
- When to use: Final validation before deployment, high-stakes applications
Human-Grounded Evaluation
Test with lay users on simplified tasks that capture the essence of the real application.
- Example: Show users attention visualizations, ask if they make sense
- Pros: Faster than application-grounded, can use larger samples
- Cons: May not generalize to real application, subjective judgments
- When to use: Iterative development, when domain experts are unavailable
Functionally-Grounded Evaluation
Test interpretations using quantitative metrics without human subjects.
- Example: Measure if high-importance features (by attribution) actually change outputs when intervened on
- Pros: Fast, reproducible, scalable, no human subjects needed
- Cons: May miss aspects that matter to humans
- When to use: Initial validation, debugging, comparing methods
2. Faithfulness: The Core Requirement (Jacovi & Goldberg, 2020)
Jacovi and Goldberg (2020) argue that faithfulness is the fundamental property interpretations must satisfy:
Faithfulness: An explanation is faithful if it accurately represents the model's true decision-making process.
Why Faithfulness Matters
An explanation can be:
- Plausible but unfaithful: Makes intuitive sense but does not reflect what the model actually does
- Faithful but implausible: Accurately describes the model but is hard to understand
Goal: Faithful AND plausible explanations. But when forced to choose, faithfulness must come first.
Testing Faithfulness
The causal intervention methods you learned in Week 5 are key faithfulness tests:
- Forward simulation: If explanation says X causes Y, intervening on X should change Y
- Backward verification: If explanation highlights component C, ablating C should break the behavior
- Sufficiency test: Patching only the explained components should recover the behavior
3. Multi-Method Validation
No single interpretability method is perfect. Robust findings require convergent evidence from multiple independent methods.
Method Triangulation
| Method Category |
Example Methods |
What It Tests |
| Causal Intervention |
Patching, ablation, steering |
Which components are necessary/sufficient |
| Attribution |
Gradients, IG, attention |
Which inputs are important |
| Probing |
Linear probes, logistic regression |
What information is encoded |
| Feature Discovery |
SAEs, dictionary learning |
What features are represented |
| Behavioral Testing |
Adversarial examples, edge cases |
When does the interpretation fail |
Validation strategy: Use at least 3 independent methods. If they agree, confidence increases. If they disagree, investigate why.
4. Sanity Checks: Catching Illusions
Before trusting any interpretation, run basic sanity checks.
Model Randomization Test
Test: Does your interpretation change when applied to a randomly initialized model?
interpretation_trained = analyze(trained_model)
interpretation_random = analyze(random_model)
assert interpretation_trained != interpretation_random
Why it matters: If interpretations look the same for trained and random models, your method might just be detecting network architecture, not learned behavior.
Data Randomization Test
Test: Train a model on data with random labels. Does your interpretation change?
- A model trained on random labels should have different (probably nonsensical) internal mechanisms
- If interpretation looks the same, it is not capturing what the model learned
Ablation Completeness Test
Test: If you ablate all "important" components identified by your method, does the behavior break?
important_components = find_important_components(model)
ablated_performance = test(model, ablate=important_components)
assert ablated_performance << baseline_performance
Sign Test (for attribution methods)
Test: Do positive attributions actually help the predicted class?
- Remove features with high positive attribution: performance should drop
- Remove features with high negative attribution: performance might improve
- If removing "important" features does not change output, attribution is unfaithful
5. Baseline Comparisons
Always compare your interpretability findings against appropriate baselines:
Random Baseline
- For component importance: Compare AIE of identified components vs. random components
- For attribution: Compare removal of high-attribution features vs. random features
- Minimum bar: Your method must beat random
Frequency Baseline
- For text: High-attribution words might just be common words
- Test: Does attribution correlate with word frequency? If yes, consider TF-IDF weighting
6. Counterfactual Testing
The strongest validation: predict what will happen, then intervene and verify.
The Counterfactual Validation Loop
- Interpret: "Component C computes function F for concept X"
- Predict: "If I intervene on C, behavior related to X should change in way Y"
- Intervene: Actually modify C (patch, ablate, steer)
- Measure: Did Y happen?
- Iterate: If not, revise interpretation
7. Validation Checklist for Your Research
Before claiming you have discovered something about how a model works, verify:
Validation Checklist
- Sanity checks passed:
- Random model gives different interpretation
- Random labels give different interpretation
- Ablating "important" components breaks behavior
- Causal validation:
- Intervention on identified components changes behavior as predicted
- Effect size is substantial (not just statistically significant)
- Results replicate across multiple examples
- Multi-method agreement:
- At least 3 independent methods point to same components/features
- If methods disagree, you understand why
- Baseline comparisons:
- Performance beats random baseline
- Results are not explained by simple heuristics (frequency, position)
- Generalization:
- Findings hold on held-out test set
- Findings transfer to related tasks/prompts
- You have characterized when the interpretation fails
- Negative results reported:
- You have documented what does not work
- You have shown edge cases where interpretation breaks
8. Common Pitfalls
Watch out for these validation failures:
- Confirmation bias: Finding what you expect to find, ignoring contradictory evidence
- Cherry-picking: Showing only examples that support your interpretation
- P-hacking: Testing many hypotheses, reporting only significant ones without correction
- Overfitting: Interpretation works on training data but not test data
- Confounds: Attributed importance is actually due to correlated features
- Visualization artifacts: Patterns in visualization do not reflect model computation
The solution: Rigorous application of this validation framework throughout your research.
In-Class Exercise: Decoding Pun Representations with Patchscopes
In this final pun exercise, we use Patchscopes—the model's own language generation—to decode and
describe what information is encoded in pun representations. This helps us understand what the model
"sees" when processing humor.
Part 1: Patchscopes Setup (15 min)
Understand the Patchscopes technique:
- Review the method:
- Patchscopes "patches" a hidden state from one context into a different prompt
- The model then generates text describing that hidden state
- Example: patch the representation of "Time flies like an arrow" into a prompt like "The following text is about: "
- Select examples:
- Choose 5 puns where your probe shows high pun-recognition
- Choose 5 similar non-puns for comparison
- Design prompts: Create 2-3 different "decoder" prompts for Patchscopes
- "This sentence is: "
- "The hidden meaning is: "
- "This is funny because: "
Part 2: Decoding Pun Representations (25 min)
Apply Patchscopes to understand what the model encodes about puns:
- Extract representations:
- Run each pun/non-pun through the model
- Extract the hidden state at the punchline position
- Focus on the layer with best pun probe accuracy (from Week 6)
- Apply Patchscopes:
- Patch the hidden state into each decoder prompt
- Generate 3-5 tokens of completion
- Record the model's "interpretation" of each representation
- Compare pun vs non-pun:
- Do pun representations produce different descriptions than non-puns?
- Does the model mention humor, wordplay, or double meanings?
- Are the descriptions meaningful or random?
Part 3: Probing Specific Features (20 min)
Test whether the model can articulate specific aspects of puns:
- Test the double meaning:
- For a pun like "Time flies like an arrow; fruit flies like a banana"
- Use prompt: "The word 'flies' here means: "
- Does the patched representation produce both meanings?
- Test humor awareness:
- Use prompt: "This sentence is [humorous/serious]: "
- Does the model correctly classify puns as humorous?
- Compare layers:
- Try Patchscopes at early, middle, and late layers
- At which layer does the model best understand the pun?
- Does this match your causal tracing results from Week 5?
Discussion: Can the model articulate its own understanding of puns?
What does this tell us about whether humor understanding is explicit or implicit in the model?
Open Neologism Training Notebook in Colab
Note: Requires NDIF access for session-based training.
Project Milestone
Due: Thursday of Week 11
Design and conduct a human validation study to test whether your interpretability findings
help humans understand or predict model behavior related to your concept.
Human Validation Study
- Design study:
- What is your research question? (e.g., "Do explanations help humans predict model errors?")
- What will you show participants? (explanations, examples, circuit diagrams)
- What will you ask them to do? (predict behavior, identify errors, categorize examples)
- How will you measure understanding? (accuracy, confidence, response time)
- Create materials:
- Control condition: humans with no explanation
- Experimental condition: humans with your interpretability findings
- Test cases: mix of typical and edge cases
- Instructions and training examples
- Recruit participants:
- Target: 10-20 participants minimum
- Consider expertise level: domain experts vs novices
- Randomize assignment to conditions
- Run study and analyze:
- Collect responses
- Compare performance: explanation vs no-explanation
- Statistical significance testing
- Qualitative feedback: what was helpful? confusing?
Deliverables:
- Study design document:
- Research question and hypotheses
- Methods: participants, materials, procedure
- Planned analyses
- Results:
- Quantitative: performance comparison with statistical tests
- Qualitative: participant feedback and observations
- Visualization: accuracy, confidence, or other metrics by condition
- Interpretation:
- Do your explanations help humans?
- What aspects are most/least helpful?
- What does this reveal about the quality of your interpretability work?
- Materials: Study materials, data, and analysis code
The ultimate test of interpretability: do your findings help humans understand the model?
Even small, well-designed studies can provide valuable validation.