A tutorial based on Perez et al. (2022), "Discovering Language Model Behaviors with Model-Written Evaluations"
Before we can study how a concept is represented inside a neural network, we need examples that reliably elicit that concept. Creating evaluation datasets by hand is slow. Perez et al. demonstrate that language models can generate evaluation data at scale, and that this approach can uncover behaviors humans might not anticipate.
This tutorial explains the core methodology and adapts it for interpretability research, using pun understanding as a running example.
The paper's approach has four stages:
The key insight: LLMs can generate diverse, creative test cases faster than humans, while humans remain in the loop for quality control and analysis.
Many interpretability methods become dramatically easier when the model's critical decision is concentrated at a single, predictable token position. Consider what you can do when you know exactly where the decision happens:
Two formats naturally provide this localization: cloze-style prompts (fill-in-the-blank) and MCQ (where the model produces a single letter). Free-form generation is harder to analyze because the "decision" is distributed across multiple tokens.
The model completes a sentence with a single word or short phrase. The blank is the decision point.
Strengths: Decision localized to one token. Enables direct probability comparison over candidate completions. Natural fit for logit lens and activation extraction.
Weaknesses: Requires careful prompt design so the blank is unambiguous. Some concepts do not reduce naturally to single-token decisions.
Design principle: The cloze prompt should be constructed so that (1) the correct answer is a single token or very short phrase, and (2) incorrect answers are also plausible completions. This lets you compare P(correct) vs. P(incorrect) at the decision position.
The model chooses among labeled options (A, B, C, D). This enables precise measurement and easy automation.
Strengths: Decision localized to a single token (the letter). Unambiguous evaluation. Enables comparison of answer probabilities across options. Easy to automate.
Weaknesses: May not reflect real-world usage. The correct answer might be guessable from surface features. Format itself may activate "test-taking" behaviors distinct from natural understanding.
For interpretability: MCQ is excellent because you can extract activations at the final position and examine the logit distribution over {A, B, C, D}. You can also study what information flows to that position from earlier in the prompt.
Zero-shot: The model receives only a question or prompt, with no examples. Tests baseline capabilities and default behaviors.
In-context learning (ICL): Provide examples in the prompt to demonstrate the task. This activates task-relevant circuits and can elicit capabilities the model has but does not spontaneously display.
For interpretability: ICL is valuable precisely because it changes internal representations. You can compare activations with and without in-context examples to see how demonstrations reshape the model's processing. Use cloze or MCQ format for the test case to maintain token localization.
Suppose we want to study how language models represent puns: wordplay that exploits multiple meanings or similar sounds. Before we can localize "pun circuits" or probe for pun-related features, we need evaluation data that reliably triggers pun processing.
We want to evaluate whether a model:
We prompt a capable model to generate evaluation data. Here are templates for each format:
Generate cloze-style questions that test pun understanding. Each question should have a blank that can be filled with a single word or short phrase. Provide the correct answer and one plausible incorrect answer. Format: Prompt: [sentence with ___ blank] Correct: [answer] Incorrect: [plausible wrong answer] Examples: Prompt: The joke "I used to be a banker but I lost interest" is a pun because the word "interest" refers to both financial returns and ___. Correct: curiosity Incorrect: money Prompt: "Time flies like an arrow; fruit flies like a banana" is a pun because "flies" shifts from being a ___ to being a noun. Correct: verb Incorrect: metaphor Generate 20 more cloze-style pun evaluation items with varied formats.
Generate pun recognition tasks in cloze format. The model must complete the sentence with "Yes" or "No". Format: Prompt: [statement]. This is a pun. Yes or No? Answer: Label: [Yes/No] Examples: Prompt: "I used to be a banker, but I lost interest." This is a pun. Yes or No? Answer: Label: Yes Prompt: "I used to be a banker, but I changed careers." This is a pun. Yes or No? Answer: Label: No Generate 30 examples, balanced between puns and non-puns.
Generate multiple choice questions that test understanding of puns. Each question should have one correct answer and three plausible distractors. Format: Question: [question text] A) [option] B) [option] C) [option] D) [option] Correct: [letter] Example: Question: In the pun "I used to be a banker, but I lost interest," which word carries the double meaning? A) banker B) lost C) interest D) used Correct: C Generate 15 more MCQ items testing pun recognition and explanation.
The paper emphasizes that raw LLM output requires filtering. Apply these checks:
Automated filters:
Human review:
Diversity checks:
With filtered data, evaluate your target models. For interpretability work, you likely want more than accuracy:
Problem: LLMs sometimes generate examples that are ambiguous, incorrect, or too easy.
Solution: Always manually review a sample. For puns, check that the wordplay actually works; LLMs sometimes generate "puns" where the double meaning does not quite land.
Problem: LLMs tend to repeat patterns. You might get 50 puns about "bank" and "interest."
Solution: Explicitly prompt for diversity: "Generate puns about different topics: food, animals, professions, science, sports..." Use multiple generation runs with different temperatures.
Problem: The model being evaluated may have seen these exact puns during training.
Solution: Generate novel puns rather than using famous ones. Include a novelty check: can the model explain puns it has likely never seen?
Problem: Models might detect puns from surface features (sentence length, punctuation) rather than understanding wordplay.
Solution: Ensure non-pun examples have similar surface features. Include "almost-puns" that have the structure but lack the double meaning.
For most interpretability methods, cloze and MCQ formats are strictly superior to free-form generation:
| Method | Cloze/MCQ | Free-form |
|---|---|---|
| Logit lens | Extract at decision token | Unclear where to extract |
| Activation patching | Patch at known position | Must search across positions |
| Probing | Clean single-position signal | Aggregation required |
| Attention analysis | Clear "what influences this token" | Diffuse across sequence |
| Causal tracing | Intervene at decision point | Multiple intervention points |
A minimal pair differs in exactly one aspect. For cloze-style interpretability work, this is especially powerful:
Pun condition: "I used to be a banker but I lost interest." This is a pun. Yes or No? Answer: [Yes] Non-pun condition (minimal change): "I used to be a banker but I lost patience." This is a pun. Yes or No? Answer: [No]
Both prompts are nearly identical up to the decision token. Differences in activations at the final position can be attributed to the pun/non-pun distinction rather than confounding surface features.
Interpretability experiments often require hundreds or thousands of examples:
Model-written evaluation scales well here. Generate 500+ examples, filter to 200+ high-quality ones.
Some puns are obvious; others are subtle. For interpretability, it helps to have:
Here is a concrete workflow for creating a pun evaluation dataset optimized for interpretability:
1. GENERATION PHASE
- Generate 100 cloze-style pun recognition items (Yes/No format)
- Generate 100 cloze-style explanation items (fill in the double meaning)
- Generate 100 minimal pairs (pun + matched non-pun)
- Generate 50 MCQ items testing mechanism understanding
- Generate 50 MCQ items testing wordplay identification
2. FILTERING PHASE
- Remove duplicates (expect ~10-15% reduction)
- Verify cloze answers are single tokens where possible
- Check minimal pairs are actually minimal (edit distance)
- Human review of 100 random samples
- Diversity analysis (topic distribution, mechanism types)
3. VALIDATION PHASE
- Test on held-out human annotators
- Pilot evaluation on 2-3 models of different sizes
- Verify token localization
- Iterate on generation prompts if needed
4. DEPLOYMENT PHASE
- Split into train/validation/test (for probing)
- For each example, record:
- Full prompt text
- Decision token position
- Correct answer token(s)
- Incorrect answer token(s) for comparison
- Minimal pair ID (if applicable)
- Extract activations at decision positions during evaluation
- Run interpretability analyses (probing, patching, logit lens)
The Perez et al. methodology is a meta-technique: using AI capabilities to accelerate AI research. For interpretability researchers, it offers a path from "I want to study concept X" to "I have 500 high-quality examples of concept X" in hours rather than weeks.
Perez, E., et al. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv preprint arXiv:2212.09251.