Northeastern Logo

Model-Written Evaluations for Interpretability Research

A tutorial based on Perez et al. (2022), "Discovering Language Model Behaviors with Model-Written Evaluations"

Overview

Before we can study how a concept is represented inside a neural network, we need examples that reliably elicit that concept. Creating evaluation datasets by hand is slow. Perez et al. demonstrate that language models can generate evaluation data at scale, and that this approach can uncover behaviors humans might not anticipate.

This tutorial explains the core methodology and adapts it for interpretability research, using pun understanding as a running example.

The Core Method

The paper's approach has four stages:

  1. Specify the behavior you want to evaluate in a clear prompt
  2. Generate examples by prompting a capable LLM with instructions and a few seed examples
  3. Filter for quality using automated checks and human review
  4. Run evaluations on target models and analyze results

The key insight: LLMs can generate diverse, creative test cases faster than humans, while humans remain in the loop for quality control and analysis.

Evaluation Formats

Why Token Localization Matters for Interpretability

Many interpretability methods become dramatically easier when the model's critical decision is concentrated at a single, predictable token position. Consider what you can do when you know exactly where the decision happens:

Two formats naturally provide this localization: cloze-style prompts (fill-in-the-blank) and MCQ (where the model produces a single letter). Free-form generation is harder to analyze because the "decision" is distributed across multiple tokens.

Cloze-Style Evaluation

The model completes a sentence with a single word or short phrase. The blank is the decision point.

Strengths: Decision localized to one token. Enables direct probability comparison over candidate completions. Natural fit for logit lens and activation extraction.

Weaknesses: Requires careful prompt design so the blank is unambiguous. Some concepts do not reduce naturally to single-token decisions.

Design principle: The cloze prompt should be constructed so that (1) the correct answer is a single token or very short phrase, and (2) incorrect answers are also plausible completions. This lets you compare P(correct) vs. P(incorrect) at the decision position.

Multiple Choice Questions (MCQ)

The model chooses among labeled options (A, B, C, D). This enables precise measurement and easy automation.

Strengths: Decision localized to a single token (the letter). Unambiguous evaluation. Enables comparison of answer probabilities across options. Easy to automate.

Weaknesses: May not reflect real-world usage. The correct answer might be guessable from surface features. Format itself may activate "test-taking" behaviors distinct from natural understanding.

For interpretability: MCQ is excellent because you can extract activations at the final position and examine the logit distribution over {A, B, C, D}. You can also study what information flows to that position from earlier in the prompt.

Zero-Shot vs. In-Context Learning

Zero-shot: The model receives only a question or prompt, with no examples. Tests baseline capabilities and default behaviors.

In-context learning (ICL): Provide examples in the prompt to demonstrate the task. This activates task-relevant circuits and can elicit capabilities the model has but does not spontaneously display.

For interpretability: ICL is valuable precisely because it changes internal representations. You can compare activations with and without in-context examples to see how demonstrations reshape the model's processing. Use cloze or MCQ format for the test case to maintain token localization.

Case Study: Evaluating Pun Understanding

Suppose we want to study how language models represent puns: wordplay that exploits multiple meanings or similar sounds. Before we can localize "pun circuits" or probe for pun-related features, we need evaluation data that reliably triggers pun processing.

Step 1: Specify the Behavior

We want to evaluate whether a model:

Step 2: Generate Examples

We prompt a capable model to generate evaluation data. Here are templates for each format:

Cloze-Style Generation Prompt

Generate cloze-style questions that test pun understanding. Each question
should have a blank that can be filled with a single word or short phrase.
Provide the correct answer and one plausible incorrect answer.

Format:
Prompt: [sentence with ___ blank]
Correct: [answer]
Incorrect: [plausible wrong answer]

Examples:

Prompt: The joke "I used to be a banker but I lost interest" is a pun
because the word "interest" refers to both financial returns and ___.
Correct: curiosity
Incorrect: money

Prompt: "Time flies like an arrow; fruit flies like a banana" is a pun
because "flies" shifts from being a ___ to being a noun.
Correct: verb
Incorrect: metaphor

Generate 20 more cloze-style pun evaluation items with varied formats.

Zero-Shot Cloze Generation Prompt

Generate pun recognition tasks in cloze format. The model must complete
the sentence with "Yes" or "No".

Format:
Prompt: [statement]. This is a pun. Yes or No? Answer:
Label: [Yes/No]

Examples:

Prompt: "I used to be a banker, but I lost interest." This is a pun.
Yes or No? Answer:
Label: Yes

Prompt: "I used to be a banker, but I changed careers." This is a pun.
Yes or No? Answer:
Label: No

Generate 30 examples, balanced between puns and non-puns.

MCQ Generation Prompt

Generate multiple choice questions that test understanding of puns.
Each question should have one correct answer and three plausible distractors.

Format:
Question: [question text]
A) [option]
B) [option]
C) [option]
D) [option]
Correct: [letter]

Example:

Question: In the pun "I used to be a banker, but I lost interest,"
which word carries the double meaning?
A) banker
B) lost
C) interest
D) used
Correct: C

Generate 15 more MCQ items testing pun recognition and explanation.

Step 3: Filter for Quality

The paper emphasizes that raw LLM output requires filtering. Apply these checks:

Automated filters:

Human review:

Diversity checks:

Step 4: Run Evaluations

With filtered data, evaluate your target models. For interpretability work, you likely want more than accuracy:

Pitfalls and How to Avoid Them

1. Generated Examples May Be Low Quality

Problem: LLMs sometimes generate examples that are ambiguous, incorrect, or too easy.

Solution: Always manually review a sample. For puns, check that the wordplay actually works; LLMs sometimes generate "puns" where the double meaning does not quite land.

2. Lack of Diversity

Problem: LLMs tend to repeat patterns. You might get 50 puns about "bank" and "interest."

Solution: Explicitly prompt for diversity: "Generate puns about different topics: food, animals, professions, science, sports..." Use multiple generation runs with different temperatures.

3. Memorization and Data Contamination

Problem: The model being evaluated may have seen these exact puns during training.

Solution: Generate novel puns rather than using famous ones. Include a novelty check: can the model explain puns it has likely never seen?

4. Surface Feature Shortcuts

Problem: Models might detect puns from surface features (sentence length, punctuation) rather than understanding wordplay.

Solution: Ensure non-pun examples have similar surface features. Include "almost-puns" that have the structure but lack the double meaning.

Advice for Interpretability Applications

Prioritize Token-Localized Formats

For most interpretability methods, cloze and MCQ formats are strictly superior to free-form generation:

Method Cloze/MCQ Free-form
Logit lens Extract at decision token Unclear where to extract
Activation patching Patch at known position Must search across positions
Probing Clean single-position signal Aggregation required
Attention analysis Clear "what influences this token" Diffuse across sequence
Causal tracing Intervene at decision point Multiple intervention points

Construct Minimal Pairs

A minimal pair differs in exactly one aspect. For cloze-style interpretability work, this is especially powerful:

Pun condition:
"I used to be a banker but I lost interest."
This is a pun. Yes or No? Answer: [Yes]

Non-pun condition (minimal change):
"I used to be a banker but I lost patience."
This is a pun. Yes or No? Answer: [No]

Both prompts are nearly identical up to the decision token. Differences in activations at the final position can be attributed to the pun/non-pun distinction rather than confounding surface features.

Generate Enough Examples for Statistical Power

Interpretability experiments often require hundreds or thousands of examples:

Model-written evaluation scales well here. Generate 500+ examples, filter to 200+ high-quality ones.

Create Difficulty Gradations

Some puns are obvious; others are subtle. For interpretability, it helps to have:

Example: A Complete Pun Evaluation Pipeline

Here is a concrete workflow for creating a pun evaluation dataset optimized for interpretability:

1. GENERATION PHASE
   - Generate 100 cloze-style pun recognition items (Yes/No format)
   - Generate 100 cloze-style explanation items (fill in the double meaning)
   - Generate 100 minimal pairs (pun + matched non-pun)
   - Generate 50 MCQ items testing mechanism understanding
   - Generate 50 MCQ items testing wordplay identification

2. FILTERING PHASE
   - Remove duplicates (expect ~10-15% reduction)
   - Verify cloze answers are single tokens where possible
   - Check minimal pairs are actually minimal (edit distance)
   - Human review of 100 random samples
   - Diversity analysis (topic distribution, mechanism types)

3. VALIDATION PHASE
   - Test on held-out human annotators
   - Pilot evaluation on 2-3 models of different sizes
   - Verify token localization
   - Iterate on generation prompts if needed

4. DEPLOYMENT PHASE
   - Split into train/validation/test (for probing)
   - For each example, record:
     - Full prompt text
     - Decision token position
     - Correct answer token(s)
     - Incorrect answer token(s) for comparison
     - Minimal pair ID (if applicable)
   - Extract activations at decision positions during evaluation
   - Run interpretability analyses (probing, patching, logit lens)

Key Takeaways

  1. LLMs can generate evaluation data at scale, dramatically reducing the bottleneck of dataset creation.
  2. Token localization is critical for interpretability: design cloze and MCQ formats so the model's decision concentrates at a single, predictable position.
  3. Human oversight remains essential: filter for quality, check for diversity, and validate that examples actually test what you intend.
  4. Construct minimal pairs: examples that differ only in the presence or absence of the target concept provide the cleanest signal for comparing activations.
  5. Multiple evaluation formats (zero-shot, ICL, MCQ) provide complementary signals and help identify format-specific artifacts.
  6. Watch for pitfalls: memorization, surface shortcuts, and lack of diversity can all undermine your evaluation.

The Perez et al. methodology is a meta-technique: using AI capabilities to accelerate AI research. For interpretability researchers, it offers a path from "I want to study concept X" to "I have 500 high-quality examples of concept X" in hours rather than weeks.

References

Perez, E., et al. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv preprint arXiv:2212.09251.