Model-Written Evaluations - Neural Mechanics

A tutorial based on Perez et al. (2022), "Discovering Language Model Behaviors with Model-Written Evaluations"

Evaluation Formats

Why Token Localization Matters for Interpretability

Many interpretability methods become dramatically easier when the model's critical decision is concentrated at a single, predictable token position. Consider what you can do when you know exactly where the decision happens:

Logit lens: Decode intermediate layers at the decision token to watch the prediction evolve
Activation patching: Intervene at a known position rather than searching across the sequence
Probing: Extract activations at exactly the decision point, avoiding noise from irrelevant positions
Attention analysis: See what the model attends to when generating the critical token
Probability comparison: Compare logits for candidate completions directly

Two formats naturally provide this localization: cloze-style prompts (fill-in-the-blank) and MCQ (where the model produces a single letter). Free-form generation is harder to analyze because the "decision" is distributed across multiple tokens.

Cloze-Style Evaluation

The model completes a sentence with a single word or short phrase. The blank is the decision point.

Strengths: Decision localized to one token. Enables direct probability comparison over candidate completions. Natural fit for logit lens and activation extraction.

Weaknesses: Requires careful prompt design so the blank is unambiguous. Some concepts do not reduce naturally to single-token decisions.

Design principle: The cloze prompt should be constructed so that (1) the correct answer is a single token or very short phrase, and (2) incorrect answers are also plausible completions. This lets you compare P(correct) vs. P(incorrect) at the decision position.

Multiple Choice Questions (MCQ)

The model chooses among labeled options (A, B, C, D). This enables precise measurement and easy automation.

Strengths: Decision localized to a single token (the letter). Unambiguous evaluation. Enables comparison of answer probabilities across options. Easy to automate.

Weaknesses: May not reflect real-world usage. The correct answer might be guessable from surface features. Format itself may activate "test-taking" behaviors distinct from natural understanding.

For interpretability: MCQ is excellent because you can extract activations at the final position and examine the logit distribution over {A, B, C, D}. You can also study what information flows to that position from earlier in the prompt.

Zero-Shot vs. In-Context Learning

Zero-shot: The model receives only a question or prompt, with no examples. Tests baseline capabilities and default behaviors.

In-context learning (ICL): Provide examples in the prompt to demonstrate the task. This activates task-relevant circuits and can elicit capabilities the model has but does not spontaneously display.

For interpretability: ICL is valuable precisely because it changes internal representations. You can compare activations with and without in-context examples to see how demonstrations reshape the model's processing. Use cloze or MCQ format for the test case to maintain token localization.

Case Study: Evaluating Pun Understanding

Suppose we want to study how language models represent puns: wordplay that exploits multiple meanings or similar sounds. Before we can localize "pun circuits" or probe for pun-related features, we need evaluation data that reliably triggers pun processing.

Step 1: Specify the Behavior

We want to evaluate whether a model:

Recognizes that a statement is a pun
Understands why it is a pun (which words have double meanings)
Can distinguish good puns from bad ones
Can generate puns (though this is harder to evaluate)

Step 2: Generate Examples

We prompt a capable model to generate evaluation data. Here are templates for each format:

Cloze-Style Generation Prompt

Generate cloze-style questions that test pun understanding. Each question
should have a blank that can be filled with a single word or short phrase.
Provide the correct answer and one plausible incorrect answer.

Format:
Prompt: [sentence with ___ blank]
Correct: [answer]
Incorrect: [plausible wrong answer]

Examples:

Prompt: The joke "I used to be a banker but I lost interest" is a pun
because the word "interest" refers to both financial returns and ___.
Correct: curiosity
Incorrect: money

Prompt: "Time flies like an arrow; fruit flies like a banana" is a pun
because "flies" shifts from being a ___ to being a noun.
Correct: verb
Incorrect: metaphor

Generate 20 more cloze-style pun evaluation items with varied formats.

Zero-Shot Cloze Generation Prompt

Generate pun recognition tasks in cloze format. The model must complete
the sentence with "Yes" or "No".

Format:
Prompt: [statement]. This is a pun. Yes or No? Answer:
Label: [Yes/No]

Examples:

Prompt: "I used to be a banker, but I lost interest." This is a pun.
Yes or No? Answer:
Label: Yes

Prompt: "I used to be a banker, but I changed careers." This is a pun.
Yes or No? Answer:
Label: No

Generate 30 examples, balanced between puns and non-puns.

MCQ Generation Prompt

Generate multiple choice questions that test understanding of puns.
Each question should have one correct answer and three plausible distractors.

Format:
Question: [question text]
A) [option]
B) [option]
C) [option]
D) [option]
Correct: [letter]

Example:

Question: In the pun "I used to be a banker, but I lost interest,"
which word carries the double meaning?
A) banker
B) lost
C) interest
D) used
Correct: C

Generate 15 more MCQ items testing pun recognition and explanation.

Step 3: Filter for Quality

The paper emphasizes that raw LLM output requires filtering. Apply these checks:

Automated filters:

Remove duplicates and near-duplicates
Check that MCQ options are distinct
Verify format compliance (correct labels, expected structure)
For puns: check that the "pun" examples actually contain wordplay

Human review:

Sample 50-100 examples for manual inspection
Check that puns are actually funny or at least recognizable as wordplay
Verify that explanations correctly identify the double meaning
Ensure distractors in MCQs are plausible but clearly wrong

Diversity checks:

Are puns drawn from varied domains (professions, animals, food, etc.)?
Do they use different mechanisms (homophones, polysemy, syntactic ambiguity)?
Is difficulty varied?

Step 4: Run Evaluations

With filtered data, evaluate your target models. For interpretability work, you likely want more than accuracy:

Activation extraction: Save hidden states when the model processes puns vs. non-puns
Attention patterns: Where does the model attend when processing the ambiguous word?
Layer-by-layer analysis: At which layer does the model "get" the joke?

Pitfalls and How to Avoid Them

1. Generated Examples May Be Low Quality

Problem: LLMs sometimes generate examples that are ambiguous, incorrect, or too easy.

Solution: Always manually review a sample. For puns, check that the wordplay actually works; LLMs sometimes generate "puns" where the double meaning does not quite land.

2. Lack of Diversity

Problem: LLMs tend to repeat patterns. You might get 50 puns about "bank" and "interest."

Solution: Explicitly prompt for diversity: "Generate puns about different topics: food, animals, professions, science, sports..." Use multiple generation runs with different temperatures.

3. Memorization and Data Contamination

Problem: The model being evaluated may have seen these exact puns during training.

Solution: Generate novel puns rather than using famous ones. Include a novelty check: can the model explain puns it has likely never seen?

4. Surface Feature Shortcuts

Problem: Models might detect puns from surface features (sentence length, punctuation) rather than understanding wordplay.

Solution: Ensure non-pun examples have similar surface features. Include "almost-puns" that have the structure but lack the double meaning.

Advice for Interpretability Applications

Prioritize Token-Localized Formats

For most interpretability methods, cloze and MCQ formats are strictly superior to free-form generation:

Method	Cloze/MCQ	Free-form
Logit lens	Extract at decision token	Unclear where to extract
Activation patching	Patch at known position	Must search across positions
Probing	Clean single-position signal	Aggregation required
Attention analysis	Clear "what influences this token"	Diffuse across sequence
Causal tracing	Intervene at decision point	Multiple intervention points

Construct Minimal Pairs

A minimal pair differs in exactly one aspect. For cloze-style interpretability work, this is especially powerful:

Pun condition:
"I used to be a banker but I lost interest."
This is a pun. Yes or No? Answer: [Yes]

Non-pun condition (minimal change):
"I used to be a banker but I lost patience."
This is a pun. Yes or No? Answer: [No]

Both prompts are nearly identical up to the decision token. Differences in activations at the final position can be attributed to the pun/non-pun distinction rather than confounding surface features.

Generate Enough Examples for Statistical Power

Interpretability experiments often require hundreds or thousands of examples:

Probing classifiers need training data
Activation patching needs many trials to estimate causal effects
PCA and other dimensionality reduction need sufficient samples

Model-written evaluation scales well here. Generate 500+ examples, filter to 200+ high-quality ones.

Create Difficulty Gradations

Some puns are obvious; others are subtle. For interpretability, it helps to have:

Easy examples: Clear signal, useful for initial exploration
Hard examples: Tests whether your methods capture genuine understanding
Edge cases: Ambiguous cases that reveal the boundaries of model representations

Example: A Complete Pun Evaluation Pipeline

Here is a concrete workflow for creating a pun evaluation dataset optimized for interpretability:

1. GENERATION PHASE
   - Generate 100 cloze-style pun recognition items (Yes/No format)
   - Generate 100 cloze-style explanation items (fill in the double meaning)
   - Generate 100 minimal pairs (pun + matched non-pun)
   - Generate 50 MCQ items testing mechanism understanding
   - Generate 50 MCQ items testing wordplay identification

2. FILTERING PHASE
   - Remove duplicates (expect ~10-15% reduction)
   - Verify cloze answers are single tokens where possible
   - Check minimal pairs are actually minimal (edit distance)
   - Human review of 100 random samples
   - Diversity analysis (topic distribution, mechanism types)

3. VALIDATION PHASE
   - Test on held-out human annotators
   - Pilot evaluation on 2-3 models of different sizes
   - Verify token localization
   - Iterate on generation prompts if needed

4. DEPLOYMENT PHASE
   - Split into train/validation/test (for probing)
   - For each example, record:
     - Full prompt text
     - Decision token position
     - Correct answer token(s)
     - Incorrect answer token(s) for comparison
     - Minimal pair ID (if applicable)
   - Extract activations at decision positions during evaluation
   - Run interpretability analyses (probing, patching, logit lens)

Key Takeaways

LLMs can generate evaluation data at scale, dramatically reducing the bottleneck of dataset creation.
Token localization is critical for interpretability: design cloze and MCQ formats so the model's decision concentrates at a single, predictable position.
Human oversight remains essential: filter for quality, check for diversity, and validate that examples actually test what you intend.
Construct minimal pairs: examples that differ only in the presence or absence of the target concept provide the cleanest signal for comparing activations.
Multiple evaluation formats (zero-shot, ICL, MCQ) provide complementary signals and help identify format-specific artifacts.
Watch for pitfalls: memorization, surface shortcuts, and lack of diversity can all undermine your evaluation.

The Perez et al. methodology is a meta-technique: using AI capabilities to accelerate AI research. For interpretability researchers, it offers a path from "I want to study concept X" to "I have 500 high-quality examples of concept X" in hours rather than weeks.

Model-Written Evaluations for Interpretability Research

Overview

The Core Method