Week 3: Evaluation Methodology

Tutorial: Why Single-Token Predictions Matter for Interpretability

1. The Autoregressive Language Model

At its core, a large language model is a function that takes a sequence of tokens as input and produces a probability distribution over a single next token. Given a sequence of tokens x₁, x₂, ..., xₙ, the model computes:

P(xₙ₊₁ | x₁, x₂, ..., xₙ)

This is a distribution over the entire vocabulary, typically 50,000 to 100,000+ tokens. The model doesn't directly generate paragraphs or essays; it generates probabilities for one token at a time. Understanding this distinction is crucial for interpretability research.

2. Why Single-Token Predictions Are Easy to Study

The one-token-at-a-time nature of autoregressive models is a gift for interpretability. When a model makes a single prediction, we know exactly where and when the decision happens:

Logit lens: Decode intermediate layers at the final token position to watch the prediction evolve layer by layer
Activation patching: Intervene at a known position rather than searching across the sequence
Probing: Extract activations at exactly the decision point, avoiding noise from irrelevant positions
Probability comparison: Compare logits for candidate completions directly (e.g., P("Yes") vs P("No"))

This is why cloze-style evaluation ("The capital of France is ___") is so powerful for mechanistic interpretability: the model's entire reasoning must culminate in a single token prediction at a known position.

3. The Challenge of Evaluating Longer Outputs

But not everything reduces to single-token predictions. Sometimes we need to evaluate:

Extended reasoning chains
Creative writing quality
Dialogue coherence
Following complex multi-step instructions

When the model generates 100 tokens, the "decision" is distributed across all of them. There's no single position where we can say "this is where the model decided to be helpful/harmful/creative." This makes interpretability much harder.

4. Autoregressive Sampling (How Long Outputs Happen)

To generate longer sequences, we sample from the model autoregressively, one token at a time:

Greedy decoding: Always pick the highest probability token
Temperature sampling: Sample from the distribution, with temperature controlling randomness
Top-k sampling: Sample only from the k most likely tokens
Nucleus (top-p) sampling: Sample from the smallest set of tokens whose cumulative probability exceeds p

Each strategy affects the diversity, coherence, and randomness of generated text. For benchmarking, greedy decoding or low-temperature sampling provides deterministic, reproducible results.

5. Prompting Strategies for Single-Token Evaluation

To make interpretability tractable, we want to reduce complex tasks to single-token predictions. Three strategies:

Cloze Prompts (Best for Interpretability)

Format the task as text completion, measuring the probability of specific completions:


      "The capital of France is ___"

We can evaluate whether "Paris" has higher probability than alternatives. The decision happens at a known position.

Multiple Choice Questions

Force selection from predefined options by comparing token probabilities for "A", "B", "C", "D":


      Which is the capital of France?

      A) London  B) Paris  C) Berlin  D) Madrid

      Answer:

The model's prediction at the final position tells us everything. We can extract activations here and compare logits across options.

In-Context Learning (ICL)

Provide examples in the prompt to demonstrate the task, then use cloze format for the test case:


      Input: The movie was terrible.

      Sentiment: Negative


      Input: I loved every minute!

      Sentiment: Positive


      Input: It was okay, nothing special.

      Sentiment:

ICL is valuable for interpretability because you can compare activations with and without the in-context examples to see how they reshape processing.

6. LLM-as-Judge: When Single Tokens Are Not Enough

Some capabilities genuinely require evaluating longer outputs: writing quality, reasoning coherence, helpfulness. When you must evaluate runs of text rather than single tokens, LLM-as-judge provides an automated approach:

Provide the judge with evaluation criteria and rubrics
Ask for both a score and justification
Use pairwise comparison rather than absolute scoring for better reliability
Be aware of biases: judges favor longer responses, their own outputs, and certain styles

MT-Bench demonstrated that LLM judges correlate well with human judgments for many tasks, but always validate on a subset with human evaluation.

For interpretability research: Try to convert free-form tasks to single-token formats when possible. Instead of "Explain why this is a pun" (hard to interpret), use "This is a pun because the word ___ has two meanings" (cloze format, tractable).

7. Creating Evaluation Datasets at Scale

The bottleneck in interpretability research is often data: you need hundreds or thousands of examples that cleanly test your concept. Model-written evaluations (Perez et al. 2022) offer a solution: use LLMs to generate evaluation datasets.

The core workflow:

Specify the behavior you want to evaluate in a clear prompt
Generate examples by prompting a capable LLM with instructions and seed examples
Filter for quality using automated checks and human review
Run evaluations on target models

For interpretability, design your generation prompts to produce cloze-format or MCQ examples. This ensures the resulting dataset has the token-localization properties you need.

Example: To study pun understanding, prompt Claude to generate "This is a pun: Yes or No? Answer:" format examples. Generate 500+ raw examples, filter to 200+ high-quality ones, and you have a dataset suitable for probing, patching, and logit lens analysis.

See the supplementary essay for detailed prompts and a complete pipeline example using puns.

8. Evaluation Metrics

Different tasks require different metrics:

Accuracy: Proportion of correct predictions (good for balanced classes)
Precision/Recall/F1: When classes are imbalanced or false positives/negatives have different costs
Perplexity: How "surprised" the model is by a sequence (lower = better)

9. Statistical Significance

A model scoring 85% vs 83% might not be meaningfully different. Quantify uncertainty:

Standard error: For accuracy on n samples, SE = √(p(1-p)/n)
Confidence intervals: 95% CI is approximately ± 2 × SE for large n
Bootstrap resampling: Repeatedly sample from your test set to estimate score distribution

Rule of thumb: differences smaller than 2-3% on typical benchmark sizes often aren't statistically significant.

10. Memorization vs. Generalization

A fundamental question: Did the model learn the concept, or did it memorize training examples?

Training Data Contamination

If test examples appeared in training data, the model might simply recall them rather than demonstrate true understanding. This is especially problematic because:

Popular benchmarks leak into training corpora (scraped web data, code repositories)
Models trained on "the whole internet" likely saw many benchmark examples
Paraphrased or slightly modified examples may still trigger memorization

Detecting and Mitigating Contamination

N-gram overlap: Search training data for test example substrings
Membership inference: Compare perplexity on test vs similar synthetic examples
Rephrased evaluation: Create novel rewordings of test questions
Held-out test sets: Keep evaluation data truly private until final evaluation

Generalization Tests

Beyond contamination, test for genuine understanding:

Novel compositions of known concepts
Out-of-distribution examples
Counterfactual variations
Transfer to related but different tasks

11. HELM: Holistic Evaluation

The Holistic Evaluation of Language Models (HELM) benchmark exemplifies comprehensive evaluation:

Key Principles

Broad coverage: 42+ scenarios across question answering, reasoning, generation, classification, etc.
Multi-metric: Evaluates accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency
Standardized prompting: Consistent few-shot examples and formatting across models
Transparency: All prompts, data, and evaluation code are public

Metrics Beyond Accuracy

Calibration: Do predicted probabilities match actual correctness rates?
Robustness: Performance under perturbations (typos, paraphrases, contrast sets)
Fairness: Equal performance across demographic groups
Efficiency: Inference cost, carbon footprint

Putting It All Together

Effective benchmarking requires:

Understanding model fundamentals (token probabilities, sampling)
Choosing appropriate prompting strategies for your concept
Designing tasks with evaluable outputs
Selecting metrics that match your research questions
Quantifying uncertainty and statistical significance
Ruling out memorization through contamination checks and generalization tests
Following established best practices (HELM) for reproducible, fair evaluation

Dataset Construction for Concept Research

For your research project, the quality of your findings depends heavily on your dataset construction. This section covers principles and strategies for building datasets that enable rigorous concept characterization.

1. Defining Your Concept Operationally

Before constructing a dataset, you must operationalize your concept—make it measurable.

From Abstract to Concrete

Abstract concept: "politeness"
Operational definition: Text containing please/thank you, formal pronouns, hedging language, indirect requests
Measurable examples: "Could you please..." (polite) vs "Do this now" (direct)

Exercise: For your chosen concept, write down:

3-5 concrete features that characterize it
10 clear positive examples
10 clear negative examples (concept absent)
5 edge cases (ambiguous or borderline)

2. Contrastive Pair Construction

The gold standard for concept research: pairs of examples that differ only in your target concept.

Minimal Pairs Strategy


      Positive: "Could you please pass the salt?"

      Negative: "Pass the salt."


      What changed: politeness marker ("Could you please")

      What stayed the same: meaning, length, topic

Controlled Variation

Substitute only concept-relevant words: polite → rude, formal → casual
Keep constant: sentence structure, length (±2 tokens), topic, complexity
Avoid confounds: Don't change sentiment when studying formality

Methods for Generating Pairs

Template-based: "[GREETING], could you [ACTION]?" vs "[ACTION] now"
LLM-assisted: Prompt GPT-4 to paraphrase with/without concept
Human-created: Writers produce matched pairs (expensive but high quality)
Found pairs: Mine existing data for natural contrasts (Reddit posts with different tones)

3. Balancing and Sampling Strategies

Class Balance

For binary concepts (polite/rude, factual/fictional), aim for 50-50 split unless you have specific research reasons.

Imbalanced data risks: Model might learn "always predict majority class"
Probe issues: Linear probes are especially sensitive to imbalance
Evaluation fairness: Accuracy is misleading with skewed data

Stratified Sampling

Ensure diversity across relevant dimensions:

Dimension	Why It Matters	Example
Topic	Prevent concept from being confounded with domain	Politeness in: requests, apologies, greetings, refusals
Length	Avoid "polite = longer" spurious correlation	5-10 tokens, 10-20 tokens, 20-40 tokens (balanced)
Syntactic structure	Ensure concept isn't just syntax	Questions, statements, commands
Formality	Separate from related concepts	Formal-polite, informal-polite, formal-rude, informal-rude

4. Avoiding Common Pitfalls

Confound: Concept Leakage


      Bad dataset:

      Positive (polite): "Thank you so much for your help! I really appreciate it."

      Negative (rude): "Whatever."


      Problem: Length, sentiment, and informativeness all vary—not just politeness!


      Better dataset:

      Positive (polite): "Thanks for your help."

      Negative (direct): "Got it, bye."


      Better: Similar length, neutral tone, only politeness marker differs

Pitfall: Lexical Shortcuts

Models might learn "if text contains 'please' → polite" without understanding context.

Test: Include "please" in rude examples ("Please shut up")
Include: Polite examples without marker words (indirect requests: "I was wondering if you might...")
Verify: Concept isn't just keyword matching

Pitfall: Training Data Overlap

If your test examples appeared in the model's training data, you're testing memorization, not understanding.

Create novel examples: Don't copy from web datasets
Paraphrase significantly: Change word choice, structure, context
Synthetic generation: Use templates or LLM generation for fresh data

5. Dataset Size Guidelines

How much data do you need? It depends on your research method:

Method (Week)	Minimum	Recommended	Purpose
Benchmarking (Week 2)	100	500-1000	Test set for measuring accuracy
Steering (Week 1)	10-20 pairs	50-100 pairs	Contrastive pairs for activation difference
Probing (Week 5)	500	2000-5000	Training set for linear classifier
Visualization (Week 3)	100	500-1000	Examples to visualize in 2D/3D

6. Documentation and Reproducibility

Your dataset is a research artifact. Document it properly:

Datasheet: Purpose, creation method, annotator info, known limitations
Examples: Show 10-20 representative samples
Statistics: Size, balance, length distribution, vocabulary size
Annotation guidelines: How you (or annotators) decided positive vs negative
Version control: Track changes, note when you add/remove examples

7. Quick Start: Building Your First Dataset

Week 2 Goal: Create a small, high-quality evaluation dataset (100-200 examples) for your concept.

Day 1-2: Write operational definition + 20 manual examples
Day 3: Create 50 contrastive pairs using templates or LLM assistance
Day 4: Balance across topics/lengths, check for confounds
Day 5: Run initial benchmarks, identify dataset weaknesses
Day 6-7: Expand to 100-200 examples, focusing on edge cases and diversity

Deliverable: A JSON/CSV file with examples, labels, and metadata (topic, length, etc.) ready for benchmarking.

Key Takeaways

Operationalize concepts concretely before collecting data
Contrastive pairs are the gold standard—minimize confounds
Balance classes and stratify across relevant dimensions
Avoid lexical shortcuts and training data overlap
Size needs vary by method: 100 for benchmarking, 2000+ for probing
Document your dataset as a research artifact

In-Class Exercise: Pun Evaluation Dataset

Building on the pun theme from Week 1, we'll create a complete evaluation pipeline: generate data, test multiple evaluation methods, and apply logit lens to see where pun understanding emerges.

Part 1: Prompt Engineering for Dataset Generation (15 min)

Work in groups to craft a prompt for Claude that generates high-quality pun evaluation examples. Your prompt should produce examples in cloze format for interpretability tractability.

Starter prompt to iterate on:

Generate pun recognition examples in cloze format. For each example:
1. Provide a sentence that may or may not be a pun
2. Format: "[sentence]" This is a pun. Yes or No? Answer:
3. Provide the correct label (Yes/No)
4. For puns, briefly note the wordplay mechanism

Generate 20 examples:
- 10 puns (varied mechanisms: homophones, polysemy, syntactic)
- 10 non-puns (similar structure but no wordplay)
- Vary topics: professions, food, animals, science, everyday life

Discussion questions:

How do we ensure the non-puns aren't obviously different from puns?
Should we include "almost-puns" that fail to land?
How do we avoid Claude generating the same classic puns everyone knows?

Part 2: Testing Evaluation Methods (20 min)

Using the NDIF workbench, we'll compare different evaluation approaches on the same pun examples:

Method A: Probability Comparison

Compare P("Yes") vs P("No") at the answer position:

prompt = '"I used to be a banker but I lost interest." This is a pun. Yes or No? Answer:'
# Extract logits at final position
# Compare logit["Yes"] vs logit["No"]
# Classify based on which is higher

Method B: In-Context Learning

Provide 3-4 examples before the test case:

"The bicycle was two-tired." Pun? Yes
"The bicycle was broken." Pun? No
"I lost interest in banking." Pun? Yes
"I lost patience with banking." Pun? No

"Time flies like an arrow." Pun?

Method C: MCQ Format

Why is "I lost interest" a pun when said by a former banker?
A) It references losing money
B) "Interest" means both curiosity and financial returns
C) Banking is boring
D) It's not actually a pun
Answer:

Compare: Which method gives the clearest signal? Which is easiest to interpret mechanistically?

Part 3: Logit Lens on Puns (25 min)

The key question: At which layer does the model "get" the pun?

For a set of pun/non-pun minimal pairs, we'll:

Run logit lens at each layer on the final token position
Track when "Yes" probability emerges for puns vs "No" for non-puns
Compare across model sizes (if time permits)

Hypothesis to test: Pun understanding requires integrating information about word meanings (likely mid-to-late layers). Early layers should show similar predictions for puns and non-puns; divergence indicates where the model processes the wordplay.

Analysis Template

from nnsight import LanguageModel

model = LanguageModel("meta-llama/Llama-3.2-1B", device_map="auto")

pun_prompt = '"I used to be a banker but I lost interest." This is a pun. Yes or No? Answer:'
non_pun_prompt = '"I used to be a banker but I lost patience." This is a pun. Yes or No? Answer:'

# For each layer, decode the residual stream at the final position
# Track: P("Yes") - P("No") across layers
# Plot: layer vs. pun_score for puns and non-puns
# Look for: divergence point where model distinguishes pun from non-pun

Discussion

Do smaller models (1B) show pun understanding? At what scale does it emerge?
Is pun understanding "all or nothing" or does it develop gradually across layers?
Do different pun types (homophones vs. polysemy) show different patterns?
How does ICL context change the layer-by-layer pattern?

Project Milestone

Due: Thursday of Week 2

Now that you've identified a promising concept through steering, it's time to formalize it with a systematic benchmark and test it across multiple models to find the best target for deep investigation.

Part 1: Design Your Benchmark

Operational definition: Translate your concept into measurable, concrete criteria (see dataset construction section above)
Contrastive pairs: Create 50-100 minimal pairs that differ primarily in your concept
- Example (formality): "Could you help me?" vs "Help me"
- Control for confounds (length, topic, sentiment, etc.)
Test set: Build a held-out test set of 500-1000 examples
- Use stratified sampling (see guidelines above)
- Document data sources and collection methods
- Ensure balanced classes
Evaluation metric: Define how you'll measure concept understanding
- Probability differences, accuracy, linear separability, etc.
- Statistical significance thresholds

Part 2: Test Across Models

Model selection: Test at least 3 models of varying sizes
- Include at least one small model (e.g., GPT-2, Pythia-1B, Llama-7B)
- Include at least one large model (e.g., GPT-3.5, Llama-70B)
- Consider models from different families if relevant
Identify smallest capable model: Find the smallest model that demonstrates robust understanding of your concept
- This will be your primary target for deep investigation in Weeks 3-11
- Smaller models = faster experiments, easier to interpret, more publishable insights
- Document performance across all tested models

Deliverables:

Dataset: Contrastive pairs (50-100) and test set (500-1000), documented with collection methodology
Benchmark results: Performance table showing all tested models and metrics
Model selection rationale: Which model did you choose for deep study and why?
Initial findings: Does the concept exist across models? Any surprising patterns?
Code: Benchmark evaluation code (notebook or script)

Tip: Don't overthink the "perfect" dataset at this stage. You'll refine it throughout the semester. Focus on getting a solid baseline that lets you move forward with confidence.

Overview

Learning Objectives

Required Readings

Supplementary Materials