Northeastern Logo

Week 3: Evaluation Methodology

Overview

Before we can understand what's happening inside a large language model, we need to know how to measure what it does. This week introduces the fundamental methods for evaluating LLM behavior, from basic probability outputs to sophisticated benchmarking strategies. You'll learn how to design effective prompts, interpret model outputs, and assess whether a model has truly learned a concept or simply memorized training data.

Learning Objectives

By the end of this week, you should be able to:

Required Readings

Perez et al. (2022). Using LLMs to generate evaluation datasets at scale.
Zheng et al. (2023). When and how to use LLMs as evaluators for open-ended outputs.
Petroni et al. (2019). The LAMA benchmark: probing factual knowledge via cloze-style tasks.

Supplementary Materials

Tutorial essay on using LLMs to generate evaluation datasets, with pun understanding as a running example.
Brown et al. (2020). Introduced the few-shot evaluation paradigm that now dominates the field.

Tutorial: Why Single-Token Predictions Matter for Interpretability

1. The Autoregressive Language Model

At its core, a large language model is a function that takes a sequence of tokens as input and produces a probability distribution over a single next token. Given a sequence of tokens x₁, x₂, ..., xₙ, the model computes:

P(xₙ₊₁ | x₁, x₂, ..., xₙ)

This is a distribution over the entire vocabulary, typically 50,000 to 100,000+ tokens. The model doesn't directly generate paragraphs or essays; it generates probabilities for one token at a time. Understanding this distinction is crucial for interpretability research.

2. Why Single-Token Predictions Are Easy to Study

The one-token-at-a-time nature of autoregressive models is a gift for interpretability. When a model makes a single prediction, we know exactly where and when the decision happens:

This is why cloze-style evaluation ("The capital of France is ___") is so powerful for mechanistic interpretability: the model's entire reasoning must culminate in a single token prediction at a known position.

3. The Challenge of Evaluating Longer Outputs

But not everything reduces to single-token predictions. Sometimes we need to evaluate:

When the model generates 100 tokens, the "decision" is distributed across all of them. There's no single position where we can say "this is where the model decided to be helpful/harmful/creative." This makes interpretability much harder.

4. Autoregressive Sampling (How Long Outputs Happen)

To generate longer sequences, we sample from the model autoregressively, one token at a time:

Each strategy affects the diversity, coherence, and randomness of generated text. For benchmarking, greedy decoding or low-temperature sampling provides deterministic, reproducible results.

5. Prompting Strategies for Single-Token Evaluation

To make interpretability tractable, we want to reduce complex tasks to single-token predictions. Three strategies:

Cloze Prompts (Best for Interpretability)

Format the task as text completion, measuring the probability of specific completions:

"The capital of France is ___"

We can evaluate whether "Paris" has higher probability than alternatives. The decision happens at a known position.

Multiple Choice Questions

Force selection from predefined options by comparing token probabilities for "A", "B", "C", "D":

Which is the capital of France?
A) London B) Paris C) Berlin D) Madrid
Answer:

The model's prediction at the final position tells us everything. We can extract activations here and compare logits across options.

In-Context Learning (ICL)

Provide examples in the prompt to demonstrate the task, then use cloze format for the test case:

Input: The movie was terrible.
Sentiment: Negative

Input: I loved every minute!
Sentiment: Positive

Input: It was okay, nothing special.
Sentiment:

ICL is valuable for interpretability because you can compare activations with and without the in-context examples to see how they reshape processing.

6. LLM-as-Judge: When Single Tokens Are Not Enough

Some capabilities genuinely require evaluating longer outputs: writing quality, reasoning coherence, helpfulness. When you must evaluate runs of text rather than single tokens, LLM-as-judge provides an automated approach:

MT-Bench demonstrated that LLM judges correlate well with human judgments for many tasks, but always validate on a subset with human evaluation.

For interpretability research: Try to convert free-form tasks to single-token formats when possible. Instead of "Explain why this is a pun" (hard to interpret), use "This is a pun because the word ___ has two meanings" (cloze format, tractable).

7. Creating Evaluation Datasets at Scale

The bottleneck in interpretability research is often data: you need hundreds or thousands of examples that cleanly test your concept. Model-written evaluations (Perez et al. 2022) offer a solution: use LLMs to generate evaluation datasets.

The core workflow:

  1. Specify the behavior you want to evaluate in a clear prompt
  2. Generate examples by prompting a capable LLM with instructions and seed examples
  3. Filter for quality using automated checks and human review
  4. Run evaluations on target models

For interpretability, design your generation prompts to produce cloze-format or MCQ examples. This ensures the resulting dataset has the token-localization properties you need.

Example: To study pun understanding, prompt Claude to generate "This is a pun: Yes or No? Answer:" format examples. Generate 500+ raw examples, filter to 200+ high-quality ones, and you have a dataset suitable for probing, patching, and logit lens analysis.

See the supplementary essay for detailed prompts and a complete pipeline example using puns.

8. Evaluation Metrics

Different tasks require different metrics:

9. Statistical Significance

A model scoring 85% vs 83% might not be meaningfully different. Quantify uncertainty:

Rule of thumb: differences smaller than 2-3% on typical benchmark sizes often aren't statistically significant.

10. Memorization vs. Generalization

A fundamental question: Did the model learn the concept, or did it memorize training examples?

Training Data Contamination

If test examples appeared in training data, the model might simply recall them rather than demonstrate true understanding. This is especially problematic because:

Detecting and Mitigating Contamination

Generalization Tests

Beyond contamination, test for genuine understanding:

11. HELM: Holistic Evaluation

The Holistic Evaluation of Language Models (HELM) benchmark exemplifies comprehensive evaluation:

Key Principles

Metrics Beyond Accuracy

Putting It All Together

Effective benchmarking requires:

  1. Understanding model fundamentals (token probabilities, sampling)
  2. Choosing appropriate prompting strategies for your concept
  3. Designing tasks with evaluable outputs
  4. Selecting metrics that match your research questions
  5. Quantifying uncertainty and statistical significance
  6. Ruling out memorization through contamination checks and generalization tests
  7. Following established best practices (HELM) for reproducible, fair evaluation

Dataset Construction for Concept Research

For your research project, the quality of your findings depends heavily on your dataset construction. This section covers principles and strategies for building datasets that enable rigorous concept characterization.

1. Defining Your Concept Operationally

Before constructing a dataset, you must operationalize your concept—make it measurable.

From Abstract to Concrete

Exercise: For your chosen concept, write down:

  1. 3-5 concrete features that characterize it
  2. 10 clear positive examples
  3. 10 clear negative examples (concept absent)
  4. 5 edge cases (ambiguous or borderline)

2. Contrastive Pair Construction

The gold standard for concept research: pairs of examples that differ only in your target concept.

Minimal Pairs Strategy

Positive: "Could you please pass the salt?"
Negative: "Pass the salt."

What changed: politeness marker ("Could you please")
What stayed the same: meaning, length, topic

Controlled Variation

Methods for Generating Pairs

  1. Template-based: "[GREETING], could you [ACTION]?" vs "[ACTION] now"
  2. LLM-assisted: Prompt GPT-4 to paraphrase with/without concept
  3. Human-created: Writers produce matched pairs (expensive but high quality)
  4. Found pairs: Mine existing data for natural contrasts (Reddit posts with different tones)

3. Balancing and Sampling Strategies

Class Balance

For binary concepts (polite/rude, factual/fictional), aim for 50-50 split unless you have specific research reasons.

Stratified Sampling

Ensure diversity across relevant dimensions:

Dimension Why It Matters Example
Topic Prevent concept from being confounded with domain Politeness in: requests, apologies, greetings, refusals
Length Avoid "polite = longer" spurious correlation 5-10 tokens, 10-20 tokens, 20-40 tokens (balanced)
Syntactic structure Ensure concept isn't just syntax Questions, statements, commands
Formality Separate from related concepts Formal-polite, informal-polite, formal-rude, informal-rude

4. Avoiding Common Pitfalls

Confound: Concept Leakage

Bad dataset:
Positive (polite): "Thank you so much for your help! I really appreciate it."
Negative (rude): "Whatever."

Problem: Length, sentiment, and informativeness all vary—not just politeness!
Better dataset:
Positive (polite): "Thanks for your help."
Negative (direct): "Got it, bye."

Better: Similar length, neutral tone, only politeness marker differs

Pitfall: Lexical Shortcuts

Models might learn "if text contains 'please' → polite" without understanding context.

Pitfall: Training Data Overlap

If your test examples appeared in the model's training data, you're testing memorization, not understanding.

5. Dataset Size Guidelines

How much data do you need? It depends on your research method:

Method (Week) Minimum Recommended Purpose
Benchmarking (Week 2) 100 500-1000 Test set for measuring accuracy
Steering (Week 1) 10-20 pairs 50-100 pairs Contrastive pairs for activation difference
Probing (Week 5) 500 2000-5000 Training set for linear classifier
Visualization (Week 3) 100 500-1000 Examples to visualize in 2D/3D

6. Documentation and Reproducibility

Your dataset is a research artifact. Document it properly:

7. Quick Start: Building Your First Dataset

Week 2 Goal: Create a small, high-quality evaluation dataset (100-200 examples) for your concept.

  1. Day 1-2: Write operational definition + 20 manual examples
  2. Day 3: Create 50 contrastive pairs using templates or LLM assistance
  3. Day 4: Balance across topics/lengths, check for confounds
  4. Day 5: Run initial benchmarks, identify dataset weaknesses
  5. Day 6-7: Expand to 100-200 examples, focusing on edge cases and diversity

Deliverable: A JSON/CSV file with examples, labels, and metadata (topic, length, etc.) ready for benchmarking.

Key Takeaways

In-Class Exercise: Pun Evaluation Dataset

Building on the pun theme from Week 1, we'll create a complete evaluation pipeline: generate data, test multiple evaluation methods, and apply logit lens to see where pun understanding emerges.

Part 1: Prompt Engineering for Dataset Generation (15 min)

Work in groups to craft a prompt for Claude that generates high-quality pun evaluation examples. Your prompt should produce examples in cloze format for interpretability tractability.

Starter prompt to iterate on:
Generate pun recognition examples in cloze format. For each example:
1. Provide a sentence that may or may not be a pun
2. Format: "[sentence]" This is a pun. Yes or No? Answer:
3. Provide the correct label (Yes/No)
4. For puns, briefly note the wordplay mechanism

Generate 20 examples:
- 10 puns (varied mechanisms: homophones, polysemy, syntactic)
- 10 non-puns (similar structure but no wordplay)
- Vary topics: professions, food, animals, science, everyday life

Discussion questions:

Part 2: Testing Evaluation Methods (20 min)

Using the NDIF workbench, we'll compare different evaluation approaches on the same pun examples:

Method A: Probability Comparison

Compare P("Yes") vs P("No") at the answer position:

prompt = '"I used to be a banker but I lost interest." This is a pun. Yes or No? Answer:'
# Extract logits at final position
# Compare logit["Yes"] vs logit["No"]
# Classify based on which is higher

Method B: In-Context Learning

Provide 3-4 examples before the test case:

"The bicycle was two-tired." Pun? Yes
"The bicycle was broken." Pun? No
"I lost interest in banking." Pun? Yes
"I lost patience with banking." Pun? No

"Time flies like an arrow." Pun?

Method C: MCQ Format

Why is "I lost interest" a pun when said by a former banker?
A) It references losing money
B) "Interest" means both curiosity and financial returns
C) Banking is boring
D) It's not actually a pun
Answer:

Compare: Which method gives the clearest signal? Which is easiest to interpret mechanistically?

Part 3: Logit Lens on Puns (25 min)

The key question: At which layer does the model "get" the pun?

For a set of pun/non-pun minimal pairs, we'll:

  1. Run logit lens at each layer on the final token position
  2. Track when "Yes" probability emerges for puns vs "No" for non-puns
  3. Compare across model sizes (if time permits)
Hypothesis to test: Pun understanding requires integrating information about word meanings (likely mid-to-late layers). Early layers should show similar predictions for puns and non-puns; divergence indicates where the model processes the wordplay.

Analysis Template

from nnsight import LanguageModel

model = LanguageModel("meta-llama/Llama-3.2-1B", device_map="auto")

pun_prompt = '"I used to be a banker but I lost interest." This is a pun. Yes or No? Answer:'
non_pun_prompt = '"I used to be a banker but I lost patience." This is a pun. Yes or No? Answer:'

# For each layer, decode the residual stream at the final position
# Track: P("Yes") - P("No") across layers
# Plot: layer vs. pun_score for puns and non-puns
# Look for: divergence point where model distinguishes pun from non-pun

Discussion

Code Exercise

This week's exercise will give you hands-on experience with the core concepts:

Open Pun Dataset Builder in Colab

Project Milestone

Due: Thursday of Week 2

Now that you've identified a promising concept through steering, it's time to formalize it with a systematic benchmark and test it across multiple models to find the best target for deep investigation.

Part 1: Design Your Benchmark

Part 2: Test Across Models

Deliverables:

Tip: Don't overthink the "perfect" dataset at this stage. You'll refine it throughout the semester. Focus on getting a solid baseline that lets you move forward with confidence.