Before we can understand what's happening inside a large language model, we need to know how to measure what it does. This week introduces the fundamental methods for evaluating LLM behavior, from basic probability outputs to sophisticated benchmarking strategies. You'll learn how to design effective prompts, interpret model outputs, and assess whether a model has truly learned a concept or simply memorized training data.
By the end of this week, you should be able to:
At its core, a large language model is a function that takes a sequence of tokens as input and produces a
probability distribution over a single next token. Given a sequence of tokens x₁, x₂, ..., xₙ, the model
computes:
P(xₙ₊₁ | x₁, x₂, ..., xₙ)
This is a distribution over the entire vocabulary, typically 50,000 to 100,000+ tokens. The model doesn't directly generate paragraphs or essays; it generates probabilities for one token at a time. Understanding this distinction is crucial for interpretability research.
The one-token-at-a-time nature of autoregressive models is a gift for interpretability. When a model makes a single prediction, we know exactly where and when the decision happens:
This is why cloze-style evaluation ("The capital of France is ___") is so powerful for mechanistic interpretability: the model's entire reasoning must culminate in a single token prediction at a known position.
But not everything reduces to single-token predictions. Sometimes we need to evaluate:
When the model generates 100 tokens, the "decision" is distributed across all of them. There's no single position where we can say "this is where the model decided to be helpful/harmful/creative." This makes interpretability much harder.
To generate longer sequences, we sample from the model autoregressively, one token at a time:
Each strategy affects the diversity, coherence, and randomness of generated text. For benchmarking, greedy decoding or low-temperature sampling provides deterministic, reproducible results.
To make interpretability tractable, we want to reduce complex tasks to single-token predictions. Three strategies:
Format the task as text completion, measuring the probability of specific completions:
"The capital of France is ___"
We can evaluate whether "Paris" has higher probability than alternatives. The decision happens at a known position.
Force selection from predefined options by comparing token probabilities for "A", "B", "C", "D":
Which is the capital of France?
A) London B) Paris C) Berlin D) Madrid
Answer:
The model's prediction at the final position tells us everything. We can extract activations here and compare logits across options.
Provide examples in the prompt to demonstrate the task, then use cloze format for the test case:
Input: The movie was terrible.
Sentiment: Negative
Input: I loved every minute!
Sentiment: Positive
Input: It was okay, nothing special.
Sentiment:
ICL is valuable for interpretability because you can compare activations with and without the in-context examples to see how they reshape processing.
Some capabilities genuinely require evaluating longer outputs: writing quality, reasoning coherence, helpfulness. When you must evaluate runs of text rather than single tokens, LLM-as-judge provides an automated approach:
MT-Bench demonstrated that LLM judges correlate well with human judgments for many tasks, but always validate on a subset with human evaluation.
The bottleneck in interpretability research is often data: you need hundreds or thousands of examples that cleanly test your concept. Model-written evaluations (Perez et al. 2022) offer a solution: use LLMs to generate evaluation datasets.
The core workflow:
For interpretability, design your generation prompts to produce cloze-format or MCQ examples. This ensures the resulting dataset has the token-localization properties you need.
See the supplementary essay for detailed prompts and a complete pipeline example using puns.
Different tasks require different metrics:
A model scoring 85% vs 83% might not be meaningfully different. Quantify uncertainty:
Rule of thumb: differences smaller than 2-3% on typical benchmark sizes often aren't statistically significant.
A fundamental question: Did the model learn the concept, or did it memorize training examples?
If test examples appeared in training data, the model might simply recall them rather than demonstrate true understanding. This is especially problematic because:
Beyond contamination, test for genuine understanding:
The Holistic Evaluation of Language Models (HELM) benchmark exemplifies comprehensive evaluation:
Effective benchmarking requires:
For your research project, the quality of your findings depends heavily on your dataset construction. This section covers principles and strategies for building datasets that enable rigorous concept characterization.
Before constructing a dataset, you must operationalize your concept—make it measurable.
Exercise: For your chosen concept, write down:
The gold standard for concept research: pairs of examples that differ only in your target concept.
Positive: "Could you please pass the salt?"
Negative: "Pass the salt."
What changed: politeness marker ("Could you please")
What stayed the same: meaning, length, topic
For binary concepts (polite/rude, factual/fictional), aim for 50-50 split unless you have specific research reasons.
Ensure diversity across relevant dimensions:
| Dimension | Why It Matters | Example |
|---|---|---|
| Topic | Prevent concept from being confounded with domain | Politeness in: requests, apologies, greetings, refusals |
| Length | Avoid "polite = longer" spurious correlation | 5-10 tokens, 10-20 tokens, 20-40 tokens (balanced) |
| Syntactic structure | Ensure concept isn't just syntax | Questions, statements, commands |
| Formality | Separate from related concepts | Formal-polite, informal-polite, formal-rude, informal-rude |
Bad dataset:
Positive (polite): "Thank you so much for your help! I really appreciate it."
Negative (rude): "Whatever."
Problem: Length, sentiment, and informativeness all vary—not just politeness!
Better dataset:
Positive (polite): "Thanks for your help."
Negative (direct): "Got it, bye."
Better: Similar length, neutral tone, only politeness marker differs
Models might learn "if text contains 'please' → polite" without understanding context.
If your test examples appeared in the model's training data, you're testing memorization, not understanding.
How much data do you need? It depends on your research method:
| Method (Week) | Minimum | Recommended | Purpose |
|---|---|---|---|
| Benchmarking (Week 2) | 100 | 500-1000 | Test set for measuring accuracy |
| Steering (Week 1) | 10-20 pairs | 50-100 pairs | Contrastive pairs for activation difference |
| Probing (Week 5) | 500 | 2000-5000 | Training set for linear classifier |
| Visualization (Week 3) | 100 | 500-1000 | Examples to visualize in 2D/3D |
Your dataset is a research artifact. Document it properly:
Week 2 Goal: Create a small, high-quality evaluation dataset (100-200 examples) for your concept.
Deliverable: A JSON/CSV file with examples, labels, and metadata (topic, length, etc.) ready for benchmarking.
Building on the pun theme from Week 1, we'll create a complete evaluation pipeline: generate data, test multiple evaluation methods, and apply logit lens to see where pun understanding emerges.
Work in groups to craft a prompt for Claude that generates high-quality pun evaluation examples. Your prompt should produce examples in cloze format for interpretability tractability.
Generate pun recognition examples in cloze format. For each example: 1. Provide a sentence that may or may not be a pun 2. Format: "[sentence]" This is a pun. Yes or No? Answer: 3. Provide the correct label (Yes/No) 4. For puns, briefly note the wordplay mechanism Generate 20 examples: - 10 puns (varied mechanisms: homophones, polysemy, syntactic) - 10 non-puns (similar structure but no wordplay) - Vary topics: professions, food, animals, science, everyday life
Discussion questions:
Using the NDIF workbench, we'll compare different evaluation approaches on the same pun examples:
Compare P("Yes") vs P("No") at the answer position:
prompt = '"I used to be a banker but I lost interest." This is a pun. Yes or No? Answer:' # Extract logits at final position # Compare logit["Yes"] vs logit["No"] # Classify based on which is higher
Provide 3-4 examples before the test case:
"The bicycle was two-tired." Pun? Yes "The bicycle was broken." Pun? No "I lost interest in banking." Pun? Yes "I lost patience with banking." Pun? No "Time flies like an arrow." Pun?
Why is "I lost interest" a pun when said by a former banker? A) It references losing money B) "Interest" means both curiosity and financial returns C) Banking is boring D) It's not actually a pun Answer:
Compare: Which method gives the clearest signal? Which is easiest to interpret mechanistically?
The key question: At which layer does the model "get" the pun?
For a set of pun/non-pun minimal pairs, we'll:
from nnsight import LanguageModel
model = LanguageModel("meta-llama/Llama-3.2-1B", device_map="auto")
pun_prompt = '"I used to be a banker but I lost interest." This is a pun. Yes or No? Answer:'
non_pun_prompt = '"I used to be a banker but I lost patience." This is a pun. Yes or No? Answer:'
# For each layer, decode the residual stream at the final position
# Track: P("Yes") - P("No") across layers
# Plot: layer vs. pun_score for puns and non-puns
# Look for: divergence point where model distinguishes pun from non-pun
This week's exercise will give you hands-on experience with the core concepts:
Due: Thursday of Week 2
Now that you've identified a promising concept through steering, it's time to formalize it with a systematic benchmark and test it across multiple models to find the best target for deep investigation.
Tip: Don't overthink the "perfect" dataset at this stage. You'll refine it throughout the semester. Focus on getting a solid baseline that lets you move forward with confidence.