Week 1: Foundations

Required Readings

A Primer on the Inner Workings of Transformer-based Language Models

Ferrando, Sarti, Bisazza & Costa-jussà (2024). Accessible pedagogical overview of interpretability techniques, establishing shared vocabulary for the course.

Interpreting GPT: The Logit Lens

nostalgebraist (2020). The foundational blog post introducing logit lens—decoding intermediate layer activations directly into vocabulary space.

Do Llamas Work in English? On the Latent Language of Multilingual Transformers

Wendler et al. (2024). Applies logit lens to multilingual models, revealing that models often pivot through English in intermediate representations.

Supplementary Readings

Eliciting Latent Predictions from Transformers with the Tuned Lens

Belrose et al. (2023). Refined version of logit lens with learned probes for more accurate intermediate decoding.

A Mathematical Framework for Transformer Circuits

Elhage et al. (2021). Deeper mathematical treatment of the residual stream view. Dense but foundational.

How do Transformers Work?

Professor Bryce. Visual explanation of transformer architecture. Helpful background for understanding the models we're interpreting.

Neural Manifold Geometry Encodes Feature Fields

Yocum, Olshausen & Weiss (2025). Extends linear representations from scalar features to function-valued "feature fields" over topological domains (e.g., positions, directions). Shows that eigendecomposition of a learned domain embedding determines which functions can be linearly probed. Advanced/theoretical.

Tutorial: Understanding Transformer Internals

1. The Transformer as a Grid of States

A transformer language model processes text through three phases, creating a grid of internal states:

Transformer state grid showing three phases: encoder, layers, and decoder

Encoder: First, the encoder turns each input token into a vector of neural activations. The word "the" becomes a high-dimensional vector.
Layers: Then a series of neural layers mixes and transforms the vectors for each token. Information flows both along each token's "column" and across tokens via attention.
Decoder: Finally, the decoder turns each vector into a prediction for the next word. After "Miles Davis plays the", the model predicts "trumpet".

2. The Residual Stream View

A transformer can be understood as a series of operations that read from and write to a shared "residual stream." At each layer, attention heads and MLP modules add information to this stream rather than replacing it entirely.

x₀ = embed(token)
x₁ = x₀ + attention₁(x₀) + MLP₁(x₀)
x₂ = x₁ + attention₂(x₁) + MLP₂(x₁)
...
output = unembed(x_final)

This view enables a key insight: we can peek at intermediate states x_L to see what the model "knows" at layer L.

3. The Logit Lens: Early Exit Decoding

What if we don't wait until the final layer to decode? The logit lens applies the decoder (unembedding matrix) directly to intermediate activations:

Early exit decoding showing how predictions evolve: thing → jazz → horn → horn → trumpet

If you decode each vector early, you can see how the prediction evolves. In this example, the model's prediction for "Miles Davis plays the ___" progresses through: "thing" → "jazz" → "horn" → "horn" → "trumpet". The correct answer emerges gradually as information is refined through the layers.

intermediate_logits = W_U · x_L
probs = softmax(intermediate_logits)

Key observations from logit lens analysis:

Early layers often predict generic or input-related tokens
Middle layers begin to show task-relevant predictions
Late layers refine toward the final answer
The "correct" prediction often emerges gradually, not suddenly

4. What Does Latent Language Reveal?

When applying logit lens to multilingual models processing non-English text, a striking pattern emerges: regardless of input language, intermediate representations often decode to English tokens.

Wendler et al. (2024) studied this systematically with translation tasks. When prompting a model to translate French to Chinese (e.g., Français: "fleur" - 中文:), the intermediate layers decode to... English! The model appears to pivot through English internally, even when English is neither the input nor output language.

Possible implications:

Perhaps the model has an internal "concept language" that decodes as English regardless of input and output languages
This suggests models might have a language-independent way of representing and processing concepts
It strongly suggests that concepts are not just words—the internal representations are something different from the surface tokens

But what does this actually show? The observation that intermediate layers decode to English is striking, but its interpretation is far from settled. Consider: the unembedding matrix was trained predominantly on English data—perhaps it simply projects any high-dimensional vector more readily onto English tokens. Or perhaps the pattern reflects tokenization artifacts rather than genuine linguistic processing. And even if the model does "pivot through English," we have no causal evidence that these intermediate English representations are actually used in the computation, rather than being epiphenomenal byproducts. We'll return to further experiments along these lines later in the course, but it's a reasonable debate: what does this really prove?

5. Variations on Logit Lens

The basic logit lens assumes we can directly project intermediate activations through the unembedding matrix. But representations at early layers may not be in the same "format" as final-layer representations, and researchers have developed many variations:

Tuned Lens (Belrose et al., 2023): Learns a small affine transformation at each layer to account for layer-specific representation formats
Future Lens (Pal et al., 2023): Predicts not just the next token but future tokens from intermediate states
Concept Lens (Feucht et al., 2024): Projects onto concept-specific directions rather than vocabulary tokens

Each variation asks a slightly different question about what information is present in intermediate representations. The proliferation of "lenses" reflects both the fertility of the basic idea and the difficulty of interpreting what these projections really mean.

6. From Observation to Understanding: What Experiments Are Needed?

Logit lens generates striking observations. But observations are not explanations. The central challenge is: what additional experiments would you need to turn a logit lens observation into a scientific claim?

Class Discussion: Designing Follow-up Experiments

Consider the Wendler observation: intermediate layers decode to English during French→Chinese translation.

Hypothesis: "The model uses English as an internal lingua franca."
Alternative hypothesis: "The decoder matrix is biased toward English tokens."
Another alternative: "This is a tokenization artifact."

Your task: For each alternative, design an experiment that would distinguish it from the main hypothesis. What would you measure? What controls would you need? What result would be convincing?

Similar questions arise for any logit lens observation:

If you see a prediction "emerge" across layers (thing → jazz → horn → trumpet), how would you test whether those intermediate states are causally involved in the final prediction, versus being epiphenomenal?
If a pattern appears in one example, what systematic evaluation would establish generalization? How many examples? What variations?
If logit lens shows the "right answer" at layer 20, how would you verify that the model actually has the capability you're testing, rather than exploiting a spurious correlation?

We'll develop tools to address these questions throughout the course: causal interventions (Week 5), systematic evaluation methodology (Week 3), and validation techniques (Weeks 6-7). For now, the goal is to recognize the gap between observation and understanding—and to start thinking about how to bridge it.

In-Class Exercise: Logit Lens Workbench

In this hands-on exercise, we'll use the NDIF Logit Lens Workbench—a code-free tool for exploring transformer internals. No programming required!

Open NDIF Logit Lens Workbench

Part 1: Reproduce the Wendler Results

First, we'll replicate the "latent language" finding from Wendler et al. (2024). Does the model really pivot through English when translating between non-English languages?

Instructions:

Open the workbench and select a multilingual model (e.g., Llama-2)
Enter a French-to-Chinese translation prompt:
Français: "cinq" - 中文: "五"
Français: "coeur" - 中文: "心"
Français: "trois" - 中文: "三"
Français: "nuage" - 中文:
Examine the logit lens heatmap for the final token position
At which layers do you see English words appearing? At which layers does Chinese emerge?

Discussion Questions:

What English words appear in the intermediate layers?
Is the pattern consistent across different French words?
Try reversing the direction (Chinese → German). Do you still see English?

Part 2: Exploring a Research Question—Do Models Understand Puns?

Now let's apply logit lens to investigate a research question we'll revisit throughout the course: how do language models process puns?

Puns are interesting because they require recognizing that a word has multiple meanings simultaneously. If a model "gets" a pun, we might expect to see both meanings active in its intermediate representations.

Example Puns to Explore:

"I used to be a banker, but I lost interest."

"The past, present, and future walked into a bar. It was tense."

"I'm reading a book about anti-gravity. It's impossible to put down."

Instructions:

Enter a pun-setup prompt that leads to a punchline word
Look at the logit lens output for the position just before the punchline
Do you see evidence of both meanings of the pun word?
At what layer does the "correct" (pun) completion emerge?
Compare with a non-pun sentence using the same words—does the pattern differ?

In-Context Learning of Puns

Here's a more sophisticated experiment: can we use in-context learning to teach the model the "pun pattern"?

Try comparing the prediction for an incomplete pun alone versus prefixed with other puns:

Without context:
"I used to be a banker, but I lost ___"
(Model might predict: "my job", "everything", "money")

With pun context:
"I used to be a tailor, but I wasn't suited for it.
I used to be a train conductor, but I switched tracks.
I used to be a banker, but I lost ___"
(Model might now predict: "interest")

Use logit lens to examine: at which layers does "interest" emerge? Does the pun context change what appears in intermediate layers, or only at the final layers? This connects back to the function vectors idea from Week 0—is there a "pun function" that gets activated by the context?

Comparing Model Sizes

Try the same pun experiments on models of different sizes. At what scale does pun understanding emerge? A small model might never predict "interest" even with pun context, while a larger model might get it immediately. Use this to explore: how large does a model need to be to "get" puns? Does the pattern in intermediate layers look different in models that succeed versus those that fail?

Using SOTA Models as a Reference

State-of-the-art models like ChatGPT or Claude can help in two ways:

Generate test cases: Ask for puns with a specific structure, e.g., "Generate 10 puns that follow the pattern 'I used to be a [profession], but [punchline using profession-related word with double meaning].'" This gives you a systematic dataset to explore.
Check pun understanding: Give a SOTA model an incomplete pun and see if it can complete it correctly. This establishes a baseline—if even the best models struggle with certain puns, that's informative. If they succeed easily, you can then ask: what's different about the internal representations in models that "get it" versus those that don't?

Research Questions to Consider:

Can you find evidence that the model represents both meanings of the pun word?
Does the "wrong" meaning appear at early layers before being suppressed?
How does in-context pun examples change the intermediate representations?
At what model size does pun understanding emerge?
What would it mean to "localize" pun understanding in the model?

We'll return to the pun research question throughout the semester as we learn new interpretability methods. Each technique will give us a different lens on the same phenomenon.

Project Milestone

Due: Thursday of Week 1

Deliverable: Initial Pitch

Pitch document: 1-2 page writeup describing your team's proposed concept, written as a Google Doc in your team's project Google Drive. Submit the link (make sure it's readable).
Elevator pitch: Prepare a 5-minute presentation of your idea for class.

Thursday In-Class Activity

Each team will present their five-minute elevator pitch. As a class, we will read each other's proposals and discuss them using the FINER framework:

Is this Feasible? Can we find signs of life in the model?
Is it Interesting? Would experts in the domain care?
Is it Novel? Has this been studied before?
Is it Ethical? Could this research enable harm?
Is it Relevant? What changes if we answer this question?

Next Assignment (Due: Week 2)

After receiving feedback, begin exploratory analysis to test the "F" in FINER: Is this feasible? Are there signs of life?

Use the Logit Lens Workbench or Neuronpedia to explore your concept
Look for preliminary evidence that the model represents or processes your concept
Document interesting observations, even negative results

Deliverable for Week 2: Start a running project slide deck in your team's Google Drive. Include your preliminary observations and be ready to present them in class.

Overview

Learning Objectives

Required Readings

Supplementary Readings

Tutorial: Understanding Transformer Internals

1. The Transformer as a Grid of States

2. The Residual Stream View

3. The Logit Lens: Early Exit Decoding

4. What Does Latent Language Reveal?

5. Variations on Logit Lens

6. From Observation to Understanding: What Experiments Are Needed?

Class Discussion: Designing Follow-up Experiments

In-Class Exercise: Logit Lens Workbench

Part 1: Reproduce the Wendler Results

Instructions:

Discussion Questions:

Part 2: Exploring a Research Question—Do Models Understand Puns?

Example Puns to Explore:

Instructions:

In-Context Learning of Puns

Comparing Model Sizes

Using SOTA Models as a Reference

Research Questions to Consider:

Optional: Code Exercise

Project Milestone

Deliverable: Initial Pitch

Thursday In-Class Activity

Next Assignment (Due: Week 2)